Multimodal large language models (LLMs) are computational frameworks capable of processing different types of data – including text, images, audio and video. LLMs fall under the broader category of generative AI – a variety of models and tasks that produce content.
Generative AI is a powerful and increasingly necessary tool for businesses. Moreover, one of generative AI's most useful back-office functions is extracting data from PDFs.
PDF files are convenient for sending and receiving information, but finding the required data points from vast batches of files can be challenging. Take financial statements, for example. Financial statements are long and laden with data. Therefore, finding pertinent information is a long and tedious process. Enter generative AI.
Generative AI often encounters issues, rendering human-in-the-loop (a system incorporating direct human feedback) necessary for enterprise deployment. Generative AI - in the form of multimodal LLMs - can read PDFs quickly. However, the cost of its convenience is egregious errors.
Errors – such as incorrectly extracting data from PDFs and images – are often an unintended consequence of training the model.
During general model training, a crucial step is dividing your dataset into a larger portion for training and a smaller one for validation. The larger part trains the model, while the smaller portion tests its performance on unseen data.
In this case, we determine the model’s performance by its ability to make predictions based on the inputted data. Suppose the model performs well on the training data but poorly on the validation data. In that case, it indicates an issue (usually overfitting: when an AI model starts memorising the training data instead of interpreting and learning the underlying patterns).
Therefore, general model training aims to keep the validation error rate as low as possible. Consider the following example:
When looking at the training error rate of Model 1 & Model 2, one might assume that Model 2 would perform better since it has a lower training error.
However, the validation error indicates how well the model generalises to new unseen data, while the training error indicates how well the model performs on the training data. Therefore, the model with the lowest validation error would be the best-performing model in production (in this example this is model 1).
Nevertheless, even rigorous training can result in a model processing unseen data poorly. Tests by our engineers showed that multimodal models like GPT-4 cannot consistently extract data from PDFs of financial documents. In one notable example, the model failed to process a financial statement by not including a significant asset in its Plant, Property and Equipment (PP&E) calculations – an error that no (human) analyst or accountant would ever make. In another example, the machine misread simple high-definition tabular data, producing wildly incorrect figures from an annual report.
If left uncorrected in the output file, these types of errors can infect a business’s decision-making process, manifesting as wasted resources, inaccurate risk perception, missed opportunities, etc.
Hallucinations occur when the model loses context over a lengthy interaction or encounters unknown information. If confronted with unknown information, the model will make up responses due to its inability to acknowledge its lack of knowledge.
During experimentation with GPT-4, we noticed issues with fabricating data. For instance, when uploading Apple’s ESG Report 2022, we asked it to note how many personal donations were made (as opposed to corporate donations). GPT-4 confidently produced a response:
These figures are incorrect. They correspond to the number of inclusion and diversity training hours in 2020 & 2021. Note that GPT’s response is also inconsistent with the fact that Apple only has 165,000 employees (as stated on page 68).
Examples like these represent how GPT-4 is an ineffective solution for businesses looking to extract data accurately from documents. Although human checking may mitigate some of GPT’s dysfunctions, publicly available LLMs are generally unsuitable for PDF analysis.
Analysing single-page documents with tools like GPT-4 is a straightforward process. However, this process becomes more challenging with longer documents, mainly due to GPT-4’s limited context window. For instance, the model may struggle to maintain context if a document spans hundreds of pages and requires cross-referencing distant sections (such as the first or last page). While LLMs offer extensive capabilities with natural language, they face limitations in processing long documents and require significant computational resources and time.
For those interested in the details, it’s worth exploring the RAG technique, which was developed recently to counter the long document limitation. Retrieval Augmentation Generation (RAG) incorporates external databases (such as knowledge graphs) during the generation process, resulting in more accurate and more attributable outputs.
Ongoing research also aims to reduce the size of LLMs, making them more efficient without compromising their performance. Further developments will likely lead to faster, more cost-effective training processes. As generative AI evolves, LLMs will become more adept at handling complex tasks, including accurate extraction from documents.
At first glance, large multimodal models like GPT-4 seem like convenient and cost-effective solutions to capture and structure data from PDFs. However, as the example above demonstrates, LLMs are prone to making errors, meaning they are not a viable option for businesses.
You can, however, train multimodal LLMs on huge datasets to specialise them in data extraction. If you would like to find out more about Evolution AI’s data capture solution, please book a demo or email hello@evolution.ai.
We’ve also developed an LLM-based solution for capturing data from financial statements, Financial Statements AI. Book a demo for more information.