LLM’s and multimodal models for document processing - A proof of Concept
In today’s fast-paced world, businesses are constantly seeking ways to streamline processes and maximize efficiency. One area that often consumes such a large volume of our precious time, is document processing, with tasks such as sorting through resumes or extracting relevant information from contracts or invoices. But what if we could do this more efficiently? Enter Large Language Models (LLMs). These LLM’s are changing the game, offering a revolutionary approach to document processing that promises to reshape how organizations handle vast amounts of textual data.
Large Language Models
For those that are not familiar with the term Large Language Models, let’s quickly explore what they are. LLM’s, such as OpenAI’s GPT series, are artificial intelligence models trained on vast amounts of text data from the internet. These models have been fine-tuned on diverse datasets, allowing them to understand and generate human-like text across a wide range of topics and styles. They are capable of performing a variety of language-related tasks, including language translation, text generation, summarization, and, crucially for document processing, information extraction.
Now what can they do for us?
Let’s take the example of a recruiter. Recruiters are all too familiar with the daunting task of going through stacks of resumes to identify qualified candidates for a job opening. Traditionally, this process has been labor-intensive. With Large Language Models, however, this process can be automated to a large extent. In a POC we developed internally, we used a LLM (Azure OpenAI GPT-3.5 Turbo, to be specific) to extract the required skills from one or our vacancies. Next, we used the same model to rate a set of fictive resumés for a match on these skills. The result is a table with skills and the degree to which each of the resumés matches the required skills.
By leveraging the model’s natural language processing capabilities, recruiters can develop algorithms that scan resumes and extract relevant information based on predefined criteria. For example, if a job posting requires proficiency in certain programming languages, the LLM can be programmed to identify mentions of those languages in the resumes and flag candidates who possess the necessary skills.
Not only does this approach save time and resources, but it also helps mitigate bias by ensuring that all resumes are evaluated based on the same criteria. Additionally, LLMs can continuously learn and improve over time, refining their ability to accurately identify relevant information and adapt to changing job requirements.
Beyond text: Multimodal Models
We could use the same principle e.g. to analyze contracts or invoices, but for these document types, analyzing text alone might not be sufficient. Enter Multimodal Models. GPT-4 (aka ChatGPT-plus) and Gemini Pro (Google’s competitor) are examples of these multimodal models. This means that besides text, they can also handle other types of content. For GPT-4 and Gemini Pro, the list of content types is limited to text and image, but GPT-4o, the latest installment of OpenAI’s GPT family that was introduced on May 13, extends the range of modalities to text, image, audio and video.
In a chat context, multimodality is a cool gimmick that extends the user-friendliness and attests to the power of the models. But, what if we could also take that same power and use that to our advantage to solve a specific business use case?
As an example, we chose to apply these multimodal models to automate receipt processing. The context for this is the following. All businesses, be it independent freelancers, neighborhood butchers, or large corporates, need to collect, analyze, process and store receipts for accounting purposes. Collecting the documents, extracting the contents, and assigning it to the correct ledger account (e.g., restaurant expenses, furniture and rolling assets…) can be a tedious manual process. But what if we could use multimodal AI for that?
At CROPLAND we have implemented a proof-of-concept system where we monitor an e-mail address that is used specifically for invoicing. Each e-mail that is received by this e-mail address is scanned for the presence of PDF attachments. When an attachment is found, a process kicks in and converts the first page of the PDF to an image. This image is then analyzed further with Azure OpenAI GPT-4 Turbo Vision to determine :
- Whether the document is a receipt or another type of document
- The creditor party
- The creditor VAT number
- The receipt total without VAT
- The total vat amount
- Document date
- Receipt title
- Receipt description
- Ledger account this document could belong to
If GPT4 determines that the document is relevant for accounting purposes, the original attachment is saved in a cloud storage bucket. The information extracted from the document, in turn, is saved to an Excel Online workbook, which links directly to the source document on cloud storage.
Things to consider
Azure OpenAI is not free! For each document that we process, the processing cost amounts to about 5 eurocents. For an SME like CROPLAND that processes not more than 50 receipts per month, this cost is not a deal breaker. Larger corporations, however, should evaluate the business case, because it may make more sense to run your own instances of open-source alternatives. In our tests, these perform on a par with GPT3.5 or GPT4. However, when processing volumes are quite low, the cost of the hardware required for smooth inference is much higher than the prices charged by Microsoft Azure for OpenAI model API calls.
Another thing to consider is multi-page documents. Most receipts are just one page, so for our invoice proof-of-concept we could afford to cut some corners and use only the first page. However, if you want process multi-page documents that contain text and images, you may want to look into models like Microsoft’s LayoutLM. These models have learned the regularities of where information can be found (e.g., the ship-to address is usually located in the top-left corner of an invoice) and combine that with a good understanding of the textual and tabular information that is contained within the document.
Conclusion
In conclusion, the integration of LLM’s and multimodal AI models like GPT-3.5, GPT-4 and Gimini Pro Vision in automated document processing represents a significant advancement for businesses. By harnessing the image-to-text capabilities and deep text understanding of these models, organizations can streamline tedious manual tasks such as invoice processing.
Our proof-of-concepts demonstrate the potential of this technology to efficiently extract key information from documents. This not only accelerates the document handling process but also reduces errors and enhances data accuracy.