When it comes to automating invoice processing, a key component is invoice digitization.
With invoice digitization, the need to manually enter invoices into back-end systems is eliminated with the use of technologies like Optical Character Recognition (OCR) and machine learning.
But the question on everyone’s minds is how accurate digitization really is which is why is it important to define what accuracy is in the first place?
Does accuracy mean extracting exactly what is shown on an Invoice? If that is the case, what if there are issues with the invoice like amounts in the invoice not adding up?
We believe the best way to define accuracy is by examining how consistent data is.
Data consistency can be viewed as how logical a piece of data is. For example, if a tax amount is extracted, does this amount make sense when compared to the total amount in terms of the relative percentage. Another example would be the summation of the cost of line items which adds up to the total indicated in an invoice.
With accuracy being defined as data consistency, users of invoice digitization can now be alerted of exceptions that not only arise from the digitization processes but also when the content in the invoice itself has an issue.
This ability to manage by exceptions will help users of such digitization applications save time as they will not have to look at every single invoice and field to perform verifications.
This data consistency or accuracy is crucial in helping organisations gain the efficiency they are looking for.
Application of Technology to Data Consistency
In the recent past, when companies talked about data extraction from images or scanned documents, most people would have singled out OCR technology.
Today, thanks to increased computing power, the likes of machine learning has come to the forefront of data extraction. But to attain consistency or accuracy of extracted data, we find it is essential to use a combination of technologies.
OCR is always a first step which helps convert images into text.
This first step is important and the quality of the image is strongly correlated to the quality of the extraction process. While multiple algorithms are used to brighten and straighten images, if the image provided is of inferior quality then this fundamental step will cause the ensuing steps to falter.
Machine learning models which are statistical in nature are then applied to the text obtained from OCR. This is where the computer understands the documents and classifies and extracts the required data.
In this process, each piece of data extracted is allocated a confidence score in terms of a percentage. If the model determines that the extracted data matches the requirements well, then a high score is given otherwise a lower score is allocated. This the first indication of how accurate the extracted data is.
The extracted data is then run through a battery of deterministic algorithms that check for the next level of consistency. These checks include logical validations and business validations.
Logical validations would include checks such as if a date is extracted it fits the format of a date. Business validation would include calculations like the total amount extracted being the sum of all the costs of line items.
The combination of the confidence scores and deterministic checks provide a view on the consistency of extracted data. Based on this, the data can be flagged for inconsistencies or can be regarded as consistent.
Geography, Industry and Accuracy
Extracting accurate or consistent data from invoices is made more complex when invoices come from across industries and geographies. Providing accurate or consistent data becomes more challenging when the constructs that help us determine consistency vary.
For example, an invoice coming from different geographies will have different tax rates applicable. This would mean that after extracting a tax amount and comparing it the total amount to determine consistency, the appropriate tax rate needs to be taken into consideration. ( 7% GST in Singapore versus 12% VAT in Europe)
This requires the logical and business validation process to be flexible and intelligent when determining the parameters to use when ascertaining consistent data.
Also, the data that is required to be extracted from invoices differs across industries.
An example of this would be for services based companies just having the total and tax amounts on a header level would suffice. For manufacturing companies, however, there will be a requirement to extract line item details in addition to the header level content.
This difference in requirements means that the complexity of providing consistent data increases because there isn’t a one-size-fits-all process that will work.
Across industries and geographies, the models and rule sets used in extracting data and ensuring its consistency will have to be consistently updated or trained to handle multiple scenarios.
Thus, harnessing the right kinds of technologies and applying them in the appropriate ways enable invoice digitization solutions to be flexible with the extraction and delivery of accurate data.
This allows the users of these solutions to fully benefit from the efficiencies they have to offer.