Intelligent document Processing (IDP) is a technology that is helping companies in various sectors unlock the power of data hidden in the various documents involved in their process. It is a combination of different components like OCR, Business rules, Business process management and Analytics. But primarily the functions of an IDP lie in understanding the document types and extracting the data inside those documents.
Due to the complex nature of underlying technology and multiple layers of outputs which leads to final result it is also difficult to understand and measure the effectiveness of the IDP system. Today there are various vendors for the IDP which offer various flavors of solution in the market. Also, vendors of RPA and system integrators also offer IDP solutions apart from the pure play IDP products. And all of them promise accuracy as one of the key metric of their performance.
It has to be argued that due to multiple stages involved in the document processing, all of the vendors may not be having the same definition in mind when they talk of accuracy. This blog tries to dissect and demystify the meaning of accuracy as used in context of IDP
Steps in IDP and components of accuracy
As one can see output of each step feeds into the next step. It is imperative that the accuracy of one stage or step will affect the accuracy of the next step. But if we were to focus on each step and understand accuracy at that stage we end up with the following 5 definitions of accuracy.
- OCR accuracy – Percentage of original document digitized correctly. This is influenced by the OCR engine that we employ. Also the host of preprocessing steps before the OCR will influence this accuracy a lot. There are also a set of post processing steps on the OCR’ed data that will improve the accuracy of the data which is then fed into the classification process.
- Document identification accuracy – Percentage of document types identified correctly. This is dependent more on the method used to identify the document. Different products use different methods for identifying the documents. The most commonly used methods are identifying the characteristics like Document title, heading, and other typical phrases which might suggest beginning and end of specific documents.
- Indexing accuracy – Of the document types identified, % where the pages are correctly identified. Depends on the logic used to attribute a given page to a certain document. We might be using a page-begin and page-end parameter that might influence this figure. Usually, vendors deal with this stage by using an algorithm like bag of words, etc. The presence of blank pages, etc. might influence the accuracy at this stage
- Data value accuracy – Of the fields identified, Percentage where the data value matches the original. This is more dependent on whether the value of the field was read correctly. Output of this stage can be influenced by OCR accuracy.
These levels of accuracy are not independent of one other. OCR accuracy for one can influence the indexing and data extraction accuracy. After all, if the characters are not identified correctly the remaining logic to identify the page start, bag of words, etc. may not work well.
As an overall measure, we can think of an overarching accuracy promised by the entire system.
Overall accuracy – Accuracy of Straight through processing – Effective accuracy of processing a document package correctly in totality. This has to be measured as the percentage of documents processed correctly – that is, 100% correct indexing and 100% correct data extraction.
Planned vs Observed Accuracy – About confidence levels
Let’s now look at another way to look at the accuracy – Planned vs Observed.
When an IDP is being setup with a particular document or a data fields we have various parameters that are used for the same. Like, some parameters will go to identify the document type and the page numbers correctly. There will be some phrases and logic used to identify a data field.
A provider can setup confidence levels based on the parameters. For example, if 8 parameters go on to identify a document correctly, how confident can we be about the document being correctly identified provided all of the 8 parameters matched. That is if 8 parameters match can we be 98% confident or 100% confident that the document has been identified correctly? This will be ascertained through human intervention – comparing the machine output with human scrutiny, to determine confidence levels.
Most of the times this is what the vendors will be mentioning as the accuracy. The observed accuracy is something that has to be determined by periodic sampling.
White gloved Accuracy
As mentioned in the beginning various kind of vendors offer solutions to varying effects. It is very common for the BPO providers or the system integrators to offer a human assisted solution. This is where the output of the system is reviewed and modified by human associates. They might re-index the document and identify the data which system couldn’t correctly extract. This results in near perfect accuracy which reached 99% or more.
Machine learning and Accuracy
Modern IDPs like DocVu.AI loop in machine learning to improve the accuracy of the system even before the humans get involved. The whole idea is to eliminate and finally reduce the manual involvement in the processing of images and files. The focus then is on a attaining a high degree of straight through processing rate. This accuracy is something which is improved as the learning in the system improves. The learning – improvement loop is more like identifying what new parameters can be employed to bring in corrections when the system error is identified. For example if the system identifies a document wrongly, through machine learning it can be trained to look for additional parameters which will strengthen the business rules engine thereby improving accuracy next time a similar situation is encountered.
It has to be remarked that the machine learning accuracy is again dependent on the underlying OCR accuracy. Another very important factor is the initial learning that the system has already had. DocVu.AI is specialized on mortgage domain and hence the system that you get is already trained well on Document formats and data intricacies related to the domain. The advantage is that you start off with a higher accuracy than a generic IDP when you use it for the Mortgage specific use cases.
Hope this demystifies how accuracy can be viewed for an automation involving document processing. DocVu.AI achieves higher accuracies