Imagine you are one of the most intelligent beings on earth. People from all over the world come to you with their questions and you can answer almost all of them. One day someone comes to you holding a book. ‘Can you read it to me?’, he asks. You open the book and start to mumble … 'the honse stood on a sliqht rise jusf on the cdge of the vinage.’
For the computer and mobile phone that you use every day this is a reality. With access to the Internet, they can help you to solve most of your problems. However, it is surprisingly hard for a computer to read text from a picture.
Identifying characters, something the human eye can do from a young age, is called Optical Character Recognition (OCR). In some cases, the computer performs reasonably well in OCR, for example, if the documents have been carefully prepared before digitalization. However, this can be a costly and time-consuming procedure. That’s why we decided to do it smarter. We think the rule of thumb should be that if a document is readable for the human eye, it should be processed by our OCR solution.
A look at the sheer volume content that is processed at the enterprise level and it is immediately clear that a proper OCR solution will have many benefits. We see many documents that have been scanned at a low resolution. Sometimes the original documents are not traceable and in general, rescanning is a lot of effort.
Decent open source OCR solutions are already available. However, the quality of text they produce is still too low for enterprise standards. Therefore, we introduce a post-processing step. The OCR engine handles the initial OCR and our custom Post-OCR algorithm learns from your content and performs the adequate corrections. Our solution can be trained to be more effective on words specific for any business.
A typical mistake of OCR is the confusion of similar letters. For example a ‘j’ is read instead of an ‘i’. However, with some knowledge of the English language, it is easy to induce that ‘fjsh’ is not correct and probably should be read as ‘fish’. In our Post-OCR step, we leverage this to the maximum.
Since we know what type of mistakes OCR engine is likely to make, we can optimize our algorithm to focus on those mistakes. The words ‘Onior’, ‘amor’, ‘aural’ and ‘pillar’ might look very different to us, but they look quite similar to OCR solution. All four start with a round character, followed by what seems like three bars, then another round character, followed by another bar.
Taking into account as much statistical information as we can collect we use the similarity of the characters and words to correct any words that seem to be processed incorrectly during the OCR process. We also take into account neighboring words, after all, “thanks for all the fish” is more plausible than “thanks far all the fish”.
Doculayer is about smart content management. We offer innovative solutions and Post-OCR is part of a larger chain of Machine Learning solutions that are available. Building this solution ourselves ensures we have the top-level quality that is highly tunable to our customers’ needs. Furthermore, this provides easy integration with other smart components in Doculayer. And that for less than some of the currently available solutions!
We will now show some of the Post-OCR in action. Suppose we are processing a document that has been scanned at a relatively low resolution. It is still human-readable, and OCR engine can perform OCR on it, but we are still left with obstructive mistakes.
Illustration 1: Example of a scanned document
When applying our algorithm to a historical newspaper article we see some typical OCR mistakes as shown. The mistakes are easily corrected by Post-OCR. Missing characters or characters misread by the OCR engine are no problem. Because ‘Reginald’ has been observed in the training data Post-OCR was even able to correct this name. More training on similar documents will make Post-OCR even more robust as statistical knowledge about the field increases.
Illustration 2: Post-OCR at work, left the result of OCR, right the corrections made by Post-OCR
On top of a basic Post-OCR package, more content can be easily plugged in. Language support for every alphabet-based language for example. A bootstrap so that the solution is already tuned at the time of deployment or a dictionary better tuned for your needs. Because our Post-OCR solution is fully in-house, we are in full control.
The development of our Post-OCR algorithm posed us several challenges. The problem is linguistic in nature but for efficient implementation, we also needed to keep the computational complexity reasonable. As Machine Learning team at Onior, this is where we excel. We combined all our skills to come up with new, innovative ideas and implementing them as efficient solutions.
Using the Post-OCR tools we created and with the right amount of training we managed to improve the capabilities of traditional OCR systems. In the end we taught the computer a task that seems simple to us but is hard for him to do. Now he will be able to read without mumbling:
The house stood on a slight rise just on the edge of the village...