Written by Lilong Qian, AI-Research Scientist
The history of OCR
Figure 1: OCR service that allows to convert scanned images to text
The Optical Character Recognition (OCR) indicates how to extract the textual and semantic data in the image files. In 1929, Tausheck first proposed the conception of OCR – utilising the machine to read characters and numerals. However, it was still a dream until the arrival of the computer age.
Some researchers started the study on character recognition (basically on text recognition, especially recognising the numerals) early in the 1960s. In that period, OCR technology had been used in mail-sorting by the postal service to recognise the zip code in the U.S.
However, early-phased OCR technology had their limitations---it is only capable of recognising high-quality text. Thus, the text used for OCR needs perfectly straight, clear, and printed in the single font that OCR devices were programmed to recognise.
Early-phase OCR technology's recognition process is needed to compare the shape between the given character and the one in the database stored previously, then to find the best match for recognising.
As the development of computer processing in the 1960s and 1970s, Omni-font OCR reader became available. Although font has various design style, this kind of scanner was able to recognise the general form and shape, instead of looking for an exact match.
The OCR devices we have known today was launched by Kurzweil Computer Products, Inc., which founded in 1974. However, until in the 21st century, OCR comes into its own. A wide range of OCR applications become a reality with the combination of internet technology. Besides, with the innovation of recognition algorithms; optical scanners enable to handle output in higher resolutions. This innovation of OCR accelerates some new exciting usages. For example, the vehicle license plate recognition, identification card recognition, receipt recognition and other customised OCR services. OCR has revolutionised the way we do business and our daily life. It has been widely used in the world.
Creativity comparison: machine-learning-based OCR vs traditional OCR
Figure 2: Traditional OCR vs machine-learning-based OCR
The traditional OCR techniques achieved great success in the last 20 years, basically in scanned documents. The whole procedure of the traditional OCR method follows mainly three steps:
Text line extraction
Text line recognition
The image pre-processing step simplifies the complexity of OCR processing and contains the function of geometric rectification, blur correction, and illumination rectification.
Image pre-processing, the text line is extracted before the text recognition via the image binarisation, page layout, and line segmentation. In the final step, character segmentation and character recognition are used to convert the image into text.
The traditional OCR can only handle relatively straightforward cases, for example:
Simple page layout
Strongly separable for foreground and background information
Easy segmentation for each text character
Figure 3: photo taken by the camera (left) vs scanned document(right)
However, the demands for text recognition in the natural scene arise. The natural scene circumstances are more complicated than the traditional scene in illumination, page layout, and has more noise.
With deep learning development, we can solve the traditional OCR limitation via the new algorithm technology. A complete machine-learning-based OCR process can be simplified into two steps now: "text detection" and "text recognition". Unlike traditional OCR, the network of machine-learning-based OCR can automatically learn a useful feature for the detection and recognition model with a large amount of data, liberating the engineer from the manual feature engineering process. It can also produce more generalisable results than traditional OCR for complicated circumstance such as the fonts in different shapes, colours, and sizes; the image in various qualities, backgrounds, illumination changes and geometric distortion. It is because the network of machine-learning-based OCR enables to extract an invariant feature regardless of changes. Besides, its processing speed is much faster than the traditional OCR due to the assistant of large-scale GPU parallel computing.
Machine-learning-based OCR has more benefits than the traditional way because of data availability and network design, which lower the difficulty of feature engineering processing.
Difficulties in machine-learning based OCR
It is widely believed that machine-learning-based OCR outperforms the traditional way. However, deploying a complete pipeline for the OCR system is still not an easy task. The difficulties mainly lie at:
Availability for a large amount of data
Efficient network design for high accuracy
Low computation cost
Develop a traditional OCR model may only need tens of hundreds of samples. However, training a network to perform the machine-learning-based OCR, which satisfies the requirement accurately, usually requires a lot of time and resources to obtain the basic dataset. Besides, another pain point of machine-learning-based OCR is the annotation precision of data collection. If a certain pattern mistake happens too frequently, the model will learn the pattern through training. For example, if an annotator labels all lower-case letters with upper-case letters, the model will also predict upper-case letters with high probability. In other words, noised data always affects the accuracy of the model, which is unavoidable when people are doing annotation tasks.
Designing efficient networks is also full of challenges. A network with good structure achieves better performance, as reducing training hardness. However, variances of the network are countless, just like the stars in the sky. If we want to find the best OCR model for training the accuracy, the only way is to utilise the common conclusions and experiments' experiences because we can not try them one by one. In practice, the network requires an additional design to make it "small" and "fast" because an extensive network that obtains high accuracy sometimes may process slowly and have a huge file size. Besides, it needs to have the possibility to run in the parallel mode, which utilises GPU's capability. Otherwise, it may be too slow for real-time applications.
In sum, we need to make consistent efforts to keep the system extensible and user-friendly because adopting research technology into real use scenarios is essential.
Creativities in ADVANCE OCR implementation
Figure 4: OCR implementation
In the development of ADVANCE.AI's OCR technology, we also face some difficulties. Fortunately, we reduce the difficulty with the following ideas:
Automatic data cleaning
Efficient network designing
Pipeline integration and maintaining
As mentioned above, the most common limitation of machine-learning-based OCR is the annotation precision of data collection, which a high-accuracy model is based on. We use deep learning algorithm technology to improve the quality of data annotation at ADVANCE.AI as the following steps:
First, train baseline models and estimate the confidence score that the annotation is correctly annotated. Second, select the part of data that needs to be annotated several times again instead of annotating the data independently to guarantee the data quality. Another method to improve data annotation productivity is using the model to find hard examples for training, instead of adding data by random choice.
We also make many efforts to improve ADVANCE OCR network. First, keep tracking the latest research in the related fields and consider which can be learned to improve results. For example, we can get rid of the LSTM module in text recognition by replacing it with one module in the NLP field, which is much faster while obtaining better performance. Second, adjust the network to fit the productions. For example, we can select and compress the appropriate network to satisfy the accuracy and speed requirements, as reducing the cost.
A healthy and extendable OCR system is the key point for providing good service for clients. With a reliable system, we can do testing, debugging, and technological evolution as quickly as possible and release a lot of human resources from repeating tasks.
Figure 5: OCR implementation
Future directions in OCR applications
It has been quite the journey for OCR from the musical book-reading device in 1914 to today's myriad applications. We have made great progress in developing ADVANCE OCR, and we continually strive to improve the quality and the functionality of our services.
Data is precious for A.I. related business. Undeniably, improving the effectiveness and efficiency of data collection is essential. For efficient data collection, we can keep moving in researching at least two methods —"synthetic data" and "auto-annotation."
Unlike the traditional synthetic way, GAN (generative adversarial network) is a good choice for generating close enough data to real data, saving manual works from numerous annotations tasks. Auto-annotation is a method that is combining with model training and data annotation.
However, there is a plight in current model development that data annotation is separated from the training, resulting in several obstacles. For example, we cannot control if the data annotated is what the model needs, and we also do not know the exact numbers of data needed. Currently, it is all based on the engineers' experience, which is, of source, not always as correct as expected.
Besides the data, the model is critical, and model research is prevalent as well. In the model research field, the following directions that we are worthy of keeping track in the future: "AutoML" and "Model Compression".
AutoML enables automatically finding a better network structure and fast the progress of research and product iterations. Model Compression helps find more light-way models that can save resources and develop more mobile-device services.
There are too many aspects that need to be improved. Although there might be some goals we cannot achieve, we still hope to develop more ideas to improve our products. In the future, we will work with new joiners to create more productivity from our ADVANCE OCR system, contribute more to society's research, speed up the technique evolution, and strive for a world of dignity, sustainability, and prosperity.
 Omni-font: any font that maintains fairly standard character shapes