Written by Jiangbo Yu, AI-R&D Lead
Optical Character Recognition (OCR) is a technology that automatically recognises the text content from images or videos and translates it into a machine-readable and processable structured character information. It plays the function of the "computer's eye", which is the basis for machine interaction with the real world.
OCR technology is known as the most "grounded" artificial intelligence technology. It is widely used in the financial sector, government service, and various other industries such as logistics, healthcare, and education. In the era of digitalisation, OCR plays a vital role in helping businesses with their digital transformation. It improves information collection efficiency, reduces labour costs, and significantly accelerates industrial transformation.
Scientists first proposed the conception of OCR in 1929 and started the research process worldwide in the 1960s and 1970s. In the early stages of OCR research, scientists mainly focused on studying number recognition. With the development of the Internet and the emergence of the deep learning algorithm, OCR has advanced in leaps and bounds through the training of massive data accumulated through deep learning.
The applications of OCR have significantly broadened in modern times. In this blog, we will examine the difference between "Deep OCR" (the deep-learning-based OCR) and "Traditional OCR" (the early-phase OCR technology that did not combine with the deep learning algorithm).
Prior to AlexNet's win at the ImageNet, traditional Computer Vision (CV) technology dominated OCR research. At that time, a standard processing flow of OCR contained image pre-processing[1], text (character) detection (a "bottom-up" processes)[2], character segmentation, character recognition[3], and post-processing of the recognition results.
Traditional OCR only performs well on good-quality printed documents due to the limitations of traditional CV algorithms. It lacks versatility and requires a lot of manual fine-tuning to adapt to different business scenarios. Therefore, the text recognition performance of traditional OCR in complex scenarios (e.g. low resolution, blurred images, image degradation) is less than ideal.
Deep learning algorithms have greatly improved the process of image classification and the object recognition research in the 21st century.
In 2012, scientists introduced deep learning algorithms to OCR study by using Convolutional Neural Networks (CNN) to replace the traditional manual feature design. Deep OCR has been divided into two types: the independent two-stage method and end-to-end text recognition method.
The independent two-stage method models text detection and text recognition separately. The text detection module is mainly responsible for the detection of text regions and directions. Commonly used text recognition algorithms include classical object recognition algorithms (SSD), object segmentation algorithms (PixelLink), and specific text detection algorithms (CTPN). Different algorithms have different characteristics. For example, segmentation-based algorithms are not limited to text size and shape but are prone to text line adhesion for text lines that are close to each other.
The text recognition module is designed to perform text recognition on the detected text boxes. The primary method is to extract the text image features by CNN as input and call LSTM+CTC or Attention Mechanism to recognise the text content of indefinite length.
The CTC technique can effectively capture the context-dependency of the input sequence and correct the alignment of images and text characters. However, because of the ambiguity of CTC decoding features, a large amount of training data is required to obtain the desired accuracy.
The attention mechanism-based text recognition technology has natural interpretability, effectively improving the feature representation capability of the OCR model by learning weights to locate to the corresponding feature vectors.
The end-to-end text recognition system is used to integrate text detection and recognition into one network for training, which better shares the weights and optimises the overall model, thereby avoiding information loss in the independent two-stage method.
OCR challenges
The emergence of deep learning algorithms has significantly improved the innovation of OCR technology. However, the cognitive skill of OCR technology still cannot compare to human ability, particularly in complex usage scenarios.
Take for example scenarios from our daily life that are complex and variable. Factors that can affect the recognition ability of OCR include complex backgrounds, low resolution, character distortion, multilingual mixing, image degradation, text character mutilation, and deformation, among others. The increase of OCR-embedded applications places higher demands on OCR performance than with traditional OCR. For example, cloud-based OCR requires low latency and high concurrency and mobile-based OCR requires compatibility and operational efficiency.
Although Deep OCR can recognise the text more accurately, business scenarios often have the need to solve the requirement of the structure of the text in a picture (e.g. the text of a card or the digits on a form) and to improve OCR technology with document format analysis. Fortunately, after 2017, Natural Language Processing technology (NLP) was introduced in OCR. The combination of OCR and NLP has given OCR technology the ability to understand the text content truly. The association of semantic information can improve end-to-end OCR solutions.
ADVANCE OCR, developed independently by ADVANCE.AI, is widely used in ID document recognition and business form recognition. It also provides mobile-OCR quality inspection services.
ADVANCE OCR has earned a fir reputation with its quick, highly accurate, and reliable product features from Southeast Asia to the global market. However, the research and development of ADVANCE OCR has not been an easy journey. As we knew, ID documents of Southeast Asia vary greatly, usually with low-quality print. The majority local people are also still using low-end phones. These conditions put a high demand on our research of OCR scalability.
ADVANCE.AI is committed to providing top-quality OCR service. The breakthroughs of ADVANCE OCR include:
Streamlined OCR development. ADVANE.AI has developed an in-house auto OCR system that can bring together data annotation, model development, and complex data mining into a complete ecosystem after data collection to accelerate model iteration.
Automatic layout analysis can replace the traditional manual design rules to support various cards, reduce the reliance on algorithms staff, and provide greater scalability.
Provide OCR quality inspection SDK to perfect the whole OCR process. The SDK covers the entire process from user photo capture to cloud recognition, enabling more flexible control of the capture quality and meeting users' customisation needs while improving the quality of OCR services.
As a part of the "computer's eye", OCR will be used in more fields as the technology develops. When OCR is combined with NLP technology, the machine will be able to "understand" the text content and structure the text more accurately.
In the future, carriers of OCR service will be more diversified, corresponding with the terminal carriers such as smartphones, smart electronics, and cloud services. Thus, the usage cost will be reduced.
At present, Deep OCR has been widely used in traditional fields such as card recognition, receipt recognition, automobile-related recognition (e.g., driver's license, license plate) and business form recognition. In emerging industries like online advertising, Deep OCR has also begun to serve image content extraction, advertisement review, and user understanding analysis.
There is still room for the improvement of end-to-end OCR technology. When computers can "understand" the real-world text content better than today, it will gradually replace tedious manual data entry tasks.
[1] Note: The pre-processing of OCR includes image correction, geometric transformation (perspective, distortion, rotation), deblurring and light correction.
[2] Note: The traditional text detection technology of OCR applied in two directions: connected domain and sliding window, and based on manually designed features.
[3] Note: Character recognition technology of OCR includes image classification and template matching.