OCR processing with deep learning: Apply to Vietnamese documents

OCR PROCESSING WITH
DEEP LEARNING: APPLY TO
VIETNAMESE DOCUMENTS
VIET-TRUNG TRAN, ANH PHI NGUYEN, KHUYEN NGUYEN

OUTLINE
• OCR overview
• History
• Pipelining
• Deep learning for OCR
• Motivation
• Connectionist temporal classification (CTC) network
• LSTM + CTC for sequence recognition

WHAT IS OCR
• Optical character recognition (optical character reader) (OCR) is the
mechanical or electronic conversion of images of typed, handwritten or
printed text into machine-encoded text

OCR TYPES
• Optical Character Recognition (OCR)
• Targets typewritten text, one character at a time
• Optical Word Recognition (OWR)
• Typewritten text, one word at a time
• Intelligent Character Recognition (ICR)
• Handwritten print script, one character at a time
• Intelligent Word Recognition (IWR)
• Handwritten, one word at a time

HISTORY OF OCR: TESSERACT OCR ENGINE
TIMELINE

PAGE LAYOUT ANALYSIS
Smith, Ray. "Hybrid page layout analysis via tab-stop
detection." Document Analysis and Recognition, 2009. ICDAR'09. 10th
International Conference on. IEEE, 2009.

IMAGE LEVEL PAGE LAYOUT ANALYSIS
• Using the morphological processing from Leptonica
• http://www.slideshare.net/versae/javier-de-larosacs9883-5912825

TESSERACT WORD RECOGNIZER
http://www.slideshare.net/temsolin/2-architecture-anddatastructures

FEATURES AND WORD CLASSIFIER
Classical character classification

CHAR SEGMENTATION, LANGUAGE MODEL AND
BEAM SEARCH

OCR CHALLENGES
1. Fonts specifics
Never overcome their ability to understand a limited numbers of fonts and page
formats
2. Character bounding boxes
3. Extracting features unreliable
4. Slow performance

RECENT IMPROVEMENTS
1. Multilanguages
2. Full layout analysis
3. Table detection
4. Equation detection
5. Better language models
6. Hand-written text

MOTIVATION
• Segmentation is difficult for cursive or unconstrained text
• R. Smith, “History of the Tesseract OCR engine: what worked and
what didn’t ,” in DRR XX, San Francisco, USA, Feb. 2013.
• there was not a single method proposed for OCR, that can achieve
very low error rates without using aforementioned sophisticated
post-processing techniques.

RESEARCH BREAKTHROUGH
A. Graves, M. Liwicki, S. Fernandez, Bertolami, H. Bunke, and J.
Schmidhuber, “A Novel Connectionist System for Unconstrained
Handwriting Recognition,” IEEE Trans. on Pattern Analysis and Machine
Intelligence, vol. 31, no. 5, pp. 855–868, May 2008.

MOTIVATION
• Real-world sequence learning task
• OCR (Optical character recognition)
• ASR (Automatic speech recognition)
• Requires
• prediction of sequences of labels from noisy, unsegmented input data
• Recurrent neural networks (RNN) can be used for sequence learning, but
ask for
• pre-segmented training data
• post-processing to transform outputs into label sequences

CONNECTIONIST TEMPORAL CLASSIFICATION
(CTC)
• Graves, Alex, et al. "Connectionist temporal classification: labelling
unsegmented sequence data with recurrent neural
networks." Proceedings of the 23rd international conference on Machine
learning. ACM, 2006.
• WHAT CTC IS ALL ABOUT?
•a novel method for training RNNs to label
unsegmented sequences directly

THE SPEECH RECOGNITION PROBLEM

OCR processing with deep learning: Apply to Vietnamese documents

DYNAMIC TIME WRAPERING
• Because the length of y might differ from (often longer than) l, so the
inference of l from y is actually a dynamic time warping problem.

CONNECTIONIST TEMPORAL CLASSIFICATION
• o transform the network outputs into a conditional probability
distribution over label sequences
• A CTC network has a softmax output layer with one more unit than there
are labels in L
• activations of the first |L| units are interpreted as the probabilities of observing the
corresponding labels at particular times
• activation of the extra unit is the probability of observing a ‘blank’, or no label

PREFIX SEARCH DECODING ON THE LABEL
ALPHABET X,Y

LONG SHORT-TERM MEMORY (LSTM)
• One type of RNN networks
• RNN vanishing gradient problem
• influence of a given input on the hidden layer, and therefore on the network output,
either decays or blows up exponentially as it cycles around the network’s recurrent
connections
• LSTM is designed to address vanishing gradient problem
• An LSTM hidden layer consists of recurrently connected subnets, called
memory blocks
• Each block contains a set of internal units, or cells, whose activation is
controlled by three multiplicative gates: the input gate, forget gate and
output gate

DEMO TIME: OCR FOR VIETNAMESE DOCUMENTS
Thank you!

REFERENCES - CREDITS
• https://github.com/yiwangbaidu/notes/blob/master/CTC/CTC.pdf
• http://colah.github.io/posts/2015-08-Understanding-LSTMs/
• Ray Smith. Everything you always wanted to know about
Tesseract. Tesseract tutorial @ DAS 2014

OCR processing with deep learning: Apply to Vietnamese documents

More Related Content

What's hot (20)

Similar to OCR processing with deep learning: Apply to Vietnamese documents (20)

More from Viet-Trung TRAN (20)

Recently uploaded (20)

OCR processing with deep learning: Apply to Vietnamese documents

Editor's Notes