ICFHR'18-DataAug.pptx

IMPROVING CNN-RNN HYBRID NETWORKS
FOR HANDWRITING RECOGNITION
Kartik Dutta, Praveen Krishnan, Minesh Mathew
and C.V. Jawahar
CVIT, IIIT Hyderabad, India

Problem
conference in london will it end with
Word Recognition
Line Recognition

Prior works
• Using BLSTM’s
 Formulated as
Sequence-2-Sequence
problem.
 Bidirectional LSTM networks
using CTC layer
[Bluche ICDAR’15, Sueiras
Neurocomputing’18]
for recognition.
• Shi 2016 proposed
a SoA Hybrid architecture for
scene text recognition

Prior works
• Variations of BLSTM’s
 MDLSTM
[Voigtlaender, et al
ICFHR’16]
 SepMDLSTM
[Chen et al ICDAR’17]
• Puigcerver et al.,
ICDAR 2017, analyze
on the effectiveness of
BLSTM’s vs MDLSTM’s.

Prior works
• Wigington et al., ICDAR’17 use new pre-processing
& augmentation strategies
Profile Normalization
Elastic Distortion

Our Prior work (DAS 18)
CNN-RNN Hybrid network
Feature Sequence from CNN layer

STN
• Used to correct Distortions in Input
• E2E trainable
• Components
 Localization Network
 Grid Generator
 Sampler
Jaderberg et al., NIPS, 2015

Contributions
• Pre-training, Data Augmentation & Normalization
Word/Line Normalization Multi Scale
Elastic Distortion Synthetic Data

Pre-processing
• We use the algorithm
used in Vinciarelli et
al., PR, 2001
• Shear i/p image &
evaluate to histogram
of contours of nearly
vertical strokes
• No parameter tuning

Pre-processing
• Image de-slanting

Multi-Scale Training
• In order to predict
characters at multiple
scales.
• Fix a 2d rectangle size
 Scale i/p image to larger
or smaller sizes
 Translate the
transformed image.

Data Augmentation
• Affine Transformation
 Combination of Rotation, Scaling & Translation
• Elastic Distortion
 Using a random displacement field, each pixel is interpolated.
 The field is smoothed using a Gaussian filter.

Pre-Training
• IIIT-HWS dataset
 10M word images
 10K vocabulary words
• Rendered using open source handwritten type
fonts.
• Parameters of rendering:-
 kerning level , stroke width
 Sampling of foreground and background pixel distribution
from a Gaussian distributions
P Krishnan and CV Jawahar, Generating Synthetic Data for Text Recognition, arXiv 2016

Datasets
Dataset Historical #Lines #Words #Writers
Rimes No 12,093 66,982 1300
GW Yes 656 4,894 1
IAM No 13,353 1,15,320 657
Sample word images

Evaluation Protocol
• Lexicon based and free decoding
• Evaluation
 Mean word error rate (WER)
 Mean character error rate (CER) based on Levenshtein
distance between the two words.

Ablation Study-I
• Let check performance of the original CRNN
network only on IAM train data
22.86
We know deep learning architectures are data hungry
Let us try pre-training with synthetic data
WER

Ablation Study-II
• Let check performance of the original CRNN
network only on IAM train data, pre-trained on
IIIT-HWS data
22.86
Let us try out a few architectural improvements
First a STN layer
20.10
WER

Ablation Study-III
• Let us add a STN layer & use the same training
scheme as before
Let us add residual blocks to our network
18.3
WER
22.86
20.10
WER

Ablation Study-IV
• Let us add more (residual) conv. layers & use the
same training scheme as before
Using a deeper network helps with HWR
Let us try using slant correction
18.3
WER
22.86
20.10
WER
16.19

Ablation Study-V
• Let us add pre-processing to the previous
architecture and training strategy
A little improvement
Now let us see the improvement with our augmentation strategies
18.3
WER
22.86
20.10
WER
16.19 15.79

Ablation Study-VI
• Let check performance of our previous network,
with the same strategy, but with our data
augmentation strategy
Data augmentation makes a huge difference in HWR
18.3
WER
22.86
20.10
WER
16.19
13.16
15.79

Ablation Study-VII
• Let check performance of our previous network,
with the same strategy, but with our data
augmentation strategy
Finally performing a Test time augmentation
18.3
WER
22.86
20.10
WER
16.19 15.79
13.16
12.61

Ablation: IAM Isolated HWR
Method WER CER
CRNN 22.86 11.08
CRNN-Synth 20.10 9.31
SCRNN-Synth 18.3 7.82
SDCRNN-Synth 16.19 6.34
PP-SDCRNN-Synth 15.79 5.98
PP-SDCRNN-Synth
+Augmentation
12.61 4.88

Isolated Word Recognition-I
Krishnan
et al.
DAS'18
Wigington
et al.
ICDAR'17
Sueiras et
al.
Neuroco
mputing'1
8
This Work
Sueiras et
al. -- Full
Lexicon
Neuroco
mputing'1
8
Stuner et
al. -- Full
Lexicon
CoRR'16
Pozanski
et al. --
Full
Lexicon
CVPR'16
Krishnan
et al. --
Full
Lexicon
DAS'18
Wigington
et al. --
Full
Lexicon
ICDAR'17
This Work
-- Full
Lexicon
WER 16.19 19.07 23.8 12.61 12.7 5.93 6.45 5.1 5.71 4.8
CER 6.34 6.07 8.8 4.88 6.2 2.78 3.44 2.66 3.03 2.52
0
5
10
15
20
25
IAM
WER CER

Isolated Word Recognition-II
Wigington et
al.
ICDAR'17
Sueiras et
al.
Neurocomp
uting'18
This Work
Sueiras et
al.
Neurocomp
uting'18 --
Comp.
Lexicon
Pozanski et
al. CVPR'16
-- Comp.
Lexicon
Wigington et
al.
ICDAR'17 --
Comp.
Lexicon
Stuner et al.
CoRR'16 --
Comp.
Lexicon
This Work --
Comp.
Lexicon
WER 11.29 15.9 7.04 6.6 3.9 2.85 3.48 1.86
CER 3.09 4.8 2.32 2.6 1.9 1.36 1.34 0.65
0
2
4
6
8
10
12
14
16
18
RIMES
WER CER

Line Level Recognition-I
Pham et al.
ICFHR'14
Krishnan et al.
DAS'18
Chen et al.
ICDAR'17
Puigcerver et al.
ICDAR'17
This Work
WER 35.1 32.89 34.55 18.4 17.82
CER 10.8 9.78 11.15 5.8 5.7
0
5
10
15
20
25
30
35
40
IAM
WER CER

Filter Visualizations
• First column is input, rest are activations
Taken from 2nd conv layer

Conclusion
• We show the state-of-the-art DL architecture for
handwriting recognition
 CNN-RNN encoder decoder and STN module
 Pretraining with synthetic data
 Preprocessing with slant correction
 Various Augmentations
o Multi scale
o Elastic + Affine
o Test Time

ICFHR'18-DataAug.pptx

More Related Content

Similar to ICFHR'18-DataAug.pptx (20)

Recently uploaded (20)

ICFHR'18-DataAug.pptx

Editor's Notes