SlideShare a Scribd company logo
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 92
Optical Character Recognition System for Urdu (Naskh Font)
Using Pattern Matching Technique
Tabassam Nawaz drtnawaz@uettaxila.edu.pk
Faculty of Software Engineering,
University of Engineering & Technology
Taxila, Pakistan
Syed Ammar Hassan Shah Naqvi ammar_hassan13@yahoo.com
Department of Software Engineering
University of Engineering & Technology
Taxila, Pakistan
Habib ur Rehman habieb_rehman@yahoo.com
Department of Software Engineering
University of Engineering & Technology
Taxila, Pakistan
Anoshia Faiz an_oshia@yahoo.com
Department of Software Engineering
University of Engineering & Technology
Taxila, Pakistan
Abstract
The offline optical character recognition (OCR) for different languages has been
developed over the recent years. Since 1965, the US postal service has been
using this system for automating their services. The range of the applications
under this area is increasing day by day, due to its utility in almost major areas of
government as well as private sector. This technique has been very useful in
making paper free environment in many major organizations as far as the backup
of their previous file record is concerned. Our this system has been proposed for
the Offline Character Recognition for Isolated Characters of Urdu language, as
Urdu language forms words by combining Isolated Characters. Urdu is a cursive
language, having connected characters making words. The major area of utility
for Urdu OCR will be digitizing of a lot of literature related material already
stocked in libraries. Urdu language is famous and spoken in more than 3 big
countries including Pakistan, India and Bangladesh. A lot of work has been done
in Urdu poetry and literature up to the recent century. Creation of OCR for Urdu
language will make an important role in converting all those work from physical
libraries to electronic libraries. Most of the stuff already placed on internet is in
the form of images having text, which took a lot of space to transfer and even
read online. So the need of an Urdu OCR is a must. The system is of training
system type. It consists of the image preprocessing, line and character
segmentation, creation of xml file for training purpose. While Recognition system
includes taking xml file, the image to be recognized, segment it and creation of
chain codes for character images and matching with already stored in xml file.
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 93
The system has been implemented and it has 89% recognition accuracy with a
15 char/sec recognition rate.
Keywords: Pattern matching, chain code creation, morphology, segmentation, training system, recognition
system, digital image processing.
1. INTRODUCTION
An Optical Character Recognition System is software engineered to convert hand-written or
typewritten text (usually scanned) documents into machine editable text formats.
FIGURE 1: Urdu Characters set
This paper describes a training based offline [11] optical character recognition system for a Naskh
font of Urdu language. The main idea behind this recognition is matching the pixel values of the
samples already stored with pixel values of those character-images to be recognized. The major
difficulty while the recognition of Urdu language is its cursive nature, i.e. the characters joined to
create new words. The research in this has been greatly increased during last decade, as the
applications of this area are increasing. It has improved Human Computer Interaction; other
examples include paperless environment, online newspapers, old literature online availability,
paper checking, automating official tasks, reading bank receipts, postal addresses and data entry
forms. Many people are now a day working on Urdu OCR research. As Standard Urdu has
approximately the twentieth largest population of native speakers, among all languages. Due to
technical issues induced by the cursive nature of Urdu language, its OCR has not been
developed completely. If Urdu OCR system is available, it will be very useful and will have great
commercial value. The overall flow of our OCR is shown in Fig.2. Section II describes the Urdu
language specifications. Some related work done for Urdu OCR is described very briefly in
Section III. Section IV and V describes the Training and Recognition Systems respectively.
Section VI describes the process of creating Unicode file [3] from already recognized characters.
All the algorithms used in this paper are described in Section VII. Finally, this paper is concluded
in Section VIII.
2. URDU LANGUAGE SPECIFICATIONS
Urdu language is the old language of Indo-Pak Sub-continent, now it is the national language of
Pakistan. This language is a combination of characteristics of Arabic, Farsi and Sanskrit
languages, as it is the language of troops. The Urdu language has the characteristics of all these
languages mentioned above, all the characters in this language are picked from these languages.
Urdu language is a more cursive and complex language than Arabic and Farsi language as it
contains the connected characters to make words. This cursive nature [9] makes it very difficult to
be recognized through usual Character Recognition Methods.
Urdu character set consists of 40 characters. The characters contain single loop, double loops
and incomplete loops. Dots and diacritics (i.e., Telda (~) and Nuktay ( )) are also included in the
character set. Dots include single, double and triple dots. The recognition of Urdu language is
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 94
very difficult due to the different multiple shapes of a single character. In Urdu language, every
letter is of minimum 2 shapes and maximum 4 shapes. These shapes are based on their
occurrence in the given word. The locations in the Urdu word are isolated, initial, medium and
last. Here all isolated shapes of the characters are mentioned in Fig.1. So the Urdu language
character recognition is very difficult still. Another characteristic of this language is; if we read or
write something is from right to left.
3. RELATED WORK
One of the oldest techniques for pattern recognition is used for character recognition, but through
all days, more focus was on Latin, Chinese [24] and Japanese [25] languages, though connected.
First, we applied Hilditch’s method, which consists of removing the pixels that lie on the edge of
the binary image until only one-pixel-wide line remains. This is followed by some conditions
suggested by Al-Emami to reduce the junction points to one junction point [23]. The matching
procedure is executed based on an image based matching algorithm. From a practical viewpoint
however, the matching time must be reduced as much as possible through the classification
techniques [3].
In our studies we analyzed the shape and visual properties of Urdu characters and define a set of
features which can distinguish one character to another. In Urdu Qaeda system, the online
character recognition technique is used for isolated characters which is some how based on the
strokes by the user runtime [17].
FIGURE 2: Block Diagram of System
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 95
4. SYSTEM
4.1 Training System
• Requirements
As we will first train our system for a specific font and then matching algorithm will match both the
sampled image information and the actual image (input image) information. So the system needs
to know the font and size of the text to be recognized. Thus we have to select a specific font and
specific size in our system. In our case, the following demonstration of training and matching is
only for NASKH font and it can be extended to any font of any size. Sample font size to
demonstrate is taken to be 36. The images used for training should be noise-free and gray scale.
We will first convert the gray scaled image into binary image as shown in Fig.3, and remove extra
noise from the image. Now we can apply morphological transforms [12] easily on these images to
get required results.
Original Image
Binary Image
FIGURE 3: Binarization of input image
• How to Take Character’s Sample
A gray scaled image comprising of the whole character set of Urdu language (Naskh font) is
taken as sample and individual characters are separated out. All these individually separated
characters are created in such a way that from each side one totally white column or row is left for
the ease of chain code [2] creation. As stated earlier, only isolated characters will be recognized
so the samples are taken in such a way. This technique can be applied to any font having any
size.
• How to Create Chain code of characters
The individual character images are then passed into the training software. Each character image
is then scanned from top to bottom and then to the next column, a chain code [18] of each
column of the image is generated. The chain code based on the sum of consecutive one's or
zeros, as the image is only a binary image. According to our assumption, Zero is assumed to be
the first entry of the column. If this Zero appears zero times, then we will place 00 in the string,
then for example 1’s appear for 23 times, we will add 23 in the string. If we continue scanning,
now its turn of zero, if zero appears for 12 times, we will add 12 to the string. Now, if the column
is finished, we will add @ to the string to make sure that column is ended (as shown in Fig.4) the
same process will be repeated for next column and so on. It should be noticed that when the
number is less than 10 e.g., 3 zeros then we write “03”; we add this because maximum no. of
pixels can't exceed 99. As these columns are of only individual characters. Thus by calculating all
columns we get a chain code (from the final string) of set of the character image. This chain code
is generated by calculating alternating on and off pixels as shown in the Fig.4.
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 96
FIGURE 4: Calculated String
Now the chain code is available, the task is to store this chain code, we can use different options
for storing this string including database, collection objects of C# and xml file. We have chosen
the xml file for storing this file, as it will take least space and reading/writing into the file will be
easy. The Fig.5 explains the syntax of our xml file. This xml file is described in next section.
FIGURE 5: Classxml.xml
• How to create classes
• Xml file
Second step during training phase is creation of xml file. Xml file contains all the 21 classes [3] of
Urdu alphabets as parent nodes or elements. Each class contains character set belonging to that
particular class, where every character makes a single child node of parent class node. Each
child node has three attributes. One, the name of the character and the other, chain code of that
character, calculated from its image earlier. Unicode of the character is saved in xml as third
attribute of the child node, which will be assigned to the identified character at the end of the
matching procedure.
o Classification Criteria
Classification of characters is the key technique in our pattern matching method of optical
character recognition. Urdu script has a large character set, consisting of 40 characters. Pattern
matching technique is of no use if system has to traverse all the 40 characters and match their
chain codes. This special matching technique [14] is made very efficient by optimization at every
level, including logical methods and programming techniques. This classification is done on the
bases of height and width of the character without diacritics and dots. Urdu script consists of total
21 shapes, occupying specific dimensions. These 21 shapes are declared as the classes in which
our system divides the characters for recognition. All the recognize-able characters are input to
the training module to train our OCR system for a specific font and size. Training system can train
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 97
the OCR for any font and size character set, as mentioned earlier. System will store their
dimensions as individual classes of characters. Fig.6 describes all the 21 classes.
FIGURE 6: Character’s Classification
o How to remove diacritics
The character's dots and diacritics are removed using morphology techniques [12] of binary
images. The dots are removed using pepper noise [6] removal process and diacritics like ~ are
removed using thinning process [5]. Special filters are designed for removing diacritics. In Urdu
script characters contain a variation of diacritics, like single Nukta, double Nukta or triple Nukta
and “Chota Tuay” as in Fig.7.
FIGURE 7: Shows Chota Tuay, Double, triple and Double Dots respectively from left.
o Filters
Different special filters are designed particularly to remove diacritics from the isolated character
images. When the shape of the character is obtained after removing diacritics, we calculated the
dimensions (height and width) of each character and place it into particular class. Fig.8, Fig.9a,
Fig.9b and Fig.9c describe the filter for removing ttuay’ and single dot, double dot, triple dot
respectively.
FIGURE 8: Ttuay Filter
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 98
FIGURE 9:. Single dot Filter
FIGURE 9a: Double dots Filter
FIGURE 9b: Triple dot Filter
o Chain code
After creating the xml for 21 classes, measured the chain code for each class characters. All the
characters of a class are added to the xml file as children to the parent classes. All the character
set classes are shown in the Fig.1. This xml file will act as an input for the recognition system.
4.2 Recognition System
o Segmentation
When an image containing text is given as the input to the character recognition system, system
performs some preprocessing steps on it; including converting grayscale image into binary image
and enhancing the image by removing pepper noise. Binary image is need for applying filters and
5
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 99
calculating chain code. Pepper noise must be removed because it will lead the system to
erroneous classification and character recognition. Error margin (described later in this section)
methods are very sensitive to extra noise in image as noise can be located at random locations
disturbing the chain code and specially dimensions. Remember that “character image variation” is
always symmetric, not random. System then divides the image into segments. As at this stage
this paper is describing the recognition of isolated characters, thus segmentation is required at
two levels instead of three levels as required for connected characters in recursive script i.e. Urdu
script.
Two levels of segmentation are:
• Line segmentation
The image is first divided into different lines of text by checking for whole row of white pixels
consecutively from right to left. Each separated line will be saved separately in the recognition
system. Fig.10 shows an image after the lines have been segmented from an image.
FIGURE 10: Line Segmentation Results
• Isolated character segmentation
The already separated text line images are to be processed now, for isolated character
recognition, by converting them into individual characters. During processing, we check for full
white pixels column at the starting and ending point of the isolated character. Start of the
character is identified when any black pixel is scanned in the column. Scan continues until
another white pixel column is identified. Image in-between the white columns, is saved, and
starting and ending white column of pixels is preserved in the image. These results into different
small images of individual characters are shown in Fig.11 which will be processed separately.
Every separate image of isolated character is now completely void on four sides. Leaving one
pixel margin on each side of the image is because of the assumption that in chain code of each
column of the image will start with white pixels count.
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 100
FIGURE 11: Character Segmentation Results
• How to Identify Class of a Character
After segmentation of whole document image into small images of isolated characters, next step
is to identify class of each character. Following Steps will describe whole recognition steps;
• Diacritics Removal
To obtain an image of recognizable class, dots (Nuka’s) and diacritics (ttuay) are removed from
the above, below or inside the character image using especially designed filters (as shown earlier
in Figs[8,9a,9b,9c] ) specific to them. This dot-less image is stored as another copy of the
character image. Now, the height and width of this new character image are measured with only
considering black pixel, as was done before while taking samples and matched with the height
and width of different classes in the classxml.xml file.
When a class gets matched, then the image's string will be calculated again and matched with
the already calculated strings of the matched class characters each. This classification helps in
providing more efficient approach in pattern matching.
• Chain code calculation
At this stage, we knew the class to which the character belongs; now the original image will be
used for recognition. And this original image is scanned from top to bottom for each column to
obtain its' chain code. We will call this chain code, the "calculated chain code" and the one that is
saved in characterset.xml file called "sampled chain code". The calculated chain code is
calculated using the scanning the input image and generating the calculated string of on and off
pixels as described earlier in section 3.1.
• How to Recognize Character
After determining the class of the character, we have to just match the calculated chain code with
only some character's sampled chain codes. This makes the process of identification more
reliable and efficient. The calculated chain code is matched column by column with every
character's sampled chain code in the identified class. This process is repeated until all the
columns are matched with some extent of error margin. As we have already set the error margin,
character is identified up to the margin because different images can have little different
properties, so exact pattern matching cannot be so efficient, to identify the exact character.
o How to check Error Margin
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 101
As any image can be error prone, we must satisfy the “chain code matching method” about the
correct and exact character to recognize. In any case error count (mismatches) for each column
is calculated, while matching total number of columns that is the width of the image. If width
varies, equalize sample image width with the calculated image width. To equalize the width just
add some characters like ‘.’ in the shorter image as new columns. Now both images contain equal
number of columns to match with each other. Once we are done with width it’s time to check
whether both columns are of equal height or not. To equalize the heights of sample column and
calculated column again add characters like ‘.’ in shorter image as a new row in affected column.
Now both images contain equal number of columns and in each column equal number of rows.
Match column by column the chain code and calculate the mismatches as error count. If error
count increases the already set value, get to the next image in the identified class only. Character
is identified when error is less than error margin.
5. ALGORITHMS USED
Chain Code Calculation
• Start at right top of the image.
• Scan from top to bottom of the first column of segmented image.
• Assume that first pixels scanned are “off”, so count number of “off” pixels and add to
chain code.
• Continue scan, if pixel value changes from “off” to “on”, start counting “on” pixels and add
to chain code.
• Continue scan, counting “on” and “off” pixels till bottom of the image is reached and add
to chain code.
• At the end of the first column code add “@” as the end of the column.
• Continue scanning the next columns to the width of the segmented image. Chain codes
of all columns are calculated and saved as a single chain code in the xml file, in the
corresponding class.
Segmentation
Line Segmentation:
•••• Load input image.
•••• Start scan from right top of the original image.
•••• Scan up to the image width, on the same Y component.
•••• Scan the first row to check any “on” pixels. If no “on” pixel is found go to next row.
•••• Continue scan until a row containing “on” pixels come across. Save the row before this
row as the top edge of the image text line.
•••• Continue scan until another row with all “off” pixels is found. Set this row as the bottom
edge of the image text line.
•••• Save image as first text line and continue scan to find next line.
•••• Continue till the bottom of the image is reached. Save all text line images to be input to
character segmentation
Character Segmentation:
•••• Input text line image.
•••• Start scan from right top of the text line image.
•••• Scan up to the image height, on the same X component.
7
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 102
•••• Scan the first column to check any “on” pixels. If no “on” pixel is found go to next column.
•••• Continue scan until a column containing “on” pixels come across. Save the column
before this column as the right hand edge of the isolated character image.
•••• Continue scan until another column with all “off” pixels is found. Set this column as the
left hand edge of the isolated character image.
•••• Save image as first character and continue scan to find next character.
•••• Continue till the left hand edge of the image text line is reached. Save all character
images.
Classification
• Load the segmented character image.
• Apply filters to remove diacritics, and get the shape of the corresponding class.
• Measure the dimensions of the shape i.e. the width and height of the character image
without diacritics, by scanning the image from top right to left bottom pixel by pixel.
• Open xml file and traverse all the class nodes. Match “width” and “height” attribute values
with the calculated values.
• When a dimension match occurs according to some already set error margin (in this case
error margin is 1, i.e. if width or height is one less or on greater than the sample values,
match occurs.) set that class as the key class for that character.
Character Matching
•••• Restore the “original segmented character image” with diacritics and dots. Calculate its
chain code according to the algorithm described earlier in this section.
•••• Open xml file. Traverse all child nodes in the identified class and match “code” attribute
values with the calculated chain code.
•••• When a chain code match occurs according to Error Margin, described in next algorithm
(Testing), set the “name” attribute of that node as the name of the character. And set the
“Unicode” attribute of that node as the Unicode of the input character.
Testing
•••• Get the sampled chain codes in the identified class from xml file and the calculated chain
code of the classified character image.
•••• Start matching the calculated chain code with each of the sampled chain codes.
•••• Chain code of each column is compared.
•••• In each column every single code is compared with actual value (sampled value). In each
column every single mismatch is counted as “error count”. If error count exceeds the Error
Margin, set according to font and size, disqualify the node to be the exact match.
•••• Continue with the next node chain code and check the Error Margin. Continue until a
character is confirmed. Confirmation occurs when “total error count” of the character is less
than the “Error Margin Limit”.
Unicode File Creation
•••• Characters are recognized through the above process.
•••• Xml file includes all the information about the recognized character.
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 103
•••• This recognized character is written into a Unicode file, according to the information
written in the xml file.
•••• The Unicode file is written from right to left.
6. RESULTS
Unicode is simply a character encoding system. It has nothing to do with how these characters
finally get displayed on computer screen. Thus it cannot be Naskh based only, its only type style
issues relating to rendering which is handled by operating system and application not by
encoding scheme [4]. After getting the identification of exact character, we create a new Unicode
file that will include all the characters identified. All the Unicode characters are according to the
Unicode Consortium [3]. We will place the Unicode of those characters in the same position from
right to left as in the Urdu text image file. The output Unicode file is now editable and searchable.
This file is written from right to left as Urdu language is right to left.
7. CONCLUSION
This training system has recognized different printed Urdu text image files, which include isolated
characters. The training system has been tested for different sizes of fonts, the results are quite
impressive. Our system has achieved an accuracy of 89% for the isolated characters with
accuracy of a 15 char/sec recognition rate. Different sizes of filters are used for different sizes,
and this system has proven good for multi font sized characters. The system has been tested for
different images and efficient results are found. Fig12.a shows the input image file, after being
processed by our OCR, the screen shot of our output text file is shown in Fig.12b.
FIGURE 12a: Input image
FIGURE 13: Output text file
8. REFERENCES
[1]. Afzal, M. and Hussain, S., “Urdu Computing Standards: Urdu Zabta Takhti (UZT) 1.01”, in
the Proceedings of International IEEE Multi topic Conference (INMIC), Lahore University
of Management Sciences (LUMS), Lahore, Pakistan, 2001.
[2]. Ethnologue, Languages of Pakistan,
http://www.ethnologue.com/show_country.asp?name=Pakistan
Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz
International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 104
[3]. See the Unicode Consortium website at http://unicode.org
[4]. Bhurgari, A. M. 2007. Enabling Pakistani Languages through Unicode, published at
http://download.microsoft.com/download/1/4/2/142aef9f-1a74-4a24-b1f4-
782d48d41a6d/PakLang.pdf
[5]. Thresholding, Image Segmentation, Digital Image Processing 2/e Rafael C. Gonzalez,
Richard E. Woods.
[6]. Fast, Bruce B., Allen, Dana R. OCR image preprocessing method for image enhancement
of scanned documents.
[7]. Zaheer Ahmad, Jehanzeb Khan Orakzai, Inam Shamsher, and Awais Adnan. “Urdu
Nastaleeq Optical Character Recognition”, “Proceedings of world academy of science,
engineering and technology volume 26 december 2007”.
[8]. U. Pal and Anirban Sarkar, “Recognition of Printed Urdu Script”, “Proceedings of the
Seventh International Conference on Document Analysis and Recognition (ICDAR 2003)”.
[9]. Khalid Saeed, “New Approaches for Cursive Languages Recognition: Machine and Hand
Written Script and Texts”.
[10]. T. Sari and M. Sellami, “Cursive Arabic Script Segmentation and Recognition System”.
[11]. Bozinovic, R.M.; Srihari, S.N, “Off-line cursive script word recognition”.
[12]. Soille, P. [2003]. “Morphological Image Analysis: Principles and Applications”, 2nd ed.,
Springer-Verlag, NY.
[13]. Dougherty. E. R. and Lotufo, R. A. [2003]. “Hands-on Morphological Image Processing”,
SPIE--The International Society for Optical Engineering, Bellingham, WA.
[14]. International Journal of Pattern Recognition and Artificial Intelligence.
[15]. Alasdari McAndrew, Anne Venables, “A ‘Secondary’ Look at Digital Image Processing”.
[16]. Ganapathy, V., Lean, C.C.H., “Optical Character Recognition Program for Images of
Printed Text using a Neural Network”.
[17]. Nabeel Shahzad, Brandon Paulson, Tracy Hammond, “Urdu Qaeda: Recognition System
for Isolated Urdu Characters”.
[18]. Hermilo, Ernesto, Ramon M. “Efficiency of chain codes to represent binary objects”.
[19]. Yong Kui Liua and Borut Žalik, “An efficient chain code with Huffman coding”.
[20]. Shah, Z.A., “Ligature based optical character recognition of Urdu- Nastaleeq
font”.
[21]. Inam Shamsher, Zaheer Ahmad, Jahenzeb Khan Orakzai and Awais Adnan, “OCR For
Printed Urdu Script Using Feed Forward Neural Network”.
[22]. “The Origin of Urdu Language”
http://www.essortment.com/all/urdulanguage_rguo.htm
[23]. T.S El-Sheikh and R.M Guindi, “computer Recognition of Arabic Cursive Script,” Pattern
Recognition, Vol.21, No, 4, 1988, pp.293-302.
[24]. G. Nagy Rensselaer Polytechnic Institute Troy, New York, “Chinese Character
Recognition A Twenty Five Year Retrospective”. Tsuyoshi Kitani t, riguchi and Masami
Ilara Yoshio, “Pattern Matching in the Textract Information Extraction System”.
9

More Related Content

What's hot (20)

PPTX
Handwriting Recognition
Bindu Karki
 
PPTX
Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...
GBM
 
PPT
optical character recognition system
Vijay Apurva
 
PPTX
OCR Presentation (Optical Character Recognition)
Neeraj Neupane
 
PDF
Handwritten Character Recognition: A Comprehensive Review on Geometrical Anal...
iosrjce
 
PPT
05a
Badri Patro
 
PPTX
Handwritten Character Recognition
Constantine Priemski
 
PPTX
Machine learning
Amit Gupta
 
PDF
Handwritten character recognition in
ijaia
 
PPTX
Text Detection and Recognition
Badruz Nasrin Basri
 
PDF
FREEMAN CODE BASED ONLINE HANDWRITTEN CHARACTER RECOGNITION FOR MALAYALAM USI...
acijjournal
 
PDF
Z04405149151
IJERA Editor
 
PPTX
Handwriting Recognition Using Deep Learning and Computer Version
Naiyan Noor
 
PPTX
PDF OCR
OliviaSmith160
 
PDF
CHARACTER RECOGNITION USING NEURAL NETWORK WITHOUT FEATURE EXTRACTION FOR KAN...
Editor IJMTER
 
PPTX
Automatic handwriting recognition
BIJIT GHOSH
 
PPTX
Handwritten character recognition using artificial neural network
Harshana Madusanka Jayamaha
 
PPTX
Final Report on Optical Character Recognition
Vidyut Singhania
 
PPTX
offline character recognition for handwritten gujarati text
Bhumika Patel
 
PDF
Optical Character Recognition (OCR) System
iosrjce
 
Handwriting Recognition
Bindu Karki
 
Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...
GBM
 
optical character recognition system
Vijay Apurva
 
OCR Presentation (Optical Character Recognition)
Neeraj Neupane
 
Handwritten Character Recognition: A Comprehensive Review on Geometrical Anal...
iosrjce
 
Handwritten Character Recognition
Constantine Priemski
 
Machine learning
Amit Gupta
 
Handwritten character recognition in
ijaia
 
Text Detection and Recognition
Badruz Nasrin Basri
 
FREEMAN CODE BASED ONLINE HANDWRITTEN CHARACTER RECOGNITION FOR MALAYALAM USI...
acijjournal
 
Z04405149151
IJERA Editor
 
Handwriting Recognition Using Deep Learning and Computer Version
Naiyan Noor
 
CHARACTER RECOGNITION USING NEURAL NETWORK WITHOUT FEATURE EXTRACTION FOR KAN...
Editor IJMTER
 
Automatic handwriting recognition
BIJIT GHOSH
 
Handwritten character recognition using artificial neural network
Harshana Madusanka Jayamaha
 
Final Report on Optical Character Recognition
Vidyut Singhania
 
offline character recognition for handwritten gujarati text
Bhumika Patel
 
Optical Character Recognition (OCR) System
iosrjce
 

Similar to Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Matching Technique (20)

PDF
International Journal of Image Processing (IJIP) Volume (3) Issue (3)
CSCJournals
 
PDF
IRJET - A Survey on Recognition of Strike-Out Texts in Handwritten Documents
IRJET Journal
 
PDF
Ijetcas14 371
Iasir Journals
 
PDF
Isolated Kannada Character Recognition using Chain Code Features
International Journal of Science and Research (IJSR)
 
PDF
A Novel Approach for Bilingual (English - Oriya) Script Identification and Re...
CSCJournals
 
PDF
Two Methods for Recognition of Hand Written Farsi Characters
CSCJournals
 
PDF
Fuzzy rule based classification and recognition of handwritten hindi
IAEME Publication
 
PDF
Fuzzy rule based classification and recognition of handwritten hindi
IAEME Publication
 
PDF
An Optical Character Recognition for Handwritten Devanagari Script
IJERA Editor
 
PDF
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT
cscpconf
 
PDF
E123440
IJRES Journal
 
PDF
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...
iosrjce
 
PDF
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
cscpconf
 
PDF
A Survey of Modern Character Recognition Techniques
ijsrd.com
 
PDF
O45018291
IJERA Editor
 
PDF
Applsci 09-02758
FahadJabbar13
 
PDF
ARABIC ONLINE HANDWRITING RECOGNITION USING NEURAL NETWORK
ijaia
 
PDF
Preprocessing Phase for Offline Arabic Handwritten Character Recognition
Editor IJCATR
 
PDF
Volume 2-issue-6-2009-2015
Editor IJARCET
 
International Journal of Image Processing (IJIP) Volume (3) Issue (3)
CSCJournals
 
IRJET - A Survey on Recognition of Strike-Out Texts in Handwritten Documents
IRJET Journal
 
Ijetcas14 371
Iasir Journals
 
Isolated Kannada Character Recognition using Chain Code Features
International Journal of Science and Research (IJSR)
 
A Novel Approach for Bilingual (English - Oriya) Script Identification and Re...
CSCJournals
 
Two Methods for Recognition of Hand Written Farsi Characters
CSCJournals
 
Fuzzy rule based classification and recognition of handwritten hindi
IAEME Publication
 
Fuzzy rule based classification and recognition of handwritten hindi
IAEME Publication
 
An Optical Character Recognition for Handwritten Devanagari Script
IJERA Editor
 
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT
cscpconf
 
E123440
IJRES Journal
 
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...
iosrjce
 
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
cscpconf
 
A Survey of Modern Character Recognition Techniques
ijsrd.com
 
O45018291
IJERA Editor
 
Applsci 09-02758
FahadJabbar13
 
ARABIC ONLINE HANDWRITING RECOGNITION USING NEURAL NETWORK
ijaia
 
Preprocessing Phase for Offline Arabic Handwritten Character Recognition
Editor IJCATR
 
Volume 2-issue-6-2009-2015
Editor IJARCET
 
Ad

Recently uploaded (20)

PPTX
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PDF
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
PDF
SSHS-2025-PKLP_Quarter-1-Dr.-Kerby-Alvarez.pdf
AishahSangcopan1
 
PDF
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
PDF
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
PPTX
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Mathematics 5 - Time Measurement: Time Zone
menchreo
 
PPTX
PPT on the Development of Education in the Victorian England
Beena E S
 
PPTX
MENINGITIS: NURSING MANAGEMENT, BACTERIAL MENINGITIS, VIRAL MENINGITIS.pptx
PRADEEP ABOTHU
 
PPSX
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
PPTX
How to Set Maximum Difference Odoo 18 POS
Celine George
 
PPTX
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PPTX
SPINA BIFIDA: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PDF
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
PPSX
Health Planning in india - Unit 03 - CHN 2 - GNM 3RD YEAR.ppsx
Priyanshu Anand
 
PPTX
Quarter1-English3-W4-Identifying Elements of the Story
FLORRACHELSANTOS
 
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
SSHS-2025-PKLP_Quarter-1-Dr.-Kerby-Alvarez.pdf
AishahSangcopan1
 
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
Mathematics 5 - Time Measurement: Time Zone
menchreo
 
PPT on the Development of Education in the Victorian England
Beena E S
 
MENINGITIS: NURSING MANAGEMENT, BACTERIAL MENINGITIS, VIRAL MENINGITIS.pptx
PRADEEP ABOTHU
 
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
How to Set Maximum Difference Odoo 18 POS
Celine George
 
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
Dimensions of Societal Planning in Commonism
StefanMz
 
SPINA BIFIDA: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
Health Planning in india - Unit 03 - CHN 2 - GNM 3RD YEAR.ppsx
Priyanshu Anand
 
Quarter1-English3-W4-Identifying Elements of the Story
FLORRACHELSANTOS
 
Ad

Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Matching Technique

  • 1. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 92 Optical Character Recognition System for Urdu (Naskh Font) Using Pattern Matching Technique Tabassam Nawaz [email protected] Faculty of Software Engineering, University of Engineering & Technology Taxila, Pakistan Syed Ammar Hassan Shah Naqvi [email protected] Department of Software Engineering University of Engineering & Technology Taxila, Pakistan Habib ur Rehman [email protected] Department of Software Engineering University of Engineering & Technology Taxila, Pakistan Anoshia Faiz [email protected] Department of Software Engineering University of Engineering & Technology Taxila, Pakistan Abstract The offline optical character recognition (OCR) for different languages has been developed over the recent years. Since 1965, the US postal service has been using this system for automating their services. The range of the applications under this area is increasing day by day, due to its utility in almost major areas of government as well as private sector. This technique has been very useful in making paper free environment in many major organizations as far as the backup of their previous file record is concerned. Our this system has been proposed for the Offline Character Recognition for Isolated Characters of Urdu language, as Urdu language forms words by combining Isolated Characters. Urdu is a cursive language, having connected characters making words. The major area of utility for Urdu OCR will be digitizing of a lot of literature related material already stocked in libraries. Urdu language is famous and spoken in more than 3 big countries including Pakistan, India and Bangladesh. A lot of work has been done in Urdu poetry and literature up to the recent century. Creation of OCR for Urdu language will make an important role in converting all those work from physical libraries to electronic libraries. Most of the stuff already placed on internet is in the form of images having text, which took a lot of space to transfer and even read online. So the need of an Urdu OCR is a must. The system is of training system type. It consists of the image preprocessing, line and character segmentation, creation of xml file for training purpose. While Recognition system includes taking xml file, the image to be recognized, segment it and creation of chain codes for character images and matching with already stored in xml file.
  • 2. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 93 The system has been implemented and it has 89% recognition accuracy with a 15 char/sec recognition rate. Keywords: Pattern matching, chain code creation, morphology, segmentation, training system, recognition system, digital image processing. 1. INTRODUCTION An Optical Character Recognition System is software engineered to convert hand-written or typewritten text (usually scanned) documents into machine editable text formats. FIGURE 1: Urdu Characters set This paper describes a training based offline [11] optical character recognition system for a Naskh font of Urdu language. The main idea behind this recognition is matching the pixel values of the samples already stored with pixel values of those character-images to be recognized. The major difficulty while the recognition of Urdu language is its cursive nature, i.e. the characters joined to create new words. The research in this has been greatly increased during last decade, as the applications of this area are increasing. It has improved Human Computer Interaction; other examples include paperless environment, online newspapers, old literature online availability, paper checking, automating official tasks, reading bank receipts, postal addresses and data entry forms. Many people are now a day working on Urdu OCR research. As Standard Urdu has approximately the twentieth largest population of native speakers, among all languages. Due to technical issues induced by the cursive nature of Urdu language, its OCR has not been developed completely. If Urdu OCR system is available, it will be very useful and will have great commercial value. The overall flow of our OCR is shown in Fig.2. Section II describes the Urdu language specifications. Some related work done for Urdu OCR is described very briefly in Section III. Section IV and V describes the Training and Recognition Systems respectively. Section VI describes the process of creating Unicode file [3] from already recognized characters. All the algorithms used in this paper are described in Section VII. Finally, this paper is concluded in Section VIII. 2. URDU LANGUAGE SPECIFICATIONS Urdu language is the old language of Indo-Pak Sub-continent, now it is the national language of Pakistan. This language is a combination of characteristics of Arabic, Farsi and Sanskrit languages, as it is the language of troops. The Urdu language has the characteristics of all these languages mentioned above, all the characters in this language are picked from these languages. Urdu language is a more cursive and complex language than Arabic and Farsi language as it contains the connected characters to make words. This cursive nature [9] makes it very difficult to be recognized through usual Character Recognition Methods. Urdu character set consists of 40 characters. The characters contain single loop, double loops and incomplete loops. Dots and diacritics (i.e., Telda (~) and Nuktay ( )) are also included in the character set. Dots include single, double and triple dots. The recognition of Urdu language is
  • 3. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 94 very difficult due to the different multiple shapes of a single character. In Urdu language, every letter is of minimum 2 shapes and maximum 4 shapes. These shapes are based on their occurrence in the given word. The locations in the Urdu word are isolated, initial, medium and last. Here all isolated shapes of the characters are mentioned in Fig.1. So the Urdu language character recognition is very difficult still. Another characteristic of this language is; if we read or write something is from right to left. 3. RELATED WORK One of the oldest techniques for pattern recognition is used for character recognition, but through all days, more focus was on Latin, Chinese [24] and Japanese [25] languages, though connected. First, we applied Hilditch’s method, which consists of removing the pixels that lie on the edge of the binary image until only one-pixel-wide line remains. This is followed by some conditions suggested by Al-Emami to reduce the junction points to one junction point [23]. The matching procedure is executed based on an image based matching algorithm. From a practical viewpoint however, the matching time must be reduced as much as possible through the classification techniques [3]. In our studies we analyzed the shape and visual properties of Urdu characters and define a set of features which can distinguish one character to another. In Urdu Qaeda system, the online character recognition technique is used for isolated characters which is some how based on the strokes by the user runtime [17]. FIGURE 2: Block Diagram of System
  • 4. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 95 4. SYSTEM 4.1 Training System • Requirements As we will first train our system for a specific font and then matching algorithm will match both the sampled image information and the actual image (input image) information. So the system needs to know the font and size of the text to be recognized. Thus we have to select a specific font and specific size in our system. In our case, the following demonstration of training and matching is only for NASKH font and it can be extended to any font of any size. Sample font size to demonstrate is taken to be 36. The images used for training should be noise-free and gray scale. We will first convert the gray scaled image into binary image as shown in Fig.3, and remove extra noise from the image. Now we can apply morphological transforms [12] easily on these images to get required results. Original Image Binary Image FIGURE 3: Binarization of input image • How to Take Character’s Sample A gray scaled image comprising of the whole character set of Urdu language (Naskh font) is taken as sample and individual characters are separated out. All these individually separated characters are created in such a way that from each side one totally white column or row is left for the ease of chain code [2] creation. As stated earlier, only isolated characters will be recognized so the samples are taken in such a way. This technique can be applied to any font having any size. • How to Create Chain code of characters The individual character images are then passed into the training software. Each character image is then scanned from top to bottom and then to the next column, a chain code [18] of each column of the image is generated. The chain code based on the sum of consecutive one's or zeros, as the image is only a binary image. According to our assumption, Zero is assumed to be the first entry of the column. If this Zero appears zero times, then we will place 00 in the string, then for example 1’s appear for 23 times, we will add 23 in the string. If we continue scanning, now its turn of zero, if zero appears for 12 times, we will add 12 to the string. Now, if the column is finished, we will add @ to the string to make sure that column is ended (as shown in Fig.4) the same process will be repeated for next column and so on. It should be noticed that when the number is less than 10 e.g., 3 zeros then we write “03”; we add this because maximum no. of pixels can't exceed 99. As these columns are of only individual characters. Thus by calculating all columns we get a chain code (from the final string) of set of the character image. This chain code is generated by calculating alternating on and off pixels as shown in the Fig.4.
  • 5. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 96 FIGURE 4: Calculated String Now the chain code is available, the task is to store this chain code, we can use different options for storing this string including database, collection objects of C# and xml file. We have chosen the xml file for storing this file, as it will take least space and reading/writing into the file will be easy. The Fig.5 explains the syntax of our xml file. This xml file is described in next section. FIGURE 5: Classxml.xml • How to create classes • Xml file Second step during training phase is creation of xml file. Xml file contains all the 21 classes [3] of Urdu alphabets as parent nodes or elements. Each class contains character set belonging to that particular class, where every character makes a single child node of parent class node. Each child node has three attributes. One, the name of the character and the other, chain code of that character, calculated from its image earlier. Unicode of the character is saved in xml as third attribute of the child node, which will be assigned to the identified character at the end of the matching procedure. o Classification Criteria Classification of characters is the key technique in our pattern matching method of optical character recognition. Urdu script has a large character set, consisting of 40 characters. Pattern matching technique is of no use if system has to traverse all the 40 characters and match their chain codes. This special matching technique [14] is made very efficient by optimization at every level, including logical methods and programming techniques. This classification is done on the bases of height and width of the character without diacritics and dots. Urdu script consists of total 21 shapes, occupying specific dimensions. These 21 shapes are declared as the classes in which our system divides the characters for recognition. All the recognize-able characters are input to the training module to train our OCR system for a specific font and size. Training system can train
  • 6. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 97 the OCR for any font and size character set, as mentioned earlier. System will store their dimensions as individual classes of characters. Fig.6 describes all the 21 classes. FIGURE 6: Character’s Classification o How to remove diacritics The character's dots and diacritics are removed using morphology techniques [12] of binary images. The dots are removed using pepper noise [6] removal process and diacritics like ~ are removed using thinning process [5]. Special filters are designed for removing diacritics. In Urdu script characters contain a variation of diacritics, like single Nukta, double Nukta or triple Nukta and “Chota Tuay” as in Fig.7. FIGURE 7: Shows Chota Tuay, Double, triple and Double Dots respectively from left. o Filters Different special filters are designed particularly to remove diacritics from the isolated character images. When the shape of the character is obtained after removing diacritics, we calculated the dimensions (height and width) of each character and place it into particular class. Fig.8, Fig.9a, Fig.9b and Fig.9c describe the filter for removing ttuay’ and single dot, double dot, triple dot respectively. FIGURE 8: Ttuay Filter
  • 7. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 98 FIGURE 9:. Single dot Filter FIGURE 9a: Double dots Filter FIGURE 9b: Triple dot Filter o Chain code After creating the xml for 21 classes, measured the chain code for each class characters. All the characters of a class are added to the xml file as children to the parent classes. All the character set classes are shown in the Fig.1. This xml file will act as an input for the recognition system. 4.2 Recognition System o Segmentation When an image containing text is given as the input to the character recognition system, system performs some preprocessing steps on it; including converting grayscale image into binary image and enhancing the image by removing pepper noise. Binary image is need for applying filters and 5
  • 8. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 99 calculating chain code. Pepper noise must be removed because it will lead the system to erroneous classification and character recognition. Error margin (described later in this section) methods are very sensitive to extra noise in image as noise can be located at random locations disturbing the chain code and specially dimensions. Remember that “character image variation” is always symmetric, not random. System then divides the image into segments. As at this stage this paper is describing the recognition of isolated characters, thus segmentation is required at two levels instead of three levels as required for connected characters in recursive script i.e. Urdu script. Two levels of segmentation are: • Line segmentation The image is first divided into different lines of text by checking for whole row of white pixels consecutively from right to left. Each separated line will be saved separately in the recognition system. Fig.10 shows an image after the lines have been segmented from an image. FIGURE 10: Line Segmentation Results • Isolated character segmentation The already separated text line images are to be processed now, for isolated character recognition, by converting them into individual characters. During processing, we check for full white pixels column at the starting and ending point of the isolated character. Start of the character is identified when any black pixel is scanned in the column. Scan continues until another white pixel column is identified. Image in-between the white columns, is saved, and starting and ending white column of pixels is preserved in the image. These results into different small images of individual characters are shown in Fig.11 which will be processed separately. Every separate image of isolated character is now completely void on four sides. Leaving one pixel margin on each side of the image is because of the assumption that in chain code of each column of the image will start with white pixels count.
  • 9. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 100 FIGURE 11: Character Segmentation Results • How to Identify Class of a Character After segmentation of whole document image into small images of isolated characters, next step is to identify class of each character. Following Steps will describe whole recognition steps; • Diacritics Removal To obtain an image of recognizable class, dots (Nuka’s) and diacritics (ttuay) are removed from the above, below or inside the character image using especially designed filters (as shown earlier in Figs[8,9a,9b,9c] ) specific to them. This dot-less image is stored as another copy of the character image. Now, the height and width of this new character image are measured with only considering black pixel, as was done before while taking samples and matched with the height and width of different classes in the classxml.xml file. When a class gets matched, then the image's string will be calculated again and matched with the already calculated strings of the matched class characters each. This classification helps in providing more efficient approach in pattern matching. • Chain code calculation At this stage, we knew the class to which the character belongs; now the original image will be used for recognition. And this original image is scanned from top to bottom for each column to obtain its' chain code. We will call this chain code, the "calculated chain code" and the one that is saved in characterset.xml file called "sampled chain code". The calculated chain code is calculated using the scanning the input image and generating the calculated string of on and off pixels as described earlier in section 3.1. • How to Recognize Character After determining the class of the character, we have to just match the calculated chain code with only some character's sampled chain codes. This makes the process of identification more reliable and efficient. The calculated chain code is matched column by column with every character's sampled chain code in the identified class. This process is repeated until all the columns are matched with some extent of error margin. As we have already set the error margin, character is identified up to the margin because different images can have little different properties, so exact pattern matching cannot be so efficient, to identify the exact character. o How to check Error Margin
  • 10. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 101 As any image can be error prone, we must satisfy the “chain code matching method” about the correct and exact character to recognize. In any case error count (mismatches) for each column is calculated, while matching total number of columns that is the width of the image. If width varies, equalize sample image width with the calculated image width. To equalize the width just add some characters like ‘.’ in the shorter image as new columns. Now both images contain equal number of columns to match with each other. Once we are done with width it’s time to check whether both columns are of equal height or not. To equalize the heights of sample column and calculated column again add characters like ‘.’ in shorter image as a new row in affected column. Now both images contain equal number of columns and in each column equal number of rows. Match column by column the chain code and calculate the mismatches as error count. If error count increases the already set value, get to the next image in the identified class only. Character is identified when error is less than error margin. 5. ALGORITHMS USED Chain Code Calculation • Start at right top of the image. • Scan from top to bottom of the first column of segmented image. • Assume that first pixels scanned are “off”, so count number of “off” pixels and add to chain code. • Continue scan, if pixel value changes from “off” to “on”, start counting “on” pixels and add to chain code. • Continue scan, counting “on” and “off” pixels till bottom of the image is reached and add to chain code. • At the end of the first column code add “@” as the end of the column. • Continue scanning the next columns to the width of the segmented image. Chain codes of all columns are calculated and saved as a single chain code in the xml file, in the corresponding class. Segmentation Line Segmentation: •••• Load input image. •••• Start scan from right top of the original image. •••• Scan up to the image width, on the same Y component. •••• Scan the first row to check any “on” pixels. If no “on” pixel is found go to next row. •••• Continue scan until a row containing “on” pixels come across. Save the row before this row as the top edge of the image text line. •••• Continue scan until another row with all “off” pixels is found. Set this row as the bottom edge of the image text line. •••• Save image as first text line and continue scan to find next line. •••• Continue till the bottom of the image is reached. Save all text line images to be input to character segmentation Character Segmentation: •••• Input text line image. •••• Start scan from right top of the text line image. •••• Scan up to the image height, on the same X component. 7
  • 11. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 102 •••• Scan the first column to check any “on” pixels. If no “on” pixel is found go to next column. •••• Continue scan until a column containing “on” pixels come across. Save the column before this column as the right hand edge of the isolated character image. •••• Continue scan until another column with all “off” pixels is found. Set this column as the left hand edge of the isolated character image. •••• Save image as first character and continue scan to find next character. •••• Continue till the left hand edge of the image text line is reached. Save all character images. Classification • Load the segmented character image. • Apply filters to remove diacritics, and get the shape of the corresponding class. • Measure the dimensions of the shape i.e. the width and height of the character image without diacritics, by scanning the image from top right to left bottom pixel by pixel. • Open xml file and traverse all the class nodes. Match “width” and “height” attribute values with the calculated values. • When a dimension match occurs according to some already set error margin (in this case error margin is 1, i.e. if width or height is one less or on greater than the sample values, match occurs.) set that class as the key class for that character. Character Matching •••• Restore the “original segmented character image” with diacritics and dots. Calculate its chain code according to the algorithm described earlier in this section. •••• Open xml file. Traverse all child nodes in the identified class and match “code” attribute values with the calculated chain code. •••• When a chain code match occurs according to Error Margin, described in next algorithm (Testing), set the “name” attribute of that node as the name of the character. And set the “Unicode” attribute of that node as the Unicode of the input character. Testing •••• Get the sampled chain codes in the identified class from xml file and the calculated chain code of the classified character image. •••• Start matching the calculated chain code with each of the sampled chain codes. •••• Chain code of each column is compared. •••• In each column every single code is compared with actual value (sampled value). In each column every single mismatch is counted as “error count”. If error count exceeds the Error Margin, set according to font and size, disqualify the node to be the exact match. •••• Continue with the next node chain code and check the Error Margin. Continue until a character is confirmed. Confirmation occurs when “total error count” of the character is less than the “Error Margin Limit”. Unicode File Creation •••• Characters are recognized through the above process. •••• Xml file includes all the information about the recognized character.
  • 12. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 103 •••• This recognized character is written into a Unicode file, according to the information written in the xml file. •••• The Unicode file is written from right to left. 6. RESULTS Unicode is simply a character encoding system. It has nothing to do with how these characters finally get displayed on computer screen. Thus it cannot be Naskh based only, its only type style issues relating to rendering which is handled by operating system and application not by encoding scheme [4]. After getting the identification of exact character, we create a new Unicode file that will include all the characters identified. All the Unicode characters are according to the Unicode Consortium [3]. We will place the Unicode of those characters in the same position from right to left as in the Urdu text image file. The output Unicode file is now editable and searchable. This file is written from right to left as Urdu language is right to left. 7. CONCLUSION This training system has recognized different printed Urdu text image files, which include isolated characters. The training system has been tested for different sizes of fonts, the results are quite impressive. Our system has achieved an accuracy of 89% for the isolated characters with accuracy of a 15 char/sec recognition rate. Different sizes of filters are used for different sizes, and this system has proven good for multi font sized characters. The system has been tested for different images and efficient results are found. Fig12.a shows the input image file, after being processed by our OCR, the screen shot of our output text file is shown in Fig.12b. FIGURE 12a: Input image FIGURE 13: Output text file 8. REFERENCES [1]. Afzal, M. and Hussain, S., “Urdu Computing Standards: Urdu Zabta Takhti (UZT) 1.01”, in the Proceedings of International IEEE Multi topic Conference (INMIC), Lahore University of Management Sciences (LUMS), Lahore, Pakistan, 2001. [2]. Ethnologue, Languages of Pakistan, http://www.ethnologue.com/show_country.asp?name=Pakistan
  • 13. Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi, Habib ur Rehman & Anoshia Faiz International Journal of Image Processing, (IJIP)Volume (3) : Issue (3) 104 [3]. See the Unicode Consortium website at http://unicode.org [4]. Bhurgari, A. M. 2007. Enabling Pakistani Languages through Unicode, published at http://download.microsoft.com/download/1/4/2/142aef9f-1a74-4a24-b1f4- 782d48d41a6d/PakLang.pdf [5]. Thresholding, Image Segmentation, Digital Image Processing 2/e Rafael C. Gonzalez, Richard E. Woods. [6]. Fast, Bruce B., Allen, Dana R. OCR image preprocessing method for image enhancement of scanned documents. [7]. Zaheer Ahmad, Jehanzeb Khan Orakzai, Inam Shamsher, and Awais Adnan. “Urdu Nastaleeq Optical Character Recognition”, “Proceedings of world academy of science, engineering and technology volume 26 december 2007”. [8]. U. Pal and Anirban Sarkar, “Recognition of Printed Urdu Script”, “Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003)”. [9]. Khalid Saeed, “New Approaches for Cursive Languages Recognition: Machine and Hand Written Script and Texts”. [10]. T. Sari and M. Sellami, “Cursive Arabic Script Segmentation and Recognition System”. [11]. Bozinovic, R.M.; Srihari, S.N, “Off-line cursive script word recognition”. [12]. Soille, P. [2003]. “Morphological Image Analysis: Principles and Applications”, 2nd ed., Springer-Verlag, NY. [13]. Dougherty. E. R. and Lotufo, R. A. [2003]. “Hands-on Morphological Image Processing”, SPIE--The International Society for Optical Engineering, Bellingham, WA. [14]. International Journal of Pattern Recognition and Artificial Intelligence. [15]. Alasdari McAndrew, Anne Venables, “A ‘Secondary’ Look at Digital Image Processing”. [16]. Ganapathy, V., Lean, C.C.H., “Optical Character Recognition Program for Images of Printed Text using a Neural Network”. [17]. Nabeel Shahzad, Brandon Paulson, Tracy Hammond, “Urdu Qaeda: Recognition System for Isolated Urdu Characters”. [18]. Hermilo, Ernesto, Ramon M. “Efficiency of chain codes to represent binary objects”. [19]. Yong Kui Liua and Borut Žalik, “An efficient chain code with Huffman coding”. [20]. Shah, Z.A., “Ligature based optical character recognition of Urdu- Nastaleeq font”. [21]. Inam Shamsher, Zaheer Ahmad, Jahenzeb Khan Orakzai and Awais Adnan, “OCR For Printed Urdu Script Using Feed Forward Neural Network”. [22]. “The Origin of Urdu Language” http://www.essortment.com/all/urdulanguage_rguo.htm [23]. T.S El-Sheikh and R.M Guindi, “computer Recognition of Arabic Cursive Script,” Pattern Recognition, Vol.21, No, 4, 1988, pp.293-302. [24]. G. Nagy Rensselaer Polytechnic Institute Troy, New York, “Chinese Character Recognition A Twenty Five Year Retrospective”. Tsuyoshi Kitani t, riguchi and Masami Ilara Yoshio, “Pattern Matching in the Textract Information Extraction System”. 9