How to create a corpus of machine-readable texts: challenges and solutions

How to create
a corpus of
machine-readable texts:
challenges and solutions

What is OCR and how does it work?
Definition of OCR according to the Oxford
Dictionary of Computer Science, p. 379:
„OCR = optical character recognition; a
process in which a machine scans,
recognizes, and encodes information
printed or typed in alphanumerical
characters. (…) OCR software is now
readily available for many low-cost
scanners giving good recognition rates for
printed material using the Latin
alphabet. The more difficult problems
posed by other character sets and
handwriting are areas of ongoing
research.“

When was OCR software invented?
mid-1970s: OCR A font and
OCR B font (similar to
normal letter-press
appearance)
Ca. 1955: early OCR
devices only recognised
limited set of characters in
machine-optimised font

Do we encounter OCR in everyday-life?
High accuracy rates have popularised OCR in the following areas:
• banking (machines „reading“ paper cheques and transfer forms)
• public administration
• health-care (e.g. machine-readable precriptions)
NOTE:
In cases where absolute perfection is needed,
OCR A and OCR B fonts are still used.
If sensitive information is handled, OCR
technology can be combined with the so-called
MICR technology (magnetic-ink character
recognition) checking the legitimacy or
originality of paper documents.

Are humanities tools using OCR?
Google Books: full-text search +
highlighting of text results
HathiTrust full-text view

What are ORC problems in historical research?
„Hannoverisches Magazin”, 1776

Best historical OCR results:
texts in standardised formats (e.g. periodicals)

Improving results for minority languages and old fonts –
an on-going challenge

Recent innovation: merging OCR and handwriting
recognition technologies (HWR/HTR)
“Handwriting recognition (HWR), also known as Handwritten Text
Recognition (HTR), is the ability of a computer to receive and interpret
intelligible handwritten input from sources such as paper documents,
photographs, touch-screens and other devices. The image of the written
text may be sensed "off line" from a piece of paper by optical scanning
(optical character recognition) or intelligent word recognition. Alternatively,
the movements of the pen tip may be sensed "on line", for example by a
pen-based computer screen surface, a generally easier task as there are
more clues available. A handwriting recognition system handles formatting,
performs correct segmentation into characters, and finds the most plausible
words.”
Wikipedia.org

The machine learning revolution in OCR

How does machine learning work?
Cf. Stanford OCR pipeline:
• text detection (layout recognition)
• character segmentation (using
„sliding window“ technique)
• character classification
• spell correction
(http://doremi2016.logdown.com/posts/
2017/01/20/standford-machine-
learning-photo-ocr-machine-learning-
pipeline)

New OCR tools based on machine learning
E.g. OCR-D project
in Germany:
• improved visual
character
recognition
• context analysis of
n-grams
• trainer feedback to
exclude potential
mistakes

Current range of OCR-tools for researchers
• Transkribus.eu (free of charge, cloud-based, each user contributes training data to the
community)
• OCR4all (free command-line OCR software for desktop-installation, difficult set-up,
does not run smoothly on Windows)
• KRAKEN (Python package for OCR, usage not monitored, data do not need to be
shared with developers or others users)
• ABBYY FineReader (one of the most popular proprietary OCR tools)
• Tesseract (originally developed as proprietary software at Hewlett Packard labs in
England and the US, released as open source in 2005, supported by Google since 2006,
available for Linux as well as Windows and Mac OS X, high pre-processing
requirements)
• PICCL/TICCL (free corpus building and corpus clean-up system performing spelling
correction and OCR post-correction, developed for LINUX, requires virtual machine
on Windows)

GBV-Verbund: Intranda OCR Service

And the development continues…

PROs and CONs of open-source OCR software:
CONs:
• takes up a lot of storage space
• difficult installation
• often limited performance on
Windows and Mac)
• usually requires command-line
operation (no GUI)
• conducting own training can be
time-consuming
• copyright issues if software
provider requires you to ingest
your (training) data into a public
pool
PROs:
• flexible integration of historical
texts in different digital formats
• adaptable to multiple languages
and new fonts / page layouts
Photo by Luca Bravo on Unsplash

How reliable are current OCR tools?
Results based on an OCR test based on the US
driver‘s licence, published on September 18,
2019:
https://mobidev.biz/blog/ocr-machine-learning-
implementation

How can we integrate OCR into our own workflow?
Export scan as PDF or
image files to perform OCR!
Analyse non-coded
plain text with topic
modelling or
stylometry tools not
requiring structured
data!
Code information (e.g. in XML/TEI or
JSON) and use software to analyse
networks between specific tagged
entities or visualise geographic data!
Export scan as PDF or image file to let
humans extract metadata and
transcribe the text!
Original manuscript
or print
Use transcriptions to train OCR-
software and improve results for
similar sources (e.g. issues of the
same newspaper)!
Perform quantitative analysis on
more texts in less time and
generate more reliable results!

Testing a cloud-based OCR tool: transkribus.eu

How to create a corpus of machine-readable texts: challenges and solutions

More Related Content

What's hot (20)

Similar to How to create a corpus of machine-readable texts: challenges and solutions (20)

Recently uploaded (20)

How to create a corpus of machine-readable texts: challenges and solutions