SlideShare a Scribd company logo
How to create
a corpus of
machine-readable texts:
challenges and solutions
What is OCR and how does it work?
Definition of OCR according to the Oxford
Dictionary of Computer Science, p. 379:
„OCR = optical character recognition; a
process in which a machine scans,
recognizes, and encodes information
printed or typed in alphanumerical
characters. (…) OCR software is now
readily available for many low-cost
scanners giving good recognition rates for
printed material using the Latin
alphabet. The more difficult problems
posed by other character sets and
handwriting are areas of ongoing
research.“
When was OCR software invented?
mid-1970s: OCR A font and
OCR B font (similar to
normal letter-press
appearance)
Ca. 1955: early OCR
devices only recognised
limited set of characters in
machine-optimised font
Do we encounter OCR in everyday-life?
High accuracy rates have popularised OCR in the following areas:
• banking (machines „reading“ paper cheques and transfer forms)
• public administration
• health-care (e.g. machine-readable precriptions)
NOTE:
In cases where absolute perfection is needed,
OCR A and OCR B fonts are still used.
If sensitive information is handled, OCR
technology can be combined with the so-called
MICR technology (magnetic-ink character
recognition) checking the legitimacy or
originality of paper documents.
Are humanities tools using OCR?
Google Books: full-text search +
highlighting of text results
HathiTrust full-text view
What are ORC problems in historical research?
„Hannoverisches Magazin”, 1776
Best historical OCR results:
texts in standardised formats (e.g. periodicals)
Improving results for minority languages and old fonts –
an on-going challenge
Recent innovation: merging OCR and handwriting
recognition technologies (HWR/HTR)
“Handwriting recognition (HWR), also known as Handwritten Text
Recognition (HTR), is the ability of a computer to receive and interpret
intelligible handwritten input from sources such as paper documents,
photographs, touch-screens and other devices. The image of the written
text may be sensed "off line" from a piece of paper by optical scanning
(optical character recognition) or intelligent word recognition. Alternatively,
the movements of the pen tip may be sensed "on line", for example by a
pen-based computer screen surface, a generally easier task as there are
more clues available. A handwriting recognition system handles formatting,
performs correct segmentation into characters, and finds the most plausible
words.”
Wikipedia.org
The machine learning revolution in OCR
How does machine learning work?
Cf. Stanford OCR pipeline:
• text detection (layout recognition)
• character segmentation (using
„sliding window“ technique)
• character classification
• spell correction
(http://doremi2016.logdown.com/posts/
2017/01/20/standford-machine-
learning-photo-ocr-machine-learning-
pipeline)
New OCR tools based on machine learning
E.g. OCR-D project
in Germany:
• improved visual
character
recognition
• context analysis of
n-grams
• trainer feedback to
exclude potential
mistakes
Current range of OCR-tools for researchers
• Transkribus.eu (free of charge, cloud-based, each user contributes training data to the
community)
• OCR4all (free command-line OCR software for desktop-installation, difficult set-up,
does not run smoothly on Windows)
• KRAKEN (Python package for OCR, usage not monitored, data do not need to be
shared with developers or others users)
• ABBYY FineReader (one of the most popular proprietary OCR tools)
• Tesseract (originally developed as proprietary software at Hewlett Packard labs in
England and the US, released as open source in 2005, supported by Google since 2006,
available for Linux as well as Windows and Mac OS X, high pre-processing
requirements)
• PICCL/TICCL (free corpus building and corpus clean-up system performing spelling
correction and OCR post-correction, developed for LINUX, requires virtual machine
on Windows)
GBV-Verbund: Intranda OCR Service
And the development continues…
PROs and CONs of open-source OCR software:
CONs:
• takes up a lot of storage space
• difficult installation
• often limited performance on
Windows and Mac)
• usually requires command-line
operation (no GUI)
• conducting own training can be
time-consuming
• copyright issues if software
provider requires you to ingest
your (training) data into a public
pool
PROs:
• flexible integration of historical
texts in different digital formats
• adaptable to multiple languages
and new fonts / page layouts
Photo by Luca Bravo on Unsplash
How reliable are current OCR tools?
Results based on an OCR test based on the US
driver‘s licence, published on September 18,
2019:
https://mobidev.biz/blog/ocr-machine-learning-
implementation
How can we integrate OCR into our own workflow?
Export scan as PDF or
image files to perform OCR!
Analyse non-coded
plain text with topic
modelling or
stylometry tools not
requiring structured
data!
Code information (e.g. in XML/TEI or
JSON) and use software to analyse
networks between specific tagged
entities or visualise geographic data!
Export scan as PDF or image file to let
humans extract metadata and
transcribe the text!
Original manuscript
or print
Use transcriptions to train OCR-
software and improve results for
similar sources (e.g. issues of the
same newspaper)!
Perform quantitative analysis on
more texts in less time and
generate more reliable results!
Testing a cloud-based OCR tool: transkribus.eu

More Related Content

What's hot (20)

PDF
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
ijcsitcejournal
 
PPTX
Optical Character Recognition
Rahul Mallik
 
PDF
Optical Character Recognition Using Python
YogeshIJTSRD
 
DOCX
A detailed study and recent research on handwritten recognition
Shruthiamar
 
PPTX
Optical Character Recognition (OCR) based Retrieval
Biniam Asnake
 
PPTX
Machine learning
Amit Gupta
 
PDF
OCR Text Extraction
Dr. Amarjeet Singh
 
PPTX
Optical Character Recognition( OCR )
Karan Panjwani
 
PPTX
OCR speech using Labview
Bharat Thakur
 
PDF
Handwriting recogntion slides boeing
Tejashree Gharat
 
PPT
05a
Badri Patro
 
PDF
Optical Character Recognition (OCR) System
iosrjce
 
PPTX
Optical Character Recognition
Durjoy Saha
 
DOCX
Optical character recognition IEEE Paper Study
Er. Ashish Pandey
 
PPTX
Ocr algorithm for ge’ez characters
Negash Desalegn
 
PPTX
Basics of-optical-character-recognition
document scanning services
 
PPTX
Presentation on OCR
xsconfused
 
PPSX
Mob ocr
Vivek Bharadwaj
 
DOC
Ocr abstract
Punya Prakash
 
DOCX
Project report of OCR Recognition
Bharat Kalia
 
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
ijcsitcejournal
 
Optical Character Recognition
Rahul Mallik
 
Optical Character Recognition Using Python
YogeshIJTSRD
 
A detailed study and recent research on handwritten recognition
Shruthiamar
 
Optical Character Recognition (OCR) based Retrieval
Biniam Asnake
 
Machine learning
Amit Gupta
 
OCR Text Extraction
Dr. Amarjeet Singh
 
Optical Character Recognition( OCR )
Karan Panjwani
 
OCR speech using Labview
Bharat Thakur
 
Handwriting recogntion slides boeing
Tejashree Gharat
 
Optical Character Recognition (OCR) System
iosrjce
 
Optical Character Recognition
Durjoy Saha
 
Optical character recognition IEEE Paper Study
Er. Ashish Pandey
 
Ocr algorithm for ge’ez characters
Negash Desalegn
 
Basics of-optical-character-recognition
document scanning services
 
Presentation on OCR
xsconfused
 
Ocr abstract
Punya Prakash
 
Project report of OCR Recognition
Bharat Kalia
 

Similar to How to create a corpus of machine-readable texts: challenges and solutions (20)

PPTX
What is OCR Technology and How to Extract Text from Any Image for Free
TwisterTools
 
PDF
Entering the Fourth Dimension of OCR with Tesseract
🎤 Hanno Embregts 🎸
 
PPTX
OCR in the VRC: Equipment and Software for New Users and New Uses
Visual Resources Association
 
PDF
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
🎤 Hanno Embregts 🎸
 
PPTX
Team-98 research paper presentation.pptx
dipakshukla158
 
PDF
A Detailed Study And Recent Research On OCR
Daniel Wachtel
 
PPTX
OCR_Masterclass.pptx asdfas asdfasdfasd asd
DevdattaSupnekar1
 
PDF
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
IRJET Journal
 
PDF
IRJET- Image to Text Conversion using Tesseract
IRJET Journal
 
PDF
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
IRJET Journal
 
PPTX
OCR Presentation hjhPresentation 23.pptx
SupriyaGhosh51
 
PDF
IRJET- A Novel Approach – Automatic paper evaluation system
IRJET Journal
 
PPT
OCR, optical character reader
Learn with Tibetan Norser
 
PPTX
Handwriting Recognition
Bindu Karki
 
PDF
ocrppt-140415204404-phpapp01.pdf
AkhilJoseph63
 
PDF
PERFORMANCE COMPARISON OF OCR TOOLS
ijujournal
 
PDF
PERFORMANCE COMPARISON OF OCR TOOLS
ijujournal
 
PDF
PERFORMANCE COMPARISON OF OCR TOOLS
ijujournal
 
PDF
Performance comparison of ocr tools
ijujournal
 
PDF
PERFORMANCE COMPARISON OF OCR TOOLS
ijujournal
 
What is OCR Technology and How to Extract Text from Any Image for Free
TwisterTools
 
Entering the Fourth Dimension of OCR with Tesseract
🎤 Hanno Embregts 🎸
 
OCR in the VRC: Equipment and Software for New Users and New Uses
Visual Resources Association
 
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
🎤 Hanno Embregts 🎸
 
Team-98 research paper presentation.pptx
dipakshukla158
 
A Detailed Study And Recent Research On OCR
Daniel Wachtel
 
OCR_Masterclass.pptx asdfas asdfasdfasd asd
DevdattaSupnekar1
 
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
IRJET Journal
 
IRJET- Image to Text Conversion using Tesseract
IRJET Journal
 
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
IRJET Journal
 
OCR Presentation hjhPresentation 23.pptx
SupriyaGhosh51
 
IRJET- A Novel Approach – Automatic paper evaluation system
IRJET Journal
 
OCR, optical character reader
Learn with Tibetan Norser
 
Handwriting Recognition
Bindu Karki
 
ocrppt-140415204404-phpapp01.pdf
AkhilJoseph63
 
PERFORMANCE COMPARISON OF OCR TOOLS
ijujournal
 
PERFORMANCE COMPARISON OF OCR TOOLS
ijujournal
 
PERFORMANCE COMPARISON OF OCR TOOLS
ijujournal
 
Performance comparison of ocr tools
ijujournal
 
PERFORMANCE COMPARISON OF OCR TOOLS
ijujournal
 
Ad

Recently uploaded (20)

PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Climate Action.pptx action plan for climate
justfortalabat
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Ad

How to create a corpus of machine-readable texts: challenges and solutions

  • 1. How to create a corpus of machine-readable texts: challenges and solutions
  • 2. What is OCR and how does it work? Definition of OCR according to the Oxford Dictionary of Computer Science, p. 379: „OCR = optical character recognition; a process in which a machine scans, recognizes, and encodes information printed or typed in alphanumerical characters. (…) OCR software is now readily available for many low-cost scanners giving good recognition rates for printed material using the Latin alphabet. The more difficult problems posed by other character sets and handwriting are areas of ongoing research.“
  • 3. When was OCR software invented? mid-1970s: OCR A font and OCR B font (similar to normal letter-press appearance) Ca. 1955: early OCR devices only recognised limited set of characters in machine-optimised font
  • 4. Do we encounter OCR in everyday-life? High accuracy rates have popularised OCR in the following areas: • banking (machines „reading“ paper cheques and transfer forms) • public administration • health-care (e.g. machine-readable precriptions) NOTE: In cases where absolute perfection is needed, OCR A and OCR B fonts are still used. If sensitive information is handled, OCR technology can be combined with the so-called MICR technology (magnetic-ink character recognition) checking the legitimacy or originality of paper documents.
  • 5. Are humanities tools using OCR? Google Books: full-text search + highlighting of text results HathiTrust full-text view
  • 6. What are ORC problems in historical research? „Hannoverisches Magazin”, 1776
  • 7. Best historical OCR results: texts in standardised formats (e.g. periodicals)
  • 8. Improving results for minority languages and old fonts – an on-going challenge
  • 9. Recent innovation: merging OCR and handwriting recognition technologies (HWR/HTR) “Handwriting recognition (HWR), also known as Handwritten Text Recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning (optical character recognition) or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most plausible words.” Wikipedia.org
  • 10. The machine learning revolution in OCR
  • 11. How does machine learning work? Cf. Stanford OCR pipeline: • text detection (layout recognition) • character segmentation (using „sliding window“ technique) • character classification • spell correction (http://doremi2016.logdown.com/posts/ 2017/01/20/standford-machine- learning-photo-ocr-machine-learning- pipeline)
  • 12. New OCR tools based on machine learning E.g. OCR-D project in Germany: • improved visual character recognition • context analysis of n-grams • trainer feedback to exclude potential mistakes
  • 13. Current range of OCR-tools for researchers • Transkribus.eu (free of charge, cloud-based, each user contributes training data to the community) • OCR4all (free command-line OCR software for desktop-installation, difficult set-up, does not run smoothly on Windows) • KRAKEN (Python package for OCR, usage not monitored, data do not need to be shared with developers or others users) • ABBYY FineReader (one of the most popular proprietary OCR tools) • Tesseract (originally developed as proprietary software at Hewlett Packard labs in England and the US, released as open source in 2005, supported by Google since 2006, available for Linux as well as Windows and Mac OS X, high pre-processing requirements) • PICCL/TICCL (free corpus building and corpus clean-up system performing spelling correction and OCR post-correction, developed for LINUX, requires virtual machine on Windows)
  • 15. And the development continues…
  • 16. PROs and CONs of open-source OCR software: CONs: • takes up a lot of storage space • difficult installation • often limited performance on Windows and Mac) • usually requires command-line operation (no GUI) • conducting own training can be time-consuming • copyright issues if software provider requires you to ingest your (training) data into a public pool PROs: • flexible integration of historical texts in different digital formats • adaptable to multiple languages and new fonts / page layouts Photo by Luca Bravo on Unsplash
  • 17. How reliable are current OCR tools? Results based on an OCR test based on the US driver‘s licence, published on September 18, 2019: https://mobidev.biz/blog/ocr-machine-learning- implementation
  • 18. How can we integrate OCR into our own workflow? Export scan as PDF or image files to perform OCR! Analyse non-coded plain text with topic modelling or stylometry tools not requiring structured data! Code information (e.g. in XML/TEI or JSON) and use software to analyse networks between specific tagged entities or visualise geographic data! Export scan as PDF or image file to let humans extract metadata and transcribe the text! Original manuscript or print Use transcriptions to train OCR- software and improve results for similar sources (e.g. issues of the same newspaper)! Perform quantitative analysis on more texts in less time and generate more reliable results!
  • 19. Testing a cloud-based OCR tool: transkribus.eu