Computer Lexica in OCR and Retrieval Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)
Overview What is a computer lexicon Lexica in IMPACT Tools for lexicon building and applying lexica  Some results Searching Demonstration IMPACT <Demo Day BL, 12 July 2011>
What is a computer lexicon?  IMPACT <Demo Day BL, 12 July 2011>
Computer lexicon vs electronic dictionary (1) IMPACT <Demo Day BL, 12 July 2011> An electronic dictionary is:  Digitised full text (no pictures) For human use Ideally: searchable with explicitely coded material (XML), such as a lemma, part of speech (PoS), meaning, quotes etc. Examples: OED online, WNT online
Dictionary XML (example) IMPACT <Demo Day BL, 12 July 2011>
IMPACT <Demo Day BL, 12 July 2011>
Computer Lexicon vs Electronic Dictionary (2) IMPACT <Demo Day BL, 12 July 2011> A computer lexicon is:  Always in a structured digital format (XML, relational database)  Main purpose: computer application Explicitely coded information (e.g. lemma, part of speech, morphology, syntax) Examples of use: Linguistic enrichment of text material ‘ Advanced’ searching (words with all spelling variant and inflections) Automatic summarization, keyword extraction…
IMPACT <Demo Day BL, 12 July 2011>
Lexica in IMPACT IMPACT <Demo Day BL, 12 July 2011>
The OCR lexicon IMPACT <Demo Day BL, 12 July 2011> An OCR lexicon is   A  checked  list of words in a language Based on a corpus (collection) of dated texts (selection!) Preferably with frequency information Preferably from the same time period or of the same text type as the texts you wish to digitize
OCR lexicon: example IMPACT <Demo Day BL, 12 July 2011> 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
The IR lexicon  IR lexicon :  most important information categories word forms (lists of words) +  - frequency information - quotes (dated sources) from corpora or electronic  dictionaries - MODERN LEMMA (// entrance dictionary) linked to  spelling variants and inflected forms of the  same word The modern lemma is used for searching in texts Standard use in corpus linguistics and modern historical lexicography  IMPACT <Demo Day BL, 12 July 2011>
IMPACT <Demo Day BL, 12 July 2011> <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> < modern_lemma > aantuilen </modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> < written_form > tuyld </written_form> <attestation><id>92141</id> <token_id></token_id> < quote >Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en  tuyld  daer weer op  an , Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform>
Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011>
Types variation (spelling, inflection…) IMPACT <Demo Day BL, 12 July 2011> uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken  uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk  uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk  I werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds  weerlyt  wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds  sweerels   zwerlys   swarels   swerelts  werelts  swerrels  weirelts tsweerelds  werret  vverelt werlts werrelt  worreld  werlden  wareld   weirelt weireld  waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje  weurlt wald weëled   II (patterns to predict variation) (a number are predictable with patterns, others need to be taken from a lexicon )
Neil Fitzgerald, 7th July 2011
Computer lexica For OCR and OCR post correction Improving searchability of historic text material by building a lexicon with variants by using a modern lemma as a search entry Tools for lexicon building Tools for application of lexicon in search engines  Lexicon cookbook Guidelines and tools to use the lexica in OCR IMPACT <Demo Day BL, 12 July 2011>
Tools (more specific) Lexicon building from corpus material and dictionaries  Use of lexica in search engines  Tool to extract spelling variation patterns from historical material  Tool to relate previously unrecognised spelling variations to their standard form Tool to deduct previously unrecognised inflected forms to their basic form  IMPACT <Demo Day BL, 12 July 2011>
Ordinary words vs Names (NEs) Tools for the automatic recognition, classification and finding of variant names Wish of the libraries Separate regular vocabulary from names Reduce unpleasant results: Abimelech    apemelk! (b/p; i/e; e/0; k/ch ) (apemelk means monkeymilk..) NE lexica IMPACT <Demo Day BL, 12 July 2011>
A number of results for Dutch and German IMPACT <Demo Day BL, 12 July 2011>
Ground truth data: Dutch IMPACT <Demo Day BL, 12 July 2011> Type and genre # words Gold Standard Book 300k Random Set Books 340k Random Set Staten Generaal (Legal Papers) 2.5M Gold Standard Staten Generaal 500k Gold Standard Newspapers 1 3.4M Gold Standard Newspapers 2 170k Random Set Newspapers 3.2M total 13.1M
Lexicon coverage (1: ground truth books) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 46% 76% Core general lexicon 56% 84% 1 + 2 63% 89% Expansion with corpus material  78% 95%
Lexicon coverage  (2: GT newspapers 18 th -19 th  C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 40% 83% Core general lexicon 41% 84% 1 + 2 51% 89% Expansion with corpus material 62% 95%
Lexicon coverage  (3: GT Staten Generaal 19 e  C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 51% 89% Core general lexicon 47% 88% 1 + 2 58% 93% Expansion with corpus material 68% 97%
Lexicon coverage  (4: GT Staten Generaal 20 e  C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 70% 93% Core general lexicon 66% 93% 1 + 2 76% 96% Expansion with corpus material 81% 98%
Lexicon coverage (5: Genesis, 1637 bible) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 31% 61% Core lexicon 62% 83% 1 + 2 65% 89% Expansion with corpus material 87% 98.6%
Lexicon coverage (6: P.C. Hooft, histories) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 26% 67% Core lexicon 47% 88% 1 + 2 50% 90% Expansion with corpus material 58% 96%
Evaluation of OCR IMPACT <Demo Day BL, 12 July 2011> Finereader SDK (version 9, 10) External dictionary interface (implementation module) Challenge Translation of corpus frequencies to weights 0-100  Broken words, case-sensitivity, … Problem with long ‘ s’  (work around) Lexicon Data IMPACT OCR-lexicon for Dutch Finereader internal lexicon
OCR results: word recognition rate IMPACT <Demo Day BL, 12 July 2011> Dataset With ABBYY internal Dutch lexicon With IMPACT lexicon for Dutch  (case hyphenation) With IMPACT lexicon for Dutch  (case hyphenation) + long S problem) DPO35 88.8% 90.9% 93,5 %
An example: IMPACT <Demo Day BL, 12 July 2011> OCR at the beginning of the project: Results: A. De  eerde   was de  gevaarlykflti  om de verlei¬ ding aan 't Hof; de tweede de  ftillie  en  veiligde ; de derde de  zwaarde , daar hy byna drie millioenen harde en  onbefchaafde   Menfchen   beftieren  moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
IMPACT <Demo Day BL, 12 July 2011> Dictionary 16 th  century No. of  word errors Reduction of error rate 18 th  century  No. of  word errors Reduction of error rate 19 th  century  No. of  word errors Reduction of error rate No Lexicon 1306 - 827 - 2074 - Optimal Lexicon 756 42% 395 52% 612 70% Modern Lexicon 1096 16% 501 39% 888 57% W.Historical Lexicon 938 28% 481 42% 856 59% Modern + Virtual H.L. 1011 25% 480 42% 849 59%
Languages in IMPACT Dutch, German, English , Spanish, French Polish, Czech, Slovene and Bulgarian Cross language perspective paper Parallel OCR and IR experiments GT datasets Language tools: language independent Except from 3 core languages: proof of concept lexica IMPACT <Demo Day BL, 12 July 2011>
English in IMPACT Lexicon building using OED OCR lexicon from quotations full text, possibly supplemented with corpus material IR lexicon from headword variants in quotations (small demo) Named Entity Recognition on newspaper material NE lexicon Gold standard corpus NE recognition (CONLL) ( Named Entity Recognition Task Definition, by: N. Chinchor, E. Brown, L. Ferro, and P. Robinson , Nr. Version 1.4 (1999) ) PER, LOC, ORG Research into the possible benefits from exclusion of modern words from the  OCR lexicon IMPACT <Demo Day BL, 12 July 2011>
IMPACT <Demo Day BL, 12 July 2011> An indemnity shall be granted to the surfer…. …  bikini …
Retrieval demonstrator Indexing and retrieval library (java) implemented on the lucene search engine Lexicon in MySQL database OCR with Finereader SDK and external dictionary interface of about 2000 images of the Dutch Ground Truth selection Page XML output  [in framework] NE tagging  Indexing and retrieval while using lexicon and NE tagging IMPACT <Demo Day BL, 12 July 2011>

More Related Content

PDF
Chemistry Enabling Chinese, Japanese and Korean Patents
PPTX
Etymology Markup in TEI XML
PPT
Becta Vms
PDF
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
PDF
Large Scale Text Processing
PDF
Embracing diversity searching over multiple languages
PPT
ITU - MDD - Textural Languages and Grammars
PPT
IMPACT Final Conference - Apostolos Antonacopoulos
Chemistry Enabling Chinese, Japanese and Korean Patents
Etymology Markup in TEI XML
Becta Vms
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Large Scale Text Processing
Embracing diversity searching over multiple languages
ITU - MDD - Textural Languages and Grammars
IMPACT Final Conference - Apostolos Antonacopoulos

Viewers also liked (8)

PPT
IMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
PPT
PPT
BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
PPT
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
PPT
BL Demo Day - July2011 - (3) Image Enhancement for OCR
PPT
BL Demo Day - July2011 - (2) IMPACT Learning Resources
PPTX
IMPACT Final Conference - Muehlberger - FEP
PPT
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (2) IMPACT Learning Resources
IMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...

Similar to Language Tools for OCR with Katrien Depuydt (20)

PPT
Language tools bne-5-10-2011
PPT
IMPACT Final Conference - Katrien Depuydt
PDF
Targeted Language Resources for the Digitisation of Historical Collections
PPT
IMPACT Final Conference - NCSR - Wordspotting
PDF
An Extensible Multilingual Open Source Lemmatizer
PPT
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
PDF
E lex presentation_03
PPT
OCR challenges in historic documents and the contribution of IMPACT
PDF
Bratislava WS - Depuydt - INL - lexicon building_pdf
PPT
IMPACT Final Conference - Claus Gravenhorst
PDF
text _preprocessing _in _NLP AI llms.pdf
PDF
Appendix A Webliography
PDF
Spell checker for Kannada OCR
PDF
Fsmnlp presentation mohammed_attia
PDF
The CW Corpus PITR2013
PDF
Exempler approach
PDF
06 traub
PPT
Information retrieval chapter 2-Text Operations.ppt
PDF
TEI based dictionaries
PDF
AINL 2016: Grigorieva
Language tools bne-5-10-2011
IMPACT Final Conference - Katrien Depuydt
Targeted Language Resources for the Digitisation of Historical Collections
IMPACT Final Conference - NCSR - Wordspotting
An Extensible Multilingual Open Source Lemmatizer
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
E lex presentation_03
OCR challenges in historic documents and the contribution of IMPACT
Bratislava WS - Depuydt - INL - lexicon building_pdf
IMPACT Final Conference - Claus Gravenhorst
text _preprocessing _in _NLP AI llms.pdf
Appendix A Webliography
Spell checker for Kannada OCR
Fsmnlp presentation mohammed_attia
The CW Corpus PITR2013
Exempler approach
06 traub
Information retrieval chapter 2-Text Operations.ppt
TEI based dictionaries
AINL 2016: Grigorieva

More from IMPACT Centre of Competence (20)

PDF
Session6 01.helmut schmid
PDF
Session1 03.hsian-an wang
PDF
Session7 03.katrien depuydt
PDF
Session7 02.peter kiraly
PDF
Session6 04.giuseppe celano
PDF
Session6 03.sandra young
PDF
Session6 02.jeremi ochab
PDF
Session5 04.evangelos varthis
PDF
Session5 03.george rehm
PDF
Session5 02.tom derrick
PDF
Session5 01.rutger vankoert
PDF
Session4 04.senka drobac
PDF
Session3 04.arnau baro
PDF
Session3 03.christian clausner
PDF
Session3 02.kimmo ketunnen
PDF
Session3 01.clemens neudecker
PDF
Session2 04.ashkan ashkpour
PDF
Session2 03.juri opitz
PDF
Session2 02.christian reul
PDF
Session2 01.emad mohamed
Session6 01.helmut schmid
Session1 03.hsian-an wang
Session7 03.katrien depuydt
Session7 02.peter kiraly
Session6 04.giuseppe celano
Session6 03.sandra young
Session6 02.jeremi ochab
Session5 04.evangelos varthis
Session5 03.george rehm
Session5 02.tom derrick
Session5 01.rutger vankoert
Session4 04.senka drobac
Session3 04.arnau baro
Session3 03.christian clausner
Session3 02.kimmo ketunnen
Session3 01.clemens neudecker
Session2 04.ashkan ashkpour
Session2 03.juri opitz
Session2 02.christian reul
Session2 01.emad mohamed

Recently uploaded (20)

PDF
Global strategy and action plan on oral health 2023 - 2030.pdf
PDF
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
PDF
Jana Ojana 2025 Prelims - School Quiz by Pragya - UEMK Quiz Club
PPTX
UCSP Section A - Human Cultural Variations,Social Differences,social ChangeCo...
PPSX
namma_kalvi_12th_botany_chapter_9_ppt.ppsx
PDF
Developing speaking skill_learning_mater.pdf
PDF
LATAM’s Top EdTech Innovators Transforming Learning in 2025.pdf
PDF
gsas-cvs-and-cover-letters jhvgfcffttfghgvhg.pdf
PDF
New_Round_Up_6_SB.pdf download for free, easy to learn
PPTX
Ppt obs emergecy.pptxydirnbduejguxjjdjidjdbuc
PPTX
MMW-CHAPTER-1-final.pptx major Elementary Education
PDF
Design and Evaluation of a Inonotus obliquus-AgNP-Maltodextrin Delivery Syste...
PDF
Jana-Ojana Finals 2025 - School Quiz by Pragya - UEMK Quiz Club
PPTX
principlesofmanagementsem1slides-131211060335-phpapp01 (1).ppt
PPTX
Theoretical for class.pptxgshdhddhdhdhgd
PPTX
CHROMIUM & Glucose Tolerance Factor.pptx
PDF
BSc-Zoology-02Sem-DrVijay-Comparative anatomy of vertebrates.pdf
PPTX
GW4 BioMed Candidate Support Webinar 2025
PDF
FYJC - Chemistry textbook - standard 11.
PDF
CHALLENGES FACED BY TEACHERS WHEN TEACHING LEARNERS WITH DEVELOPMENTAL DISABI...
Global strategy and action plan on oral health 2023 - 2030.pdf
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
Jana Ojana 2025 Prelims - School Quiz by Pragya - UEMK Quiz Club
UCSP Section A - Human Cultural Variations,Social Differences,social ChangeCo...
namma_kalvi_12th_botany_chapter_9_ppt.ppsx
Developing speaking skill_learning_mater.pdf
LATAM’s Top EdTech Innovators Transforming Learning in 2025.pdf
gsas-cvs-and-cover-letters jhvgfcffttfghgvhg.pdf
New_Round_Up_6_SB.pdf download for free, easy to learn
Ppt obs emergecy.pptxydirnbduejguxjjdjidjdbuc
MMW-CHAPTER-1-final.pptx major Elementary Education
Design and Evaluation of a Inonotus obliquus-AgNP-Maltodextrin Delivery Syste...
Jana-Ojana Finals 2025 - School Quiz by Pragya - UEMK Quiz Club
principlesofmanagementsem1slides-131211060335-phpapp01 (1).ppt
Theoretical for class.pptxgshdhddhdhdhgd
CHROMIUM & Glucose Tolerance Factor.pptx
BSc-Zoology-02Sem-DrVijay-Comparative anatomy of vertebrates.pdf
GW4 BioMed Candidate Support Webinar 2025
FYJC - Chemistry textbook - standard 11.
CHALLENGES FACED BY TEACHERS WHEN TEACHING LEARNERS WITH DEVELOPMENTAL DISABI...

Language Tools for OCR with Katrien Depuydt

  • 1. Computer Lexica in OCR and Retrieval Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)
  • 2. Overview What is a computer lexicon Lexica in IMPACT Tools for lexicon building and applying lexica Some results Searching Demonstration IMPACT <Demo Day BL, 12 July 2011>
  • 3. What is a computer lexicon? IMPACT <Demo Day BL, 12 July 2011>
  • 4. Computer lexicon vs electronic dictionary (1) IMPACT <Demo Day BL, 12 July 2011> An electronic dictionary is: Digitised full text (no pictures) For human use Ideally: searchable with explicitely coded material (XML), such as a lemma, part of speech (PoS), meaning, quotes etc. Examples: OED online, WNT online
  • 5. Dictionary XML (example) IMPACT <Demo Day BL, 12 July 2011>
  • 6. IMPACT <Demo Day BL, 12 July 2011>
  • 7. Computer Lexicon vs Electronic Dictionary (2) IMPACT <Demo Day BL, 12 July 2011> A computer lexicon is: Always in a structured digital format (XML, relational database) Main purpose: computer application Explicitely coded information (e.g. lemma, part of speech, morphology, syntax) Examples of use: Linguistic enrichment of text material ‘ Advanced’ searching (words with all spelling variant and inflections) Automatic summarization, keyword extraction…
  • 8. IMPACT <Demo Day BL, 12 July 2011>
  • 9. Lexica in IMPACT IMPACT <Demo Day BL, 12 July 2011>
  • 10. The OCR lexicon IMPACT <Demo Day BL, 12 July 2011> An OCR lexicon is A checked list of words in a language Based on a corpus (collection) of dated texts (selection!) Preferably with frequency information Preferably from the same time period or of the same text type as the texts you wish to digitize
  • 11. OCR lexicon: example IMPACT <Demo Day BL, 12 July 2011> 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
  • 12. The IR lexicon IR lexicon : most important information categories word forms (lists of words) + - frequency information - quotes (dated sources) from corpora or electronic dictionaries - MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the same word The modern lemma is used for searching in texts Standard use in corpus linguistics and modern historical lexicography IMPACT <Demo Day BL, 12 July 2011>
  • 13. IMPACT <Demo Day BL, 12 July 2011> <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> < modern_lemma > aantuilen </modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> < written_form > tuyld </written_form> <attestation><id>92141</id> <token_id></token_id> < quote >Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an , Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform>
  • 14. Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011>
  • 15. Types variation (spelling, inflection…) IMPACT <Demo Day BL, 12 July 2011> uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk I werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled II (patterns to predict variation) (a number are predictable with patterns, others need to be taken from a lexicon )
  • 16. Neil Fitzgerald, 7th July 2011
  • 17. Computer lexica For OCR and OCR post correction Improving searchability of historic text material by building a lexicon with variants by using a modern lemma as a search entry Tools for lexicon building Tools for application of lexicon in search engines Lexicon cookbook Guidelines and tools to use the lexica in OCR IMPACT <Demo Day BL, 12 July 2011>
  • 18. Tools (more specific) Lexicon building from corpus material and dictionaries Use of lexica in search engines Tool to extract spelling variation patterns from historical material Tool to relate previously unrecognised spelling variations to their standard form Tool to deduct previously unrecognised inflected forms to their basic form IMPACT <Demo Day BL, 12 July 2011>
  • 19. Ordinary words vs Names (NEs) Tools for the automatic recognition, classification and finding of variant names Wish of the libraries Separate regular vocabulary from names Reduce unpleasant results: Abimelech  apemelk! (b/p; i/e; e/0; k/ch ) (apemelk means monkeymilk..) NE lexica IMPACT <Demo Day BL, 12 July 2011>
  • 20. A number of results for Dutch and German IMPACT <Demo Day BL, 12 July 2011>
  • 21. Ground truth data: Dutch IMPACT <Demo Day BL, 12 July 2011> Type and genre # words Gold Standard Book 300k Random Set Books 340k Random Set Staten Generaal (Legal Papers) 2.5M Gold Standard Staten Generaal 500k Gold Standard Newspapers 1 3.4M Gold Standard Newspapers 2 170k Random Set Newspapers 3.2M total 13.1M
  • 22. Lexicon coverage (1: ground truth books) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 46% 76% Core general lexicon 56% 84% 1 + 2 63% 89% Expansion with corpus material 78% 95%
  • 23. Lexicon coverage (2: GT newspapers 18 th -19 th C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 40% 83% Core general lexicon 41% 84% 1 + 2 51% 89% Expansion with corpus material 62% 95%
  • 24. Lexicon coverage (3: GT Staten Generaal 19 e C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 51% 89% Core general lexicon 47% 88% 1 + 2 58% 93% Expansion with corpus material 68% 97%
  • 25. Lexicon coverage (4: GT Staten Generaal 20 e C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 70% 93% Core general lexicon 66% 93% 1 + 2 76% 96% Expansion with corpus material 81% 98%
  • 26. Lexicon coverage (5: Genesis, 1637 bible) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 31% 61% Core lexicon 62% 83% 1 + 2 65% 89% Expansion with corpus material 87% 98.6%
  • 27. Lexicon coverage (6: P.C. Hooft, histories) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 26% 67% Core lexicon 47% 88% 1 + 2 50% 90% Expansion with corpus material 58% 96%
  • 28. Evaluation of OCR IMPACT <Demo Day BL, 12 July 2011> Finereader SDK (version 9, 10) External dictionary interface (implementation module) Challenge Translation of corpus frequencies to weights 0-100 Broken words, case-sensitivity, … Problem with long ‘ s’ (work around) Lexicon Data IMPACT OCR-lexicon for Dutch Finereader internal lexicon
  • 29. OCR results: word recognition rate IMPACT <Demo Day BL, 12 July 2011> Dataset With ABBYY internal Dutch lexicon With IMPACT lexicon for Dutch (case hyphenation) With IMPACT lexicon for Dutch (case hyphenation) + long S problem) DPO35 88.8% 90.9% 93,5 %
  • 30. An example: IMPACT <Demo Day BL, 12 July 2011> OCR at the beginning of the project: Results: A. De eerde was de gevaarlykflti om de verlei¬ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
  • 31. IMPACT <Demo Day BL, 12 July 2011> Dictionary 16 th century No. of word errors Reduction of error rate 18 th century No. of word errors Reduction of error rate 19 th century No. of word errors Reduction of error rate No Lexicon 1306 - 827 - 2074 - Optimal Lexicon 756 42% 395 52% 612 70% Modern Lexicon 1096 16% 501 39% 888 57% W.Historical Lexicon 938 28% 481 42% 856 59% Modern + Virtual H.L. 1011 25% 480 42% 849 59%
  • 32. Languages in IMPACT Dutch, German, English , Spanish, French Polish, Czech, Slovene and Bulgarian Cross language perspective paper Parallel OCR and IR experiments GT datasets Language tools: language independent Except from 3 core languages: proof of concept lexica IMPACT <Demo Day BL, 12 July 2011>
  • 33. English in IMPACT Lexicon building using OED OCR lexicon from quotations full text, possibly supplemented with corpus material IR lexicon from headword variants in quotations (small demo) Named Entity Recognition on newspaper material NE lexicon Gold standard corpus NE recognition (CONLL) ( Named Entity Recognition Task Definition, by: N. Chinchor, E. Brown, L. Ferro, and P. Robinson , Nr. Version 1.4 (1999) ) PER, LOC, ORG Research into the possible benefits from exclusion of modern words from the OCR lexicon IMPACT <Demo Day BL, 12 July 2011>
  • 34. IMPACT <Demo Day BL, 12 July 2011> An indemnity shall be granted to the surfer…. … bikini …
  • 35. Retrieval demonstrator Indexing and retrieval library (java) implemented on the lucene search engine Lexicon in MySQL database OCR with Finereader SDK and external dictionary interface of about 2000 images of the Dutch Ground Truth selection Page XML output [in framework] NE tagging Indexing and retrieval while using lexicon and NE tagging IMPACT <Demo Day BL, 12 July 2011>

Editor's Notes

  • #5: This presentation is based on how the INL works with language. A electronic dictionary is not what we need for OCR and simple retrieval but is introduced anyway because we can (and do) use our dictionaries for lexicon construction.
  • #6: This is what an XML-based electronic dictionary looks like.
  • #7: This is the XML of the Oxford English dictionary. The horizontal lines mark a place where part of the structure has been folded in.
  • #8: &lt;ed&gt; We need further explanation for what ‘lemma’, ‘part of speech’ and ‘morphology’ mean Lemma: headword, like in an ordinary dictionary the entry Morphology: morphological analysis is done for compounds and derivates: which parts are to be distinguished in a word, e.g. apple pie : apple + pie
  • #9: This is an little part of a computational lexicon (of a certain type; there are many types of computational lexica)
  • #13: &lt;ed&gt; again, unsure of what LEMMA means Be, was, am, is, etc. all forms of the same word BE (and that is an example of a lemma)
  • #16: Two types of variation, examples for Dutch from the lexicon
  • #17: To give an indication of possible spelling variants of the word ‘world’ for English, a screenshot from the OED online...
  • #18: These are some of the ways in which we are using Computer lexica as building blocks.
  • #23: The
  • #24: The
  • #25: The
  • #26: The
  • #27: The
  • #28: The
  • #32: These are results with a rather limited historical lexicon of German.
  • #34: Computational Natural Language Learning
  • #35: 322445 (vierde kolom middennin) 424979