SlideShare a Scribd company logo
[ RMLL 2013, Bruxelles – Thursday 11th
July 2013 ]
Presentation of OpenNLP
Presenter : Dr Ir Robert Viseur
2
What is OpenNLP ?
• Toolkit for the processing of natural language text.
• Project of the Apache Foundation.
• Developped in Java.
• Under Apache License, Version 2.
• Download and documentation:
http://opennlp.apache.org/.
3
What are the features ?
• For common NLP tasks :
• tokenization,
• sentence segmentation,
• part-of-speech tagging,
• named entity extraction,
• chuncking.
4
What is the part-of-speech tagging ?
• Example :
• See more:
http://opennlp.apache.org/documentation/1.5.3
/manual/opennlp.html.
5
What is the named entity
extraction ?
• Example :
• See more:
http://opennlp.apache.org/documentation/1.5.3
/manual/opennlp.html.
6
How does it work ? (1/2)
• The features are associated to pre-trained models.
• Each pre-trained model is created for one language
and for one type of use.
• Supported languages: da, de, en, es, nl, pt, se.
• Warnings :
– The functional coverage varies with languages.
– The french language is not supported !
• See http://opennlp.sourceforge.net/models-
1.5/.
• Use in command line or as a Java library.
• Warning : loading time of models with CLI.
7
How does it work ? (2/2)
• Example (English vs Spanish languages) :
8
What are the criteria of choice ?
• Support of the product.
• License.
• Available languages.
• Precision / Recall.
• Speed of text processing.
9
Are there free (as freedom)
alternative tools ?
• Other light tools :
• Stanford Log-linear Part-Of-Speech Tagger (POST),
• Stanford Named Entity Recognizer (NER),
• TagEN,
• Java Automatic Term Extraction toolkit.
• Frameworks :
• In Java : UIMA (Java), GATE (Java).
• In other languages : NLTK (Python).
10
Example:
tag cloud creation (1/6)
• Starting point: website.
• Example: www.adacore.com.
• What we want (from website content):
• common tag cloud,
• circular tag cloud.
• Main steps : crawl, cleaning of HTML documents,
named entities (person) and terminology
extractions (+ merge) and display (tag cloud).
11
Example:
tag cloud creation (2/6)
• Cleaning:
• Remove the HTML tags and keep only the useful
content.
• Warnings:
• NLP tools are sensitive to noise in raw data.
• Pay attention to the language of the document.
• Use of HTML boilerplate tool (HTML -> TXT).
• Tool: Boilerpipe.
• See http://code.google.com/p/boilerpipe/.
• Next: normalization of the text.
12
Example:
tag cloud creation (3/6)
• Named entities extraction.
• Standard in OpenNLP : OpenNLP adds tags in text.
• Here : extraction of Person NE.
• Terminology extraction.
• First : part-of-speech tagging (POST).
• Next : identification et filtering (threshold) of :
• collocations (i.e: Name_Name, Adjective_Name,...),
• proper names (often: brands or people).
13
Example:
tag cloud creation (4/6)
• Process :
Raw HTML
document
---- --- -- ----.
--- -- -- -- ----
--- -- ----.
---- --- -- ----.
--- -- -- -- ----
--- -- ----.
_--- _-- _-- _
_---- _--.
_--- _-- _-- _--
_____
_____
_____
Conversion
to text
Normalization
POS
tagging
_____
_____
_____
Terminology
extraction
NE extraction
Tag cloud
(for a website)
Website
(Internet)
Website
(local)
Crawl
Tags
Merge
14
Example:
tag cloud creation (5/6)
• Result: common tag cloud.
15
Example:
tag cloud creation (6/6)
• Result: circular tag cloud.
16
Thanks for your attention.
Any questions ?
17
Contact
Dr Ir Robert Viseur
Email (@CETIC) : robert.viseur@cetic.be
Email (@UMONS) : robert.viseur@umons.ac.be
Phone : 0032 (0) 479 66 08 76
Website : www.robertviseur.be
This presentation is covered by « CC-BY-ND » license.

More Related Content

What's hot (20)

PPTX
natural language processing help at myassignmenthelp.net
www.myassignmenthelp.net
 
PPTX
Text MIning
Prakhyath Rai
 
PPTX
Presentation on Sentiment Analysis
Rebecca Williams
 
PPTX
Natural lanaguage processing
gulshan kumar
 
PPTX
NLP_KASHK:Smoothing N-gram Models
Hemantha Kulathilake
 
PPTX
Understanding GloVe
JEE HYUN PARK
 
PDF
Text Classification, Sentiment Analysis, and Opinion Mining
Fabrizio Sebastiani
 
PPTX
Morphological Analysis
Akshat Pandey
 
PDF
Practical sentiment analysis
Diana Maynard
 
PDF
Language Models for Information Retrieval
Nik Spirin
 
PDF
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
PPTX
Signature files
Deepali Raikar
 
PDF
linux os-basics,Devops training in Hyderabad
Devops Trainer
 
PDF
PPT2: Introduction of Machine Learning & Deep Learning and its types
akira-ai
 
PPTX
Natural Language Processing
Adarsh Saxena
 
PPT
Natural Language Processing
Yasir Khan
 
PPTX
NLTK
Girish Khanzode
 
PDF
Natural language processing (NLP) introduction
Robert Lujo
 
PDF
Text summarization
prateek khandelwal
 
PPTX
NLP_KASHK:N-Grams
Hemantha Kulathilake
 
natural language processing help at myassignmenthelp.net
www.myassignmenthelp.net
 
Text MIning
Prakhyath Rai
 
Presentation on Sentiment Analysis
Rebecca Williams
 
Natural lanaguage processing
gulshan kumar
 
NLP_KASHK:Smoothing N-gram Models
Hemantha Kulathilake
 
Understanding GloVe
JEE HYUN PARK
 
Text Classification, Sentiment Analysis, and Opinion Mining
Fabrizio Sebastiani
 
Morphological Analysis
Akshat Pandey
 
Practical sentiment analysis
Diana Maynard
 
Language Models for Information Retrieval
Nik Spirin
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
Signature files
Deepali Raikar
 
linux os-basics,Devops training in Hyderabad
Devops Trainer
 
PPT2: Introduction of Machine Learning & Deep Learning and its types
akira-ai
 
Natural Language Processing
Adarsh Saxena
 
Natural Language Processing
Yasir Khan
 
Natural language processing (NLP) introduction
Robert Lujo
 
Text summarization
prateek khandelwal
 
NLP_KASHK:N-Grams
Hemantha Kulathilake
 

Similar to Presentation of OpenNLP (20)

PPTX
Ontology Access Kit_ Workshop Intro Slides.pptx
Chris Mungall
 
PPTX
Python presentation of Government Engineering College Aurangabad, Bihar
UttamKumar617567
 
PDF
01 html-introduction
Mohsin Mushtaq
 
PDF
Introduction to libre « fulltext » technology
Robert Viseur
 
PDF
Drupal and Apache Stanbol
Alkuvoima
 
PPTX
Its2 ontology-localization
Felix Sasaki
 
PDF
Building OBO Foundry ontology using semantic web tools
Melanie Courtot
 
PDF
Aspects of NLP Practice
Vsevolod Dyomkin
 
PPTX
Lecture semantic augmentation
Dhavalkumar Thakker
 
PPTX
Medical Heritage Library (MHL) on ArchiveSpark
Helge Holzmann
 
PPT
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Takeshi Morita
 
PPTX
python programming unit 1 wala ppt .pptx
AnaIyer1
 
PPTX
Apache cTAKES - NLP in Healthcare
Alexandru Zbarcea
 
PDF
Apache Solr for TYPO3 CMS 101
Olivier Dobberkau
 
PDF
Doctrine Project
Daniel Lima
 
PPTX
All_About_Python_and_more+Cambridge.pptx
nadaragnesrani
 
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
Sri Ambati
 
PPTX
How to Write the Fastest JSON Parser/Writer in the World
Milo Yip
 
PDF
The State of #NLProc
Vsevolod Dyomkin
 
PDF
Apache cTAKES- NLP in Healthcare
Alexandru Zbarcea
 
Ontology Access Kit_ Workshop Intro Slides.pptx
Chris Mungall
 
Python presentation of Government Engineering College Aurangabad, Bihar
UttamKumar617567
 
01 html-introduction
Mohsin Mushtaq
 
Introduction to libre « fulltext » technology
Robert Viseur
 
Drupal and Apache Stanbol
Alkuvoima
 
Its2 ontology-localization
Felix Sasaki
 
Building OBO Foundry ontology using semantic web tools
Melanie Courtot
 
Aspects of NLP Practice
Vsevolod Dyomkin
 
Lecture semantic augmentation
Dhavalkumar Thakker
 
Medical Heritage Library (MHL) on ArchiveSpark
Helge Holzmann
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Takeshi Morita
 
python programming unit 1 wala ppt .pptx
AnaIyer1
 
Apache cTAKES - NLP in Healthcare
Alexandru Zbarcea
 
Apache Solr for TYPO3 CMS 101
Olivier Dobberkau
 
Doctrine Project
Daniel Lima
 
All_About_Python_and_more+Cambridge.pptx
nadaragnesrani
 
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
Sri Ambati
 
How to Write the Fastest JSON Parser/Writer in the World
Milo Yip
 
The State of #NLProc
Vsevolod Dyomkin
 
Apache cTAKES- NLP in Healthcare
Alexandru Zbarcea
 
Ad

More from Robert Viseur (20)

PDF
La PI dans les espaces de co-création et d'innovation ouverte. Propriété inte...
Robert Viseur
 
PDF
L'écosystème régional du Big Data
Robert Viseur
 
PDF
Piloter son appareil photo numérique avec des logiciels libres
Robert Viseur
 
PDF
Exploiter les données issues de Wikipedia
Robert Viseur
 
PDF
De l’open source à l’open cloud
Robert Viseur
 
PDF
Développer ses photos avec RawTherapee
Robert Viseur
 
PDF
Convertir ses photos en N/B avec Gimp
Robert Viseur
 
PDF
L'open hardware : l'ouverture au service de l'innovation
Robert Viseur
 
PDF
Pechakucha (Mons) : Street Art à Mons
Robert Viseur
 
PDF
L'open hardware dans l'électronique (et au delà...)
Robert Viseur
 
PDF
Analyse des concepts de Fab Lab, Living Lab et Hub créatif
Robert Viseur
 
PDF
Open Source Hardware for Dummies
Robert Viseur
 
PDF
Pratiques innovantes dans le secteur automobile: du champion de produit à l'i...
Robert Viseur
 
PDF
Etude du secteur des prestataires FLOSS en Belgique
Robert Viseur
 
PDF
Hacker son appareil photo avec des outils libres
Robert Viseur
 
PDF
Comment gérer le risque de lock-in technique en cas d'usage de services de cl...
Robert Viseur
 
PDF
Hacker son appareil photo, c'est possible !
Robert Viseur
 
PDF
Comprendre les licences de logiciels libres
Robert Viseur
 
PDF
Impact of cloud computing on FOSS editors
Robert Viseur
 
PDF
Une introduction à la co-création dans le domaine des TIC
Robert Viseur
 
La PI dans les espaces de co-création et d'innovation ouverte. Propriété inte...
Robert Viseur
 
L'écosystème régional du Big Data
Robert Viseur
 
Piloter son appareil photo numérique avec des logiciels libres
Robert Viseur
 
Exploiter les données issues de Wikipedia
Robert Viseur
 
De l’open source à l’open cloud
Robert Viseur
 
Développer ses photos avec RawTherapee
Robert Viseur
 
Convertir ses photos en N/B avec Gimp
Robert Viseur
 
L'open hardware : l'ouverture au service de l'innovation
Robert Viseur
 
Pechakucha (Mons) : Street Art à Mons
Robert Viseur
 
L'open hardware dans l'électronique (et au delà...)
Robert Viseur
 
Analyse des concepts de Fab Lab, Living Lab et Hub créatif
Robert Viseur
 
Open Source Hardware for Dummies
Robert Viseur
 
Pratiques innovantes dans le secteur automobile: du champion de produit à l'i...
Robert Viseur
 
Etude du secteur des prestataires FLOSS en Belgique
Robert Viseur
 
Hacker son appareil photo avec des outils libres
Robert Viseur
 
Comment gérer le risque de lock-in technique en cas d'usage de services de cl...
Robert Viseur
 
Hacker son appareil photo, c'est possible !
Robert Viseur
 
Comprendre les licences de logiciels libres
Robert Viseur
 
Impact of cloud computing on FOSS editors
Robert Viseur
 
Une introduction à la co-création dans le domaine des TIC
Robert Viseur
 
Ad

Recently uploaded (20)

PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PDF
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
The Future of Artificial Intelligence (AI)
Mukul
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 

Presentation of OpenNLP

  • 1. [ RMLL 2013, Bruxelles – Thursday 11th July 2013 ] Presentation of OpenNLP Presenter : Dr Ir Robert Viseur
  • 2. 2 What is OpenNLP ? • Toolkit for the processing of natural language text. • Project of the Apache Foundation. • Developped in Java. • Under Apache License, Version 2. • Download and documentation: http://opennlp.apache.org/.
  • 3. 3 What are the features ? • For common NLP tasks : • tokenization, • sentence segmentation, • part-of-speech tagging, • named entity extraction, • chuncking.
  • 4. 4 What is the part-of-speech tagging ? • Example : • See more: http://opennlp.apache.org/documentation/1.5.3 /manual/opennlp.html.
  • 5. 5 What is the named entity extraction ? • Example : • See more: http://opennlp.apache.org/documentation/1.5.3 /manual/opennlp.html.
  • 6. 6 How does it work ? (1/2) • The features are associated to pre-trained models. • Each pre-trained model is created for one language and for one type of use. • Supported languages: da, de, en, es, nl, pt, se. • Warnings : – The functional coverage varies with languages. – The french language is not supported ! • See http://opennlp.sourceforge.net/models- 1.5/. • Use in command line or as a Java library. • Warning : loading time of models with CLI.
  • 7. 7 How does it work ? (2/2) • Example (English vs Spanish languages) :
  • 8. 8 What are the criteria of choice ? • Support of the product. • License. • Available languages. • Precision / Recall. • Speed of text processing.
  • 9. 9 Are there free (as freedom) alternative tools ? • Other light tools : • Stanford Log-linear Part-Of-Speech Tagger (POST), • Stanford Named Entity Recognizer (NER), • TagEN, • Java Automatic Term Extraction toolkit. • Frameworks : • In Java : UIMA (Java), GATE (Java). • In other languages : NLTK (Python).
  • 10. 10 Example: tag cloud creation (1/6) • Starting point: website. • Example: www.adacore.com. • What we want (from website content): • common tag cloud, • circular tag cloud. • Main steps : crawl, cleaning of HTML documents, named entities (person) and terminology extractions (+ merge) and display (tag cloud).
  • 11. 11 Example: tag cloud creation (2/6) • Cleaning: • Remove the HTML tags and keep only the useful content. • Warnings: • NLP tools are sensitive to noise in raw data. • Pay attention to the language of the document. • Use of HTML boilerplate tool (HTML -> TXT). • Tool: Boilerpipe. • See http://code.google.com/p/boilerpipe/. • Next: normalization of the text.
  • 12. 12 Example: tag cloud creation (3/6) • Named entities extraction. • Standard in OpenNLP : OpenNLP adds tags in text. • Here : extraction of Person NE. • Terminology extraction. • First : part-of-speech tagging (POST). • Next : identification et filtering (threshold) of : • collocations (i.e: Name_Name, Adjective_Name,...), • proper names (often: brands or people).
  • 13. 13 Example: tag cloud creation (4/6) • Process : Raw HTML document ---- --- -- ----. --- -- -- -- ---- --- -- ----. ---- --- -- ----. --- -- -- -- ---- --- -- ----. _--- _-- _-- _ _---- _--. _--- _-- _-- _-- _____ _____ _____ Conversion to text Normalization POS tagging _____ _____ _____ Terminology extraction NE extraction Tag cloud (for a website) Website (Internet) Website (local) Crawl Tags Merge
  • 14. 14 Example: tag cloud creation (5/6) • Result: common tag cloud.
  • 15. 15 Example: tag cloud creation (6/6) • Result: circular tag cloud.
  • 16. 16 Thanks for your attention. Any questions ?
  • 17. 17 Contact Dr Ir Robert Viseur Email (@CETIC) : [email protected] Email (@UMONS) : [email protected] Phone : 0032 (0) 479 66 08 76 Website : www.robertviseur.be This presentation is covered by « CC-BY-ND » license.