SlideShare a Scribd company logo
Semantic Text Processing Powered by Wikipedia Maxim Grinev [email_address]
Technology Overview Next Generation Text Analysis bootstrapped  by Wikipedia  Wikipedia is a new enabling resource for NLP Comprehensive coverage ( 6M terms versus 65K in Britannica ) Continuously brought up-to-date Rich Structure ( cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes ) New Algorithms: Advanced NLP:  Word Sense Disambiguation, Keywords Extraction, Topic Inference Automatic Ontology Management:  Organizing Concept into Thematically Grouped Tag Clouds   Semantic Search:  Concept-based Similarity Search, Smart Faceted Navigation Improved Recommendations:  Semantic Document Similarity Zero-cost deployment and customization:  No machine learning techniques which require human labor, no “cold start”
We analyse Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms We use  Dice-measure   with weighted links (bi-directional links, direct links, “see also” links, etc) Basic Technique: Semantic Relatedness of Terms Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation,  VLDB 2008
Terms Detection and Disambiguation Example:  IBM  may stand for  International Business Machines Corp . or  International Brotherhood of Magicians We use Wikipedia  redirection (synonyms)  and  disambiguation pages (homonyms)  to detect and disambiguate terms in a text Example:  Platform  is mentioned in the context of  implementation ,  open-source ,  web-server, HTTP Denis Turdakov, Pavel Velikhov “ Semantic Relatedness Metric for Wikipedia Concepts Based on  Link Analysis and its Application to Word Sense Disambiguation ” SYRCoDIS, 2008
Keywords Extraction Build  document semantic graph  using semantic relatedness between Wikipedia terms detected in the doc Discover community structure of the document semantic graph  Community – densely interconnected group of nodes in a graph Girvan-Newman algorithm  for detection community structure in networks Select “best” communities: Densed  communities contain  key terms Sparse  communities contain  not  important   terms, and possible  disambiguation mistakes Maria Grineva, Maxim Grinev, Dmitry Lizorkin Extracting Key Terms From Noisy and Multitheme Documents WWW2009: 18th International World Wide Web Conference
Keywords Extraction (Example) Semantic graph built from a news article  " Apple to Make ITunes More Accessible For the Blind "
Advantages of the Keywords Extraction Method No training .  Instead of training the system with hand-created examples, we use semantic information derived from Wikipedia Noise and multi-theme stability.  Good at filtering out noise and discover topics in Web pages Thematically grouped key terms .  Significantly improve further inferring of document topics High accuracy .  Evaluated using human judgments
Other Methods General Topic Inference for a doc using spreading activation over Wikipedia categories graph Example:  Amazon EC2, Microsoft Azure, Google MapReduce  => Cloud Computing Building Thematically Grouped Tag Clouds for many docs Girvan-Newman algorithm to split into thematic groups Topic inference for each group Document classification Semantic similarity is used to indentify indirect relationships between terms (e.g. a doc about  collaborative filtering  is classified to  recommender system )
Semantic Search & Navigation Search by Concept : Advantages of query and in-doc terms disambiguation Result: documents about the concept and related concepts ordered by relevance (keywordness) Smart  Faceted Navigation : query-relevant facets using semantic relatedness  Concept-tips  to grasp the result documents Each document in the result is accompanied with concepts-tips that explain how this document is relevant to the Query
Facets Generation
Facets Generation (cont.)
Facets Generation (cont.)
Facets Generation (cont.)
Thank You!

More Related Content

What's hot (20)

PPTX
PhD Research Topics in Cloud Computing Tutorials
PhD Services
 
PPTX
An Approach for RDF-based Semantic Access to NoSQL Repositories
Luiz Henrique Zambom Santana
 
PPTX
03 interlinking-dass
Diego Pessoa
 
PPT
Enhancing Semantic Mining
Santhosh Kumar
 
PDF
CLARIAH Toogdag 2018: A distributed network of digital heritage information
Enno Meijers
 
PPTX
PhD Projects in Constant Bitrate Network Research Ideas
PhD Services
 
DOCX
Outsourced similarity search on
IMPULSE_TECHNOLOGY
 
PDF
balloon: LOD forecasting - cloudy with a chance of services
Kai Schlegel
 
PDF
Are our knowledge graphs trustworthy?
Elena Simperl
 
PDF
Towards a Conceptual Framework and Metamodel for Context-Aware Personal Cross...
Beat Signer
 
PDF
9th International Conference on Database and Data Mining (DBDM 2021)
albert ca
 
PDF
The web of data: how are we doing so far?
Elena Simperl
 
PPTX
Linked data 20171106
Synaptica, LLC
 
PPT
Grid Computing July 2009
Ian Foster
 
DOCX
Privacy preserving multi-keyword ranked search over encrypted cloud data
Shakas Technologies
 
PDF
A distributed network of digital heritage information - Unesco/NDL India
Enno Meijers
 
PPTX
Linked Data Quality Assessment – daQ and Luzzu
jerdeb
 
PDF
ieee projects in chennai 2018-2019
Phoenix Systems
 
PDF
Nlp and semantic_web_for_competitive_int
KarenVacca
 
PPT
The Structure of Computer Science Knowledge Network
Pham Cuong
 
PhD Research Topics in Cloud Computing Tutorials
PhD Services
 
An Approach for RDF-based Semantic Access to NoSQL Repositories
Luiz Henrique Zambom Santana
 
03 interlinking-dass
Diego Pessoa
 
Enhancing Semantic Mining
Santhosh Kumar
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
Enno Meijers
 
PhD Projects in Constant Bitrate Network Research Ideas
PhD Services
 
Outsourced similarity search on
IMPULSE_TECHNOLOGY
 
balloon: LOD forecasting - cloudy with a chance of services
Kai Schlegel
 
Are our knowledge graphs trustworthy?
Elena Simperl
 
Towards a Conceptual Framework and Metamodel for Context-Aware Personal Cross...
Beat Signer
 
9th International Conference on Database and Data Mining (DBDM 2021)
albert ca
 
The web of data: how are we doing so far?
Elena Simperl
 
Linked data 20171106
Synaptica, LLC
 
Grid Computing July 2009
Ian Foster
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Shakas Technologies
 
A distributed network of digital heritage information - Unesco/NDL India
Enno Meijers
 
Linked Data Quality Assessment – daQ and Luzzu
jerdeb
 
ieee projects in chennai 2018-2019
Phoenix Systems
 
Nlp and semantic_web_for_competitive_int
KarenVacca
 
The Structure of Computer Science Knowledge Network
Pham Cuong
 

Viewers also liked (20)

PDF
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
IDES Editor
 
PDF
Indianapolis - Wikipedia and the Cultural Sector
wittylama
 
PDF
Natural Language Generation: New Automation and Personalization Opportunities
Automated Insights
 
PPT
Online Character Recognition
Kamakhya Gupta
 
PPTX
Language translation english to hindi
RAJENDRA VERMA
 
PDF
Automatic Document Summarization
Findwise
 
PDF
Natural Language Generation from First-Order Expressions
Thomas Mathew
 
PPTX
Machine Translation=Google Translator
Nerea
 
PPT
What is machine translation
Stephen Peacock
 
PPTX
Machine translation
mohamed hassan
 
PPTX
Speech acts
angegamg
 
PDF
Instant Question Answering System
Dhwaj Raj
 
PPT
Latent Semantic Indexing and Analysis
Mercy Livingstone
 
PPT
Latent Semantic Indexing For Information Retrieval
Sudarsun Santhiappan
 
PDF
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
PPTX
Machine Translation
Skilrock Technologies
 
PPT
Types of machine translation
Rushdi Shams
 
PDF
Machine Translation Introduction
nlab_utokyo
 
PPTX
Speech to text conversion
ankit_saluja
 
PDF
Text summarization
kareemhashem
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
IDES Editor
 
Indianapolis - Wikipedia and the Cultural Sector
wittylama
 
Natural Language Generation: New Automation and Personalization Opportunities
Automated Insights
 
Online Character Recognition
Kamakhya Gupta
 
Language translation english to hindi
RAJENDRA VERMA
 
Automatic Document Summarization
Findwise
 
Natural Language Generation from First-Order Expressions
Thomas Mathew
 
Machine Translation=Google Translator
Nerea
 
What is machine translation
Stephen Peacock
 
Machine translation
mohamed hassan
 
Speech acts
angegamg
 
Instant Question Answering System
Dhwaj Raj
 
Latent Semantic Indexing and Analysis
Mercy Livingstone
 
Latent Semantic Indexing For Information Retrieval
Sudarsun Santhiappan
 
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
Machine Translation
Skilrock Technologies
 
Types of machine translation
Rushdi Shams
 
Machine Translation Introduction
nlab_utokyo
 
Speech to text conversion
ankit_saluja
 
Text summarization
kareemhashem
 
Ad

Similar to Semantic Text Processing Powered by Wikipedia (20)

PPT
Extracting Key Terms From Noisy and Multi-theme Documents
maria.grineva
 
PPT
Effective Extraction of Thematically Grouped Key Terms From Text
maria.grineva
 
PPTX
Linkator: enriching web pages by automatically adding dereferenceable semanti...
Samur Araujo
 
PDF
G1803054653
IOSR Journals
 
PDF
Gic2011 aula10-ingles
Marielba-Mayeya Zacarias
 
PPT
AI (1).ppt ug gjhghhhjkjhhjjffdfhhcchhvvh
viralak69
 
PPT
Artificial Intelligence and the Internet
JCGonzaga1
 
PDF
Paper id 25201463
IJRAT
 
PPT
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Artificial Intelligence Institute at UofSC
 
PPT
PoolParty Thesaurus Management - ISKO UK, London 2010
Andreas Blumauer
 
PDF
Topic Modeling : Clustering of Deep Webpages
csandit
 
PDF
Topic Modeling : Clustering of Deep Webpages
csandit
 
PDF
A web content mining application for detecting relevant pages using Jaccard ...
IJECEIAES
 
PPT
Vellino presentationtocisti
Andre Vellino
 
PPTX
Semantic Web, Ontology, and Ontology Learning: Introduction
Kent State University
 
PDF
Volume 2-issue-6-2016-2020
Editor IJARCET
 
PDF
Volume 2-issue-6-2016-2020
Editor IJARCET
 
PPT
Data Mining and the Web_Past_Present and Future
feiwin
 
PPT
Semantic Relatedness of Web Resources by XESA - Philipp Scholl
CROKODIl consortium
 
PPTX
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Andre Freitas
 
Extracting Key Terms From Noisy and Multi-theme Documents
maria.grineva
 
Effective Extraction of Thematically Grouped Key Terms From Text
maria.grineva
 
Linkator: enriching web pages by automatically adding dereferenceable semanti...
Samur Araujo
 
G1803054653
IOSR Journals
 
Gic2011 aula10-ingles
Marielba-Mayeya Zacarias
 
AI (1).ppt ug gjhghhhjkjhhjjffdfhhcchhvvh
viralak69
 
Artificial Intelligence and the Internet
JCGonzaga1
 
Paper id 25201463
IJRAT
 
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Artificial Intelligence Institute at UofSC
 
PoolParty Thesaurus Management - ISKO UK, London 2010
Andreas Blumauer
 
Topic Modeling : Clustering of Deep Webpages
csandit
 
Topic Modeling : Clustering of Deep Webpages
csandit
 
A web content mining application for detecting relevant pages using Jaccard ...
IJECEIAES
 
Vellino presentationtocisti
Andre Vellino
 
Semantic Web, Ontology, and Ontology Learning: Introduction
Kent State University
 
Volume 2-issue-6-2016-2020
Editor IJARCET
 
Volume 2-issue-6-2016-2020
Editor IJARCET
 
Data Mining and the Web_Past_Present and Future
feiwin
 
Semantic Relatedness of Web Resources by XESA - Philipp Scholl
CROKODIl consortium
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Andre Freitas
 
Ad

Recently uploaded (20)

PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Français Patch Tuesday - Juillet
Ivanti
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PPTX
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Français Patch Tuesday - Juillet
Ivanti
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
July Patch Tuesday
Ivanti
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 

Semantic Text Processing Powered by Wikipedia

  • 1. Semantic Text Processing Powered by Wikipedia Maxim Grinev [email_address]
  • 2. Technology Overview Next Generation Text Analysis bootstrapped by Wikipedia Wikipedia is a new enabling resource for NLP Comprehensive coverage ( 6M terms versus 65K in Britannica ) Continuously brought up-to-date Rich Structure ( cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes ) New Algorithms: Advanced NLP: Word Sense Disambiguation, Keywords Extraction, Topic Inference Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation Improved Recommendations: Semantic Document Similarity Zero-cost deployment and customization: No machine learning techniques which require human labor, no “cold start”
  • 3. We analyse Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms We use Dice-measure with weighted links (bi-directional links, direct links, “see also” links, etc) Basic Technique: Semantic Relatedness of Terms Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation, VLDB 2008
  • 4. Terms Detection and Disambiguation Example: IBM may stand for International Business Machines Corp . or International Brotherhood of Magicians We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text Example: Platform is mentioned in the context of implementation , open-source , web-server, HTTP Denis Turdakov, Pavel Velikhov “ Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation ” SYRCoDIS, 2008
  • 5. Keywords Extraction Build document semantic graph using semantic relatedness between Wikipedia terms detected in the doc Discover community structure of the document semantic graph Community – densely interconnected group of nodes in a graph Girvan-Newman algorithm for detection community structure in networks Select “best” communities: Densed communities contain key terms Sparse communities contain not important terms, and possible disambiguation mistakes Maria Grineva, Maxim Grinev, Dmitry Lizorkin Extracting Key Terms From Noisy and Multitheme Documents WWW2009: 18th International World Wide Web Conference
  • 6. Keywords Extraction (Example) Semantic graph built from a news article " Apple to Make ITunes More Accessible For the Blind "
  • 7. Advantages of the Keywords Extraction Method No training . Instead of training the system with hand-created examples, we use semantic information derived from Wikipedia Noise and multi-theme stability. Good at filtering out noise and discover topics in Web pages Thematically grouped key terms . Significantly improve further inferring of document topics High accuracy . Evaluated using human judgments
  • 8. Other Methods General Topic Inference for a doc using spreading activation over Wikipedia categories graph Example: Amazon EC2, Microsoft Azure, Google MapReduce => Cloud Computing Building Thematically Grouped Tag Clouds for many docs Girvan-Newman algorithm to split into thematic groups Topic inference for each group Document classification Semantic similarity is used to indentify indirect relationships between terms (e.g. a doc about collaborative filtering is classified to recommender system )
  • 9. Semantic Search & Navigation Search by Concept : Advantages of query and in-doc terms disambiguation Result: documents about the concept and related concepts ordered by relevance (keywordness) Smart Faceted Navigation : query-relevant facets using semantic relatedness Concept-tips to grasp the result documents Each document in the result is accompanied with concepts-tips that explain how this document is relevant to the Query

Editor's Notes

  • #3: We've developed a new technology for semantic text analysis and semantic search. The main idea behind our technology is that we use knowledge extreacted from Wikipedia to facilitate text analysis. To recent moment Wikipedia has grown into the biggest database of concepts and their relationships that ever existed. Wikipedia is great for a number of reasons (i t provides a number of things ) : 1) Comprehensive coverage (it contains very general concepts such car, computer, government, etc and a lot of niche concepts such as new small startup companies or people known only in some mmunities)  2) Continuously brought up-to-date (it is often updated just in minutes after announcements) 3) It is well-structured (it has redirects (Ivan the Terrible redirected to Ivan IV of Russia) which is synonims, it has disambiguation pages (homonyms) which includes different meaning for a term (IBM may stands for International Business Machines or International Brotherhood of Magicians). Using Wikipedia as a big knowledge base allows us to significantly improve a number of techniques and develop new techniques that were not possible before. Here is list of techniques that we developed: Advance NLP etc It is just a list of techniques. I will explain how it all works.
  • #6: betweenness – how much is edge “in between” different communities modularity - partition is a good one, if there are many edges within communities and only a few between them