Semantic Text Processing Powered by Wikipedia

8 likes1,803 views

The document discusses using Wikipedia as a resource for semantic text processing and natural language processing techniques. It describes using Wikipedia's comprehensive coverage of terms, rich structure of links and categories, and ability to be continuously updated to power text analysis algorithms. These include word sense disambiguation, keyword extraction, topic inference, ontology management, semantic search, and improved recommendations. The techniques analyze Wikipedia's link structure and build semantic graphs of documents to discover related concepts and group keywords.

Technology Business

Semantic Text Processing Powered by Wikipedia Maxim Grinev [email_address]

Technology Overview Next Generation Text Analysis bootstrapped by Wikipedia Wikipedia is a new enabling resource for NLP Comprehensive coverage ( 6M terms versus 65K in Britannica ) Continuously brought up-to-date Rich Structure ( cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes ) New Algorithms: Advanced NLP: Word Sense Disambiguation, Keywords Extraction, Topic Inference Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation Improved Recommendations: Semantic Document Similarity Zero-cost deployment and customization: No machine learning techniques which require human labor, no “cold start”

We analyse Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms We use Dice-measure with weighted links (bi-directional links, direct links, “see also” links, etc) Basic Technique: Semantic Relatedness of Terms Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation, VLDB 2008

Terms Detection and Disambiguation Example: IBM may stand for International Business Machines Corp . or International Brotherhood of Magicians We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text Example: Platform is mentioned in the context of implementation , open-source , web-server, HTTP Denis Turdakov, Pavel Velikhov “ Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation ” SYRCoDIS, 2008

Keywords Extraction Build document semantic graph using semantic relatedness between Wikipedia terms detected in the doc Discover community structure of the document semantic graph Community – densely interconnected group of nodes in a graph Girvan-Newman algorithm for detection community structure in networks Select “best” communities: Densed communities contain key terms Sparse communities contain not important terms, and possible disambiguation mistakes Maria Grineva, Maxim Grinev, Dmitry Lizorkin Extracting Key Terms From Noisy and Multitheme Documents WWW2009: 18th International World Wide Web Conference

Keywords Extraction (Example) Semantic graph built from a news article " Apple to Make ITunes More Accessible For the Blind "

Advantages of the Keywords Extraction Method No training . Instead of training the system with hand-created examples, we use semantic information derived from Wikipedia Noise and multi-theme stability. Good at filtering out noise and discover topics in Web pages Thematically grouped key terms . Significantly improve further inferring of document topics High accuracy . Evaluated using human judgments

Other Methods General Topic Inference for a doc using spreading activation over Wikipedia categories graph Example: Amazon EC2, Microsoft Azure, Google MapReduce => Cloud Computing Building Thematically Grouped Tag Clouds for many docs Girvan-Newman algorithm to split into thematic groups Topic inference for each group Document classification Semantic similarity is used to indentify indirect relationships between terms (e.g. a doc about collaborative filtering is classified to recommender system )

Semantic Search & Navigation Search by Concept : Advantages of query and in-doc terms disambiguation Result: documents about the concept and related concepts ordered by relevance (keywordness) Smart Faceted Navigation : query-relevant facets using semantic relatedness Concept-tips to grasp the result documents Each document in the result is accompanied with concepts-tips that explain how this document is relevant to the Query

More Related Content

What's hot (20)

PPTX

PhD Research Topics in Cloud Computing TutorialsPhD Services

PPTX

An Approach for RDF-based Semantic Access to NoSQL RepositoriesLuiz Henrique Zambom Santana

PPTX

03 interlinking-dassDiego Pessoa

PPT

Enhancing Semantic MiningSanthosh Kumar

PDF

CLARIAH Toogdag 2018: A distributed network of digital heritage informationEnno Meijers

PPTX

PhD Projects in Constant Bitrate Network Research IdeasPhD Services

DOCX

Outsourced similarity search onIMPULSE_TECHNOLOGY

PDF

balloon: LOD forecasting - cloudy with a chance of servicesKai Schlegel

PDF

Are our knowledge graphs trustworthy?Elena Simperl

PDF

Towards a Conceptual Framework and Metamodel for Context-Aware Personal Cross...Beat Signer

PDF

9th International Conference on Database and Data Mining (DBDM 2021)albert ca

PDF

The web of data: how are we doing so far?Elena Simperl

PPTX

Linked data 20171106Synaptica, LLC

PPT

Grid Computing July 2009Ian Foster

DOCX

Privacy preserving multi-keyword ranked search over encrypted cloud dataShakas Technologies

PDF

A distributed network of digital heritage information - Unesco/NDL IndiaEnno Meijers

PPTX

Linked Data Quality Assessment – daQ and Luzzujerdeb

PDF

ieee projects in chennai 2018-2019Phoenix Systems

PDF

Nlp and semantic_web_for_competitive_intKarenVacca

PPT

The Structure of Computer Science Knowledge NetworkPham Cuong

PhD Research Topics in Cloud Computing TutorialsPhD Services

An Approach for RDF-based Semantic Access to NoSQL RepositoriesLuiz Henrique Zambom Santana

03 interlinking-dassDiego Pessoa

Enhancing Semantic MiningSanthosh Kumar

CLARIAH Toogdag 2018: A distributed network of digital heritage informationEnno Meijers

PhD Projects in Constant Bitrate Network Research IdeasPhD Services

Outsourced similarity search onIMPULSE_TECHNOLOGY

balloon: LOD forecasting - cloudy with a chance of servicesKai Schlegel

Are our knowledge graphs trustworthy?Elena Simperl

Towards a Conceptual Framework and Metamodel for Context-Aware Personal Cross...Beat Signer

9th International Conference on Database and Data Mining (DBDM 2021)albert ca

The web of data: how are we doing so far?Elena Simperl

Linked data 20171106Synaptica, LLC

Grid Computing July 2009Ian Foster

Privacy preserving multi-keyword ranked search over encrypted cloud dataShakas Technologies

A distributed network of digital heritage information - Unesco/NDL IndiaEnno Meijers

Linked Data Quality Assessment – daQ and Luzzujerdeb

ieee projects in chennai 2018-2019Phoenix Systems

Nlp and semantic_web_for_competitive_intKarenVacca

The Structure of Computer Science Knowledge NetworkPham Cuong

Viewers also liked (20)

PDF

Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor

PDF

Indianapolis - Wikipedia and the Cultural Sectorwittylama

PDF

Natural Language Generation: New Automation and Personalization OpportunitiesAutomated Insights

PPT

Online Character RecognitionKamakhya Gupta

PPTX

Language translation english to hindiRAJENDRA VERMA

PDF

Automatic Document SummarizationFindwise

PDF

Natural Language Generation from First-Order ExpressionsThomas Mathew

PPTX

Machine Translation=Google TranslatorNerea

PPT

What is machine translationStephen Peacock

PPTX

Machine translationmohamed hassan

PPTX

Speech actsangegamg

PDF

Instant Question Answering SystemDhwaj Raj

PPT

Latent Semantic Indexing and AnalysisMercy Livingstone

PPT

Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan

PDF

Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics

PPTX

Machine TranslationSkilrock Technologies

PPT

Types of machine translationRushdi Shams

PDF

Machine Translation Introductionnlab_utokyo

PPTX

Speech to text conversionankit_saluja

PDF

Text summarizationkareemhashem

Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor

Indianapolis - Wikipedia and the Cultural Sectorwittylama

Natural Language Generation: New Automation and Personalization OpportunitiesAutomated Insights

Online Character RecognitionKamakhya Gupta

Language translation english to hindiRAJENDRA VERMA

Automatic Document SummarizationFindwise

Natural Language Generation from First-Order ExpressionsThomas Mathew

Machine Translation=Google TranslatorNerea

What is machine translationStephen Peacock

Machine translationmohamed hassan

Speech actsangegamg

Instant Question Answering SystemDhwaj Raj

Latent Semantic Indexing and AnalysisMercy Livingstone

Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan

Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics

Machine TranslationSkilrock Technologies

Types of machine translationRushdi Shams

Machine Translation Introductionnlab_utokyo

Speech to text conversionankit_saluja

Text summarizationkareemhashem

Similar to Semantic Text Processing Powered by Wikipedia (20)

PPT

Extracting Key Terms From Noisy and Multi-theme Documentsmaria.grineva

PPT

Effective Extraction of Thematically Grouped Key Terms From Textmaria.grineva

PPTX

Linkator: enriching web pages by automatically adding dereferenceable semanti...Samur Araujo

PDF

G1803054653IOSR Journals

PDF

Gic2011 aula10-inglesMarielba-Mayeya Zacarias

PPT

AI (1).ppt ug gjhghhhjkjhhjjffdfhhcchhvvhviralak69

PPT

Artificial Intelligence and the InternetJCGonzaga1

PDF

Paper id 25201463IJRAT

PPT

Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Artificial Intelligence Institute at UofSC

PPT

PoolParty Thesaurus Management - ISKO UK, London 2010Andreas Blumauer

PDF

Topic Modeling : Clustering of Deep Webpagescsandit

PDF

Topic Modeling : Clustering of Deep Webpagescsandit

PDF

A web content mining application for detecting relevant pages using Jaccard ...IJECEIAES

PPT

Vellino presentationtocistiAndre Vellino

PPTX

Semantic Web, Ontology, and Ontology Learning: IntroductionKent State University

PDF

Volume 2-issue-6-2016-2020Editor IJARCET

PDF

Volume 2-issue-6-2016-2020Editor IJARCET

PPT

Data Mining and the Web_Past_Present and Futurefeiwin

PPT

Semantic Relatedness of Web Resources by XESA - Philipp SchollCROKODIl consortium

PPTX

Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas

Extracting Key Terms From Noisy and Multi-theme Documentsmaria.grineva

Effective Extraction of Thematically Grouped Key Terms From Textmaria.grineva

Linkator: enriching web pages by automatically adding dereferenceable semanti...Samur Araujo

G1803054653IOSR Journals

Gic2011 aula10-inglesMarielba-Mayeya Zacarias

AI (1).ppt ug gjhghhhjkjhhjjffdfhhcchhvvhviralak69

Artificial Intelligence and the InternetJCGonzaga1

Paper id 25201463IJRAT

Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Artificial Intelligence Institute at UofSC

PoolParty Thesaurus Management - ISKO UK, London 2010Andreas Blumauer

Topic Modeling : Clustering of Deep Webpagescsandit

A web content mining application for detecting relevant pages using Jaccard ...IJECEIAES

Vellino presentationtocistiAndre Vellino

Semantic Web, Ontology, and Ontology Learning: IntroductionKent State University

Volume 2-issue-6-2016-2020Editor IJARCET

Data Mining and the Web_Past_Present and Futurefeiwin

Semantic Relatedness of Web Resources by XESA - Philipp SchollCROKODIl consortium

Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas

Recently uploaded (20)

PDF

Complete JavaScript Notes: From Basics to Advanced Concepts.pdfhaydendavispro

PDF

CIFDAQ Token Spotlight for 9th July 2025CIFDAQ

PPTX

WooCommerce Workshop: Bring Your LaptopLaura Hartwig

PDF

Français Patch Tuesday - JuilletIvanti

PDF

Meetup Kickoff & Welcome - Rohit Yadav, CSIUG ChairmanShapeBlue

PDF

Log-Based Anomaly Detection: Enhancing System Reliability with Machine LearningMohammed BEKKOUCHE

PDF

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI

PDF

How Startups Are Growing Faster with App Developers in Australia.pdfIndia App Developer

PDF

July Patch TuesdayIvanti

PPTX

UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst ContentDianaGray10

PPTX

Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...AWS Chicago

PDF

Why Orbit Edge Tech is a Top Next JS Development Company in 2025mahendraalaska08

PDF

Smart Air Quality Monitoring with Serrax AQM190 LITESERRAX TECHNOLOGIES LLP

PDF

CIFDAQ Weekly Market Wrap for 11th July 2025CIFDAQ

PDF

LLMs.txt: Easily Control How AI Crawls Your SiteKeploy

PDF

Blockchain Transactions Explained For EveryoneCIFDAQ

PDF

Human-centred design in online workplace learning and relationship to engagem...Tracy Tang

PPTX

Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...ShapeBlue

PDF

TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...TrustArc

PDF

Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdfPavel Shukhman

Complete JavaScript Notes: From Basics to Advanced Concepts.pdfhaydendavispro

CIFDAQ Token Spotlight for 9th July 2025CIFDAQ

WooCommerce Workshop: Bring Your LaptopLaura Hartwig

Français Patch Tuesday - JuilletIvanti

Meetup Kickoff & Welcome - Rohit Yadav, CSIUG ChairmanShapeBlue

Log-Based Anomaly Detection: Enhancing System Reliability with Machine LearningMohammed BEKKOUCHE

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI

How Startups Are Growing Faster with App Developers in Australia.pdfIndia App Developer

July Patch TuesdayIvanti

UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst ContentDianaGray10

Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...AWS Chicago

Why Orbit Edge Tech is a Top Next JS Development Company in 2025mahendraalaska08

Smart Air Quality Monitoring with Serrax AQM190 LITESERRAX TECHNOLOGIES LLP

CIFDAQ Weekly Market Wrap for 11th July 2025CIFDAQ

LLMs.txt: Easily Control How AI Crawls Your SiteKeploy

Blockchain Transactions Explained For EveryoneCIFDAQ

Human-centred design in online workplace learning and relationship to engagem...Tracy Tang

Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...ShapeBlue

TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...TrustArc

Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdfPavel Shukhman

Semantic Text Processing Powered by Wikipedia

1. Semantic Text Processing Powered by Wikipedia Maxim Grinev [email_address]

2. Technology Overview Next Generation Text Analysis bootstrapped by Wikipedia Wikipedia is a new enabling resource for NLP Comprehensive coverage ( 6M terms versus 65K in Britannica ) Continuously brought up-to-date Rich Structure ( cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes ) New Algorithms: Advanced NLP: Word Sense Disambiguation, Keywords Extraction, Topic Inference Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation Improved Recommendations: Semantic Document Similarity Zero-cost deployment and customization: No machine learning techniques which require human labor, no “cold start”

3. We analyse Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms We use Dice-measure with weighted links (bi-directional links, direct links, “see also” links, etc) Basic Technique: Semantic Relatedness of Terms Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation, VLDB 2008

4. Terms Detection and Disambiguation Example: IBM may stand for International Business Machines Corp . or International Brotherhood of Magicians We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text Example: Platform is mentioned in the context of implementation , open-source , web-server, HTTP Denis Turdakov, Pavel Velikhov “ Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation ” SYRCoDIS, 2008

5. Keywords Extraction Build document semantic graph using semantic relatedness between Wikipedia terms detected in the doc Discover community structure of the document semantic graph Community – densely interconnected group of nodes in a graph Girvan-Newman algorithm for detection community structure in networks Select “best” communities: Densed communities contain key terms Sparse communities contain not important terms, and possible disambiguation mistakes Maria Grineva, Maxim Grinev, Dmitry Lizorkin Extracting Key Terms From Noisy and Multitheme Documents WWW2009: 18th International World Wide Web Conference

6. Keywords Extraction (Example) Semantic graph built from a news article " Apple to Make ITunes More Accessible For the Blind "

7. Advantages of the Keywords Extraction Method No training . Instead of training the system with hand-created examples, we use semantic information derived from Wikipedia Noise and multi-theme stability. Good at filtering out noise and discover topics in Web pages Thematically grouped key terms . Significantly improve further inferring of document topics High accuracy . Evaluated using human judgments

8. Other Methods General Topic Inference for a doc using spreading activation over Wikipedia categories graph Example: Amazon EC2, Microsoft Azure, Google MapReduce => Cloud Computing Building Thematically Grouped Tag Clouds for many docs Girvan-Newman algorithm to split into thematic groups Topic inference for each group Document classification Semantic similarity is used to indentify indirect relationships between terms (e.g. a doc about collaborative filtering is classified to recommender system )

9. Semantic Search & Navigation Search by Concept : Advantages of query and in-doc terms disambiguation Result: documents about the concept and related concepts ordered by relevance (keywordness) Smart Faceted Navigation : query-relevant facets using semantic relatedness Concept-tips to grasp the result documents Each document in the result is accompanied with concepts-tips that explain how this document is relevant to the Query

10. Facets Generation

11. Facets Generation (cont.)

12. Facets Generation (cont.)

13. Facets Generation (cont.)

14. Thank You!

Editor's Notes

#3: We've developed a new technology for semantic text analysis and semantic search. The main idea behind our technology is that we use knowledge extreacted from Wikipedia to facilitate text analysis. To recent moment Wikipedia has grown into the biggest database of concepts and their relationships that ever existed. Wikipedia is great for a number of reasons (i t provides a number of things ) : 1) Comprehensive coverage (it contains very general concepts such car, computer, government, etc and a lot of niche concepts such as new small startup companies or people known only in some mmunities) 2) Continuously brought up-to-date (it is often updated just in minutes after announcements) 3) It is well-structured (it has redirects (Ivan the Terrible redirected to Ivan IV of Russia) which is synonims, it has disambiguation pages (homonyms) which includes different meaning for a term (IBM may stands for International Business Machines or International Brotherhood of Magicians). Using Wikipedia as a big knowledge base allows us to significantly improve a number of techniques and develop new techniques that were not possible before. Here is list of techniques that we developed: Advance NLP etc It is just a list of techniques. I will explain how it all works.
#6: betweenness – how much is edge “in between” different communities modularity - partition is a good one, if there are many edges within communities and only a few between them