Information Extraction
from Web-Scale N-Gram Data
Niket Tandon and Gerard de Melo
2010-07-23
Max Planck Institute for Informatics
Saarbr¨ucken, Germany
1 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Outline
1 Information Extraction
2 N-Gram Information Extraction
3 Experiments
4 Conclusion
2 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Other Applications
Query expansion
Semantic analysis
Faceted search
Entity Tracking
Document Enrichment
Mobile Services
Visual Object Recognition
etc.
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
... and more
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
... and more
Where do we obtain
such data?
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
... and more
Where do we obtain
such data?
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
... and more
Where do we obtain
such data?
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
and the love of friends' [p] Happy as the grass was green' [p] Come live with me, and be my
lawns swoop around the sunken garden. The grass is emerald green and perfect-a tribute to
overlooking the silver river. All round her the grass stretched green, but stunted, browning in the
the ground steadied beneath them, and the grass turned green, swishing high around their
to see the sun shine, the flowers blossom, the grass grow green. I could not bear to hear the
are quite dwarf. M. sinensis. Chinese silver grass. Ample green- and silver-striped foliage but
in either of them." It was summer and the grass was green. Clive Rappaport was a solicitor,
however, each bank is lined with stands of grass that remain green and stand taller than the
groaned and farted and schemed for snatches of grass that showed green at the corners of his bits,
the flowers were blossoming profusely and the grass was richly green. The people of the village
Song. [f] He is dead and gone; At his head a grass-green turf, At his heels a stone." O, ho! [f]
hard thoughts I stand by popple scrub, in tall grass, blown over and harsh, green and dry. From my
Well the sky is blue and er [tc text=pause] the grass is green and [tc text=pause] there's
Yes. Yes. [F01] Dreadful things. Erm so the grass was never quite as green [ZF1] as [ZF0] as
be beautiful on there really beautiful. All the grass lush and green not a car parked on it
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
... and more
Where do we obtain
such data?
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Structured Data
isA(Guggenheim,Museum)
locatedIn(Guggenheim,Manhattan)
partOf(Manhattan,NewYork)
. . .
4 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Pattern-Based Approaches
Use simple textual patterns to extract information
(Lyons 1977, Cruse 1986, Hearst 1992)
5 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Pattern-Based Approaches
Use simple textual patterns to extract information
(Lyons 1977, Cruse 1986, Hearst 1992)
e.g. “<Y> such as <X>”
“cities such as Salem” isA(Salem,City)
5 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Pattern-Based Approaches
Use simple textual patterns to extract information
(Lyons 1977, Cruse 1986, Hearst 1992)
e.g. “<Y> such as <X>”
“cities such as Salem” isA(Salem,City)
e.g. “<X> and other <Y>”
“Lausanne and other cities” isA(Lausanne,City)
5 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
6 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
One Possibility: Sophisticated NLP (1990s)
MUC evaluation initiative
CRF-style segmentation methods
etc.
6 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
6 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
British National Corpus: 100 million words
6 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
British National Corpus: 100 million words
English Wikipedia: 1 000 million words
6 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
British National Corpus: 100 million words
English Wikipedia: 1 000 million words
Agichtein (2005), Pantel (2004): scalable IE, but still only a
small fraction of the entire Web
6 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Web Search Engines
7 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Web Search Engines
Problems
Need to know what you’re looking for.
Can only retrieve top-k results
Very slow: days instead of minutes – Cafarella (2005)
7 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Web Search Engines
Problems
Need to know what you’re looking for.
Can only retrieve top-k results
Very slow: days instead of minutes – Cafarella (2005)
Instead
Use n-gram statistics derived
from very large parts of the
Web!
7 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Outline
1 Information Extraction
2 N-Gram Information Extraction
3 Experiments
4 Conclusion
8 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Data
Web-Scale N-Gram Datasets
Web-scale n-gram statistics derived from around 1012
words of
text are available
9 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Data
Web-Scale N-Gram Datasets
Web-scale n-gram statistics derived from around 1012
words of
text are available
Provides: Frequencies/Language model for strings
Example: f(“cities such as Geneva”)=...
f(“Z¨urich and other cities”)=...
f(“Lausanne and other Swiss cities”)=...
9 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
ok:
if independently extractable, e.g. founding year and location of
organization
not ok:
“<V> imported <W> dollars worth of <X> from <Y>
in year <Z>”
10 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
short items of interest
ok:
birthYear(Mozart,1756)
10 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
short items of interest
ok:
birthYear(Mozart,1756)
not:
fatherOf(Wolfgang Amadeus
Mozart,F. X. Mozart)
10 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
short items of interest
no way:
fatherOf(Johannes
Chrysostomus Wolfgangus
Theophilus Mozart,
Franz Xaver Wolfgang
Mozart)
10 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
short items of interest
short patterns
ok:
“<X> and other <Y>”
not:
“<X> has an inflation rate of <Y>”
10 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Risks
Influence of spam and boilerplate text
11 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Risks
Influence of spam and boilerplate text
Less control over the selection of input documents
11 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Risks
Influence of spam and boilerplate text
Less control over the selection of input documents
Less context information (WSD, POS tagging, parsing)
11 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Then why use n-grams?
much larger input (petabytes of original data)
better coverage
higher precision (more evidence, more redundancy)
Pantel (2004): more data allows a rather simple technique to
outperform much more sophisticated algorithms
12 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Then why use n-grams?
much larger input (petabytes of original data)
better coverage
higher precision (more evidence, more redundancy)
Pantel (2004): more data allows a rather simple technique to
outperform much more sophisticated algorithms
availability
larger than available document collections
crawling the Web: slow, requires link farm detection, high
bandwidth
12 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
13 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
e.g. for isA relation: (dogs,animals), (gold,metal)
e.g. for partOf: (finger,hand), (leaves,trees),
(windows,houses)
13 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
query n-gram dataset: “dogs * animals” (and “animals * dogs”)
alternatively: “dogs ? animals”, “dogs ? ? animals”, . . .
alternatively: fall back to separate document collection
13 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
generalize to textual patterns
(dogs,animals) found in
“.... dogs and other animals ...”
“<X> and other <Y>”
13 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
generalize to textual patterns
2 Search for patterns in n-grams data candidate tuples
“<X> and other <Y>” finds
(Z¨urich,cities) “Z¨urich and other cities”
(apples,fruits) “apples and other fruits”
13 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
generalize to textual patterns
2 Search for patterns in n-grams data candidate tuples
3 Finally, rank the candidate tuples, choose output tuples
Supervised learning based on labeled set of tuples
Output: Accepted tuples like (Geneva,city).
13 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
generalize to textual patterns
2 Search for patterns in n-grams data candidate tuples
3 Finally, rank the candidate tuples, choose output tuples
Features: for a tuple (x, y)
fi (p(x, y)) for each datasource i and pattern p
p∈P
fi (p(x, y)) for each datasource i
13 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Outline
1 Information Extraction
2 N-Gram Information Extraction
3 Experiments
4 Conclusion
14 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
generated from around 1012
words of text
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
generated from around 1012
words of text
positive: distributed (around 60GB uncompressed)
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
generated from around 1012
words of text
positive: distributed (around 60GB uncompressed)
negative: cut-off frequency 40
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
generated from around 1.4T tokens, complete English US version
of Bing index
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
generated from around 1.4T tokens, complete English US version
of Bing index
also: statistics from titles (12.5G tokens) and anchor texts (357G
tokens)
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
generated from around 1.4T tokens, complete English US version
of Bing index
also: statistics from titles (12.5G tokens) and anchor texts (357G
tokens)
WSDL-based web service
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
3 ClueWeb09 5-grams
500 million web pages, 700M 5-grams
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Seeds and Patterns
Patterns
Relation Seeds discovered
isA 100 2991
partOf 100 3883
hasProperty 100 3175
seeds from MIT ConceptNet
even among highest-ranked:
partOf(children,parents) and isA(winning,everything)
16 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Pattern Examples: isA
Pattern PMI range
<X> and almost any <Y> high
<X> betting basketball betting <Y> high
<X> is my favorite <Y> high
<X> shoes online shoes <Y> high
<X> is a <Y> medium
<X> is the best <Y> medium
<X> or any other <Y> medium
<X> , and <Y> medium
<X> and other smart <Y> medium
<X> and grammar <Y> low
<X> content of the <Y> low
<X> when it changes <Y> low
17 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Pattern Examples: partOf
Pattern PMI range
<X> with the other <Y> high
<X> of the top <Y> high
<X> online <Y> high
<X> shoes online shoes <Y> high
<X> from the <Y> medium
<X> or even entire <Y> medium
<X> of host <Y> medium
<X> from <Y> medium
<X> of a different <Y> medium
<X> entertainment and <Y> low
<X> Download for thou <Y> low
<X> company home in <Y> low
18 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Pattern: Microsoft Document Body 3-
grams vs. Anchor 3-grams
(each point represents the sum of pattern scores for a tuple)
19 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Patterns: Microsoft Document Body 3-
grams vs. Title 3-grams
(each point represents the sum of pattern scores for a tuple)
20 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Patterns: Microsoft Document Body 3-
grams vs. Google Body 3-grams
(each point represents the sum of pattern scores for a tuple)
21 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
22 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
∼ 500 random labelled examples per relation
(matching any of the patterns)
22 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
∼ 500 random labelled examples per relation
(matching any of the patterns)
10-fold leave one out cross-validation
22 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
∼ 500 random labelled examples per relation
(matching any of the patterns)
10-fold leave one out cross-validation
=⇒ Recall is relative to union of pattern matches
22 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Overall Results
(all data sources simultaneously)
Relation Precision Recall F1 Output
per million
n-grams1
isA 88.9% 8.1% 14.8% 983
partOf 80.5% 34.0% 47.8% 7897
hasProperty 75.3% 99.3% 85.6% 26180
1: the expected number of distinct accepted tuples per million input n-grams
(the total number of 5-grams in the Google Web 1T dataset is ∼1,176 million)
22 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Overall Results
(all data sources simultaneously)
Relation Precision Recall F1 Output
per million
n-grams1
isA 88.9% 8.1% 14.8% 983
partOf 80.5% 34.0% 47.8% 7897
hasProperty 75.3% 99.3% 85.6% 26180
1: the expected number of distinct accepted tuples per million input n-grams
(the total number of 5-grams in the Google Web 1T dataset is ∼1,176 million)
linguistic information implicitly captured via combinations of
patterns!
22 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Detailed Results (partOf relation)
Dataset Source Prec. Recall F1
Google 3-grams Document Body 55.9% 38.5% 45.6%
Google 4-grams Document Body 52.6% 43.3% 47.5%
Google 5-grams Document Body 48.1% 42.8% 45.3%
ClueWeb 5-grams Document Body 51.7% 35.6% 42.2%
Google 3-/4-
grams
Document Body 53.9% 42.8% 47.7%
Google 3-/4-/5-
grams
Document Body 58.7% 43.8% 50.1%
23 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Detailed Results (partOf relation)
Dataset Source Prec. Recall F1
Microsoft 3-grams Document Body 58.5% 33.2% 42.3%
Microsoft 3-grams Document Title 51.7% 29.8% 37.8%
Microsoft 3-grams Anchor Text 57.3% 36.1% 44.2%
Microsoft 3-grams Body / Title / Anchor 40.4% 100.0% 57.5%
Google 3-grams Document Body 55.9% 38.5% 45.6%
Microsoft 3/4-
grams
Body (3-grams only) /
Title / Anchor
40.5% 98.1% 57.3%
Google 3/4-
grams
Document Body 53.9% 42.8% 47.7%
Google 3/4/5-
grams
Document Body 58.7% 43.8% 50.1%
All 3/4/5-
grams
Body / Title / Anchor 80.5% 34.0% 47.8%
24 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Example: hasProperty
Properties of “flowers”
25 / 27
Information Extraction from Web-Scale N-Gram Data
Conclusion
Outline
1 Information Extraction
2 N-Gram Information Extraction
3 Experiments
4 Conclusion
26 / 27
Information Extraction from Web-Scale N-Gram Data
Conclusion
Summary
Lessons Learnt
N-grams datasets allow for
Information Extraction from
petabytes of original data
27 / 27
Information Extraction from Web-Scale N-Gram Data
Conclusion
Summary
Lessons Learnt
N-grams datasets allow for
Information Extraction from
petabytes of original data
Requirements: short entity
names, short patterns
27 / 27
Information Extraction from Web-Scale N-Gram Data
Conclusion
Summary
Lessons Learnt
N-grams datasets allow for
Information Extraction from
petabytes of original data
Requirements: short entity
names, short patterns
more data helps (even at very
large scales)
27 / 27
Information Extraction from Web-Scale N-Gram Data
Conclusion
Summary
Lessons Learnt
N-grams datasets allow for
Information Extraction from
petabytes of original data
Requirements: short entity
names, short patterns
more data helps (even at very
large scales)
diversity of data sources helps
27 / 27
Information Extraction from Web-Scale N-Gram Data

More Related Content

PPTX
Linked Data and Services
PDF
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...
PDF
Election Project (ELEP)
PPTX
Visualising the Australian open data and research data landscape
PDF
Redundancy analysis on linked data #cold2014 #ISWC2014
PPTX
Quick tour all handout
PDF
Transcript - Tracking Research Data Footprints via Integration with Research ...
PPT
Linked Data Overview - AGI Technical SIG
Linked Data and Services
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...
Election Project (ELEP)
Visualising the Australian open data and research data landscape
Redundancy analysis on linked data #cold2014 #ISWC2014
Quick tour all handout
Transcript - Tracking Research Data Footprints via Integration with Research ...
Linked Data Overview - AGI Technical SIG

What's hot (8)

PPTX
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
PDF
Qald 7 at ESWC2017
PDF
QALD-7 Question Answering over Linked Data Challenge
PDF
The web of interlinked data and knowledge stripped
PPTX
Open Data - a goldmine (JavaZone 2009)
PDF
Learning-based Data Cleaning
PDF
Standing-off Trees and Graphs : on the affordance of technologies for the edi...
PPTX
Web open standards for linked data and knowledge graphs as enablers of EU dig...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Qald 7 at ESWC2017
QALD-7 Question Answering over Linked Data Challenge
The web of interlinked data and knowledge stripped
Open Data - a goldmine (JavaZone 2009)
Learning-based Data Cleaning
Standing-off Trees and Graphs : on the affordance of technologies for the edi...
Web open standards for linked data and knowledge graphs as enablers of EU dig...
Ad

Viewers also liked (20)

ODP
Information Extraction from the Web - Algorithms and Tools
PPT
Enterprise information extraction: recent developments and open challenges
PDF
Data and Information Extraction on the Web
PDF
Information Extraction
PDF
Using the Web of Data for Information Extraction
PDF
Efficient Top-k Algorithms for Fuzzy Search in String Collections
PDF
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
PDF
IRE- Algorithm Name Detection in Research Papers
PDF
Multimodal Information Extraction: Disease, Date and Location Retrieval
PPTX
Mining Product Synonyms - Slides
ODP
Web Information Retrieval and Mining
PPT
Web Information Extraction Learning based on Probabilistic Graphical Models
PDF
Group-13 Project 15 Sub event detection on social media
PDF
System for-health-diagnosis
PPT
Information extraction for Free Text
PDF
A survey of_eigenvector_methods_for_web_information_retrieval
PPTX
Information_retrieval_and_extraction_IIIT
PDF
Open Information Extraction 2nd
PDF
Information Retrieval and Extraction
PPTX
Algorithm Name Detection & Extraction
Information Extraction from the Web - Algorithms and Tools
Enterprise information extraction: recent developments and open challenges
Data and Information Extraction on the Web
Information Extraction
Using the Web of Data for Information Extraction
Efficient Top-k Algorithms for Fuzzy Search in String Collections
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
IRE- Algorithm Name Detection in Research Papers
Multimodal Information Extraction: Disease, Date and Location Retrieval
Mining Product Synonyms - Slides
Web Information Retrieval and Mining
Web Information Extraction Learning based on Probabilistic Graphical Models
Group-13 Project 15 Sub event detection on social media
System for-health-diagnosis
Information extraction for Free Text
A survey of_eigenvector_methods_for_web_information_retrieval
Information_retrieval_and_extraction_IIIT
Open Information Extraction 2nd
Information Retrieval and Extraction
Algorithm Name Detection & Extraction
Ad

Similar to Information Extraction from Web-Scale N-Gram Data (20)

PDF
big data and data warehouse unit 1 for college
PDF
Cs501 dm intro
PDF
Improving access to geospatial Big Data in the hydrology domain
PPT
Overview of Data Mining
DOCX
English 103 Final TestReading Poetry for 10Answer all five
PDF
PRTR Open Data Sources
PPT
It's not the documents; it's the DATA
PPTX
The lifecycle of reproducible science data and what provenance has got to do ...
PDF
Why Data Science is a Science
PPT
Chapter 1. Introduction
PPT
Drowning in information – the need of macroscopes for research funding
PDF
Shared data infrastructures from smart cities to education
PPTX
2016 07 12_purdue_bigdatainomics_seandavis
PPTX
Big dataorig
PPT
Data Mining introduction and basic concepts
PDF
Moa: Real Time Analytics for Data Streams
PDF
Ways to Extract Variable Insights when Data is Scarse
PDF
Where is the World is my Open Government Data?
PPT
Open Analytics Environment
big data and data warehouse unit 1 for college
Cs501 dm intro
Improving access to geospatial Big Data in the hydrology domain
Overview of Data Mining
English 103 Final TestReading Poetry for 10Answer all five
PRTR Open Data Sources
It's not the documents; it's the DATA
The lifecycle of reproducible science data and what provenance has got to do ...
Why Data Science is a Science
Chapter 1. Introduction
Drowning in information – the need of macroscopes for research funding
Shared data infrastructures from smart cities to education
2016 07 12_purdue_bigdatainomics_seandavis
Big dataorig
Data Mining introduction and basic concepts
Moa: Real Time Analytics for Data Streams
Ways to Extract Variable Insights when Data is Scarse
Where is the World is my Open Government Data?
Open Analytics Environment

More from Gerard de Melo (15)

PDF
SEMAC Graph Node Embeddings for Link Prediction
PDF
How to Manage your Research
PDF
Knowlywood: Mining Activity Knowledge from Hollywood Narratives
PDF
Learning Multilingual Semantics from Big Data on the Web
PDF
From Big Data to Valuable Knowledge
PDF
Scalable Learning Technologies for Big Data Mining
PDF
Searching the Web of Data (Tutorial)
PDF
From Linked Data to Tightly Integrated Data
PDF
UWN: A Large Multilingual Lexical Knowledge Base
PDF
Multilingual Text Classification using Ontologies
PDF
Extracting Sense-Disambiguated Example Sentences From Parallel Corpora
PDF
Towards a Universal Wordnet by Learning from Combined Evidence
PDF
Not Quite the Same: Identity Constraints for the Web of Linked Data
PDF
Good, Great, Excellent: Global Inference of Semantic Intensities
PDF
YAGO-SUMO: Integrating YAGO into the Suggested Upper Merged Ontology
SEMAC Graph Node Embeddings for Link Prediction
How to Manage your Research
Knowlywood: Mining Activity Knowledge from Hollywood Narratives
Learning Multilingual Semantics from Big Data on the Web
From Big Data to Valuable Knowledge
Scalable Learning Technologies for Big Data Mining
Searching the Web of Data (Tutorial)
From Linked Data to Tightly Integrated Data
UWN: A Large Multilingual Lexical Knowledge Base
Multilingual Text Classification using Ontologies
Extracting Sense-Disambiguated Example Sentences From Parallel Corpora
Towards a Universal Wordnet by Learning from Combined Evidence
Not Quite the Same: Identity Constraints for the Web of Linked Data
Good, Great, Excellent: Global Inference of Semantic Intensities
YAGO-SUMO: Integrating YAGO into the Suggested Upper Merged Ontology

Recently uploaded (20)

PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PPTX
New ISO 27001_2022 standard and the changes
PPT
Image processing and pattern recognition 2.ppt
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
A biomechanical Functional analysis of the masitary muscles in man
PPTX
recommendation Project PPT with details attached
PPTX
Machine Learning and working of machine Learning
PDF
Global Data and Analytics Market Outlook Report
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Session 11 - Data Visualization Storytelling (2).pdf
PPTX
Caseware_IDEA_Detailed_Presentation.pptx
PDF
Microsoft 365 products and services descrption
PPT
statistics analysis - topic 3 - describing data visually
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
ai agent creaction with langgraph_presentation_
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
Tapan_20220802057_Researchinternship_final_stage.pptx
PPT
statistic analysis for study - data collection
PPTX
MBA JAPAN: 2025 the University of Waseda
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
expt-design-lecture-12 hghhgfggjhjd (1).ppt
New ISO 27001_2022 standard and the changes
Image processing and pattern recognition 2.ppt
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
A biomechanical Functional analysis of the masitary muscles in man
recommendation Project PPT with details attached
Machine Learning and working of machine Learning
Global Data and Analytics Market Outlook Report
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Session 11 - Data Visualization Storytelling (2).pdf
Caseware_IDEA_Detailed_Presentation.pptx
Microsoft 365 products and services descrption
statistics analysis - topic 3 - describing data visually
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
ai agent creaction with langgraph_presentation_
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Tapan_20220802057_Researchinternship_final_stage.pptx
statistic analysis for study - data collection
MBA JAPAN: 2025 the University of Waseda

Information Extraction from Web-Scale N-Gram Data

  • 1. Information Extraction from Web-Scale N-Gram Data Niket Tandon and Gerard de Melo 2010-07-23 Max Planck Institute for Informatics Saarbr¨ucken, Germany 1 / 27 Information Extraction from Web-Scale N-Gram Data
  • 2. Information Extraction Outline 1 Information Extraction 2 N-Gram Information Extraction 3 Experiments 4 Conclusion 2 / 27 Information Extraction from Web-Scale N-Gram Data
  • 3. Information Extraction Introduction Information Extraction Users generally want information, not documents 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 4. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 5. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 6. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 7. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 8. Information Extraction Introduction Other Applications Query expansion Semantic analysis Faceted search Entity Tracking Document Enrichment Mobile Services Visual Object Recognition etc. Information Extraction Users generally want information, not documents Structured data Direct, instant answers ... and more 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 9. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers ... and more Where do we obtain such data? 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 10. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers ... and more Where do we obtain such data? 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 11. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers ... and more Where do we obtain such data? 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 12. Information Extraction Introduction and the love of friends' [p] Happy as the grass was green' [p] Come live with me, and be my lawns swoop around the sunken garden. The grass is emerald green and perfect-a tribute to overlooking the silver river. All round her the grass stretched green, but stunted, browning in the the ground steadied beneath them, and the grass turned green, swishing high around their to see the sun shine, the flowers blossom, the grass grow green. I could not bear to hear the are quite dwarf. M. sinensis. Chinese silver grass. Ample green- and silver-striped foliage but in either of them." It was summer and the grass was green. Clive Rappaport was a solicitor, however, each bank is lined with stands of grass that remain green and stand taller than the groaned and farted and schemed for snatches of grass that showed green at the corners of his bits, the flowers were blossoming profusely and the grass was richly green. The people of the village Song. [f] He is dead and gone; At his head a grass-green turf, At his heels a stone." O, ho! [f] hard thoughts I stand by popple scrub, in tall grass, blown over and harsh, green and dry. From my Well the sky is blue and er [tc text=pause] the grass is green and [tc text=pause] there's Yes. Yes. [F01] Dreadful things. Erm so the grass was never quite as green [ZF1] as [ZF0] as be beautiful on there really beautiful. All the grass lush and green not a car parked on it Information Extraction Users generally want information, not documents Structured data Direct, instant answers ... and more Where do we obtain such data? 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 13. Information Extraction How do we get Structured Data? Structured Data isA(Guggenheim,Museum) locatedIn(Guggenheim,Manhattan) partOf(Manhattan,NewYork) . . . 4 / 27 Information Extraction from Web-Scale N-Gram Data
  • 14. Information Extraction How do we get Structured Data? Pattern-Based Approaches Use simple textual patterns to extract information (Lyons 1977, Cruse 1986, Hearst 1992) 5 / 27 Information Extraction from Web-Scale N-Gram Data
  • 15. Information Extraction How do we get Structured Data? Pattern-Based Approaches Use simple textual patterns to extract information (Lyons 1977, Cruse 1986, Hearst 1992) e.g. “<Y> such as <X>” “cities such as Salem” isA(Salem,City) 5 / 27 Information Extraction from Web-Scale N-Gram Data
  • 16. Information Extraction How do we get Structured Data? Pattern-Based Approaches Use simple textual patterns to extract information (Lyons 1977, Cruse 1986, Hearst 1992) e.g. “<Y> such as <X>” “cities such as Salem” isA(Salem,City) e.g. “<X> and other <Y>” “Lausanne and other cities” isA(Lausanne,City) 5 / 27 Information Extraction from Web-Scale N-Gram Data
  • 17. Information Extraction How do we get Structured Data? Problem: Pattern Matches are Rare Hearst found only 46 facts in 20 million word New York Times article collection 6 / 27 Information Extraction from Web-Scale N-Gram Data
  • 18. Information Extraction How do we get Structured Data? Problem: Pattern Matches are Rare Hearst found only 46 facts in 20 million word New York Times article collection One Possibility: Sophisticated NLP (1990s) MUC evaluation initiative CRF-style segmentation methods etc. 6 / 27 Information Extraction from Web-Scale N-Gram Data
  • 19. Information Extraction How do we get Structured Data? Problem: Pattern Matches are Rare Hearst found only 46 facts in 20 million word New York Times article collection Alternative: Use Larger Corpora American National Corpus: 22 million words 6 / 27 Information Extraction from Web-Scale N-Gram Data
  • 20. Information Extraction How do we get Structured Data? Problem: Pattern Matches are Rare Hearst found only 46 facts in 20 million word New York Times article collection Alternative: Use Larger Corpora American National Corpus: 22 million words British National Corpus: 100 million words 6 / 27 Information Extraction from Web-Scale N-Gram Data
  • 21. Information Extraction How do we get Structured Data? Problem: Pattern Matches are Rare Hearst found only 46 facts in 20 million word New York Times article collection Alternative: Use Larger Corpora American National Corpus: 22 million words British National Corpus: 100 million words English Wikipedia: 1 000 million words 6 / 27 Information Extraction from Web-Scale N-Gram Data
  • 22. Information Extraction How do we get Structured Data? Problem: Pattern Matches are Rare Hearst found only 46 facts in 20 million word New York Times article collection Alternative: Use Larger Corpora American National Corpus: 22 million words British National Corpus: 100 million words English Wikipedia: 1 000 million words Agichtein (2005), Pantel (2004): scalable IE, but still only a small fraction of the entire Web 6 / 27 Information Extraction from Web-Scale N-Gram Data
  • 23. Information Extraction Web Search Engines 7 / 27 Information Extraction from Web-Scale N-Gram Data
  • 24. Information Extraction Web Search Engines Problems Need to know what you’re looking for. Can only retrieve top-k results Very slow: days instead of minutes – Cafarella (2005) 7 / 27 Information Extraction from Web-Scale N-Gram Data
  • 25. Information Extraction Web Search Engines Problems Need to know what you’re looking for. Can only retrieve top-k results Very slow: days instead of minutes – Cafarella (2005) Instead Use n-gram statistics derived from very large parts of the Web! 7 / 27 Information Extraction from Web-Scale N-Gram Data
  • 26. N-Gram Information Extraction Outline 1 Information Extraction 2 N-Gram Information Extraction 3 Experiments 4 Conclusion 8 / 27 Information Extraction from Web-Scale N-Gram Data
  • 27. N-Gram Information Extraction N-Gram Data Web-Scale N-Gram Datasets Web-scale n-gram statistics derived from around 1012 words of text are available 9 / 27 Information Extraction from Web-Scale N-Gram Data
  • 28. N-Gram Information Extraction N-Gram Data Web-Scale N-Gram Datasets Web-scale n-gram statistics derived from around 1012 words of text are available Provides: Frequencies/Language model for strings Example: f(“cities such as Geneva”)=... f(“Z¨urich and other cities”)=... f(“Lausanne and other Swiss cities”)=... 9 / 27 Information Extraction from Web-Scale N-Gram Data
  • 29. N-Gram Information Extraction N-Gram Information Extraction Requirements usually binary relationships between entities ok: if independently extractable, e.g. founding year and location of organization not ok: “<V> imported <W> dollars worth of <X> from <Y> in year <Z>” 10 / 27 Information Extraction from Web-Scale N-Gram Data
  • 30. N-Gram Information Extraction N-Gram Information Extraction Requirements usually binary relationships between entities short items of interest ok: birthYear(Mozart,1756) 10 / 27 Information Extraction from Web-Scale N-Gram Data
  • 31. N-Gram Information Extraction N-Gram Information Extraction Requirements usually binary relationships between entities short items of interest ok: birthYear(Mozart,1756) not: fatherOf(Wolfgang Amadeus Mozart,F. X. Mozart) 10 / 27 Information Extraction from Web-Scale N-Gram Data
  • 32. N-Gram Information Extraction N-Gram Information Extraction Requirements usually binary relationships between entities short items of interest no way: fatherOf(Johannes Chrysostomus Wolfgangus Theophilus Mozart, Franz Xaver Wolfgang Mozart) 10 / 27 Information Extraction from Web-Scale N-Gram Data
  • 33. N-Gram Information Extraction N-Gram Information Extraction Requirements usually binary relationships between entities short items of interest short patterns ok: “<X> and other <Y>” not: “<X> has an inflation rate of <Y>” 10 / 27 Information Extraction from Web-Scale N-Gram Data
  • 34. N-Gram Information Extraction N-Gram Information Extraction Risks Influence of spam and boilerplate text 11 / 27 Information Extraction from Web-Scale N-Gram Data
  • 35. N-Gram Information Extraction N-Gram Information Extraction Risks Influence of spam and boilerplate text Less control over the selection of input documents 11 / 27 Information Extraction from Web-Scale N-Gram Data
  • 36. N-Gram Information Extraction N-Gram Information Extraction Risks Influence of spam and boilerplate text Less control over the selection of input documents Less context information (WSD, POS tagging, parsing) 11 / 27 Information Extraction from Web-Scale N-Gram Data
  • 37. N-Gram Information Extraction N-Gram Information Extraction Then why use n-grams? much larger input (petabytes of original data) better coverage higher precision (more evidence, more redundancy) Pantel (2004): more data allows a rather simple technique to outperform much more sophisticated algorithms 12 / 27 Information Extraction from Web-Scale N-Gram Data
  • 38. N-Gram Information Extraction N-Gram Information Extraction Then why use n-grams? much larger input (petabytes of original data) better coverage higher precision (more evidence, more redundancy) Pantel (2004): more data allows a rather simple technique to outperform much more sophisticated algorithms availability larger than available document collections crawling the Web: slow, requires link farm detection, high bandwidth 12 / 27 Information Extraction from Web-Scale N-Gram Data
  • 39. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 40. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns input: seed tuples for a relation e.g. for isA relation: (dogs,animals), (gold,metal) e.g. for partOf: (finger,hand), (leaves,trees), (windows,houses) 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 41. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns input: seed tuples for a relation find n-grams containing seeds query n-gram dataset: “dogs * animals” (and “animals * dogs”) alternatively: “dogs ? animals”, “dogs ? ? animals”, . . . alternatively: fall back to separate document collection 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 42. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns input: seed tuples for a relation find n-grams containing seeds generalize to textual patterns (dogs,animals) found in “.... dogs and other animals ...” “<X> and other <Y>” 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 43. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns input: seed tuples for a relation find n-grams containing seeds generalize to textual patterns 2 Search for patterns in n-grams data candidate tuples “<X> and other <Y>” finds (Z¨urich,cities) “Z¨urich and other cities” (apples,fruits) “apples and other fruits” 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 44. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns input: seed tuples for a relation find n-grams containing seeds generalize to textual patterns 2 Search for patterns in n-grams data candidate tuples 3 Finally, rank the candidate tuples, choose output tuples Supervised learning based on labeled set of tuples Output: Accepted tuples like (Geneva,city). 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 45. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns input: seed tuples for a relation find n-grams containing seeds generalize to textual patterns 2 Search for patterns in n-grams data candidate tuples 3 Finally, rank the candidate tuples, choose output tuples Features: for a tuple (x, y) fi (p(x, y)) for each datasource i and pattern p p∈P fi (p(x, y)) for each datasource i 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 46. Experiments Outline 1 Information Extraction 2 N-Gram Information Extraction 3 Experiments 4 Conclusion 14 / 27 Information Extraction from Web-Scale N-Gram Data
  • 47. Experiments Datasets 1 Google Web 1T 5-Gram Corpus contains n-gram statistics for n = 1 . . . 5 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 48. Experiments Datasets 1 Google Web 1T 5-Gram Corpus contains n-gram statistics for n = 1 . . . 5 generated from around 1012 words of text 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 49. Experiments Datasets 1 Google Web 1T 5-Gram Corpus contains n-gram statistics for n = 1 . . . 5 generated from around 1012 words of text positive: distributed (around 60GB uncompressed) 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 50. Experiments Datasets 1 Google Web 1T 5-Gram Corpus contains n-gram statistics for n = 1 . . . 5 generated from around 1012 words of text positive: distributed (around 60GB uncompressed) negative: cut-off frequency 40 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 51. Experiments Datasets 1 Google Web 1T 5-Gram Corpus 2 Microsoft Web N-gram Corpus currently 3,4-grams, smoothed language models 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 52. Experiments Datasets 1 Google Web 1T 5-Gram Corpus 2 Microsoft Web N-gram Corpus currently 3,4-grams, smoothed language models generated from around 1.4T tokens, complete English US version of Bing index 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 53. Experiments Datasets 1 Google Web 1T 5-Gram Corpus 2 Microsoft Web N-gram Corpus currently 3,4-grams, smoothed language models generated from around 1.4T tokens, complete English US version of Bing index also: statistics from titles (12.5G tokens) and anchor texts (357G tokens) 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 54. Experiments Datasets 1 Google Web 1T 5-Gram Corpus 2 Microsoft Web N-gram Corpus currently 3,4-grams, smoothed language models generated from around 1.4T tokens, complete English US version of Bing index also: statistics from titles (12.5G tokens) and anchor texts (357G tokens) WSDL-based web service 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 55. Experiments Datasets 1 Google Web 1T 5-Gram Corpus 2 Microsoft Web N-gram Corpus 3 ClueWeb09 5-grams 500 million web pages, 700M 5-grams 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 56. Experiments Seeds and Patterns Patterns Relation Seeds discovered isA 100 2991 partOf 100 3883 hasProperty 100 3175 seeds from MIT ConceptNet even among highest-ranked: partOf(children,parents) and isA(winning,everything) 16 / 27 Information Extraction from Web-Scale N-Gram Data
  • 57. Experiments Pattern Examples: isA Pattern PMI range <X> and almost any <Y> high <X> betting basketball betting <Y> high <X> is my favorite <Y> high <X> shoes online shoes <Y> high <X> is a <Y> medium <X> is the best <Y> medium <X> or any other <Y> medium <X> , and <Y> medium <X> and other smart <Y> medium <X> and grammar <Y> low <X> content of the <Y> low <X> when it changes <Y> low 17 / 27 Information Extraction from Web-Scale N-Gram Data
  • 58. Experiments Pattern Examples: partOf Pattern PMI range <X> with the other <Y> high <X> of the top <Y> high <X> online <Y> high <X> shoes online shoes <Y> high <X> from the <Y> medium <X> or even entire <Y> medium <X> of host <Y> medium <X> from <Y> medium <X> of a different <Y> medium <X> entertainment and <Y> low <X> Download for thou <Y> low <X> company home in <Y> low 18 / 27 Information Extraction from Web-Scale N-Gram Data
  • 59. Experiments Pattern: Microsoft Document Body 3- grams vs. Anchor 3-grams (each point represents the sum of pattern scores for a tuple) 19 / 27 Information Extraction from Web-Scale N-Gram Data
  • 60. Experiments Patterns: Microsoft Document Body 3- grams vs. Title 3-grams (each point represents the sum of pattern scores for a tuple) 20 / 27 Information Extraction from Web-Scale N-Gram Data
  • 61. Experiments Patterns: Microsoft Document Body 3- grams vs. Google Body 3-grams (each point represents the sum of pattern scores for a tuple) 21 / 27 Information Extraction from Web-Scale N-Gram Data
  • 62. Experiments Overall Results (all data sources simultaneously) Approach learning: RBF-kernel SVMs, also: random forests, C4.5, AdaBoost 22 / 27 Information Extraction from Web-Scale N-Gram Data
  • 63. Experiments Overall Results (all data sources simultaneously) Approach learning: RBF-kernel SVMs, also: random forests, C4.5, AdaBoost ∼ 500 random labelled examples per relation (matching any of the patterns) 22 / 27 Information Extraction from Web-Scale N-Gram Data
  • 64. Experiments Overall Results (all data sources simultaneously) Approach learning: RBF-kernel SVMs, also: random forests, C4.5, AdaBoost ∼ 500 random labelled examples per relation (matching any of the patterns) 10-fold leave one out cross-validation 22 / 27 Information Extraction from Web-Scale N-Gram Data
  • 65. Experiments Overall Results (all data sources simultaneously) Approach learning: RBF-kernel SVMs, also: random forests, C4.5, AdaBoost ∼ 500 random labelled examples per relation (matching any of the patterns) 10-fold leave one out cross-validation =⇒ Recall is relative to union of pattern matches 22 / 27 Information Extraction from Web-Scale N-Gram Data
  • 66. Experiments Overall Results (all data sources simultaneously) Relation Precision Recall F1 Output per million n-grams1 isA 88.9% 8.1% 14.8% 983 partOf 80.5% 34.0% 47.8% 7897 hasProperty 75.3% 99.3% 85.6% 26180 1: the expected number of distinct accepted tuples per million input n-grams (the total number of 5-grams in the Google Web 1T dataset is ∼1,176 million) 22 / 27 Information Extraction from Web-Scale N-Gram Data
  • 67. Experiments Overall Results (all data sources simultaneously) Relation Precision Recall F1 Output per million n-grams1 isA 88.9% 8.1% 14.8% 983 partOf 80.5% 34.0% 47.8% 7897 hasProperty 75.3% 99.3% 85.6% 26180 1: the expected number of distinct accepted tuples per million input n-grams (the total number of 5-grams in the Google Web 1T dataset is ∼1,176 million) linguistic information implicitly captured via combinations of patterns! 22 / 27 Information Extraction from Web-Scale N-Gram Data
  • 68. Experiments Detailed Results (partOf relation) Dataset Source Prec. Recall F1 Google 3-grams Document Body 55.9% 38.5% 45.6% Google 4-grams Document Body 52.6% 43.3% 47.5% Google 5-grams Document Body 48.1% 42.8% 45.3% ClueWeb 5-grams Document Body 51.7% 35.6% 42.2% Google 3-/4- grams Document Body 53.9% 42.8% 47.7% Google 3-/4-/5- grams Document Body 58.7% 43.8% 50.1% 23 / 27 Information Extraction from Web-Scale N-Gram Data
  • 69. Experiments Detailed Results (partOf relation) Dataset Source Prec. Recall F1 Microsoft 3-grams Document Body 58.5% 33.2% 42.3% Microsoft 3-grams Document Title 51.7% 29.8% 37.8% Microsoft 3-grams Anchor Text 57.3% 36.1% 44.2% Microsoft 3-grams Body / Title / Anchor 40.4% 100.0% 57.5% Google 3-grams Document Body 55.9% 38.5% 45.6% Microsoft 3/4- grams Body (3-grams only) / Title / Anchor 40.5% 98.1% 57.3% Google 3/4- grams Document Body 53.9% 42.8% 47.7% Google 3/4/5- grams Document Body 58.7% 43.8% 50.1% All 3/4/5- grams Body / Title / Anchor 80.5% 34.0% 47.8% 24 / 27 Information Extraction from Web-Scale N-Gram Data
  • 70. Experiments Example: hasProperty Properties of “flowers” 25 / 27 Information Extraction from Web-Scale N-Gram Data
  • 71. Conclusion Outline 1 Information Extraction 2 N-Gram Information Extraction 3 Experiments 4 Conclusion 26 / 27 Information Extraction from Web-Scale N-Gram Data
  • 72. Conclusion Summary Lessons Learnt N-grams datasets allow for Information Extraction from petabytes of original data 27 / 27 Information Extraction from Web-Scale N-Gram Data
  • 73. Conclusion Summary Lessons Learnt N-grams datasets allow for Information Extraction from petabytes of original data Requirements: short entity names, short patterns 27 / 27 Information Extraction from Web-Scale N-Gram Data
  • 74. Conclusion Summary Lessons Learnt N-grams datasets allow for Information Extraction from petabytes of original data Requirements: short entity names, short patterns more data helps (even at very large scales) 27 / 27 Information Extraction from Web-Scale N-Gram Data
  • 75. Conclusion Summary Lessons Learnt N-grams datasets allow for Information Extraction from petabytes of original data Requirements: short entity names, short patterns more data helps (even at very large scales) diversity of data sources helps 27 / 27 Information Extraction from Web-Scale N-Gram Data