Information Extraction from Web-Scale N-Gram Data

Information Extraction
from Web-Scale N-Gram Data
Niket Tandon and Gerard de Melo
2010-07-23
Max Planck Institute for Informatics
Saarbr¨ucken, Germany
1 / 27
Information Extraction from Web-Scale N-Gram Data

Outline
1 Information Extraction
2 N-Gram Information Extraction
3 Experiments
4 Conclusion
2 / 27

Introduction
Users generally want
information,
not documents
3 / 27

Introduction
information,
not documents
Structured data
Direct, instant answers
3 / 27

Introduction
Other Applications
Query expansion
Semantic analysis
Faceted search
Entity Tracking
Document Enrichment
Mobile Services
Visual Object Recognition
etc.
information,
not documents
Structured data
... and more
3 / 27

Introduction
information,
not documents
Structured data
... and more
Where do we obtain
such data?
3 / 27

Introduction
and the love of friends' [p] Happy as the grass was green' [p] Come live with me, and be my
lawns swoop around the sunken garden. The grass is emerald green and perfect-a tribute to
overlooking the silver river. All round her the grass stretched green, but stunted, browning in the
the ground steadied beneath them, and the grass turned green, swishing high around their
to see the sun shine, the flowers blossom, the grass grow green. I could not bear to hear the
are quite dwarf. M. sinensis. Chinese silver grass. Ample green- and silver-striped foliage but
in either of them." It was summer and the grass was green. Clive Rappaport was a solicitor,
however, each bank is lined with stands of grass that remain green and stand taller than the
groaned and farted and schemed for snatches of grass that showed green at the corners of his bits,
the flowers were blossoming profusely and the grass was richly green. The people of the village
Song. [f] He is dead and gone; At his head a grass-green turf, At his heels a stone." O, ho! [f]
hard thoughts I stand by popple scrub, in tall grass, blown over and harsh, green and dry. From my
Well the sky is blue and er [tc text=pause] the grass is green and [tc text=pause] there's
Yes. Yes. [F01] Dreadful things. Erm so the grass was never quite as green [ZF1] as [ZF0] as
be beautiful on there really beautiful. All the grass lush and green not a car parked on it
information,
not documents
Structured data
... and more
Where do we obtain
such data?
3 / 27

How do we get Structured Data?
Structured Data
isA(Guggenheim,Museum)
locatedIn(Guggenheim,Manhattan)
partOf(Manhattan,NewYork)
. . .
4 / 27

Pattern-Based Approaches
Use simple textual patterns to extract information
(Lyons 1977, Cruse 1986, Hearst 1992)
5 / 27

e.g. “<Y> such as <X>”
“cities such as Salem” isA(Salem,City)
5 / 27

e.g. “<Y> such as <X>”
“cities such as Salem” isA(Salem,City)
e.g. “<X> and other <Y>”
“Lausanne and other cities” isA(Lausanne,City)
5 / 27

Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
6 / 27

article collection
One Possibility: Sophisticated NLP (1990s)
MUC evaluation initiative
CRF-style segmentation methods
etc.
6 / 27

article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
6 / 27

article collection
British National Corpus: 100 million words
6 / 27

article collection
English Wikipedia: 1 000 million words
6 / 27

article collection
English Wikipedia: 1 000 million words
Agichtein (2005), Pantel (2004): scalable IE, but still only a
small fraction of the entire Web
6 / 27

Web Search Engines
7 / 27

Web Search Engines
Problems
Need to know what you’re looking for.
Can only retrieve top-k results
Very slow: days instead of minutes – Cafarella (2005)
7 / 27

Web Search Engines
Problems
Need to know what you’re looking for.
Can only retrieve top-k results
Very slow: days instead of minutes – Cafarella (2005)
Instead
Use n-gram statistics derived
from very large parts of the
Web!
7 / 27

N-Gram Information Extraction
Outline
3 Experiments
4 Conclusion
8 / 27

N-Gram Data
Web-Scale N-Gram Datasets
Web-scale n-gram statistics derived from around 1012
words of
text are available
9 / 27

N-Gram Data
Web-Scale N-Gram Datasets
Web-scale n-gram statistics derived from around 1012
words of
text are available
Provides: Frequencies/Language model for strings
Example: f(“cities such as Geneva”)=...
f(“Z¨urich and other cities”)=...
f(“Lausanne and other Swiss cities”)=...
9 / 27

Requirements
usually binary relationships between entities
ok:
if independently extractable, e.g. founding year and location of
organization
not ok:
“<V> imported <W> dollars worth of <X> from <Y>
in year <Z>”
10 / 27

Requirements
short items of interest
ok:
birthYear(Mozart,1756)
10 / 27

Requirements
ok:
birthYear(Mozart,1756)
not:
fatherOf(Wolfgang Amadeus
Mozart,F. X. Mozart)
10 / 27

Requirements
no way:
fatherOf(Johannes
Chrysostomus Wolfgangus
Theophilus Mozart,
Franz Xaver Wolfgang
Mozart)
10 / 27

Requirements
short patterns
ok:
“<X> and other <Y>”
not:
“<X> has an inﬂation rate of <Y>”
10 / 27

Risks
Inﬂuence of spam and boilerplate text
11 / 27

Risks
Less control over the selection of input documents
11 / 27

Risks
Less control over the selection of input documents
Less context information (WSD, POS tagging, parsing)
11 / 27

Then why use n-grams?
much larger input (petabytes of original data)
better coverage
higher precision (more evidence, more redundancy)
Pantel (2004): more data allows a rather simple technique to
outperform much more sophisticated algorithms
12 / 27

Then why use n-grams?
much larger input (petabytes of original data)
better coverage
higher precision (more evidence, more redundancy)
Pantel (2004): more data allows a rather simple technique to
outperform much more sophisticated algorithms
availability
larger than available document collections
crawling the Web: slow, requires link farm detection, high
bandwidth
12 / 27

Algorithm
1 collect patterns
13 / 27

Algorithm
1 collect patterns
input: seed tuples for a relation
e.g. for isA relation: (dogs,animals), (gold,metal)
e.g. for partOf: (finger,hand), (leaves,trees),
(windows,houses)
13 / 27

Algorithm
1 collect patterns
ﬁnd n-grams containing seeds
query n-gram dataset: “dogs * animals” (and “animals * dogs”)
alternatively: “dogs ? animals”, “dogs ? ? animals”, . . .
alternatively: fall back to separate document collection
13 / 27

Algorithm
1 collect patterns
generalize to textual patterns
(dogs,animals) found in
“.... dogs and other animals ...”
“<X> and other <Y>”
13 / 27

Algorithm
1 collect patterns
2 Search for patterns in n-grams data candidate tuples
“<X> and other <Y>” finds
(Zürich,cities) “Zürich and other cities”
(apples,fruits) “apples and other fruits”
13 / 27

Algorithm
1 collect patterns
3 Finally, rank the candidate tuples, choose output tuples
Supervised learning based on labeled set of tuples
Output: Accepted tuples like (Geneva,city).
13 / 27

Algorithm
1 collect patterns
3 Finally, rank the candidate tuples, choose output tuples
Features: for a tuple (x, y)
fi (p(x, y)) for each datasource i and pattern p
p∈P
fi (p(x, y)) for each datasource i
13 / 27

Experiments
Outline
3 Experiments
4 Conclusion
14 / 27

Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
15 / 27

Experiments
Datasets
generated from around 1012
words of text
15 / 27

Experiments
Datasets
words of text
positive: distributed (around 60GB uncompressed)
15 / 27

Experiments
Datasets
words of text
positive: distributed (around 60GB uncompressed)
negative: cut-oﬀ frequency 40
15 / 27

Experiments
Datasets
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
15 / 27

Experiments
Datasets
generated from around 1.4T tokens, complete English US version
of Bing index
15 / 27

Experiments
Datasets
of Bing index
also: statistics from titles (12.5G tokens) and anchor texts (357G
tokens)
15 / 27

Experiments
Datasets
of Bing index
also: statistics from titles (12.5G tokens) and anchor texts (357G
tokens)
WSDL-based web service
15 / 27

Experiments
Datasets
3 ClueWeb09 5-grams
500 million web pages, 700M 5-grams
15 / 27

Experiments
Seeds and Patterns
Patterns
Relation Seeds discovered
isA 100 2991
partOf 100 3883
hasProperty 100 3175
seeds from MIT ConceptNet
even among highest-ranked:
partOf(children,parents) and isA(winning,everything)
16 / 27

Experiments
Pattern Examples: isA
Pattern PMI range
<X> and almost any <Y> high
<X> betting basketball betting <Y> high
<X> is my favorite <Y> high
<X> shoes online shoes <Y> high
<X> is a <Y> medium
<X> is the best <Y> medium
<X> or any other <Y> medium
<X> , and <Y> medium
<X> and other smart <Y> medium
<X> and grammar <Y> low
<X> content of the <Y> low
<X> when it changes <Y> low
17 / 27

Experiments
Pattern Examples: partOf
Pattern PMI range
<X> with the other <Y> high
<X> of the top <Y> high
<X> online <Y> high
<X> shoes online shoes <Y> high
<X> from the <Y> medium
<X> or even entire <Y> medium
<X> of host <Y> medium
<X> from <Y> medium
<X> of a diﬀerent <Y> medium
<X> entertainment and <Y> low
<X> Download for thou <Y> low
<X> company home in <Y> low
18 / 27

Experiments
Pattern: Microsoft Document Body 3-
grams vs. Anchor 3-grams
(each point represents the sum of pattern scores for a tuple)
19 / 27

Experiments
Patterns: Microsoft Document Body 3-
grams vs. Title 3-grams
20 / 27

Experiments
Patterns: Microsoft Document Body 3-
grams vs. Google Body 3-grams
21 / 27

Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
22 / 27

Experiments
Overall Results
Approach
learning:
∼ 500 random labelled examples per relation
(matching any of the patterns)
22 / 27

Experiments
Overall Results
Approach
learning:
10-fold leave one out cross-validation
22 / 27

Experiments
Overall Results
Approach
learning:
10-fold leave one out cross-validation
=⇒ Recall is relative to union of pattern matches
22 / 27

Experiments
Overall Results
Relation Precision Recall F1 Output
per million
n-grams1
isA 88.9% 8.1% 14.8% 983
partOf 80.5% 34.0% 47.8% 7897
hasProperty 75.3% 99.3% 85.6% 26180
1: the expected number of distinct accepted tuples per million input n-grams
(the total number of 5-grams in the Google Web 1T dataset is ∼1,176 million)
22 / 27

Experiments
Overall Results
Relation Precision Recall F1 Output
per million
n-grams1
isA 88.9% 8.1% 14.8% 983
partOf 80.5% 34.0% 47.8% 7897
hasProperty 75.3% 99.3% 85.6% 26180
1: the expected number of distinct accepted tuples per million input n-grams
(the total number of 5-grams in the Google Web 1T dataset is ∼1,176 million)
linguistic information implicitly captured via combinations of
patterns!
22 / 27

Experiments
Detailed Results (partOf relation)
Dataset Source Prec. Recall F1
Google 3-grams Document Body 55.9% 38.5% 45.6%
ClueWeb 5-grams Document Body 51.7% 35.6% 42.2%
Google 3-/4-
grams
Document Body 53.9% 42.8% 47.7%
Google 3-/4-/5-
grams
Document Body 58.7% 43.8% 50.1%
23 / 27

Experiments
Detailed Results (partOf relation)
Dataset Source Prec. Recall F1
Microsoft 3-grams Document Body 58.5% 33.2% 42.3%
Microsoft 3-grams Document Title 51.7% 29.8% 37.8%
Microsoft 3-grams Anchor Text 57.3% 36.1% 44.2%
Microsoft 3-grams Body / Title / Anchor 40.4% 100.0% 57.5%
Microsoft 3/4-
grams
Body (3-grams only) /
Title / Anchor
40.5% 98.1% 57.3%
Google 3/4-
grams
Document Body 53.9% 42.8% 47.7%
Google 3/4/5-
grams
Document Body 58.7% 43.8% 50.1%
All 3/4/5-
grams
Body / Title / Anchor 80.5% 34.0% 47.8%
24 / 27

Experiments
Example: hasProperty
Properties of “ﬂowers”
25 / 27

Conclusion
Outline
3 Experiments
4 Conclusion
26 / 27

Conclusion
Summary
Lessons Learnt
N-grams datasets allow for
Information Extraction from
petabytes of original data
27 / 27

Conclusion
Summary
Lessons Learnt
Requirements: short entity
names, short patterns
27 / 27

Conclusion
Summary
Lessons Learnt
more data helps (even at very
large scales)
27 / 27

Conclusion
Summary
Lessons Learnt
more data helps (even at very
large scales)
diversity of data sources helps
27 / 27

Information Extraction from Web-Scale N-Gram Data

More Related Content

What's hot (8)

Viewers also liked (20)

Similar to Information Extraction from Web-Scale N-Gram Data (20)

More from Gerard de Melo (15)

Recently uploaded (20)

Information Extraction from Web-Scale N-Gram Data