Temporal Web Dynamics
Implications from Search
Perspective
Speaker: Nattiya Kanhabua
Advanced Methods for IR Course
L3S Research Center, University of Hannover
26 June 2014
Outline
• Temporal Web Dynamics
• Research Problems
– Temporal Information Extraction
– Temporal Query Analysis
– Time-aware Retrieval and Ranking
• Application to Temporal Search
Temporal Web Dynamics
• Web is changing over time in many aspects:
– Size: web pages are added/deleted all the time
– Content: web pages are edited/modified
– Query: users’ information needs changes
[Ke et al., CN 2006; Risvik et al., CN 2002]
[Dumais, SIAM-SDM 2012; WebDyn 2010]
2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters!2004
Index grows to
4.2 billion pages
1995 2012
2008
Google counts
1 trillion
unique URLs
Web and Index Sizes
2009
TBs or PBs of data/index
Tens of thousands of PCs
http://www.worldwidewebsize.com/
Impacts: crawling, indexing, and caching
Content Dynamics
• WayBack Machine
– Web archive search by the Internet Archive
1998
2006
Content Dynamics
2012Impacts: document representation and retrieval
Query Dynamics
• Search queries exhibit temporal patterns
– Spikes or seasonality
Impacts: search intent and query representation
http://www.google.com/insights/search
Temporal Query Examples
• A temporal query consists of:
– Query keywords
– Temporal expressions
• A document consists of:
– Terms, i.e., bag-of-words
– Publication time and temporal expressions
[Berberich et al., ECIR 2010]
Query/Document Matching
query
Temporal
Web
Determining
Search Intent
Term: {Germany, World, Cup}
Time: {06/2006, 07/2006}
D2006
Retrieved results
matching
Time-sensitive
queries
Semantic
Annotation
Annotated
documents Term: {w1, w2, …, wn}
Time: {PubTime(di), ContentTime(di)}
Temporal Information
Extraction
Two Time Aspects
Two time dimensions
1. Publication or modified time
2. Content or event time
content time
publication time
Problem Statements
• Difficult to find the trustworthy time for web documents
– Time gap between crawling and indexing
– Decentralization and relocation of web documents
– No standard metadata for time/date
Document Dating
Let’s me see…
This document is
probably
written in 850 A.C.
with 95% confidence.
I found a bible-like
document. But I have
no idea when it was
created?
“ For a given document with uncertain
timestamp, can the contents be used to
determine the timestamp with a sufficiently
high confidence? ”
Probabilistic Approach
Timestamp Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
Temporal Language
Models
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]
Freq
1
1
1
1
1
1
Extracting Content Time
• How to determine relevant temporal
expressions tagged in a document?
– Not all temporal expressions associated to an event
are equally relevant
Reported by World Health Organization (WHO) on
29 July 2012 about an ongoing Ebola outbreak
in Uganda since the beginning of July 2012
Approach
• The task of identifying relevant time is
regarded as a classification problem
– Two classes: (1) relevant and (2) irrelevant
• Definition: relevant if overlaps the starting,
ending or ongoing time of the event
• Machine learning: three classes of features
– Sentence-based features
– Document-based features
– Corpus-specific features
[Kanhabua et al., TAIA 2012; Strötgen et al., TempWeb 2012]
Features
• Sentence-based
– senLen, senPos, isContext, cntEntityInS, cntTExpInS,
cntTPointInS, cntTPeriodInS, entityPos, entityPosDist,
TExpPos, TExpPosDist, timeDist, entityTExpPosDist
• Document-based
– cntEntityInD, cntEntitySen, cntTExpInD, cntTPointInD,
cntTPeriodInD
• Domain-specific
– isNeg, isHistory
[Kanhabua et al., TAIA 2012]
Temporal Query Analysis
Determining Search Intent
• Two types of temporal queries:
1. Explicit: time is provided, “US President 2012“
2. Implicit: time is not provided, "Germany FIFA
World Cup"
• Temporal intent can be implicitly inferred
• Previous studies on temporal queries:
– 1.5% of web queries are explicit
– ~7% of web queries are implicit
[Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009]
Query Log Analysis
• Leverage real-world query logs
– Search query frequencies over time
• Apply time-series analysis
– Time-series decomposition for detecting seasonal
queries
[Metzler et al., SIGIR 2009; Shokouhi, SIGIR 2011]
Time-series Decomposition
Query: Easter
Matching: Re-visited
D2006
Ranked results
query
Temporal
Web
Determining
Search Intent
Term: {Germany, World, Cup}
Time: {06/2006, 07/2006}
D2006
Retrieved results
matching
Time-sensitive
Queries
Semantic
Annotation
Annotated
documents Term: {w1, w2, …, wn}
Time: {PubTime(di), ContentTime(di)}
Time-aware Retrieval and
Ranking
Searching the Past
• Searching documents created/edited over time
– E.g., web archives, news archives, blogs, or emails
– A journalist wants to write a timeline of a news article
– A Wikipedia contributor searches for historical
information about an entity of interests
Web
archives
news
archives
blogs emails
“temporal document
collections”
Retrieve documents
about Pope Benedict
XVI written before 2005
Term-based IR approaches
may give unsatisfied results
• Time must be explicitly modeled in order to
increase the effectiveness of ranking
– To order search results so that the most relevant
ones are ranked higher
• Time uncertainty should be taken into account
– Two temporal expressions can refer to the same time
period even though they are not equally written
– E.g. the query “Independence Day 2011”
• A retrieval model relying on term-matching only will fail to
retrieve documents mentioning “July 4, 2011”
Challenges
Query/Document Models
• A temporal query consists of:
– Query keywords
– Temporal expressions
• A document consists of:
– Terms, i.e., bag-of-words
– Publication time and temporal expressions
Time-aware Ranking Models
• Two main approaches
1. Mixture model [Kanhabua et al., ECDL 2010]
• Linearly combining textual- and temporal similarity
2. Probabilistic model [Berberich et al., ECIR 2010]
• Generating a query from the textual part and temporal part
of a document independently
Mixture Model
• Linearly combine textual- and temporal similarity
– α indicates the importance of similarity scores
• Both scores are normalized before combining
– Textual similarity can be determined using any term-
based retrieval model
• E.g., tf.idf or a unigram language model
Mixture Model
• Linearly combine textual- and temporal similarity
– α indicates the importance of similarity scores
• Both scores are normalized before combining
– Textual similarity can be determined using any term-
based retrieval model
• E.g., tf.idf or a unigram language model
How to determine temporal similarity?
[Kanhabua et al., ECDL 2010]
Temporal Similarity
• Assume that temporal expressions in the query are
generated independently from a two-step
generative model:
– P(tq|td) can be estimated based on publication time
using an exponential decay function [Kanhabua et al.,
ECDL 2010]
– Linear interpolation smoothing is applied to eliminates
zero probabilities
• I.e., an unseen temporal expression tq in d
Similarityscore
Time
d1 d2
<q>
Dist(d1,q)
Dist(d2,q)
[Kanhabua et al., ECDL 2010]
Temporal Similarity
• Assume that temporal expressions in the query are
generated independently from a two-step
generative model:
– P(tq|td) can be estimated based on publication time
using an exponential decay function
– Linear interpolation smoothing is applied to eliminates
zero probabilities
• I.e., an unseen temporal expression tq in d
[Kanhabua et al., ECDL 2010]
Application to Temporal IR
Problem Statements
• Queries of named entities (people, company, place)
– Highly dynamic in appearance, i.e., relationships between
terms changes over time
– E.g. changes of roles, name alterations, or semantic shift
Named Entity Evolution
Scenario 1
Query: “Pope Benedict XVI” and written before 2005
Documents about “Joseph Alois Ratzinger” are relevant
Scenario 2
Query: “Hillary R. Clinton” and written from 1997 to 2002
Documents about “New York Senator” and “First Lady of
the United States” are relevant
Examples of Name Changes
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
Current Approaches
• Temporal co-occurrence
• Temporal association rule mining
• Temporal knowledge extraction
– Ontology
– Wikipedia history
[Berberich et al., WebDB 2009; Kanhabua et al., JCDL 2010]
[Kaluarachchi et al., CIKM 2010; Tahmasebi et al., COLING 2012]
Temporal Co-occurrence
• Temporal co-occurrence
– Measure the degree of relatedness of two entities at
different times by comparing term contexts
– Require a recurrent computation at querying time,
which reduce efficiency and scalability
[Berberich et al., WebDB 2009]
Association Rule Mining
• Temporal association rule mining
– Discover semantically identical concepts (or named
entities) that are used in different time
– Two entities are semantically related if their
associated events occur multiple times in a collection
– Events are represented as sentences containing a
subject, a verb, objects, and nouns
[Kaluarachchi et al., CIKM 2010]
Temporal Knowledge Bases
• YAGO ontology
– Extract named entities from the YAGO ontology
– Track named entity evolution using the New York
Times Annotated Corpus
• Wikipedia history
– Define a time-based synonym as a term semantically
related to a named entity at a particular time period
– Extract synonyms of named entities from anchor texts
in article links using the whole history of Wikipedia
[Mazeika et al., CIKM 2011; Kanhabua et al., JCDL 2010]
Search with Name Changes
• Extract time-based synonyms from Wikipedia
– Synonyms are words with similar meanings
– In this context, synonyms refer name variants (name
changes, titles, or roles) of a named entity
• E.g., "Cardinal Joseph Ratzinger" is a synonym of
"Pope Benedict XVI" before 2005
• Two types of time-based synonyms
1. Time-independent
2. Time-dependent
[Kanhabua et al., JCDL 2010]
Recognize Named Entities
[Kanhabua et al., JCDL 2010]
Recognize Named Entities
[Kanhabua et al., JCDL 2010]
Recognize Named Entities
[Kanhabua et al., JCDL 2010]
Find Synonyms
• Find a set of entity-synonym relationships at time tk
• For each ei ϵ Etk , extract anchor texts from article
links:
– Entity: President_of_the_United_States
– Synonym: George W. Bush
– Time: 11/2004
President_of_th
e_United_States
George
W. Bush
George
W. Bush
Presiden
t George
W. Bush
Presiden
t Bush
(43)
Initial Results
• Time periods are not accurate
Note: the time of synonyms are timestamps of Wikipedia articles (8 years)
[Kanhabua et al., JCDL 2010]
• Analyze NYT Corpus to discover accurate time
– 20-year time span (1987-2007)
• Use the burst detection algorithm
– Time periods of synonyms = burst intervals
Enhancement using NYT
[Kanhabua et al., JCDL 2010]
• Analyze NYT Corpus to discover accurate time
– 20-year time span (1987-2007)
• Use the burst detection algorithm
– Time periods of synonyms = burst intervals
Enhancement using NYT
[Kanhabua et al., JCDL 2010]
• Analyze NYT Corpus to discover accurate time
– 20-year time span (1987-2007)
• Use the burst detection algorithm
– Time periods of synonyms = burst intervals
Enhancement using NYT
Initial results
Query Expansion
1. A user enters an entity as a query
[Kanhabua et al., ECML PKDD 2010]
Query Expansion
1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query
[Kanhabua et al., ECML PKDD 2010]
Query Expansion
1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query
3. The user select synonyms to expand the query
[Kanhabua et al., ECML PKDD 2010]
References
• [Berberich et al., ECIR 2010] Klaus Berberich, Srikanta J. Bedathur, Omar Alonso, Gerhard Weikum:
A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25
• [Radinsky et al., WWW 2012] Kira Radinsky, Krysta Svore, Susan T. Dumais, Jaime Teevan, Alex
Bocharov, Eric Horvitz: Modeling and predicting behavioral dynamics on the web. WWW 2012: 599-
608.
• [Dumais, SIAM-SDM 2012] Susan T. Dumais: Temporal Dynamics and Information Retrieval. SIAM-
SDM 2012
• [de Jong et al., AHC 2005] Franciska de Jong, Henning Rode, Djoerd Hiemstra: Temporal language
models for the disclosure of historical text. AHC 2005: 161-168
• [Kaluarachchi et al., CIKM 2010] Amal Chaminda Kaluarachchi, Aparna S. Varde, Srikanta J.
Bedathur, Gerhard Weikum, Jing Peng, Anna Feldman: Incorporating terminology evolution for query
translation in text retrieval with association rules. CIKM 2010: 1789-1792
• [Kanhabua et al., JCDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Exploiting time-based synonyms in
searching document archives. JCDL 2010: 79-88
• [Kanhabua et al., ECDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Determining Time of Queries for Re-
ranking Search Results. ECDL 2010: 261-272
• [Kanhabua et al., TAIA 2012] Nattiya Kanhabua, Sara Romano, Avaré Stewart: Identifying Relevant
Temporal Expressions for Real-World Events. Time-aware Information Access Workshop 2012
• [Ke et al., CN 2006] Yiping Ke, Lin Deng, Wilfred Ng, Dik Lun Lee: Web dynamics and their
ramifications for the development of Web search engines. Computer Networks 50(10): 1430-1447
(2006)
References (cont’)
• [Metzler et al., SIGIR 2009] Donald Metzler, Rosie Jones, Fuchun Peng, Ruiqiang Zhang:
Improving search relevance for implicitly temporal queries. SIGIR 2009: 700-701
• [Nunes et al., ECIR 2008] Sérgio Nunes, Cristina Ribeiro, Gabriel David: Use of Temporal
Expressions in Web Search. ECIR 2008: 580-584
• [Risvik et al., CN 2002] Knut Magne Risvik, Rolf Michelsen: Search engines and Web dynamics.
Computer Networks 39(3): 289-302 (2002)
• [Shokouhi, SIGIR 2011] Milad Shokouhi: Detecting Seasonal Queries by Time-Series Analysis.
SIGIR 2011: 1171-1172
• [Strötgen et al., TempWeb 2012] Jannik Strötgen, Omar Alonso, Michael Gertz: Identification of
top relevant temporal expressions in documents. Temporal Web Workshop 2012.
• [Tahmasebi et al., COLING2012] Nina Tahmasebi, Gerhard Gossen, Nattiya Kanhabua, Helge
Holzmann, Thomas Risse: NEER: An Unsupervised Method for Named Entity Evolution
Recognition. COLING 2012
• [WebDyn 2010] Web Dynamics course: http://www.mpi-
inf.mpg.de/departments/d5/teaching/ss10/dyn/, Max-Planck Institute for Informatics, Saarbrücken,
Germany, 2010
• [Zhang et al., EMNLP 2010] Ruiqiang Zhang, Yuki Konda, Anlei Dong, Pranam Kolari, Yi Chang,
Zhaohui Zheng: Learning Recurrent Event Queries for Web Search. EMNLP 2010: 1129-1139
Further Reading
1. Tu Ngoc Nguyen, Nattiya Kanhabua: Leveraging Dynamic Query Subtopics for Time-Aware
Search Result Diversification. ECIR 2014: 222-234
2. Miles Efron: Query representation for cross-temporal information retrieval. SIGIR 2013: 383-392

More Related Content

PDF
Searching the Temporal Web: Challenges and Current Approaches
PPT
Dynamics of Web: Analysis and Implications from Search Perspective
PDF
Time-aware Approaches to Information Retrieval
PDF
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
PPT
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PPTX
Temporal Web Dynamics and Implications for Information Retrieval
PDF
Exploiting temporal information in retrieval of archived documents (doctoral ...
PDF
Search, Exploration and Analytics of Evolving Data
Searching the Temporal Web: Challenges and Current Approaches
Dynamics of Web: Analysis and Implications from Search Perspective
Time-aware Approaches to Information Retrieval
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Temporal Web Dynamics and Implications for Information Retrieval
Exploiting temporal information in retrieval of archived documents (doctoral ...
Search, Exploration and Analytics of Evolving Data

Similar to Temporal Web Dynamics: Implications from Search Perspective (20)

PDF
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
PDF
Using machine learning to predict temporal orientation of search engines’ que...
PDF
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
PPT
Searching over the past, present and future
PDF
Identifying Relevant Temporal Expressions for Real-world Events
PPTX
Beyond document retrieval using semantic annotations
PPT
Anil timeline construction
PPTX
On the Value of Temporal Anchor Texts in Wikipedia
PDF
Exploiting Time-based Synonyms in Searching Document Archives
PDF
Twitter as a personalizable information service ii
PDF
Temporal information extraction in the general and clinical domain
PDF
Temporal models for mining, ranking and recommendation in the Web
PPTX
Chapter 1 Intro Information Rerieval.pptx
ODP
SIGIR 2011
PPT
20080919 regular meeting報告
PPTX
Emerging topic detection on twitter based on temporal and social terms evalua...
PDF
Determining Time of Queries for Re-ranking Search Results
PDF
Event detection and summarization based on social networks and semantic query...
PDF
Text Mining: (Asynchronous Sequences)
PPTX
Topical_Facets
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Using machine learning to predict temporal orientation of search engines’ que...
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Searching over the past, present and future
Identifying Relevant Temporal Expressions for Real-world Events
Beyond document retrieval using semantic annotations
Anil timeline construction
On the Value of Temporal Anchor Texts in Wikipedia
Exploiting Time-based Synonyms in Searching Document Archives
Twitter as a personalizable information service ii
Temporal information extraction in the general and clinical domain
Temporal models for mining, ranking and recommendation in the Web
Chapter 1 Intro Information Rerieval.pptx
SIGIR 2011
20080919 regular meeting報告
Emerging topic detection on twitter based on temporal and social terms evalua...
Determining Time of Queries for Re-ranking Search Results
Event detection and summarization based on social networks and semantic query...
Text Mining: (Asynchronous Sequences)
Topical_Facets
Ad

More from Nattiya Kanhabua (13)

PPTX
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
PPT
Understanding the Diversity of Tweets in the Time of Outbreaks
PPT
Why Is It Difficult to Detect Outbreaks in Twitter?
PPTX
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
PDF
Ranking Related News Predictions
PPTX
Temporal summarization of event related updates
PPTX
Preservation and Forgetting: Friends or Foes?
PDF
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
PPT
Can Twitter & Co. Save Lives?
PDF
Supporting Exploration and Serendipity in Information Retrieval
PDF
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
PDF
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
PDF
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Understanding the Diversity of Tweets in the Time of Outbreaks
Why Is It Difficult to Detect Outbreaks in Twitter?
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Ranking Related News Predictions
Temporal summarization of event related updates
Preservation and Forgetting: Friends or Foes?
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Can Twitter & Co. Save Lives?
Supporting Exploration and Serendipity in Information Retrieval
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
Ad

Recently uploaded (20)

PPTX
Ulangan Harian_TEOREMA PYTHAGORAS_8.pptx
PPTX
Lesson 2 (Technology and Transmission) - Terms.pptx
PDF
Pitch Style Data Report Template Preview
PDF
Books and book chapters(CITATIONS AND REFERENCING) (LORENA).pdf
PPTX
Literatura en Star Wars (Legends y Canon)
PDF
Pitch Perfect Minimal Presentation for PPT
PDF
Public speaking for kids in India - LearnifyU
PPTX
Analytics in Human Resource Management FY
PPTX
Public Speaking Is Easy . Start Now . It's now or never.
PPTX
Pharmaceutical industry and drugdevelopment.pptx
PPTX
power point presentation ofDracena species.pptx
PPTX
Training for Village Watershed Volunteers.pptx
PDF
Community User Group Leaders_ Agentblazer Status, AI Sustainability, and Work...
DOC
办DSU毕业证学历认证,罗杰威廉姆斯大学毕业证毕业典礼
PPTX
Challenges, strengths and prospects of Pakistan in.pptx
PDF
Yoken Capital Network Presentation Slide
PDF
soft skills for kids in India - LearnifyU
DOCX
CLASS XII bbbbbnjhcvfyfhfyfyhPROJECT.docx
PPTX
INDIGENOUS-LANGUAGES-AND-LITERATURE.pptx
PPTX
History Subject for High School_ Military Dictatorships by Slidesgo.pptx
Ulangan Harian_TEOREMA PYTHAGORAS_8.pptx
Lesson 2 (Technology and Transmission) - Terms.pptx
Pitch Style Data Report Template Preview
Books and book chapters(CITATIONS AND REFERENCING) (LORENA).pdf
Literatura en Star Wars (Legends y Canon)
Pitch Perfect Minimal Presentation for PPT
Public speaking for kids in India - LearnifyU
Analytics in Human Resource Management FY
Public Speaking Is Easy . Start Now . It's now or never.
Pharmaceutical industry and drugdevelopment.pptx
power point presentation ofDracena species.pptx
Training for Village Watershed Volunteers.pptx
Community User Group Leaders_ Agentblazer Status, AI Sustainability, and Work...
办DSU毕业证学历认证,罗杰威廉姆斯大学毕业证毕业典礼
Challenges, strengths and prospects of Pakistan in.pptx
Yoken Capital Network Presentation Slide
soft skills for kids in India - LearnifyU
CLASS XII bbbbbnjhcvfyfhfyfyhPROJECT.docx
INDIGENOUS-LANGUAGES-AND-LITERATURE.pptx
History Subject for High School_ Military Dictatorships by Slidesgo.pptx

Temporal Web Dynamics: Implications from Search Perspective

  • 1. Temporal Web Dynamics Implications from Search Perspective Speaker: Nattiya Kanhabua Advanced Methods for IR Course L3S Research Center, University of Hannover 26 June 2014
  • 2. Outline • Temporal Web Dynamics • Research Problems – Temporal Information Extraction – Temporal Query Analysis – Time-aware Retrieval and Ranking • Application to Temporal Search
  • 3. Temporal Web Dynamics • Web is changing over time in many aspects: – Size: web pages are added/deleted all the time – Content: web pages are edited/modified – Query: users’ information needs changes [Ke et al., CN 2006; Risvik et al., CN 2002] [Dumais, SIAM-SDM 2012; WebDyn 2010]
  • 4. 2000 First billion-URL index The world’s largest! ≈5000 PCs in clusters!2004 Index grows to 4.2 billion pages 1995 2012 2008 Google counts 1 trillion unique URLs Web and Index Sizes 2009 TBs or PBs of data/index Tens of thousands of PCs http://www.worldwidewebsize.com/ Impacts: crawling, indexing, and caching
  • 5. Content Dynamics • WayBack Machine – Web archive search by the Internet Archive
  • 7. Query Dynamics • Search queries exhibit temporal patterns – Spikes or seasonality Impacts: search intent and query representation http://www.google.com/insights/search
  • 8. Temporal Query Examples • A temporal query consists of: – Query keywords – Temporal expressions • A document consists of: – Terms, i.e., bag-of-words – Publication time and temporal expressions [Berberich et al., ECIR 2010]
  • 9. Query/Document Matching query Temporal Web Determining Search Intent Term: {Germany, World, Cup} Time: {06/2006, 07/2006} D2006 Retrieved results matching Time-sensitive queries Semantic Annotation Annotated documents Term: {w1, w2, …, wn} Time: {PubTime(di), ContentTime(di)}
  • 11. Two Time Aspects Two time dimensions 1. Publication or modified time 2. Content or event time content time publication time
  • 12. Problem Statements • Difficult to find the trustworthy time for web documents – Time gap between crawling and indexing – Decentralization and relocation of web documents – No standard metadata for time/date Document Dating Let’s me see… This document is probably written in 850 A.C. with 95% confidence. I found a bible-like document. But I have no idea when it was created? “ For a given document with uncertain timestamp, can the contents be used to determine the timestamp with a sufficiently high confidence? ”
  • 13. Probabilistic Approach Timestamp Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models tsunami Thailand A non-timestamped document Similarity Scores Score(1999) = 1 Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004 Temporal Language Models • Based on the statistic usage of words over time • Compare each word of a non-timestamped document with a reference corpus • Tentative timestamp -- a time partition mostly overlaps in word usage [de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008] Freq 1 1 1 1 1 1
  • 14. Extracting Content Time • How to determine relevant temporal expressions tagged in a document? – Not all temporal expressions associated to an event are equally relevant Reported by World Health Organization (WHO) on 29 July 2012 about an ongoing Ebola outbreak in Uganda since the beginning of July 2012
  • 15. Approach • The task of identifying relevant time is regarded as a classification problem – Two classes: (1) relevant and (2) irrelevant • Definition: relevant if overlaps the starting, ending or ongoing time of the event • Machine learning: three classes of features – Sentence-based features – Document-based features – Corpus-specific features [Kanhabua et al., TAIA 2012; Strötgen et al., TempWeb 2012]
  • 16. Features • Sentence-based – senLen, senPos, isContext, cntEntityInS, cntTExpInS, cntTPointInS, cntTPeriodInS, entityPos, entityPosDist, TExpPos, TExpPosDist, timeDist, entityTExpPosDist • Document-based – cntEntityInD, cntEntitySen, cntTExpInD, cntTPointInD, cntTPeriodInD • Domain-specific – isNeg, isHistory [Kanhabua et al., TAIA 2012]
  • 18. Determining Search Intent • Two types of temporal queries: 1. Explicit: time is provided, “US President 2012“ 2. Implicit: time is not provided, "Germany FIFA World Cup" • Temporal intent can be implicitly inferred • Previous studies on temporal queries: – 1.5% of web queries are explicit – ~7% of web queries are implicit [Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009]
  • 19. Query Log Analysis • Leverage real-world query logs – Search query frequencies over time • Apply time-series analysis – Time-series decomposition for detecting seasonal queries [Metzler et al., SIGIR 2009; Shokouhi, SIGIR 2011]
  • 21. Matching: Re-visited D2006 Ranked results query Temporal Web Determining Search Intent Term: {Germany, World, Cup} Time: {06/2006, 07/2006} D2006 Retrieved results matching Time-sensitive Queries Semantic Annotation Annotated documents Term: {w1, w2, …, wn} Time: {PubTime(di), ContentTime(di)}
  • 23. Searching the Past • Searching documents created/edited over time – E.g., web archives, news archives, blogs, or emails – A journalist wants to write a timeline of a news article – A Wikipedia contributor searches for historical information about an entity of interests Web archives news archives blogs emails “temporal document collections” Retrieve documents about Pope Benedict XVI written before 2005 Term-based IR approaches may give unsatisfied results
  • 24. • Time must be explicitly modeled in order to increase the effectiveness of ranking – To order search results so that the most relevant ones are ranked higher • Time uncertainty should be taken into account – Two temporal expressions can refer to the same time period even though they are not equally written – E.g. the query “Independence Day 2011” • A retrieval model relying on term-matching only will fail to retrieve documents mentioning “July 4, 2011” Challenges
  • 25. Query/Document Models • A temporal query consists of: – Query keywords – Temporal expressions • A document consists of: – Terms, i.e., bag-of-words – Publication time and temporal expressions
  • 26. Time-aware Ranking Models • Two main approaches 1. Mixture model [Kanhabua et al., ECDL 2010] • Linearly combining textual- and temporal similarity 2. Probabilistic model [Berberich et al., ECIR 2010] • Generating a query from the textual part and temporal part of a document independently
  • 27. Mixture Model • Linearly combine textual- and temporal similarity – α indicates the importance of similarity scores • Both scores are normalized before combining – Textual similarity can be determined using any term- based retrieval model • E.g., tf.idf or a unigram language model
  • 28. Mixture Model • Linearly combine textual- and temporal similarity – α indicates the importance of similarity scores • Both scores are normalized before combining – Textual similarity can be determined using any term- based retrieval model • E.g., tf.idf or a unigram language model How to determine temporal similarity? [Kanhabua et al., ECDL 2010]
  • 29. Temporal Similarity • Assume that temporal expressions in the query are generated independently from a two-step generative model: – P(tq|td) can be estimated based on publication time using an exponential decay function [Kanhabua et al., ECDL 2010] – Linear interpolation smoothing is applied to eliminates zero probabilities • I.e., an unseen temporal expression tq in d Similarityscore Time d1 d2 <q> Dist(d1,q) Dist(d2,q) [Kanhabua et al., ECDL 2010]
  • 30. Temporal Similarity • Assume that temporal expressions in the query are generated independently from a two-step generative model: – P(tq|td) can be estimated based on publication time using an exponential decay function – Linear interpolation smoothing is applied to eliminates zero probabilities • I.e., an unseen temporal expression tq in d [Kanhabua et al., ECDL 2010]
  • 32. Problem Statements • Queries of named entities (people, company, place) – Highly dynamic in appearance, i.e., relationships between terms changes over time – E.g. changes of roles, name alterations, or semantic shift Named Entity Evolution Scenario 1 Query: “Pope Benedict XVI” and written before 2005 Documents about “Joseph Alois Ratzinger” are relevant Scenario 2 Query: “Hillary R. Clinton” and written from 1997 to 2002 Documents about “New York Senator” and “First Lady of the United States” are relevant
  • 33. Examples of Name Changes QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
  • 34. Current Approaches • Temporal co-occurrence • Temporal association rule mining • Temporal knowledge extraction – Ontology – Wikipedia history [Berberich et al., WebDB 2009; Kanhabua et al., JCDL 2010] [Kaluarachchi et al., CIKM 2010; Tahmasebi et al., COLING 2012]
  • 35. Temporal Co-occurrence • Temporal co-occurrence – Measure the degree of relatedness of two entities at different times by comparing term contexts – Require a recurrent computation at querying time, which reduce efficiency and scalability [Berberich et al., WebDB 2009]
  • 36. Association Rule Mining • Temporal association rule mining – Discover semantically identical concepts (or named entities) that are used in different time – Two entities are semantically related if their associated events occur multiple times in a collection – Events are represented as sentences containing a subject, a verb, objects, and nouns [Kaluarachchi et al., CIKM 2010]
  • 37. Temporal Knowledge Bases • YAGO ontology – Extract named entities from the YAGO ontology – Track named entity evolution using the New York Times Annotated Corpus • Wikipedia history – Define a time-based synonym as a term semantically related to a named entity at a particular time period – Extract synonyms of named entities from anchor texts in article links using the whole history of Wikipedia [Mazeika et al., CIKM 2011; Kanhabua et al., JCDL 2010]
  • 38. Search with Name Changes • Extract time-based synonyms from Wikipedia – Synonyms are words with similar meanings – In this context, synonyms refer name variants (name changes, titles, or roles) of a named entity • E.g., "Cardinal Joseph Ratzinger" is a synonym of "Pope Benedict XVI" before 2005 • Two types of time-based synonyms 1. Time-independent 2. Time-dependent [Kanhabua et al., JCDL 2010]
  • 42. Find Synonyms • Find a set of entity-synonym relationships at time tk • For each ei ϵ Etk , extract anchor texts from article links: – Entity: President_of_the_United_States – Synonym: George W. Bush – Time: 11/2004 President_of_th e_United_States George W. Bush George W. Bush Presiden t George W. Bush Presiden t Bush (43)
  • 43. Initial Results • Time periods are not accurate Note: the time of synonyms are timestamps of Wikipedia articles (8 years) [Kanhabua et al., JCDL 2010]
  • 44. • Analyze NYT Corpus to discover accurate time – 20-year time span (1987-2007) • Use the burst detection algorithm – Time periods of synonyms = burst intervals Enhancement using NYT [Kanhabua et al., JCDL 2010]
  • 45. • Analyze NYT Corpus to discover accurate time – 20-year time span (1987-2007) • Use the burst detection algorithm – Time periods of synonyms = burst intervals Enhancement using NYT [Kanhabua et al., JCDL 2010]
  • 46. • Analyze NYT Corpus to discover accurate time – 20-year time span (1987-2007) • Use the burst detection algorithm – Time periods of synonyms = burst intervals Enhancement using NYT Initial results
  • 47. Query Expansion 1. A user enters an entity as a query [Kanhabua et al., ECML PKDD 2010]
  • 48. Query Expansion 1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query [Kanhabua et al., ECML PKDD 2010]
  • 49. Query Expansion 1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query 3. The user select synonyms to expand the query [Kanhabua et al., ECML PKDD 2010]
  • 50. References • [Berberich et al., ECIR 2010] Klaus Berberich, Srikanta J. Bedathur, Omar Alonso, Gerhard Weikum: A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25 • [Radinsky et al., WWW 2012] Kira Radinsky, Krysta Svore, Susan T. Dumais, Jaime Teevan, Alex Bocharov, Eric Horvitz: Modeling and predicting behavioral dynamics on the web. WWW 2012: 599- 608. • [Dumais, SIAM-SDM 2012] Susan T. Dumais: Temporal Dynamics and Information Retrieval. SIAM- SDM 2012 • [de Jong et al., AHC 2005] Franciska de Jong, Henning Rode, Djoerd Hiemstra: Temporal language models for the disclosure of historical text. AHC 2005: 161-168 • [Kaluarachchi et al., CIKM 2010] Amal Chaminda Kaluarachchi, Aparna S. Varde, Srikanta J. Bedathur, Gerhard Weikum, Jing Peng, Anna Feldman: Incorporating terminology evolution for query translation in text retrieval with association rules. CIKM 2010: 1789-1792 • [Kanhabua et al., JCDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Exploiting time-based synonyms in searching document archives. JCDL 2010: 79-88 • [Kanhabua et al., ECDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Determining Time of Queries for Re- ranking Search Results. ECDL 2010: 261-272 • [Kanhabua et al., TAIA 2012] Nattiya Kanhabua, Sara Romano, Avaré Stewart: Identifying Relevant Temporal Expressions for Real-World Events. Time-aware Information Access Workshop 2012 • [Ke et al., CN 2006] Yiping Ke, Lin Deng, Wilfred Ng, Dik Lun Lee: Web dynamics and their ramifications for the development of Web search engines. Computer Networks 50(10): 1430-1447 (2006)
  • 51. References (cont’) • [Metzler et al., SIGIR 2009] Donald Metzler, Rosie Jones, Fuchun Peng, Ruiqiang Zhang: Improving search relevance for implicitly temporal queries. SIGIR 2009: 700-701 • [Nunes et al., ECIR 2008] Sérgio Nunes, Cristina Ribeiro, Gabriel David: Use of Temporal Expressions in Web Search. ECIR 2008: 580-584 • [Risvik et al., CN 2002] Knut Magne Risvik, Rolf Michelsen: Search engines and Web dynamics. Computer Networks 39(3): 289-302 (2002) • [Shokouhi, SIGIR 2011] Milad Shokouhi: Detecting Seasonal Queries by Time-Series Analysis. SIGIR 2011: 1171-1172 • [Strötgen et al., TempWeb 2012] Jannik Strötgen, Omar Alonso, Michael Gertz: Identification of top relevant temporal expressions in documents. Temporal Web Workshop 2012. • [Tahmasebi et al., COLING2012] Nina Tahmasebi, Gerhard Gossen, Nattiya Kanhabua, Helge Holzmann, Thomas Risse: NEER: An Unsupervised Method for Named Entity Evolution Recognition. COLING 2012 • [WebDyn 2010] Web Dynamics course: http://www.mpi- inf.mpg.de/departments/d5/teaching/ss10/dyn/, Max-Planck Institute for Informatics, Saarbrücken, Germany, 2010 • [Zhang et al., EMNLP 2010] Ruiqiang Zhang, Yuki Konda, Anlei Dong, Pranam Kolari, Yi Chang, Zhaohui Zheng: Learning Recurrent Event Queries for Web Search. EMNLP 2010: 1129-1139
  • 52. Further Reading 1. Tu Ngoc Nguyen, Nattiya Kanhabua: Leveraging Dynamic Query Subtopics for Time-Aware Search Result Diversification. ECIR 2014: 222-234 2. Miles Efron: Query representation for cross-temporal information retrieval. SIGIR 2013: 383-392