SlideShare a Scribd company logo
Searching over the past, present and 
future 
Roi Blanco (roi@yahoo-inc.com) 
http://labs.yahoo.com/Yahoo_Labs_Barcelona
Yahoo! Research Barcelona 
Established January, 2006 
Led by Ricardo Baeza-Yates 
Research areas 
• Web Mining 
• Social Media 
• Distributed Web retrieval 
• Geo information retrieval 
• NLP and Semantics
Agenda 
• Natural Language retrieval 
• Time and search engines 
• Searching over web archives 
• Searching on real time information 
• Caching! 
• Time-based exploratory search 
• Searching over future events 
• Future directions
Natural Language Retrieval 
• How to exploit the structure and meaning of 
natural language text to improve search 
• Current search engines perform only limited NLP 
(tokenization, stemming) 
• Automated tools exist for deeper analysis 
• Applications to diversity-aware search 
• Source, Location, Time, Language, Opinion, Ranking… 
• Search over semi-structured data, semantic search 
• Roll-out user experiences that use higher layers of 
the NLP stack 
• In this talk, focus on the time dimension
Searching over the past, present and future
High-level Architecture of WSEs 
Cache Query 
results 
Runtime system 
Parser/ 
Tokenizer 
Index 
terms 
Engine 
queries 
Indexing pipeline 
W WWWWW
Web Search and time 
• Information freshness adds constraints/tensions in 
every layer of WSE 
• Architecture 
• Crawling 
• Indexing 
• Caching 
• Serving system 
• Modeling 
• Time-dependent user intent 
• UI (how to let the user take control) 
7
Adding the time dimension 
• Some solutions don’t scale up anymore 
Review your architecture 
Review your algorithms 
Add more machines (~$$$) 
• Some solutions don’t apply anymore 
Caching 
8
Evolution 
• 1999 
• Index updated ~once per month 
• Disk-based updates/indexing 
• 2001 
• In-memory indexes 
• Changes the whole-game! 
• 2007 
• Indexing time < 1 minute 
• Accept updates while serving 
• Now 
• Focused crawling, delayed transactions, etc. 
• Batch Updates -> Incremental processing 
9
Some landmarks 
• Reliable distributed storage 
• Some models/processes require millions of accesses 
• Massive parallelization 
• Map/Reduce – Hadoop 
• Semi-structured storage systems 
• Asynchronous item updates 
10
What’s going on “right now”? 
11
Query temporal profiles 
• Modeling 
• Time-dependent user intent 
• Implicitly time-qualified search queries 
• SIGIR 
• Dream theater barcelona 
• Barcelona vs Madrid 
• …. 
12
Caching for Real-Time Indexes 
• Queries are redundant (heavy-tail) and bursty 
• Caching search results saves up executing ~30/60% of the queries 
• Tens of machines do the work of 1000s 
• Dilemma: Freshness versus Computation 
• Extreme #1: do not cache at all – evaluate all queries 
• 100% fresh results, lots of redundant evaluations 
• Extreme #2: never invalidate the cache 
• A majority of stale results – results refreshed only due to 
cache replacement, no redundant work 
• Middle ground: invalidate periodically (TTL) 
• A time-to-live parameter is applied to each cached entry
•Problem: 
•In fast crawling, cache not always up-to-date (stale) 
•Solution: 
• Cache Invalidator Predictor - looks into new documents and 
invalidates queries accordingly 
• Using synopsis reduces 
the number of refreshes up 
to 30% compared to a time-to- 
live baseline 
14 
CACHING FOR INCREMENTAL INDEXES
Time(ly) opportunities 
Can we create new user experiences based on a deeper 
analysis and exploration of the time dimension? 
Goals: 
Build an application that helps users to explore, interact 
and ultimately understand existing information about 
the past and the future. 
Help the user cope with the information overload and 
eventually find/learn about what she’s looking for
Original Idea 
R. Baeza-Yates, Searching the Future, MF/IR 2005 
On December 1st 2003, on Google News, there were more than 100K 
references to 2004 and beyond. 
E.g. 2034: 
The ownership of Dolphin Square in London must revert to an 
insurance company. 
Voyager 2 should run out of fuel. 
Long-term care facilities may have to house 2.1 million people 
in the USA. 
A human base in the moon would be in operation.
17 
Time Explorer 
• Public demo since August 2010 
• For exploring news through time and into the 
future 
• Using a 1.8M news articles from New York Times 
Annotated Corpus 
• Try it at 
http://fbmya01.barcelonamedia.org:8080/future/
Time Explorer
19 
Time Explorer - Motivation 
 Time is important to search 
 Recency, particularly in news is highly related to 
relevancy 
 But, what about evolution over time? 
 How has a topic evolved over time? 
 How did the entities (people, place, etc) evolve with respect to the 
topic over time? 
 How will this topic continue to evolve over the future? 
 How does bias and sentiment in blogs and news change over time? 
 Google Trends, Yahoo! Clues, RecordedFuture … 
 Great research playground
20 
Time Explorer
21 
Collections 
 New York Times (1.8 million document) 
 Well structured 
 manual annotations 
 publically available 
 but, not diverse 
 Web Crawl Collection (100 news source and 500 
blogs sites) 
 Great for diversity 
 Challenge because of format, languages, structure, etc 
 Custom Collections 
 Yahoo! News
22 
Analysis Pipeline 
 Tokenization, Sentence Splitting, Part-of-speech 
tagging, chunking with OpenNLP 
 Entity extraction with SuperSense tagger 
 Time expressions extracted with TimeML 
 Explicit dates (August 23rd, 2008) 
 Relative dates (Next year, resolved with Pub Date) 
 Sentiment Analysis with LivingKnowledge 
 Ontology matching with Yago 
 Image Analysis – sentiment and face detection
23 
Indexing/Search 
• Lucene/Solr search platform to index and search 
– Sentence level 
– Document level 
• Facets for entity types 
• Index publication date and content date –extracted dates if 
they exists or publication date 
• Solr Faceting allows aggregation over query entity ranking 
and allowing for aggregating counts over time 
• Content date allows searching into the future
UI - Demo 
Time Explorer
UI - Demo
Timeline
Timeline - Document
Facets
Timeline – Facet Trend
Timeline – Future
Timeline – Oil Spill
Oil Spill – Gulf of Mexico
Oil Spill – Predictions 2011
UI - Snippets 
Snippet – With Source Summary 
Snippet – With image support – Negative Image
Ongoing Work 
• Better Sentiment Detection 
– How has sentiment towards a particular topic changed 
over time 
• Better Bias Detection 
– How does Fox News differ from NYT on presenting global 
warming 
• Future Mentions to Future Prediction 
– Which opinions to trust? 
– How to aggregate? 
• Move to web dataset 
– Domain shift – news to blogs 
– Noisy data – boilerplate, more date format, etc 
• Integrating multimedia data
36 
Any Questions? 
Thanks for your attention 
Joint work with Mike 
Matthews, Peter Mika, Jordi 
Atserias, Hugo Zaragoza and 
many others
References 
37 
•Caching Search Engine Results over Incremental Indices 
Roi Blanco; Edward Bortnikov; Flavio Junqueira; Ronny 
Lempel; Luca Telloli; Hugo Zaragoza, SIGIR'2010, 
•Searching through time in the New York Times Michael 
Matthews; Pancho Tolchinsky; Roi Blanco; Jordi 
Atserias; Peter Mika; Hugo Zaragoza, HCIR 2010, 2010 
•Ranking Related News Predictions Nattiya Kanhabua; 
Roi Blanco; Michael Matthews, SIGIR, 2011 
•Searching the future. Ricardo Baeza-Yates, MF/IR 
workshop 2005

More Related Content

PDF
Introduction to Big Data
Roi Blanco
 
PPTX
From Queries to Answers in the Web
Roi Blanco
 
PPTX
Mining Web content for Enhanced Search
Roi Blanco
 
PPTX
Beyond document retrieval using semantic annotations
Roi Blanco
 
PPTX
Semantic search: from document retrieval to virtual assistants
Peter Mika
 
PPT
Semantic Search overview at SSSW 2012
Peter Mika
 
PPTX
Making things findable
Peter Mika
 
PPTX
Semantic Search tutorial at SemTech 2012
Peter Mika
 
Introduction to Big Data
Roi Blanco
 
From Queries to Answers in the Web
Roi Blanco
 
Mining Web content for Enhanced Search
Roi Blanco
 
Beyond document retrieval using semantic annotations
Roi Blanco
 
Semantic search: from document retrieval to virtual assistants
Peter Mika
 
Semantic Search overview at SSSW 2012
Peter Mika
 
Making things findable
Peter Mika
 
Semantic Search tutorial at SemTech 2012
Peter Mika
 

What's hot (20)

PPT
Related Entity Finding on the Web
Peter Mika
 
PPTX
Semantic Search on the Rise
Peter Mika
 
PPTX
An Introduction to Entities in Semantic Search
David Amerland
 
PPTX
Semantic Search at Yahoo
Peter Mika
 
PPTX
SemTech 2011 Semantic Search tutorial
Peter Mika
 
PPTX
Large-Scale Semantic Search
Roi Blanco
 
PPT
Semantic Search
sssw2012
 
PPTX
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
VOGIN-academie
 
PPTX
Influence of Timeline and Named-entity Components on User Engagement
Roi Blanco
 
PPT
Implementing Semantic Search
Paul Wlodarczyk
 
PPTX
The Semantic Knowledge Graph
Trey Grainger
 
PPT
Peter Mika's Presentation at SSSW 2011
sssw2011
 
PPT
Brave new search world
voginip
 
PPT
Semantic search
Andreas Blumauer
 
PDF
Reflected Intelligence: Real world AI in Digital Transformation
Trey Grainger
 
PPTX
Is Search the Right Way?
Andrew Prescott
 
PDF
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
PPTX
Jim Hendler's Presentation at SSSW 2011
sssw2011
 
PPTX
Semtech bizsemanticsearchtutorial
Barbara Starr
 
PDF
Harith Alani's presentation at SSSW 2011
sssw2011
 
Related Entity Finding on the Web
Peter Mika
 
Semantic Search on the Rise
Peter Mika
 
An Introduction to Entities in Semantic Search
David Amerland
 
Semantic Search at Yahoo
Peter Mika
 
SemTech 2011 Semantic Search tutorial
Peter Mika
 
Large-Scale Semantic Search
Roi Blanco
 
Semantic Search
sssw2012
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
VOGIN-academie
 
Influence of Timeline and Named-entity Components on User Engagement
Roi Blanco
 
Implementing Semantic Search
Paul Wlodarczyk
 
The Semantic Knowledge Graph
Trey Grainger
 
Peter Mika's Presentation at SSSW 2011
sssw2011
 
Brave new search world
voginip
 
Semantic search
Andreas Blumauer
 
Reflected Intelligence: Real world AI in Digital Transformation
Trey Grainger
 
Is Search the Right Way?
Andrew Prescott
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
Jim Hendler's Presentation at SSSW 2011
sssw2011
 
Semtech bizsemanticsearchtutorial
Barbara Starr
 
Harith Alani's presentation at SSSW 2011
sssw2011
 
Ad

Viewers also liked (20)

PPT
D:\งานส่ง\G48 53011810075
BenjamasS
 
PDF
My name is
SaRiCo
 
PPTX
Shipbuilding in Halifax
Halifax Partnership
 
PPTX
Tech training 7.17.13 pm session
Leah Vestal
 
PPTX
Top+5+world+flatness 4
IUisawesome
 
PPTX
オールドエコノミーを喰いつくせ
pgcafe
 
PPT
D:\งานส่ง\G48 53011810070
BenjamasS
 
PPTX
Philharmonie/OPL merger
Johannes Kadar
 
PDF
Guiding conservation and sustainable use through a national Prunus africana M...
Verina Ingram
 
PPT
F:\Itag48 (53011810065)
BenjamasS
 
PDF
エンジニアでも分かる営業講座 岡崎 史
pgcafe
 
PPTX
Flat plan resit
Becca McPartland
 
PDF
Byod
Leah Vestal
 
ZIP
Programming Language purl
nanki
 
PDF
Redes Sociales
Paola Avila
 
KEY
Plastic f il for parents
cpdsroom27
 
PPT
Physical Science: Chapter 5, sec3
mshenry
 
PDF
Computing - Delivering Innovative Research
Peter Lancaster
 
PPT
Mission mercury
Lisa Baird
 
PPT
D:\งานส่ง\G48 53011810070
BenjamasS
 
D:\งานส่ง\G48 53011810075
BenjamasS
 
My name is
SaRiCo
 
Shipbuilding in Halifax
Halifax Partnership
 
Tech training 7.17.13 pm session
Leah Vestal
 
Top+5+world+flatness 4
IUisawesome
 
オールドエコノミーを喰いつくせ
pgcafe
 
D:\งานส่ง\G48 53011810070
BenjamasS
 
Philharmonie/OPL merger
Johannes Kadar
 
Guiding conservation and sustainable use through a national Prunus africana M...
Verina Ingram
 
F:\Itag48 (53011810065)
BenjamasS
 
エンジニアでも分かる営業講座 岡崎 史
pgcafe
 
Flat plan resit
Becca McPartland
 
Programming Language purl
nanki
 
Redes Sociales
Paola Avila
 
Plastic f il for parents
cpdsroom27
 
Physical Science: Chapter 5, sec3
mshenry
 
Computing - Delivering Innovative Research
Peter Lancaster
 
Mission mercury
Lisa Baird
 
D:\งานส่ง\G48 53011810070
BenjamasS
 
Ad

Similar to Searching over the past, present and future (20)

PPT
Dynamics of Web: Analysis and Implications from Search Perspective
Nattiya Kanhabua
 
PDF
Temporal Web Dynamics: Implications from Search Perspective
Nattiya Kanhabua
 
PDF
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
lljohnston
 
PPTX
Gary Price, MIT Program on Information Science
Micah Altman
 
PDF
Leslie Johnston Keynote, Best Practices Exchange 2011
lljohnston
 
PDF
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
TimelessFuture
 
PDF
Building an Open Source, Real-Time, Billion Object Spatio-Temporal Search Pla...
Paolo Corti
 
PDF
When Search becomes Research and Research becomes Search
Jaap Kamps
 
PDF
2014_WWW_BTOR
Dongpo Deng
 
PDF
Exploiting temporal information in retrieval of archived documents (doctoral ...
Nattiya Kanhabua
 
PPTX
Temporal Web Dynamics and Implications for Information Retrieval
Nattiya Kanhabua
 
PPSX
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
Libcorpio
 
PDF
Web-Scale Discovery: Post Implementation
Rachel Vacek
 
PDF
00-01 DSnDA.pdf
SugumarSarDurai
 
PPT
Melissa Terras' Report on the #UKMHLiveLab
University of Edinburgh
 
PDF
Guy avoiding-dat apocalypse
ENUG
 
PPTX
Television News Search and Analysis with Lucene/Solr
UCLA Social Sciences Computing
 
PPTX
THe HathiTrust Research Center: Digital Humanities at Scale
Robert H. McDonald
 
PPTX
Introduction to Big Data
Srinath Perera
 
Dynamics of Web: Analysis and Implications from Search Perspective
Nattiya Kanhabua
 
Temporal Web Dynamics: Implications from Search Perspective
Nattiya Kanhabua
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
lljohnston
 
Gary Price, MIT Program on Information Science
Micah Altman
 
Leslie Johnston Keynote, Best Practices Exchange 2011
lljohnston
 
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
TimelessFuture
 
Building an Open Source, Real-Time, Billion Object Spatio-Temporal Search Pla...
Paolo Corti
 
When Search becomes Research and Research becomes Search
Jaap Kamps
 
2014_WWW_BTOR
Dongpo Deng
 
Exploiting temporal information in retrieval of archived documents (doctoral ...
Nattiya Kanhabua
 
Temporal Web Dynamics and Implications for Information Retrieval
Nattiya Kanhabua
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
Libcorpio
 
Web-Scale Discovery: Post Implementation
Rachel Vacek
 
00-01 DSnDA.pdf
SugumarSarDurai
 
Melissa Terras' Report on the #UKMHLiveLab
University of Edinburgh
 
Guy avoiding-dat apocalypse
ENUG
 
Television News Search and Analysis with Lucene/Solr
UCLA Social Sciences Computing
 
THe HathiTrust Research Center: Digital Humanities at Scale
Robert H. McDonald
 
Introduction to Big Data
Srinath Perera
 

More from Roi Blanco (8)

PDF
Entity Linking via Graph-Distance Minimization
Roi Blanco
 
PPTX
Introduction to Information Retrieval
Roi Blanco
 
PPT
Keyword Search over RDF Graphs
Roi Blanco
 
PDF
Extending BM25 with multiple query operators
Roi Blanco
 
PPTX
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Roi Blanco
 
PPTX
Effective and Efficient Entity Search in RDF data
Roi Blanco
 
PPT
Caching Search Engine Results over Incremental Indices
Roi Blanco
 
PPT
Finding support sentences for entities
Roi Blanco
 
Entity Linking via Graph-Distance Minimization
Roi Blanco
 
Introduction to Information Retrieval
Roi Blanco
 
Keyword Search over RDF Graphs
Roi Blanco
 
Extending BM25 with multiple query operators
Roi Blanco
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Roi Blanco
 
Effective and Efficient Entity Search in RDF data
Roi Blanco
 
Caching Search Engine Results over Incremental Indices
Roi Blanco
 
Finding support sentences for entities
Roi Blanco
 

Recently uploaded (20)

PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Doc9.....................................
SofiaCollazos
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
The Future of Artificial Intelligence (AI)
Mukul
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 

Searching over the past, present and future

  • 1. Searching over the past, present and future Roi Blanco ([email protected]) http://labs.yahoo.com/Yahoo_Labs_Barcelona
  • 2. Yahoo! Research Barcelona Established January, 2006 Led by Ricardo Baeza-Yates Research areas • Web Mining • Social Media • Distributed Web retrieval • Geo information retrieval • NLP and Semantics
  • 3. Agenda • Natural Language retrieval • Time and search engines • Searching over web archives • Searching on real time information • Caching! • Time-based exploratory search • Searching over future events • Future directions
  • 4. Natural Language Retrieval • How to exploit the structure and meaning of natural language text to improve search • Current search engines perform only limited NLP (tokenization, stemming) • Automated tools exist for deeper analysis • Applications to diversity-aware search • Source, Location, Time, Language, Opinion, Ranking… • Search over semi-structured data, semantic search • Roll-out user experiences that use higher layers of the NLP stack • In this talk, focus on the time dimension
  • 6. High-level Architecture of WSEs Cache Query results Runtime system Parser/ Tokenizer Index terms Engine queries Indexing pipeline W WWWWW
  • 7. Web Search and time • Information freshness adds constraints/tensions in every layer of WSE • Architecture • Crawling • Indexing • Caching • Serving system • Modeling • Time-dependent user intent • UI (how to let the user take control) 7
  • 8. Adding the time dimension • Some solutions don’t scale up anymore Review your architecture Review your algorithms Add more machines (~$$$) • Some solutions don’t apply anymore Caching 8
  • 9. Evolution • 1999 • Index updated ~once per month • Disk-based updates/indexing • 2001 • In-memory indexes • Changes the whole-game! • 2007 • Indexing time < 1 minute • Accept updates while serving • Now • Focused crawling, delayed transactions, etc. • Batch Updates -> Incremental processing 9
  • 10. Some landmarks • Reliable distributed storage • Some models/processes require millions of accesses • Massive parallelization • Map/Reduce – Hadoop • Semi-structured storage systems • Asynchronous item updates 10
  • 11. What’s going on “right now”? 11
  • 12. Query temporal profiles • Modeling • Time-dependent user intent • Implicitly time-qualified search queries • SIGIR • Dream theater barcelona • Barcelona vs Madrid • …. 12
  • 13. Caching for Real-Time Indexes • Queries are redundant (heavy-tail) and bursty • Caching search results saves up executing ~30/60% of the queries • Tens of machines do the work of 1000s • Dilemma: Freshness versus Computation • Extreme #1: do not cache at all – evaluate all queries • 100% fresh results, lots of redundant evaluations • Extreme #2: never invalidate the cache • A majority of stale results – results refreshed only due to cache replacement, no redundant work • Middle ground: invalidate periodically (TTL) • A time-to-live parameter is applied to each cached entry
  • 14. •Problem: •In fast crawling, cache not always up-to-date (stale) •Solution: • Cache Invalidator Predictor - looks into new documents and invalidates queries accordingly • Using synopsis reduces the number of refreshes up to 30% compared to a time-to- live baseline 14 CACHING FOR INCREMENTAL INDEXES
  • 15. Time(ly) opportunities Can we create new user experiences based on a deeper analysis and exploration of the time dimension? Goals: Build an application that helps users to explore, interact and ultimately understand existing information about the past and the future. Help the user cope with the information overload and eventually find/learn about what she’s looking for
  • 16. Original Idea R. Baeza-Yates, Searching the Future, MF/IR 2005 On December 1st 2003, on Google News, there were more than 100K references to 2004 and beyond. E.g. 2034: The ownership of Dolphin Square in London must revert to an insurance company. Voyager 2 should run out of fuel. Long-term care facilities may have to house 2.1 million people in the USA. A human base in the moon would be in operation.
  • 17. 17 Time Explorer • Public demo since August 2010 • For exploring news through time and into the future • Using a 1.8M news articles from New York Times Annotated Corpus • Try it at http://fbmya01.barcelonamedia.org:8080/future/
  • 19. 19 Time Explorer - Motivation  Time is important to search  Recency, particularly in news is highly related to relevancy  But, what about evolution over time?  How has a topic evolved over time?  How did the entities (people, place, etc) evolve with respect to the topic over time?  How will this topic continue to evolve over the future?  How does bias and sentiment in blogs and news change over time?  Google Trends, Yahoo! Clues, RecordedFuture …  Great research playground
  • 21. 21 Collections  New York Times (1.8 million document)  Well structured  manual annotations  publically available  but, not diverse  Web Crawl Collection (100 news source and 500 blogs sites)  Great for diversity  Challenge because of format, languages, structure, etc  Custom Collections  Yahoo! News
  • 22. 22 Analysis Pipeline  Tokenization, Sentence Splitting, Part-of-speech tagging, chunking with OpenNLP  Entity extraction with SuperSense tagger  Time expressions extracted with TimeML  Explicit dates (August 23rd, 2008)  Relative dates (Next year, resolved with Pub Date)  Sentiment Analysis with LivingKnowledge  Ontology matching with Yago  Image Analysis – sentiment and face detection
  • 23. 23 Indexing/Search • Lucene/Solr search platform to index and search – Sentence level – Document level • Facets for entity types • Index publication date and content date –extracted dates if they exists or publication date • Solr Faceting allows aggregation over query entity ranking and allowing for aggregating counts over time • Content date allows searching into the future
  • 24. UI - Demo Time Explorer
  • 32. Oil Spill – Gulf of Mexico
  • 33. Oil Spill – Predictions 2011
  • 34. UI - Snippets Snippet – With Source Summary Snippet – With image support – Negative Image
  • 35. Ongoing Work • Better Sentiment Detection – How has sentiment towards a particular topic changed over time • Better Bias Detection – How does Fox News differ from NYT on presenting global warming • Future Mentions to Future Prediction – Which opinions to trust? – How to aggregate? • Move to web dataset – Domain shift – news to blogs – Noisy data – boilerplate, more date format, etc • Integrating multimedia data
  • 36. 36 Any Questions? Thanks for your attention Joint work with Mike Matthews, Peter Mika, Jordi Atserias, Hugo Zaragoza and many others
  • 37. References 37 •Caching Search Engine Results over Incremental Indices Roi Blanco; Edward Bortnikov; Flavio Junqueira; Ronny Lempel; Luca Telloli; Hugo Zaragoza, SIGIR'2010, •Searching through time in the New York Times Michael Matthews; Pancho Tolchinsky; Roi Blanco; Jordi Atserias; Peter Mika; Hugo Zaragoza, HCIR 2010, 2010 •Ranking Related News Predictions Nattiya Kanhabua; Roi Blanco; Michael Matthews, SIGIR, 2011 •Searching the future. Ricardo Baeza-Yates, MF/IR workshop 2005

Editor's Notes

  • #11: And now that we have the data online, what do we do with it