SlideShare a Scribd company logo
Effective and Efficient Entity Search in RDF data 
Roi Blanco1, Peter Mika1 and Sebastiano Vigna2 
1 Yahoo! Research 
2 Università degli Studi di Milano
- 2 - 
Semantic Search 
• Unstructured or hybrid search over RDF data 
– Supporting end-users 
• Users who can not express their need in SPARQL 
– Dealing with large-scale data 
• Giving up query expressivity for scale 
– Dealing with heterogeneity 
• Users who are unaware of the schema of the data 
• No single schema to the data 
– Example: 2.6m classes and 33k properties in Billion Triples 2009 
• Entity search 
– Queries where the user is looking for a single entity named or 
described in the query 
– e.g. kaz vaporizer, hospice of cincinnati, mst3000
- 3 - 
Use cases in web search 
Top-1 entity with 
structured data 
Related entities 
Structured data 
extracted from HTML
Information access in the Semantic Web 
• Database-style indexing of RDF data 
– Triple stores 
– Structural queries (SPARQL) 
– No ranking 
– Evaluation focused on efficiency 
• IR-style indexing of RDF data 
– Search engines 
– Keyword queries 
– Ranking 
– Evaluation focused on effectiveness 
- 4 - 
• Combined methods 
– Keyword matching and limited join processing
- 5 - 
Related works 
• Ranking methods on RDF data 
– Wang et al. Semplore: A scalable IR approach to search the 
Web of Data. ISWC 2007, JWS 7(3) 
– Pérez-Agüera et al. Using BM25F for semantic search. 
SemSearch 2010 
– (many others) 
• Evaluation campaigns 
– SemSearch Challenge 2010, 2011 
– Question-Answering over Linked Data (QALD) 2011 
– TREC Entity Track 2010, 2011 
• Keyword search in databases 
– No open evaluation campaigns
1st part of the talk 2nd part 
- 6 - 
Architecture overview 
Doc 
1. Download, uncompress, 
convert (if needed) 
2. Sort quads by subject 
3. Compute Minimal Perfect 
Hash (MPH) 
map 
map 
reduce 
reduce 
map reduce 
Index 
3. Each mapper reads part of 
the collection 
4. Each reducer builds an 
index for a subset of the 
vocabulary 
5. Optionally, we also build an 
archive (forward-index) 
5. The sub-indices are 
merged into a single 
index 
6. Serving 
and 
Ranking
RDF indexing using MapReduce 
• Text indexing using MapReduce 
– Map: parse input into (term, doc) pairs 
• Pre-processing such as stemming, blacklisting 
• To support phrase queries values are (doc, position) pairs 
– Reduce: collect all values for the same key: (term, {doc1,doc2…}), 
output posting-list 
• Secondary sort to pre-sort document ids before iteration 
• RDF indexing using MapReduce (see Mika, SemSearch 2009) 
– Document is all triples with a given subject 
• Variations: index also RDF molecules, triples where the URI is an object 
– Index terms in property-values 
• Keys are (field, term) pairs 
• Variation: distinguish values for the same property 
– Index terms in the subject URI 
• Variation: index also terms in object URIs 
- 7 -
- 8 - 
Horizontal index structure 
• One field per position 
– one for object (token), one for predicates (property), optionally one for context 
• For each term, store the property on the same position in the 
property index 
– Positions are required even without phrase queries 
• Query engine needs to support fields and the alignment operator 
 Dictionary is number of unique terms + number of properties 
 Occurrences is number of tokens * 2
- 9 - 
Vertical index structure 
• One field (index) per property 
• Positions are not required 
• Query engine needs to support fields 
 Dictionary is number of unique terms 
 Occurrences is number of tokens 
✗ Number of fields is a problem for merging, query performance 
• In experiments we index the N most common properties
- 10 - 
Efficiency improvements 
• r-vertical (reduced-vertical) index 
– One field per weight vs. one field per property 
– More efficient for keyword queries but loses the ability to 
restrict per field 
– Example: three weight levels 
• Pre-computation of alignments 
– Additional term-to-field index 
– Used to quickly determine which fields contain a term (in any 
document)
- 11 - 
Indexing efficiency 
• Billion Triples 2009 dataset 
– 249 GB in uncompressed N-Quad 
– 114 million URIs and 274 million triples with datatype properties 
– 2.9B / 1.4B occurrences (horiz/vert) 
• Selected 300 most frequent datatype properties for vertical indexing 
• Resulting index is 9-10GB in size 
• Horizontal and vertical indexing using Hadoop 
– Scale is only limited by number of machines 
– Number of reducers is a trade-off between speed and number of sub-indices to be merged
- 12 - 
Run-time efficiency 
• Measured average execution time (including ranking) 
– Using 150k queries that lead to a click on Wikipedia 
– Avg. length 2.2 tokens 
– Baseline is plain text indexing with BM25 
• Results 
– Some cost for field-based retrieval compared to plain text indexing 
– AND is always faster than OR 
• Except in horizontal, where alignment time dominates 
– r-vertical significantly improves execution time in OR mode 
AND mode OR mode 
plain text 46 ms 80 ms 
horizontal 819 ms 847 ms 
vertical 97 ms 780 ms 
r-vertical 78 ms 152 ms
- 13 - 
BM25F Ranking 
BM25(F) uses a term-frequency (tf) that accounts for the 
decreasing marginal contribution of terms 
where 
vs is the weight of the field 
tfsi is the frequency of term i in field s 
Bs is the document length normalization factor: 
ls is the length of field s 
avls is the average length of s 
bs is a tunable parameter
- 14 - 
BM25F ranking cont. 
• Final term score is a combination of tf and idf 
where 
k1 is a tunable parameter 
wIDF is the inverse-document frequency: 
• Finally, the score of a document D is the sum of the scores 
of query terms q
- 15 - 
Effectiveness evaluation 
• Semantic Search Challenge 2010 
– Data, queries, assessments available online 
• Billion Triples Challenge 2009 dataset 
• 92 entity queries from web search 
– Queries where the user is looking for a single entity 
– Sampled randomly from Microsoft and Yahoo! query logs 
• Assessed using Amazon’s Mechanical Turk 
– Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010 
– Blanco et al. Repeatable and Reliable Search System 
Evaluation using Crowd-Sourcing, SIGIR2011
- 16 - 
Evaluation form
- 17 - 
Implementation 
• Simplified model to reduce the number of parameters 
– Three levels of vs: important, neutral, unimportant 
– Assign weights to domains instead of individual doc weights wD 
– Single parameter b for all bs 
– Single parameter ls for all l, bounded by a maximum lmax=10 
• Manually classified a small number of properties and 
domains into important, neutral, unimportant 
– Future work to learn this classification 
– Weights are learned (see next)
- 18 - 
Effectiveness results 
• Individual features 
– Positive, stat. significant improvement from each feature 
– Even a manual classification of properties and domains helps 
• Combination 
– Positive stat. significant marginal improvement from each additional feature 
– Total improvement of 53% over the baseline 
– Different signals of relevance
Comparison to SemSearch’10 
• Two-fold cross validation 
• Tuning all parameters at the same time 
– Promising directions algorithm (Robertson and Zaragoza) 
• 42% improvement over the best method submitted 
• Performs well on short, specific queries with many results 
– Negative examples: the morning call lehigh valley pa 
- 19 -
- 20 - 
Conclusions 
• Indexing and ranking RDF data 
– Novel index structures 
– Ranking method based on BM25F 
• Future work 
– Ranking documents with metadata 
• e.g. microdata/RDFa 
– Exploiting more semantics 
• e.g. sameAs 
– Ranking triples for display 
– Question-answering

More Related Content

What's hot (20)

PPTX
Algorithms for Query Processing and Optimization of Spatial Operations
Natasha Mandal
 
PPT
13. Query Processing in DBMS
koolkampus
 
PPT
Textmining Retrieval And Clustering
guest0edcaf
 
PPT
Chapter15
gourab87
 
PDF
Query trees
Shefa Idrees
 
PDF
SQL: Query optimization in practice
Jano Suchal
 
PPT
Query processing-and-optimization
WBUTTUTORIALS
 
PPTX
IR tutorial
Hussein Hazimeh
 
PPTX
Unit 3
Piyush Rochwani
 
PPTX
An Approach for the Incremental Export of Relational Databases into RDF Graphs
Nikolaos Konstantinou
 
PPTX
Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...
DrkhanchanaR
 
PPTX
Programming in C++ and Data Strucutres
Dr. C.V. Suresh Babu
 
PPT
Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation
Jason Yang
 
PDF
Introduction to data analysis using R
Victoria López
 
PPTX
Unit 2 linked list
DrkhanchanaR
 
PPTX
Information Content based Ranking Metric for Linked Open Vocabularies
Ghislain Atemezing
 
PDF
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
eXascale Infolab
 
PPTX
Query-porcessing-& Query optimization
Saranya Natarajan
 
PPTX
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
Victor Giannakouris
 
PPTX
CS 542 -- Query Optimization
J Singh
 
Algorithms for Query Processing and Optimization of Spatial Operations
Natasha Mandal
 
13. Query Processing in DBMS
koolkampus
 
Textmining Retrieval And Clustering
guest0edcaf
 
Chapter15
gourab87
 
Query trees
Shefa Idrees
 
SQL: Query optimization in practice
Jano Suchal
 
Query processing-and-optimization
WBUTTUTORIALS
 
IR tutorial
Hussein Hazimeh
 
An Approach for the Incremental Export of Relational Databases into RDF Graphs
Nikolaos Konstantinou
 
Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...
DrkhanchanaR
 
Programming in C++ and Data Strucutres
Dr. C.V. Suresh Babu
 
Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation
Jason Yang
 
Introduction to data analysis using R
Victoria López
 
Unit 2 linked list
DrkhanchanaR
 
Information Content based Ranking Metric for Linked Open Vocabularies
Ghislain Atemezing
 
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
eXascale Infolab
 
Query-porcessing-& Query optimization
Saranya Natarajan
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
Victor Giannakouris
 
CS 542 -- Query Optimization
J Singh
 

Viewers also liked (20)

DOCX
Saidanaturalistaenseadadainsualibretadecampo
Belén Lorenzo
 
PDF
#ForoEGovAR | Casos de PSC y su adaptación
CESSI Argentina
 
PPT
D:\งานส่ง\G48 53011810075
BenjamasS
 
PPTX
Mastering the eligible content
Leah Vestal
 
PPT
Searching over the past, present and future
Roi Blanco
 
PDF
July 2012 Newsletter
Felix Ortiz
 
PDF
#ForoEGovAR | Plan de Modernización del Estado
CESSI Argentina
 
PPT
Mastering the Curriculum in Reading and Math
Leah Vestal
 
KEY
Build Great Apps on Android - Boris Chan - FITC Spotlight Android
Boris Chan
 
PPTX
Halifax: Economic Trends
Halifax Partnership
 
PDF
Gic2011 aula4-ingles-theory
Marielba-Mayeya Zacarias
 
PPT
Corporate wellbeing
Ravi Samuel
 
PPTX
Filosofia 6º ano - 2012
evertonbazu
 
PDF
Workshops Red ArgenTIna IT 2015 - Propuesta de Sponsoreo
CESSI Argentina
 
PDF
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
Takeshi Furusato
 
PPTX
Tech training 7.17.13
Leah Vestal
 
PDF
Entity Linking via Graph-Distance Minimization
Roi Blanco
 
PDF
Mayonn, Inc. Website in PDF format
mayonn
 
PDF
Englishtestunit73 eso d-2
Vicky
 
PDF
Best of the web ms.hs
Leah Vestal
 
Saidanaturalistaenseadadainsualibretadecampo
Belén Lorenzo
 
#ForoEGovAR | Casos de PSC y su adaptación
CESSI Argentina
 
D:\งานส่ง\G48 53011810075
BenjamasS
 
Mastering the eligible content
Leah Vestal
 
Searching over the past, present and future
Roi Blanco
 
July 2012 Newsletter
Felix Ortiz
 
#ForoEGovAR | Plan de Modernización del Estado
CESSI Argentina
 
Mastering the Curriculum in Reading and Math
Leah Vestal
 
Build Great Apps on Android - Boris Chan - FITC Spotlight Android
Boris Chan
 
Halifax: Economic Trends
Halifax Partnership
 
Gic2011 aula4-ingles-theory
Marielba-Mayeya Zacarias
 
Corporate wellbeing
Ravi Samuel
 
Filosofia 6º ano - 2012
evertonbazu
 
Workshops Red ArgenTIna IT 2015 - Propuesta de Sponsoreo
CESSI Argentina
 
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
Takeshi Furusato
 
Tech training 7.17.13
Leah Vestal
 
Entity Linking via Graph-Distance Minimization
Roi Blanco
 
Mayonn, Inc. Website in PDF format
mayonn
 
Englishtestunit73 eso d-2
Vicky
 
Best of the web ms.hs
Leah Vestal
 
Ad

Similar to Effective and Efficient Entity Search in RDF data (20)

PDF
A Survey of Entity Ranking over RDF Graphs
Intelligent Search Systems and Semantic Technologies lab at ITIS KFU
 
PPTX
Large-Scale Semantic Search
Roi Blanco
 
PDF
Using BM25F for Semantic Search
Jose R. Perez-Aguera
 
PPTX
SemTech 2011 Semantic Search tutorial
Peter Mika
 
PPT
Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
eXascale Infolab
 
PDF
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ijnlc
 
PDF
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
kevig
 
PPT
Related Entity Finding on the Web
Peter Mika
 
PDF
Returning the right results - Jettro Coenradie
NLJUG
 
PDF
Semantic Search Tutorial at SemTech 2012
Thanh Tran
 
PPT
Peter Mika's Presentation at SSSW 2011
sssw2011
 
PDF
Enhancement of Searching and Analyzing the Document using Elastic Search
IRJET Journal
 
PPTX
Semantic Search tutorial at SemTech 2012
Peter Mika
 
PPT
Improving VIVO search through semantic ranking.
Deepak K
 
PPT
Friday talk 11.02.2011
Jürgen Umbrich
 
PPTX
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
PPTX
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
PDF
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME
Piotr Pelczar
 
PDF
Enriching search results using ontology
IAEME Publication
 
PPTX
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Robert Calcavecchia
 
Large-Scale Semantic Search
Roi Blanco
 
Using BM25F for Semantic Search
Jose R. Perez-Aguera
 
SemTech 2011 Semantic Search tutorial
Peter Mika
 
Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
eXascale Infolab
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ijnlc
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
kevig
 
Related Entity Finding on the Web
Peter Mika
 
Returning the right results - Jettro Coenradie
NLJUG
 
Semantic Search Tutorial at SemTech 2012
Thanh Tran
 
Peter Mika's Presentation at SSSW 2011
sssw2011
 
Enhancement of Searching and Analyzing the Document using Elastic Search
IRJET Journal
 
Semantic Search tutorial at SemTech 2012
Peter Mika
 
Improving VIVO search through semantic ranking.
Deepak K
 
Friday talk 11.02.2011
Jürgen Umbrich
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME
Piotr Pelczar
 
Enriching search results using ontology
IAEME Publication
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Robert Calcavecchia
 
Ad

More from Roi Blanco (11)

PPTX
From Queries to Answers in the Web
Roi Blanco
 
PDF
Introduction to Big Data
Roi Blanco
 
PPTX
Mining Web content for Enhanced Search
Roi Blanco
 
PPTX
Influence of Timeline and Named-entity Components on User Engagement
Roi Blanco
 
PPTX
Introduction to Information Retrieval
Roi Blanco
 
PPTX
Beyond document retrieval using semantic annotations
Roi Blanco
 
PPT
Keyword Search over RDF Graphs
Roi Blanco
 
PDF
Extending BM25 with multiple query operators
Roi Blanco
 
PPTX
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Roi Blanco
 
PPT
Caching Search Engine Results over Incremental Indices
Roi Blanco
 
PPT
Finding support sentences for entities
Roi Blanco
 
From Queries to Answers in the Web
Roi Blanco
 
Introduction to Big Data
Roi Blanco
 
Mining Web content for Enhanced Search
Roi Blanco
 
Influence of Timeline and Named-entity Components on User Engagement
Roi Blanco
 
Introduction to Information Retrieval
Roi Blanco
 
Beyond document retrieval using semantic annotations
Roi Blanco
 
Keyword Search over RDF Graphs
Roi Blanco
 
Extending BM25 with multiple query operators
Roi Blanco
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Roi Blanco
 
Caching Search Engine Results over Incremental Indices
Roi Blanco
 
Finding support sentences for entities
Roi Blanco
 

Recently uploaded (20)

PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
July Patch Tuesday
Ivanti
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Biography of Daniel Podor.pdf
Daniel Podor
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 

Effective and Efficient Entity Search in RDF data

  • 1. Effective and Efficient Entity Search in RDF data Roi Blanco1, Peter Mika1 and Sebastiano Vigna2 1 Yahoo! Research 2 Università degli Studi di Milano
  • 2. - 2 - Semantic Search • Unstructured or hybrid search over RDF data – Supporting end-users • Users who can not express their need in SPARQL – Dealing with large-scale data • Giving up query expressivity for scale – Dealing with heterogeneity • Users who are unaware of the schema of the data • No single schema to the data – Example: 2.6m classes and 33k properties in Billion Triples 2009 • Entity search – Queries where the user is looking for a single entity named or described in the query – e.g. kaz vaporizer, hospice of cincinnati, mst3000
  • 3. - 3 - Use cases in web search Top-1 entity with structured data Related entities Structured data extracted from HTML
  • 4. Information access in the Semantic Web • Database-style indexing of RDF data – Triple stores – Structural queries (SPARQL) – No ranking – Evaluation focused on efficiency • IR-style indexing of RDF data – Search engines – Keyword queries – Ranking – Evaluation focused on effectiveness - 4 - • Combined methods – Keyword matching and limited join processing
  • 5. - 5 - Related works • Ranking methods on RDF data – Wang et al. Semplore: A scalable IR approach to search the Web of Data. ISWC 2007, JWS 7(3) – Pérez-Agüera et al. Using BM25F for semantic search. SemSearch 2010 – (many others) • Evaluation campaigns – SemSearch Challenge 2010, 2011 – Question-Answering over Linked Data (QALD) 2011 – TREC Entity Track 2010, 2011 • Keyword search in databases – No open evaluation campaigns
  • 6. 1st part of the talk 2nd part - 6 - Architecture overview Doc 1. Download, uncompress, convert (if needed) 2. Sort quads by subject 3. Compute Minimal Perfect Hash (MPH) map map reduce reduce map reduce Index 3. Each mapper reads part of the collection 4. Each reducer builds an index for a subset of the vocabulary 5. Optionally, we also build an archive (forward-index) 5. The sub-indices are merged into a single index 6. Serving and Ranking
  • 7. RDF indexing using MapReduce • Text indexing using MapReduce – Map: parse input into (term, doc) pairs • Pre-processing such as stemming, blacklisting • To support phrase queries values are (doc, position) pairs – Reduce: collect all values for the same key: (term, {doc1,doc2…}), output posting-list • Secondary sort to pre-sort document ids before iteration • RDF indexing using MapReduce (see Mika, SemSearch 2009) – Document is all triples with a given subject • Variations: index also RDF molecules, triples where the URI is an object – Index terms in property-values • Keys are (field, term) pairs • Variation: distinguish values for the same property – Index terms in the subject URI • Variation: index also terms in object URIs - 7 -
  • 8. - 8 - Horizontal index structure • One field per position – one for object (token), one for predicates (property), optionally one for context • For each term, store the property on the same position in the property index – Positions are required even without phrase queries • Query engine needs to support fields and the alignment operator  Dictionary is number of unique terms + number of properties  Occurrences is number of tokens * 2
  • 9. - 9 - Vertical index structure • One field (index) per property • Positions are not required • Query engine needs to support fields  Dictionary is number of unique terms  Occurrences is number of tokens ✗ Number of fields is a problem for merging, query performance • In experiments we index the N most common properties
  • 10. - 10 - Efficiency improvements • r-vertical (reduced-vertical) index – One field per weight vs. one field per property – More efficient for keyword queries but loses the ability to restrict per field – Example: three weight levels • Pre-computation of alignments – Additional term-to-field index – Used to quickly determine which fields contain a term (in any document)
  • 11. - 11 - Indexing efficiency • Billion Triples 2009 dataset – 249 GB in uncompressed N-Quad – 114 million URIs and 274 million triples with datatype properties – 2.9B / 1.4B occurrences (horiz/vert) • Selected 300 most frequent datatype properties for vertical indexing • Resulting index is 9-10GB in size • Horizontal and vertical indexing using Hadoop – Scale is only limited by number of machines – Number of reducers is a trade-off between speed and number of sub-indices to be merged
  • 12. - 12 - Run-time efficiency • Measured average execution time (including ranking) – Using 150k queries that lead to a click on Wikipedia – Avg. length 2.2 tokens – Baseline is plain text indexing with BM25 • Results – Some cost for field-based retrieval compared to plain text indexing – AND is always faster than OR • Except in horizontal, where alignment time dominates – r-vertical significantly improves execution time in OR mode AND mode OR mode plain text 46 ms 80 ms horizontal 819 ms 847 ms vertical 97 ms 780 ms r-vertical 78 ms 152 ms
  • 13. - 13 - BM25F Ranking BM25(F) uses a term-frequency (tf) that accounts for the decreasing marginal contribution of terms where vs is the weight of the field tfsi is the frequency of term i in field s Bs is the document length normalization factor: ls is the length of field s avls is the average length of s bs is a tunable parameter
  • 14. - 14 - BM25F ranking cont. • Final term score is a combination of tf and idf where k1 is a tunable parameter wIDF is the inverse-document frequency: • Finally, the score of a document D is the sum of the scores of query terms q
  • 15. - 15 - Effectiveness evaluation • Semantic Search Challenge 2010 – Data, queries, assessments available online • Billion Triples Challenge 2009 dataset • 92 entity queries from web search – Queries where the user is looking for a single entity – Sampled randomly from Microsoft and Yahoo! query logs • Assessed using Amazon’s Mechanical Turk – Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010 – Blanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011
  • 16. - 16 - Evaluation form
  • 17. - 17 - Implementation • Simplified model to reduce the number of parameters – Three levels of vs: important, neutral, unimportant – Assign weights to domains instead of individual doc weights wD – Single parameter b for all bs – Single parameter ls for all l, bounded by a maximum lmax=10 • Manually classified a small number of properties and domains into important, neutral, unimportant – Future work to learn this classification – Weights are learned (see next)
  • 18. - 18 - Effectiveness results • Individual features – Positive, stat. significant improvement from each feature – Even a manual classification of properties and domains helps • Combination – Positive stat. significant marginal improvement from each additional feature – Total improvement of 53% over the baseline – Different signals of relevance
  • 19. Comparison to SemSearch’10 • Two-fold cross validation • Tuning all parameters at the same time – Promising directions algorithm (Robertson and Zaragoza) • 42% improvement over the best method submitted • Performs well on short, specific queries with many results – Negative examples: the morning call lehigh valley pa - 19 -
  • 20. - 20 - Conclusions • Indexing and ranking RDF data – Novel index structures – Ranking method based on BM25F • Future work – Ranking documents with metadata • e.g. microdata/RDFa – Exploiting more semantics • e.g. sameAs – Ranking triples for display – Question-answering