SlideShare a Scribd company logo
Linked Data Query Processing
Tutorial at the 22nd International World Wide Web Conference (WWW 2013)
May 14, 2013
http://db.uwaterloo.ca/LDQTut2013/
3. Source Selection
Olaf Hartig
University of Waterloo
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 2
● Result construction approach
● i.e., query-local data processing
http://mdb.../Paul http://geo.../Berlin
http://mdb.../Ric http://geo.../Rome
?loc?actor
● Combining data retrieval
and result construction
● Data retrieval approach
● Data source selection
● Data source ranking
(optional, for optimization)
GET http://.../movie2449
“Ingredients” for LD Query Execution
Query-local data
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 3
Query-Specific Relevance of URIs
● Definition: A URI is relevant for a given query if looking up
this URI gives us data that contributes to the query result.
● Example:
● Conjunctive query (BGP): { (Bob, lives in, ?x) , (?y, lives in, ?x) }
● Looking up URI Bob gives us: { (Bob, lives in, Berlin) , ... }
● Looking up URI Alice gives us: { (Alice, lives in, Berlin) , ... }
● Hence, μ = { ?x → Berlin , ?y → Alice } is a solution
● Thus, URIs Bob and Alice are relevant for the query
● Simply contributing a matching triple is not sufficient:
● Suppose, URI Charles gives us { (Charles, lives in, London) , ... }
● Since the matching triple cannot be used for computing
a solution, URI Charles is not relevant.
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 4
Objective of Source Selection
● Source selection: Given a Linked Data query,
determine a set of URIs to look up
● Ideal source selection approach:
● For any query, selects all relevant URIs
● For any query, selects relevant URIs only
● Irrelevant URIs are not required to answer the query
● Avoiding their lookup reduces cost of query executions
significantly!
● Caveat:
● What URIs are relevant (resp. irrelevant) is unknown
before the query execution has been completed.
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 5
Outline
 Objectives of Source Selection
 Index-Based Strategy
➢ General Idea
➢ Possible Index Structures
 Live Exploration Strategy
 Comparison of both Strategies
 Combining both Strategies
√
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 6
Idea of Index-Based Source Selection
● Use a pre-populated index structure to determine relevant
URIs (and to avoid as many irrelevant ones as possible)
● Example: triple-pattern-based indexes
● For single triple pattern queries, source
selection using such an index structure is
sound and complete (w.r.t. the indexed URIs)
Entry: { uri1, uri2, … , urin }Key: tp
GET urii
matches
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 7
General Properties of Lookup Indexes
● Index entries:
● Usually, a set of URIs
● Each URI in such an entry may be paired
with a cardinality (utilized for source ranking)
● Indexed URIs may appear multiple times
(i.e., associated with multiple index keys)
● Type of index keys depends on the
particular index structure used
● e.g., triple patterns
● Represent a summary of the data from all indexed URIs
● Perfect summary: index keys are individual elements
● Approximate summary: index keys may range over elements
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 8
Perfect Summaries
● Triple-pattern-based indexes
● “Inverted URI Indexing” [UHK+11]
● “Schema-level Indexing” [UHK+11]
● Index keys: schema elements
● Like a triple-pattern-based index that considers only two types
of triple patterns: ( ?s, property, ?o ) and ( ?s, rdf:type, class )
● Tian et al. [TUY11]
● Index keys: Unique encodings of combinations of triple
patterns (i.e., BGPs) frequently found in a query workload
Key: uri
mentioned in
Entry: { uri1, … , urin } GET urii
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 9
Approximate Summaries
● Recall, index keys may range over elements
● Advantage: approximation reduces index size
● Disadvantage: index lookup may return false positives
● Examples of data structures used:
● Multidimensional histogram [UHK+11]
● QTree [HHK+10, UHK+11]
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 10
Multidimensional Histograms
● Transform RDF triples to points in a 3-dimensional space
(Bob, lives in, Berlin) → hash function → (422, 247, 143)
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 11
Multidimensional Histograms
● Transform RDF triples to points in a 3-dimensional space
(Bob, lives in, Berlin) → hash function → (422, 247, 143)
● Buckets partition that space into disjoint regions
● Indexing: Each bucket contains entries for all URIs whose
data includes an RDF triple in the corresponding region
● Source selection:
● Transform triple patterns to lines / planes in the space
(Bob, lives in, ?x) → (422, 247, ?)
● Any URI relevant for the triple pattern
may only be contained in buckets whose
region is touched by the line / plane
● Pruning due to non-overlapping regions
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 12
Root
QTree
● Combination of histograms and R-trees (i.e., hierarchical)
● Leaf nodes are the buckets
● Different buckets may
represent regions of
different size
(in contrast to fixed-sized
regions used for MDH)
● Non-populated regions
are ignored
● Deals more efficiently with a space
that is populated sparsely or
contains many clusters
B
C
AA1
A2
Root
A B
A1 A2
C
B1 B2
B2
B1
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 13
Index Construction
● Given a set of URIs to index, each of these URIs needs to
be looked up and its data needs to be retrieved
● Alternative: crawl the Web to obtain URIs and their data
● Alternative: populate index as a by-product of executing
queries using live-exploration-based source selection
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 14
Index Maintenance
● Adding additionally discovered URIs
● Keeping the index in sync with original data
● Still an open research problem
● Similar to index maintenance in
information retrieval and
view maintenance in
database systems
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 15
Outline
 Objectives of Source Selection
 Index-Based Strategy
➢ General Idea
➢ Possible Index Structures
 Live Exploration Strategy
 Comparison of both Strategies
 Combining both Strategies
√
√
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 16
Live Exploration
● General idea: Perform a recursive URI lookup process
at query execution runtime
● Start from a set of seed URIs
● Explore the queried Web by traversing data links
● Retrieved data serves two purposes:
(1) Discover further URIs
(2) Construct query result
● Lookup of URIs may be constrained
(i.e., not all links need be traversed)
● Natural support of reachability-based query semantics
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 17
Comparison to Focused Crawling
● Separate pre-runtime (or
background) process
● Crawler populates
a search index or
a local database
● Essential part of the query
execution process itself
● Live exploration aims
to discover data for
answering a particular
query
● URIs qualify for lookup
because of their high
relevance for a topic
● Relevance of URIs
related to the query
at hand
Focused Crawling vs. Live Exploration
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 18
Outline
 Objectives of Source Selection
 Index-Based Strategy
➢ General Idea
➢ Possible Index Structures
 Live Exploration Strategy
 Comparison of both Strategies
 Combining both Strategies
√
√
√
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 19
Live Exploration – vs. – Index-Based
● Possibilities for parallelized
data retrieval are limited
● Data retrieval adds to query
execution time significantly
● Usable immediately
● Most suitable for “on-
demand” querying scenario
● Depends on the structure
of the network of data links
● Data retrieval can be fully
parallelized
● Reduces the impact of data
retrieval on query exec. time
● Usable only after
initialization phase
● Depends on what has been
selected for the index
● May miss new data sources
None of both strategies is superior over the other w.r.t.
result completeness (under full-Web query semantics).
● Both strategies may miss (different) solutions for a query
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 20
Hybrid Source Selection
Why not get the best of both strategies by combining them?
● Ideas:
● Use index to obtain seed URIs for live exploration
(e.g., “mixed strategy” [LT10])
● Feed back information discovered by live exploration
to update, to expand, or to reorganize the index
● Use data summary for controlling a live exploration process
(e.g., by prioritizing the URIs scheduled for lookup)
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 21
Outline
 Objectives of Source Selection
 Index-Based Strategy
➢ General Idea
➢ Possible Index Structures
 Live Exploration Strategy
 Comparison of both Strategies
 Combining both Strategies
√
√
√
√
√
Next part: 4. Execution Process ...
WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 22
These slides have been created by
Olaf Hartig
for the
WWW 2013 tutorial on
Link Data Query Processing
Tutorial Website: http://db.uwaterloo.ca/LDQTut2013/
This work is licensed under a
Creative Commons Attribution-Share Alike 3.0 License
(http://creativecommons.org/licenses/by-sa/3.0/)
(Slides 10,11, and 12 are inspired by slides
from Andreas Harth [HHK+10] – Thanks!)

More Related Content

What's hot (20)

PDF
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Logilab
 
PPTX
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
Ontotext
 
PPTX
eNanoMapper database, search tools and templates
Nina Jeliazkova
 
PPTX
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Mikel Emaldi Manrique
 
PPTX
Streams&io
PhD Research Scholar
 
PPTX
RDF-Gen: Generating RDF from streaming and archival data
Giorgos Santipantakis
 
PPTX
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
Boris Glavic
 
PPTX
Providing Linked Data
EUCLID project
 
PDF
Adventures in Linked Data Land (presentation by Richard Light)
jottevanger
 
PDF
Oshs_9_11_2015
Béatrice Bouchou
 
PPTX
Querying Linked Data
EUCLID project
 
PPT
euclid_linkedup WWW tutorial (Besnik Fetahu)
Besnik Fetahu
 
PPTX
Interaction with Linked Data
EUCLID project
 
PPTX
Building Linked Data Applications
EUCLID project
 
ODP
Graph databases
Karol Grzegorczyk
 
PDF
SF Python Meetup: TextRank in Python
Paco Nathan
 
PPTX
Big Linked Data - Creating Training Curricula
EUCLID project
 
PPT
A Model of the Scholarly Community
Marko Rodriguez
 
PPT
Computing with Directed Labeled Graphs
Marko Rodriguez
 
PPT
Automatic Metadata Generation using Associative Networks
Marko Rodriguez
 
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Logilab
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
Ontotext
 
eNanoMapper database, search tools and templates
Nina Jeliazkova
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Mikel Emaldi Manrique
 
Streams&io
PhD Research Scholar
 
RDF-Gen: Generating RDF from streaming and archival data
Giorgos Santipantakis
 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
Boris Glavic
 
Providing Linked Data
EUCLID project
 
Adventures in Linked Data Land (presentation by Richard Light)
jottevanger
 
Oshs_9_11_2015
Béatrice Bouchou
 
Querying Linked Data
EUCLID project
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
Besnik Fetahu
 
Interaction with Linked Data
EUCLID project
 
Building Linked Data Applications
EUCLID project
 
Graph databases
Karol Grzegorczyk
 
SF Python Meetup: TextRank in Python
Paco Nathan
 
Big Linked Data - Creating Training Curricula
EUCLID project
 
A Model of the Scholarly Community
Marko Rodriguez
 
Computing with Directed Labeled Graphs
Marko Rodriguez
 
Automatic Metadata Generation using Associative Networks
Marko Rodriguez
 

Similar to Tutorial "Linked Data Query Processing" Part 3 "Source Selection Strategies" (WWW 2013 Ed.) (20)

PPTX
Sekhon final 1_ppt
Manant Sweet
 
PDF
Sebastian Hellmann
Connected Data World
 
PDF
SEMLIB Final Conference | DERI presentation
SemLib Project
 
PDF
A candidate dataset_discovery_and_linkage_recommendation_system_for_linked_data
STIinnsbruck
 
PDF
Comparative analysis of relative and exact search for web information retrieval
eSAT Journals
 
PPTX
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
PDF
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
eXascale Infolab
 
PDF
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
IRJET Journal
 
PDF
Recommendation engines
Georgian Micsa
 
PDF
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
Matthias Braunhofer
 
PPTX
Data mining and warehouse by dr D. R. Patil sir
chaudharipruthvirajr
 
PDF
Quick overview on mongo db
Eman Mohamed
 
PDF
Becoming "Facet"-nated with Search API
cgmonroe
 
PDF
Web clustering engines
Yash Darak
 
PDF
Statistical Databases
ssuseraef7e0
 
PDF
A Novel Data Extraction and Alignment Method for Web Databases
IJMER
 
PPT
web clustering engines
Arun TR
 
PDF
Data Mining Module 5 Business Analytics.pdf
Jayanti Pande
 
PPTX
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
Yongyao Jiang
 
PDF
Data Analytics.01. Data selection and capture
Alex Rayón Jerez
 
Sekhon final 1_ppt
Manant Sweet
 
Sebastian Hellmann
Connected Data World
 
SEMLIB Final Conference | DERI presentation
SemLib Project
 
A candidate dataset_discovery_and_linkage_recommendation_system_for_linked_data
STIinnsbruck
 
Comparative analysis of relative and exact search for web information retrieval
eSAT Journals
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
eXascale Infolab
 
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
IRJET Journal
 
Recommendation engines
Georgian Micsa
 
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
Matthias Braunhofer
 
Data mining and warehouse by dr D. R. Patil sir
chaudharipruthvirajr
 
Quick overview on mongo db
Eman Mohamed
 
Becoming "Facet"-nated with Search API
cgmonroe
 
Web clustering engines
Yash Darak
 
Statistical Databases
ssuseraef7e0
 
A Novel Data Extraction and Alignment Method for Web Databases
IJMER
 
web clustering engines
Arun TR
 
Data Mining Module 5 Business Analytics.pdf
Jayanti Pande
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
Yongyao Jiang
 
Data Analytics.01. Data selection and capture
Alex Rayón Jerez
 
Ad

More from Olaf Hartig (20)

PDF
A Context-Based Semantics for SPARQL Property Paths over the Web
Olaf Hartig
 
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Olaf Hartig
 
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Olaf Hartig
 
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Olaf Hartig
 
ODP
An Overview on PROV-AQ: Provenance Access and Query
Olaf Hartig
 
PDF
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
Olaf Hartig
 
PDF
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...
Olaf Hartig
 
PDF
The Impact of Data Caching of on Query Execution for Linked Data
Olaf Hartig
 
PDF
How Caching Improves Efficiency and Result Completeness for Querying Linked Data
Olaf Hartig
 
PDF
A Main Memory Index Structure to Query Linked Data
Olaf Hartig
 
PDF
Towards a Data-Centric Notion of Trust in the Semantic Web (A Position Statem...
Olaf Hartig
 
PDF
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)
Olaf Hartig
 
PDF
Querying Linked Data with SPARQL (2010)
Olaf Hartig
 
PDF
Answers to usual issues in getting started with consuming Linked Data (2010)
Olaf Hartig
 
PDF
Linked Data on the Web
Olaf Hartig
 
PDF
Executing SPARQL Queries of the Web of Linked Data
Olaf Hartig
 
PDF
Using Web Data Provenance for Quality Assessment
Olaf Hartig
 
PDF
Answers to usual issues in getting started with consuming Linked Data
Olaf Hartig
 
PDF
Querying Linked Data with SPARQL
Olaf Hartig
 
PDF
Querying Trust in RDF Data with tSPARQL
Olaf Hartig
 
A Context-Based Semantics for SPARQL Property Paths over the Web
Olaf Hartig
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Olaf Hartig
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Olaf Hartig
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Olaf Hartig
 
An Overview on PROV-AQ: Provenance Access and Query
Olaf Hartig
 
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
Olaf Hartig
 
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...
Olaf Hartig
 
The Impact of Data Caching of on Query Execution for Linked Data
Olaf Hartig
 
How Caching Improves Efficiency and Result Completeness for Querying Linked Data
Olaf Hartig
 
A Main Memory Index Structure to Query Linked Data
Olaf Hartig
 
Towards a Data-Centric Notion of Trust in the Semantic Web (A Position Statem...
Olaf Hartig
 
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)
Olaf Hartig
 
Querying Linked Data with SPARQL (2010)
Olaf Hartig
 
Answers to usual issues in getting started with consuming Linked Data (2010)
Olaf Hartig
 
Linked Data on the Web
Olaf Hartig
 
Executing SPARQL Queries of the Web of Linked Data
Olaf Hartig
 
Using Web Data Provenance for Quality Assessment
Olaf Hartig
 
Answers to usual issues in getting started with consuming Linked Data
Olaf Hartig
 
Querying Linked Data with SPARQL
Olaf Hartig
 
Querying Trust in RDF Data with tSPARQL
Olaf Hartig
 
Ad

Recently uploaded (20)

PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Digital Circuits, important subject in CS
contactparinay1
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 

Tutorial "Linked Data Query Processing" Part 3 "Source Selection Strategies" (WWW 2013 Ed.)

  • 1. Linked Data Query Processing Tutorial at the 22nd International World Wide Web Conference (WWW 2013) May 14, 2013 http://db.uwaterloo.ca/LDQTut2013/ 3. Source Selection Olaf Hartig University of Waterloo
  • 2. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 2 ● Result construction approach ● i.e., query-local data processing http://mdb.../Paul http://geo.../Berlin http://mdb.../Ric http://geo.../Rome ?loc?actor ● Combining data retrieval and result construction ● Data retrieval approach ● Data source selection ● Data source ranking (optional, for optimization) GET http://.../movie2449 “Ingredients” for LD Query Execution Query-local data
  • 3. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 3 Query-Specific Relevance of URIs ● Definition: A URI is relevant for a given query if looking up this URI gives us data that contributes to the query result. ● Example: ● Conjunctive query (BGP): { (Bob, lives in, ?x) , (?y, lives in, ?x) } ● Looking up URI Bob gives us: { (Bob, lives in, Berlin) , ... } ● Looking up URI Alice gives us: { (Alice, lives in, Berlin) , ... } ● Hence, μ = { ?x → Berlin , ?y → Alice } is a solution ● Thus, URIs Bob and Alice are relevant for the query ● Simply contributing a matching triple is not sufficient: ● Suppose, URI Charles gives us { (Charles, lives in, London) , ... } ● Since the matching triple cannot be used for computing a solution, URI Charles is not relevant.
  • 4. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 4 Objective of Source Selection ● Source selection: Given a Linked Data query, determine a set of URIs to look up ● Ideal source selection approach: ● For any query, selects all relevant URIs ● For any query, selects relevant URIs only ● Irrelevant URIs are not required to answer the query ● Avoiding their lookup reduces cost of query executions significantly! ● Caveat: ● What URIs are relevant (resp. irrelevant) is unknown before the query execution has been completed.
  • 5. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 5 Outline  Objectives of Source Selection  Index-Based Strategy ➢ General Idea ➢ Possible Index Structures  Live Exploration Strategy  Comparison of both Strategies  Combining both Strategies √
  • 6. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 6 Idea of Index-Based Source Selection ● Use a pre-populated index structure to determine relevant URIs (and to avoid as many irrelevant ones as possible) ● Example: triple-pattern-based indexes ● For single triple pattern queries, source selection using such an index structure is sound and complete (w.r.t. the indexed URIs) Entry: { uri1, uri2, … , urin }Key: tp GET urii matches
  • 7. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 7 General Properties of Lookup Indexes ● Index entries: ● Usually, a set of URIs ● Each URI in such an entry may be paired with a cardinality (utilized for source ranking) ● Indexed URIs may appear multiple times (i.e., associated with multiple index keys) ● Type of index keys depends on the particular index structure used ● e.g., triple patterns ● Represent a summary of the data from all indexed URIs ● Perfect summary: index keys are individual elements ● Approximate summary: index keys may range over elements
  • 8. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 8 Perfect Summaries ● Triple-pattern-based indexes ● “Inverted URI Indexing” [UHK+11] ● “Schema-level Indexing” [UHK+11] ● Index keys: schema elements ● Like a triple-pattern-based index that considers only two types of triple patterns: ( ?s, property, ?o ) and ( ?s, rdf:type, class ) ● Tian et al. [TUY11] ● Index keys: Unique encodings of combinations of triple patterns (i.e., BGPs) frequently found in a query workload Key: uri mentioned in Entry: { uri1, … , urin } GET urii
  • 9. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 9 Approximate Summaries ● Recall, index keys may range over elements ● Advantage: approximation reduces index size ● Disadvantage: index lookup may return false positives ● Examples of data structures used: ● Multidimensional histogram [UHK+11] ● QTree [HHK+10, UHK+11]
  • 10. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 10 Multidimensional Histograms ● Transform RDF triples to points in a 3-dimensional space (Bob, lives in, Berlin) → hash function → (422, 247, 143)
  • 11. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 11 Multidimensional Histograms ● Transform RDF triples to points in a 3-dimensional space (Bob, lives in, Berlin) → hash function → (422, 247, 143) ● Buckets partition that space into disjoint regions ● Indexing: Each bucket contains entries for all URIs whose data includes an RDF triple in the corresponding region ● Source selection: ● Transform triple patterns to lines / planes in the space (Bob, lives in, ?x) → (422, 247, ?) ● Any URI relevant for the triple pattern may only be contained in buckets whose region is touched by the line / plane ● Pruning due to non-overlapping regions
  • 12. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 12 Root QTree ● Combination of histograms and R-trees (i.e., hierarchical) ● Leaf nodes are the buckets ● Different buckets may represent regions of different size (in contrast to fixed-sized regions used for MDH) ● Non-populated regions are ignored ● Deals more efficiently with a space that is populated sparsely or contains many clusters B C AA1 A2 Root A B A1 A2 C B1 B2 B2 B1
  • 13. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 13 Index Construction ● Given a set of URIs to index, each of these URIs needs to be looked up and its data needs to be retrieved ● Alternative: crawl the Web to obtain URIs and their data ● Alternative: populate index as a by-product of executing queries using live-exploration-based source selection
  • 14. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 14 Index Maintenance ● Adding additionally discovered URIs ● Keeping the index in sync with original data ● Still an open research problem ● Similar to index maintenance in information retrieval and view maintenance in database systems
  • 15. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 15 Outline  Objectives of Source Selection  Index-Based Strategy ➢ General Idea ➢ Possible Index Structures  Live Exploration Strategy  Comparison of both Strategies  Combining both Strategies √ √
  • 16. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 16 Live Exploration ● General idea: Perform a recursive URI lookup process at query execution runtime ● Start from a set of seed URIs ● Explore the queried Web by traversing data links ● Retrieved data serves two purposes: (1) Discover further URIs (2) Construct query result ● Lookup of URIs may be constrained (i.e., not all links need be traversed) ● Natural support of reachability-based query semantics
  • 17. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 17 Comparison to Focused Crawling ● Separate pre-runtime (or background) process ● Crawler populates a search index or a local database ● Essential part of the query execution process itself ● Live exploration aims to discover data for answering a particular query ● URIs qualify for lookup because of their high relevance for a topic ● Relevance of URIs related to the query at hand Focused Crawling vs. Live Exploration
  • 18. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 18 Outline  Objectives of Source Selection  Index-Based Strategy ➢ General Idea ➢ Possible Index Structures  Live Exploration Strategy  Comparison of both Strategies  Combining both Strategies √ √ √
  • 19. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 19 Live Exploration – vs. – Index-Based ● Possibilities for parallelized data retrieval are limited ● Data retrieval adds to query execution time significantly ● Usable immediately ● Most suitable for “on- demand” querying scenario ● Depends on the structure of the network of data links ● Data retrieval can be fully parallelized ● Reduces the impact of data retrieval on query exec. time ● Usable only after initialization phase ● Depends on what has been selected for the index ● May miss new data sources None of both strategies is superior over the other w.r.t. result completeness (under full-Web query semantics). ● Both strategies may miss (different) solutions for a query
  • 20. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 20 Hybrid Source Selection Why not get the best of both strategies by combining them? ● Ideas: ● Use index to obtain seed URIs for live exploration (e.g., “mixed strategy” [LT10]) ● Feed back information discovered by live exploration to update, to expand, or to reorganize the index ● Use data summary for controlling a live exploration process (e.g., by prioritizing the URIs scheduled for lookup)
  • 21. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 21 Outline  Objectives of Source Selection  Index-Based Strategy ➢ General Idea ➢ Possible Index Structures  Live Exploration Strategy  Comparison of both Strategies  Combining both Strategies √ √ √ √ √ Next part: 4. Execution Process ...
  • 22. WWW 2013 Tutorial on Linked Data Query Processing [ Source Selection ] 22 These slides have been created by Olaf Hartig for the WWW 2013 tutorial on Link Data Query Processing Tutorial Website: http://db.uwaterloo.ca/LDQTut2013/ This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License (http://creativecommons.org/licenses/by-sa/3.0/) (Slides 10,11, and 12 are inspired by slides from Andreas Harth [HHK+10] – Thanks!)