SlideShare a Scribd company logo
Linked Data Query Processing
Tutorial at the 22nd International World Wide Web Conference (WWW 2013)
May 14, 2013
http://db.uwaterloo.ca/LDQTut2013/
5. Query Planning
and Optimization
Olaf Hartig
University of Waterloo
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 2
Query Plan Selection
● Possible assessment criteria:
● Benefit (size of computed query result)
● Cost (overall query execution time)
● Response time (time for returning k solutions)
● To select from candidate plans, criteria must be estimated
● For index-based source selection: estimation may be
based on information recorded in the index [HHK+10]
● For (pure) live exploration: estimation impossible
● No a-priori information available
● Use heuristics instead
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 3
Outline
 Heuristics-Based Planning
 Optimizing Link Traversing Iterators
➢ Prefetching
➢ Postponing
 Source Ranking
➢ Harth et al. [HHK+10, UHK+11]
➢ Ladwig and Tran [LT10]
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 4
Heuristics-Based Plan Selection [Har11a]
● Four rules:
● DEPENDENCY RULE
● SEED RULE
● INSTANCE SEED RULE
● FILTER RULE
● Tailored to LTBQE implemented by link traversing iterators
● Assumptions about queries:
● Query pattern refers to instance data
● URIs mentioned in the query pattern are the seed URIs
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 5
?p ex:affiliated_with <http://.../orgaX>
?p ex:interested_in ?b
?b rdf:type <http://.../Book>
Query
DEPENDENCY RULE
● Dependency: a variable from each triple pattern already
occurs in one of the preceding triple patterns
tp1
= ( ?p , ex:affiliated_with , <http://.../orgaX>) I1
tp2
= ( ?p , ex:interested_in , ?b ) I2
tp3
= ( ?b , rdf:type , <http://.../Book> ) I3
Use a dependency respecting query plan
√
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 6
?p ex:affiliated_with <http://.../orgaX>
?p ex:interested_in ?b
?b rdf:type <http://.../Book>
Query
DEPENDENCY RULE
● Dependency: a variable from each triple pattern already
occurs in one of the preceding triple patterns
tp1
= ( ?p , ex:affiliated_with , <http://.../orgaX>) I1
tp2
= ( ?p , ex:interested_in , ?b ) I2
tp3
= ( ?b , rdf:type , <http://.../Book> ) I3
Use a dependency respecting query plan
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 7
?p ex:affiliated_with <http://.../orgaX>
?p ex:interested_in ?b
?b rdf:type <http://.../Book>
Query
DEPENDENCY RULE
● Dependency: a variable from each triple pattern already
occurs in one of the preceding triple patterns
● Rationale:
Avoid
cartesian
products
tp1
= ( ?p , ex:affiliated_with , <http://.../orgaX>) I1
tp2
= ( ?b , rdf:type , <http://.../Book> ) I2
tp3
= ( ?p , ex:interested_in , ?b ) I3
Use a dependency respecting query plan
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 8
Recall assumption:
seed URIs = URIs in the query
SEED RULE
● Seed triple pattern of a plan
… is the first triple pattern in the plan, and
… contains at least one HTTP URI
● Rationale:
Good starting point
Use a plan with a seed triple pattern
?p ex:affiliated_with <http://.../orgaX>
?p ex:interested_in ?b
?b rdf:type <http://.../Book>
Query
√
√
√
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 9
INSTANCE SEED RULE
● Patterns to avoid:
✗ ?s ex:any_property ?o
✗ ?s rdf:type ex:any_class
● Rationale: URIs for vocabulary terms usually resolve to
vocabulary definitions with little instance data
Avoid a seed triple pattern with vocabulary terms
?p ex:affiliated_with <http://.../orgaX>
?p ex:interested_in ?b
?b rdf:type <http://.../Book>
Query
√
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 10
FILTER RULE
● Filtering triple pattern: each variable already occurs in one
of the preceding triple patterns
● For each valuation
consumed as input
a filtering TP can
only report 1 or 0
valuations as
output
● Rationale: Reduce
cost
tp2
= ( ?p , ex:interested_in , ?b ) I2
tp3
= ( ?b , rdf:type , <http://.../Book> ) I3
Use a plan where all filtering triple patterns are
as close to the first triple pattern as possible
{ ?p = <http://.../alice> }
{ ?p = <http://.../alice> , ?b = <http://.../b1> }
tp2
' = ( <http://.../alice> , ex:interested_in , ?b )
tp3
' = ( <http://.../b1> , rdf:type , <http://.../Book> )
tp1
= ( ?p , ex:affiliated_with , <http://.../orgaX>) I1
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 11
Outline
 Heuristics-Based Planning
 Optimizing Link Traversing Iterators
➢ Prefetching
➢ Postponing
 Source Ranking
➢ Harth et al. [HHK+10, UHK+11]
➢ Ladwig and Tran [LT10]
√
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 12
Next?
Next?
tp3
= ( ?b , rdf:type , <http://.../Book> ) I3
tp1
= ( ?p , ex:affiliated_with , <http://.../orgaX> ) I1
tp2
= ( ?p , ex:interested_in , ?b )
tp2
' = ( <http://.../alice> , ex:interested_in , ?b )
I2
query-local
dataset
{ ?p = <http://.../alice> }
Link Traversing Iterators May Block!
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 13
Next?
Next?
tp3
= ( ?b , rdf:type , <http://.../Book> ) I3
tp1
= ( ?p , ex:affiliated_with , <http://.../orgaX> ) I1
tp2
= ( ?p , ex:interested_in , ?b )
tp2
' = ( <http://.../alice> , ex:interested_in , ?b )
I2
query-local
dataset
{ ?p = <http://.../alice> }
Link Traversing Iterators May Block!
Initiate look-up(s)
and wait
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 14
Next?
Next?
tp3
= ( ?b , rdf:type , <http://.../Book> ) I3
tp1
= ( ?p , ex:affiliated_with , <http://.../orgaX> ) I1
tp2
= ( ?p , ex:interested_in , ?b )
tp2
' = ( <http://.../alice> , ex:interested_in , ?b )
I2
query-local
dataset
{ ?p = <http://.../alice> }
Link Traversing Iterators May Block!
Initiate look-up(s)
and wait
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 15
Next?
Next?
tp3
= ( ?b , rdf:type , <http://.../Book> ) I3
tp1
= ( ?p , ex:affiliated_with , <http://.../orgaX> ) I1
tp2
= ( ?p , ex:interested_in , ?b )
tp2
' = ( <http://.../alice> , ex:interested_in , ?b )
I2
query-local
dataset
{ ?p = <http://.../alice> }
Prefetching of URIs [HBF09]
Ensure look-up
is finished
Initiate
look-up
in the
background
Initiate look-up(s)
and wait
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 16
Next?
Next?
tp3
= ( ?b , rdf:type , <http://.../Book> ) I3
tp1
= ( ?p , ex:affiliated_with , <http://.../orgaX> ) I1
tp2
= ( ?p , ex:interested_in , ?b )
tp2
' = ( <http://.../alice> , ex:interested_in , ?b )
I2
query-local
dataset
{ ?p = <http://.../alice> }
Prefetching of URIs [HBF09]
Wait until look-up
is finished
Initiate
look-up
in the
background
Initiate look-up(s)
and wait
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 17
Postponing Iterator [HBF09]
● Idea: temporarily reject an input solution
if processing it would cause blocking
● Enabled by an extension of the iterator paradigm:
● New function POSTPONE: treat the element most recently
reported by GETNEXT as if it
has not yet been reported
(i.e., “take back” this element)
● Adjusted GETNEXT: either return a (new) next element or
return a formerly postponed element
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 19
Outline
 Heuristics-Based Planning
 Optimizing Link Traversing Iterators
➢ Prefetching
➢ Postponing
 Source Ranking
➢ Harth et al. [HHK+10, UHK+11]
➢ Ladwig and Tran [LT10]
√
√
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 20
General Idea of Source Ranking
Rank the URIs resulting from source selection
such that
the ranking represents a priority for lookup
● Possible objectives:
● Report first solutions as early as possible
● Minimize time for computing the first k solutions
● Maximize the number of solutions computed in a
given amount of time
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 21
Harth et al. [HHK+10, UHK+11]
● For triple patterns this number is directly available:
● Recall, each QTree bucket stores a set of (URI,count)-pairs
● All query-relevant buckets are known after source selection
For any URI u (selected by the QTree-based approach), let:
rank(u) :═ estimated number of solutions that u contributes to
Root
B
C
AA1
A2
B2
B1
Root
A B
A1 A2
C
B1 B2
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 22
Harth et al. [HHK+10, UHK+11]
● For triple patterns this number is directly available:
● Recall, each QTree bucket stores a set of (URI,count)-pairs
● All query-relevant buckets are known after source selection
● For BGPs, estimate the number recursively:
● Recursively determine regions of join-able data
(based on overlapping QTree buckets for each triple pattern)
● For each of these regions, recursively estimate number of
triples the URI contributes to the region
● Factor in the estimated join result cardinality of these regions
(estimated based on overlap between contributing buckets)
For any URI u (selected by the QTree-based approach), let:
rank(u) :═ estimated number of solutions that u contributes to
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 23
Ladwig and Tran [LT10]
● Multiple scores
● Triple pattern cardinality
● Triple frequency – inverse source frequency (TF–ISF)
● (URI-specific) join pattern cardinality
● Incoming links
● Assumption: pre-populated index that stores triple pattern
cardinalities and join pattern cardinalities for each URI
● Aggregation of the scores to obtain ranks
● For indexed URIs: weighted summation of all scores
● For non-indexed URIs: weighting of (currently known) in-links
● Ranking is refined at run-time
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 24
Metric: Triple Pattern Cardinality [LT10]
● Rationale: data that contains many matching triples
is likely to contribute to many solutions
● Requirement: pre-populated index that stores the cardinalities
● Caveat: some triple patterns have a high
cardinality for almost all URIs
● Example: (?x, rdf:type, ?y)
● These patterns do not discriminate URIs
For a selected URI u, and a triple pattern tp (from the query), let:
card(u, tp) :═ number of triples in the data of u that match tp
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 25
Metric: TF–ISF [LT10]
● Idea: adopt TF-IDF concept to weight triple patterns
● Triple Frequency – Inverse Source Frequency (TF–ISF)
● Rationale:
● Importance positively correlates to the number of matching
triples that occur in the data for a URI
● Importance negatively correlates to how often matching
triples occur for all known URIs (i.e., all indexed URIs)
For a selected URI u, a triple pattern tp, and a set of all known
URIs Uknown , let:
tf.isf (u ,tp):=card (u ,tp) ∗ log
( ∣U known∣
{r∈U known ∣ card (r ,tp)>0})
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 26
Metric: Join Pattern Cardinality [LT10]
● Rationale: data that matches pairs of (joined) triple patterns
is highly relevant, because it matches a larger
part of the query
● Requirement: these join cardinalities are also pre-computed
and stored in a pre-populated index
For a selected URI u, two triple pattern tpi and tpj , and
query variable v, let:
card(u, tpi , tpj , v) :═ number of solutions produced
by joining tpi and tpj on variable v
using only the data from u
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 27
Ladwig and Tran [LT10]
● Multiple scores
● Triple pattern cardinality
● Triple frequency – inverse source frequency (TF–ISF)
● (URI-specific) join pattern cardinality
● Incoming links
● Assumption: pre-populated index that stores triple pattern
cardinalities and join pattern cardinalities for each URI
● Aggregation of the scores to obtain ranks
● For indexed URIs: weighted summation of all scores
● For non-indexed URIs: weighting of (currently known) in-links
● Ranking is refined at run-time
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 28
Refinement at Run-Time [LT10]
● During query execution information becomes available
(1) intermediate join results (2) more incoming links
● Use it to adjust scores & ranking (for integrated execution)
● Re-estimate join pattern cardinalities based on samples of
intermediate results (available from hash tables in SHJ)
● Parameters for influencing behavior of ranking process:
● Invalid score threshold: re-rank when the number of URIs
with invalid scores passes this threshold
● Sample size: larger samples give better estimates, but make
the process more costly
● Re-sampling threshold: reuse cached estimates unless the
hash table of join operators grows past this threshold
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 29
Outline
 Heuristics-Based Planning
 Optimizing Link Traversing Iterators
➢ Prefetching
➢ Postponing
 Source Ranking
➢ Harth et al. [HHK+10, UHK+11]
➢ Ladwig and Tran [LT10]
√
√
√
WWW 2013 Tutorial on Linked Data Query Processing [ Introduction ] 30
Tutorial Outline
(1) Introduction
(2) Theoretical Foundations
(3) Source Selection Strategies
(4) Execution Process
(5) Query Planning and Optimization
… Thanks!
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 31
These slides have been created by
Olaf Hartig
for the
WWW 2013 tutorial on
Link Data Query Processing
Tutorial Website: http://db.uwaterloo.ca/LDQTut2013/
This work is licensed under a
Creative Commons Attribution-Share Alike 3.0 License
(http://creativecommons.org/licenses/by-sa/3.0/)
(Some of the slides in this slide set have been inspired by
slides from Günter Ladwig [LT10] – Thanks!)
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 32
These slides have been created by
Olaf Hartig
for the
WWW 2013 tutorial on
Link Data Query Processing
Tutorial Website: http://db.uwaterloo.ca/LDQTut2013/
This work is licensed under a
Creative Commons Attribution-Share Alike 3.0 License
(http://creativecommons.org/licenses/by-sa/3.0/)
(Slides 24 - 26, 33, and 34 are inspired by slides
from Günter Ladwig [LT10] – Thanks!)
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 33
Backup Slides
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 34
Metric: Links to Results [LT10]
● Rationale: a URI is more relevant if data from
many relevant URIs mention it
● Links are only discovered at run-time
The “links to results” of a selected URI u is defined by:
where Uprocessed is the set of URIs whose data has already been
processed and links( u1 , u2 ) are the links to URI u1 mentioned
in the data from URI u2.
links(u):={l ∈links(u ,uprocessed )∣u processed ∈U processed }
WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 35
Metric: Retrieval Cost [LT10]
● Rationale: URIs are more relevant the faster their data can
be retrieved
● Size is available in the pre-populated index
● Bandwidth for any particular host can be approximated
based on past experience or average performance
recorded during the query execution process
The retrieval cost of a selected URI u is defined by:
cost( u) :═ Agg( size(u) , bandwidth(u) )
where size(u) is the of the data from u, and bandwidth(u) is the
bandwidth of the Web server that hosts u.

More Related Content

What's hot (20)

PPTX
A Workshop on R
Ajay Ohri
 
PPTX
Predicting the relevance of search results for e-commerce systems
Universiti Technologi Malaysia (UTM)
 
PPTX
Hacktoberfest 2020 - Intro to Knowledge Graphs
ArangoDB Database
 
PPTX
Querying the Web of Data
Rinke Hoekstra
 
PDF
ParlBench: a SPARQL-benchmark for electronic publishing applications.
Tatiana Tarasova
 
PPT
Benchmarking graph databases on the problem of community detection
Symeon Papadopoulos
 
PDF
OQGraph at MySQL Users Conference 2011
Antony T Curtis
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
PDF
Data science at the command line
Sharat Chikkerur
 
PPT
Mapreduce in Search
Amund Tveit
 
PDF
Multimodal Features for Search and Hyperlinking of Video Content
Petra Galuscakova
 
PDF
Rdf conjunctive query selectivity estimation
INRIA-OAK
 
PDF
Graph Analytics with ArangoDB
ArangoDB Database
 
PDF
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
PDF
Introduction to data analysis using R
Victoria López
 
PPTX
LD4KD 2015 - Demos and tools
Vrije Universiteit Amsterdam
 
PPT
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Research Data Alliance
 
PDF
Optimization Techniques
Joud Khattab
 
A Workshop on R
Ajay Ohri
 
Predicting the relevance of search results for e-commerce systems
Universiti Technologi Malaysia (UTM)
 
Hacktoberfest 2020 - Intro to Knowledge Graphs
ArangoDB Database
 
Querying the Web of Data
Rinke Hoekstra
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
Tatiana Tarasova
 
Benchmarking graph databases on the problem of community detection
Symeon Papadopoulos
 
OQGraph at MySQL Users Conference 2011
Antony T Curtis
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
Data science at the command line
Sharat Chikkerur
 
Mapreduce in Search
Amund Tveit
 
Multimodal Features for Search and Hyperlinking of Video Content
Petra Galuscakova
 
Rdf conjunctive query selectivity estimation
INRIA-OAK
 
Graph Analytics with ArangoDB
ArangoDB Database
 
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
Introduction to data analysis using R
Victoria López
 
LD4KD 2015 - Demos and tools
Vrije Universiteit Amsterdam
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Research Data Alliance
 
Optimization Techniques
Joud Khattab
 

Similar to Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimization" (WWW 2013 Ed.) (20)

PDF
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...
Olaf Hartig
 
PPTX
Strategies for Processing and Explaining Distributed Queries on Linked Data
Rakebul Hasan
 
PPTX
Why do they call it Linked Data when they want to say...?
Oscar Corcho
 
PDF
Keyword-Based Navigation and Search over the Linked Data Web
Luca Matteis
 
PPT
Friday talk 11.02.2011
Jürgen Umbrich
 
PDF
Linked Data Fragments
Ruben Verborgh
 
PDF
LDQL: A Query Language for the Web of Linked Data
Olaf Hartig
 
PPTX
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Sören Auer
 
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Olaf Hartig
 
PDF
How Caching Improves Efficiency and Result Completeness for Querying Linked Data
Olaf Hartig
 
PPTX
Consuming Linked Data 4/5 Semtech2011
Juan Sequeda
 
PDF
Sustainable queryable access to Linked Data
Ruben Verborgh
 
PPS
How web searching engines work
VNIT-ACM Student Chapter
 
PDF
dexa08linli
Hiroshi Ono
 
PDF
Web of Data Usage Mining
Markus Luczak-Rösch
 
PPTX
A Machine Learning Approach to SPARQL Query Performance Prediction
Rakebul Hasan
 
PDF
inteSearch: An Intelligent Linked Data Information Access Framework
National Inistitute of Informatics (NII), Tokyo, Japann
 
PDF
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Stefan Dietze
 
PDF
Linked Data 1st Edition David Wood Marsha Zaidman Luke Ruth Michael Hausenblas
juradorurua
 
PDF
Introduction to Linked Data - Part 1
Itza Carbajal
 
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...
Olaf Hartig
 
Strategies for Processing and Explaining Distributed Queries on Linked Data
Rakebul Hasan
 
Why do they call it Linked Data when they want to say...?
Oscar Corcho
 
Keyword-Based Navigation and Search over the Linked Data Web
Luca Matteis
 
Friday talk 11.02.2011
Jürgen Umbrich
 
Linked Data Fragments
Ruben Verborgh
 
LDQL: A Query Language for the Web of Linked Data
Olaf Hartig
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Sören Auer
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Olaf Hartig
 
How Caching Improves Efficiency and Result Completeness for Querying Linked Data
Olaf Hartig
 
Consuming Linked Data 4/5 Semtech2011
Juan Sequeda
 
Sustainable queryable access to Linked Data
Ruben Verborgh
 
How web searching engines work
VNIT-ACM Student Chapter
 
dexa08linli
Hiroshi Ono
 
Web of Data Usage Mining
Markus Luczak-Rösch
 
A Machine Learning Approach to SPARQL Query Performance Prediction
Rakebul Hasan
 
inteSearch: An Intelligent Linked Data Information Access Framework
National Inistitute of Informatics (NII), Tokyo, Japann
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Stefan Dietze
 
Linked Data 1st Edition David Wood Marsha Zaidman Luke Ruth Michael Hausenblas
juradorurua
 
Introduction to Linked Data - Part 1
Itza Carbajal
 
Ad

More from Olaf Hartig (20)

PDF
A Context-Based Semantics for SPARQL Property Paths over the Web
Olaf Hartig
 
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Olaf Hartig
 
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Olaf Hartig
 
ODP
An Overview on PROV-AQ: Provenance Access and Query
Olaf Hartig
 
PDF
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
Olaf Hartig
 
PDF
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...
Olaf Hartig
 
PDF
A Main Memory Index Structure to Query Linked Data
Olaf Hartig
 
PDF
Towards a Data-Centric Notion of Trust in the Semantic Web (A Position Statem...
Olaf Hartig
 
PDF
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)
Olaf Hartig
 
PDF
Querying Linked Data with SPARQL (2010)
Olaf Hartig
 
PDF
Answers to usual issues in getting started with consuming Linked Data (2010)
Olaf Hartig
 
PDF
Linked Data on the Web
Olaf Hartig
 
PDF
Executing SPARQL Queries of the Web of Linked Data
Olaf Hartig
 
PDF
Using Web Data Provenance for Quality Assessment
Olaf Hartig
 
PDF
Answers to usual issues in getting started with consuming Linked Data
Olaf Hartig
 
PDF
Querying Linked Data with SPARQL
Olaf Hartig
 
PDF
Querying Trust in RDF Data with tSPARQL
Olaf Hartig
 
PDF
Database Researchers Map
Olaf Hartig
 
PDF
Provenance Information in the Web of Data
Olaf Hartig
 
PDF
The SPARQL Query Graph Model for Query Optimization
Olaf Hartig
 
A Context-Based Semantics for SPARQL Property Paths over the Web
Olaf Hartig
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Olaf Hartig
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Olaf Hartig
 
An Overview on PROV-AQ: Provenance Access and Query
Olaf Hartig
 
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
Olaf Hartig
 
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...
Olaf Hartig
 
A Main Memory Index Structure to Query Linked Data
Olaf Hartig
 
Towards a Data-Centric Notion of Trust in the Semantic Web (A Position Statem...
Olaf Hartig
 
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)
Olaf Hartig
 
Querying Linked Data with SPARQL (2010)
Olaf Hartig
 
Answers to usual issues in getting started with consuming Linked Data (2010)
Olaf Hartig
 
Linked Data on the Web
Olaf Hartig
 
Executing SPARQL Queries of the Web of Linked Data
Olaf Hartig
 
Using Web Data Provenance for Quality Assessment
Olaf Hartig
 
Answers to usual issues in getting started with consuming Linked Data
Olaf Hartig
 
Querying Linked Data with SPARQL
Olaf Hartig
 
Querying Trust in RDF Data with tSPARQL
Olaf Hartig
 
Database Researchers Map
Olaf Hartig
 
Provenance Information in the Web of Data
Olaf Hartig
 
The SPARQL Query Graph Model for Query Optimization
Olaf Hartig
 
Ad

Recently uploaded (20)

PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 

Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimization" (WWW 2013 Ed.)

  • 1. Linked Data Query Processing Tutorial at the 22nd International World Wide Web Conference (WWW 2013) May 14, 2013 http://db.uwaterloo.ca/LDQTut2013/ 5. Query Planning and Optimization Olaf Hartig University of Waterloo
  • 2. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 2 Query Plan Selection ● Possible assessment criteria: ● Benefit (size of computed query result) ● Cost (overall query execution time) ● Response time (time for returning k solutions) ● To select from candidate plans, criteria must be estimated ● For index-based source selection: estimation may be based on information recorded in the index [HHK+10] ● For (pure) live exploration: estimation impossible ● No a-priori information available ● Use heuristics instead
  • 3. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 3 Outline  Heuristics-Based Planning  Optimizing Link Traversing Iterators ➢ Prefetching ➢ Postponing  Source Ranking ➢ Harth et al. [HHK+10, UHK+11] ➢ Ladwig and Tran [LT10]
  • 4. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 4 Heuristics-Based Plan Selection [Har11a] ● Four rules: ● DEPENDENCY RULE ● SEED RULE ● INSTANCE SEED RULE ● FILTER RULE ● Tailored to LTBQE implemented by link traversing iterators ● Assumptions about queries: ● Query pattern refers to instance data ● URIs mentioned in the query pattern are the seed URIs
  • 5. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 5 ?p ex:affiliated_with <http://.../orgaX> ?p ex:interested_in ?b ?b rdf:type <http://.../Book> Query DEPENDENCY RULE ● Dependency: a variable from each triple pattern already occurs in one of the preceding triple patterns tp1 = ( ?p , ex:affiliated_with , <http://.../orgaX>) I1 tp2 = ( ?p , ex:interested_in , ?b ) I2 tp3 = ( ?b , rdf:type , <http://.../Book> ) I3 Use a dependency respecting query plan √
  • 6. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 6 ?p ex:affiliated_with <http://.../orgaX> ?p ex:interested_in ?b ?b rdf:type <http://.../Book> Query DEPENDENCY RULE ● Dependency: a variable from each triple pattern already occurs in one of the preceding triple patterns tp1 = ( ?p , ex:affiliated_with , <http://.../orgaX>) I1 tp2 = ( ?p , ex:interested_in , ?b ) I2 tp3 = ( ?b , rdf:type , <http://.../Book> ) I3 Use a dependency respecting query plan
  • 7. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 7 ?p ex:affiliated_with <http://.../orgaX> ?p ex:interested_in ?b ?b rdf:type <http://.../Book> Query DEPENDENCY RULE ● Dependency: a variable from each triple pattern already occurs in one of the preceding triple patterns ● Rationale: Avoid cartesian products tp1 = ( ?p , ex:affiliated_with , <http://.../orgaX>) I1 tp2 = ( ?b , rdf:type , <http://.../Book> ) I2 tp3 = ( ?p , ex:interested_in , ?b ) I3 Use a dependency respecting query plan
  • 8. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 8 Recall assumption: seed URIs = URIs in the query SEED RULE ● Seed triple pattern of a plan … is the first triple pattern in the plan, and … contains at least one HTTP URI ● Rationale: Good starting point Use a plan with a seed triple pattern ?p ex:affiliated_with <http://.../orgaX> ?p ex:interested_in ?b ?b rdf:type <http://.../Book> Query √ √ √
  • 9. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 9 INSTANCE SEED RULE ● Patterns to avoid: ✗ ?s ex:any_property ?o ✗ ?s rdf:type ex:any_class ● Rationale: URIs for vocabulary terms usually resolve to vocabulary definitions with little instance data Avoid a seed triple pattern with vocabulary terms ?p ex:affiliated_with <http://.../orgaX> ?p ex:interested_in ?b ?b rdf:type <http://.../Book> Query √
  • 10. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 10 FILTER RULE ● Filtering triple pattern: each variable already occurs in one of the preceding triple patterns ● For each valuation consumed as input a filtering TP can only report 1 or 0 valuations as output ● Rationale: Reduce cost tp2 = ( ?p , ex:interested_in , ?b ) I2 tp3 = ( ?b , rdf:type , <http://.../Book> ) I3 Use a plan where all filtering triple patterns are as close to the first triple pattern as possible { ?p = <http://.../alice> } { ?p = <http://.../alice> , ?b = <http://.../b1> } tp2 ' = ( <http://.../alice> , ex:interested_in , ?b ) tp3 ' = ( <http://.../b1> , rdf:type , <http://.../Book> ) tp1 = ( ?p , ex:affiliated_with , <http://.../orgaX>) I1
  • 11. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 11 Outline  Heuristics-Based Planning  Optimizing Link Traversing Iterators ➢ Prefetching ➢ Postponing  Source Ranking ➢ Harth et al. [HHK+10, UHK+11] ➢ Ladwig and Tran [LT10] √
  • 12. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 12 Next? Next? tp3 = ( ?b , rdf:type , <http://.../Book> ) I3 tp1 = ( ?p , ex:affiliated_with , <http://.../orgaX> ) I1 tp2 = ( ?p , ex:interested_in , ?b ) tp2 ' = ( <http://.../alice> , ex:interested_in , ?b ) I2 query-local dataset { ?p = <http://.../alice> } Link Traversing Iterators May Block!
  • 13. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 13 Next? Next? tp3 = ( ?b , rdf:type , <http://.../Book> ) I3 tp1 = ( ?p , ex:affiliated_with , <http://.../orgaX> ) I1 tp2 = ( ?p , ex:interested_in , ?b ) tp2 ' = ( <http://.../alice> , ex:interested_in , ?b ) I2 query-local dataset { ?p = <http://.../alice> } Link Traversing Iterators May Block! Initiate look-up(s) and wait
  • 14. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 14 Next? Next? tp3 = ( ?b , rdf:type , <http://.../Book> ) I3 tp1 = ( ?p , ex:affiliated_with , <http://.../orgaX> ) I1 tp2 = ( ?p , ex:interested_in , ?b ) tp2 ' = ( <http://.../alice> , ex:interested_in , ?b ) I2 query-local dataset { ?p = <http://.../alice> } Link Traversing Iterators May Block! Initiate look-up(s) and wait
  • 15. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 15 Next? Next? tp3 = ( ?b , rdf:type , <http://.../Book> ) I3 tp1 = ( ?p , ex:affiliated_with , <http://.../orgaX> ) I1 tp2 = ( ?p , ex:interested_in , ?b ) tp2 ' = ( <http://.../alice> , ex:interested_in , ?b ) I2 query-local dataset { ?p = <http://.../alice> } Prefetching of URIs [HBF09] Ensure look-up is finished Initiate look-up in the background Initiate look-up(s) and wait
  • 16. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 16 Next? Next? tp3 = ( ?b , rdf:type , <http://.../Book> ) I3 tp1 = ( ?p , ex:affiliated_with , <http://.../orgaX> ) I1 tp2 = ( ?p , ex:interested_in , ?b ) tp2 ' = ( <http://.../alice> , ex:interested_in , ?b ) I2 query-local dataset { ?p = <http://.../alice> } Prefetching of URIs [HBF09] Wait until look-up is finished Initiate look-up in the background Initiate look-up(s) and wait
  • 17. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 17 Postponing Iterator [HBF09] ● Idea: temporarily reject an input solution if processing it would cause blocking ● Enabled by an extension of the iterator paradigm: ● New function POSTPONE: treat the element most recently reported by GETNEXT as if it has not yet been reported (i.e., “take back” this element) ● Adjusted GETNEXT: either return a (new) next element or return a formerly postponed element
  • 18. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 19 Outline  Heuristics-Based Planning  Optimizing Link Traversing Iterators ➢ Prefetching ➢ Postponing  Source Ranking ➢ Harth et al. [HHK+10, UHK+11] ➢ Ladwig and Tran [LT10] √ √
  • 19. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 20 General Idea of Source Ranking Rank the URIs resulting from source selection such that the ranking represents a priority for lookup ● Possible objectives: ● Report first solutions as early as possible ● Minimize time for computing the first k solutions ● Maximize the number of solutions computed in a given amount of time
  • 20. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 21 Harth et al. [HHK+10, UHK+11] ● For triple patterns this number is directly available: ● Recall, each QTree bucket stores a set of (URI,count)-pairs ● All query-relevant buckets are known after source selection For any URI u (selected by the QTree-based approach), let: rank(u) :═ estimated number of solutions that u contributes to Root B C AA1 A2 B2 B1 Root A B A1 A2 C B1 B2
  • 21. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 22 Harth et al. [HHK+10, UHK+11] ● For triple patterns this number is directly available: ● Recall, each QTree bucket stores a set of (URI,count)-pairs ● All query-relevant buckets are known after source selection ● For BGPs, estimate the number recursively: ● Recursively determine regions of join-able data (based on overlapping QTree buckets for each triple pattern) ● For each of these regions, recursively estimate number of triples the URI contributes to the region ● Factor in the estimated join result cardinality of these regions (estimated based on overlap between contributing buckets) For any URI u (selected by the QTree-based approach), let: rank(u) :═ estimated number of solutions that u contributes to
  • 22. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 23 Ladwig and Tran [LT10] ● Multiple scores ● Triple pattern cardinality ● Triple frequency – inverse source frequency (TF–ISF) ● (URI-specific) join pattern cardinality ● Incoming links ● Assumption: pre-populated index that stores triple pattern cardinalities and join pattern cardinalities for each URI ● Aggregation of the scores to obtain ranks ● For indexed URIs: weighted summation of all scores ● For non-indexed URIs: weighting of (currently known) in-links ● Ranking is refined at run-time
  • 23. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 24 Metric: Triple Pattern Cardinality [LT10] ● Rationale: data that contains many matching triples is likely to contribute to many solutions ● Requirement: pre-populated index that stores the cardinalities ● Caveat: some triple patterns have a high cardinality for almost all URIs ● Example: (?x, rdf:type, ?y) ● These patterns do not discriminate URIs For a selected URI u, and a triple pattern tp (from the query), let: card(u, tp) :═ number of triples in the data of u that match tp
  • 24. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 25 Metric: TF–ISF [LT10] ● Idea: adopt TF-IDF concept to weight triple patterns ● Triple Frequency – Inverse Source Frequency (TF–ISF) ● Rationale: ● Importance positively correlates to the number of matching triples that occur in the data for a URI ● Importance negatively correlates to how often matching triples occur for all known URIs (i.e., all indexed URIs) For a selected URI u, a triple pattern tp, and a set of all known URIs Uknown , let: tf.isf (u ,tp):=card (u ,tp) ∗ log ( ∣U known∣ {r∈U known ∣ card (r ,tp)>0})
  • 25. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 26 Metric: Join Pattern Cardinality [LT10] ● Rationale: data that matches pairs of (joined) triple patterns is highly relevant, because it matches a larger part of the query ● Requirement: these join cardinalities are also pre-computed and stored in a pre-populated index For a selected URI u, two triple pattern tpi and tpj , and query variable v, let: card(u, tpi , tpj , v) :═ number of solutions produced by joining tpi and tpj on variable v using only the data from u
  • 26. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 27 Ladwig and Tran [LT10] ● Multiple scores ● Triple pattern cardinality ● Triple frequency – inverse source frequency (TF–ISF) ● (URI-specific) join pattern cardinality ● Incoming links ● Assumption: pre-populated index that stores triple pattern cardinalities and join pattern cardinalities for each URI ● Aggregation of the scores to obtain ranks ● For indexed URIs: weighted summation of all scores ● For non-indexed URIs: weighting of (currently known) in-links ● Ranking is refined at run-time
  • 27. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 28 Refinement at Run-Time [LT10] ● During query execution information becomes available (1) intermediate join results (2) more incoming links ● Use it to adjust scores & ranking (for integrated execution) ● Re-estimate join pattern cardinalities based on samples of intermediate results (available from hash tables in SHJ) ● Parameters for influencing behavior of ranking process: ● Invalid score threshold: re-rank when the number of URIs with invalid scores passes this threshold ● Sample size: larger samples give better estimates, but make the process more costly ● Re-sampling threshold: reuse cached estimates unless the hash table of join operators grows past this threshold
  • 28. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 29 Outline  Heuristics-Based Planning  Optimizing Link Traversing Iterators ➢ Prefetching ➢ Postponing  Source Ranking ➢ Harth et al. [HHK+10, UHK+11] ➢ Ladwig and Tran [LT10] √ √ √
  • 29. WWW 2013 Tutorial on Linked Data Query Processing [ Introduction ] 30 Tutorial Outline (1) Introduction (2) Theoretical Foundations (3) Source Selection Strategies (4) Execution Process (5) Query Planning and Optimization … Thanks!
  • 30. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 31 These slides have been created by Olaf Hartig for the WWW 2013 tutorial on Link Data Query Processing Tutorial Website: http://db.uwaterloo.ca/LDQTut2013/ This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License (http://creativecommons.org/licenses/by-sa/3.0/) (Some of the slides in this slide set have been inspired by slides from Günter Ladwig [LT10] – Thanks!)
  • 31. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 32 These slides have been created by Olaf Hartig for the WWW 2013 tutorial on Link Data Query Processing Tutorial Website: http://db.uwaterloo.ca/LDQTut2013/ This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License (http://creativecommons.org/licenses/by-sa/3.0/) (Slides 24 - 26, 33, and 34 are inspired by slides from Günter Ladwig [LT10] – Thanks!)
  • 32. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 33 Backup Slides
  • 33. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 34 Metric: Links to Results [LT10] ● Rationale: a URI is more relevant if data from many relevant URIs mention it ● Links are only discovered at run-time The “links to results” of a selected URI u is defined by: where Uprocessed is the set of URIs whose data has already been processed and links( u1 , u2 ) are the links to URI u1 mentioned in the data from URI u2. links(u):={l ∈links(u ,uprocessed )∣u processed ∈U processed }
  • 34. WWW 2013 Tutorial on Linked Data Query Processing [ Query Planning and Optimization ] 35 Metric: Retrieval Cost [LT10] ● Rationale: URIs are more relevant the faster their data can be retrieved ● Size is available in the pre-populated index ● Bandwidth for any particular host can be approximated based on past experience or average performance recorded during the query execution process The retrieval cost of a selected URI u is defined by: cost( u) :═ Agg( size(u) , bandwidth(u) ) where size(u) is the of the data from u, and bandwidth(u) is the bandwidth of the Web server that hosts u.