SlideShare a Scribd company logo
ANHAI DOAN ALON HALEVY ZACHARY IVES
CHAPTER 14: DATA
PROVENANCE
PRINCIPLES OF
DATA INTEGRATION
“Where Did this Data Come from?”
Challenge: integrated data may come from many
sources and mappings – of different quality or
trustworthiness!
 How did I get this particular result?
 What mappings produced it?
 How much should I trust (believe) it?
Data provenance (lineage) captures the relationships
between tuples in a set of data instances
2
An Example: View Tuple Derivations
B C
2 3
3 2
4 3
A B
1 2
2 4
R S
Source relations
A C directly derivable by
1 3 R(1,2) ⋈ S(2,3) ∪ R(1,4) ⋈
S(4,3)
2 2 S(2,3) ⋈ ρB  A, C  B S(3,2)
3 3 S(3,2) ⋈ ρB  A, C  B S(2,3)
View V1 = R ⋈ S ∪ S ⋈ S
3
Formulating a Provenance Model
Conceptually, provenance captures the operations
and operands going into a result
There are many options to do this, and many levels of detail!
A “good” provenance model should:
 Have a formal semantics
 Have equivalence properties such that equivalent query
plans produce equivalent provenance
 Connect to notions of value, quality or score
4
Outline
 The two views of provenance
 Applications of data provenance
 Provenance semirings: one ring to rule them all
 Storing provenance
5
Provenance as Annotations on Data
 Annotate each derivation with an “explanation” in
terms of relational algebra and the tuple operands
 Lets us “look up” the derivation of a result
B C
2 3
3 2
4 3
A B
1 2
1 4
R
S
A C provenance annotation
1 3 R(1,2) ⋈ S(2,3) ∪ R(1,4) ⋈
S(4,3)
2 2 S(2,3) ⋈ ρB  A, C  B S(3,2)
3 3 S(3,2) ⋈ ρB  A, C  B S(2,3)
View V1 (in Datalog):
V1(x,z) :- R(x,y), S(y,z)
V1(x,x) :- S(x,y), S(y,x)
6
Provenance as a Graph of Relationships
 Bipartite graph: tuple nodes connected via “derivation nodes”
 Encodes a hypergraph (hyperedges = derivations)
 Makes direct derivation relationships more explicit
7
R(1,2)
R(1,4)
S(2,3)
S(3,2)
S(4,3)
V1(1,3)
V1(2,2)
V1(3,3)
derives via
V1
derives via
V1
derives via
V1
derives via
V1
Making the Two Interchangeable
 We can make these equivalent by introducing
provenance tokens (equiv. node IDs) for each tuple
 Derived tuples’ annotations = expressions over tokens
B C ann
2 3 s1
3 2 s2
4 3 s3
A B ann
1 2 r1
1 4 r2
R
S A C ann
1 3 v1 = r1 ⋈ s1 ∪ r2 ⋈ s3
2 2 v2 = s1 ⋈ s2
3 3 v3 = s2 ⋈ s1 8
V1
r1
r2
s1
s2
s3
v1
v2
v3
V1
V1
V1
V1
Outline
 The two views of provenance
 Applications of data provenance
 Provenance semirings: one ring to rule them all
 Storing provenance
9
Where Can We Use Provenance?
Explanations
 Help the user understand why an item exists
Scoring
 Provide a ranked list of “most relevant” results
Reasoning about interactions
 Help the user understand data relationships
Examples of Provenance’s Utility
Schema mapping debugging:
We may have a bad result
Determine why that result exists, what is faulty
Bioinformatics data integration:
Different sources have different levels of reliability or
authoritativeness
Rank results by score!
Probabilistic databases:
We may need to know that results are correlated
Encode the relationships, use to assign probabilities
Outline
 The two views of provenance
 Applications of data provenance
 Provenance semirings: one ring to rule them all
 Storing provenance
12
The Notion of Provenance as
Annotations
 Many formalisms were defined for using query
computations to produce annotations
 Each captured certain subtleties
 The key question: Is there one “most powerful”
model that captures the properties of the relational
algebra*?
 Equivalent queries should produce equivalent provenance
* over multi-sets or bags, as used by “real” systems
The Provenance Semiring Model
To represent provenance, use:
 A set of provenance tokens or tuple IDs, K
 Abstract operators representing combination of tuples
Abstract sum operator, ⊕, for union or projection
has identity element 0 (a ⊕ 0 ≡ 0 ⊕ a ≡ 0)
Abstract product operator, ⊗, for join
 has identity element 1 (a ⊗ 1 ≡ 1 ⊗ a ≡ 1)
 also (a ⊗ 0 ≡ 0 ⊗ a ≡ 0)
This is formally a commutative semiring
14
The Provenance Semiring Model
 We can re-express our example as below, using the
semiring operators instead of the relational algebra
ones
B C ann
2 3 s1
3 2 s2
4 3 s3
A B ann
1 2 r1
1 4 r2
R
S A C Ann
1 3 v1 = r1 ⊗ s1 ⊕ r2 ⊗ s3
2 2 v2 = s1 ⊗ s2
3 3 v3 = s2 ⊗ s1 15
V1
r1
r2
s1
s2
s3
v1
v2
v3
V1
V1
V1
V1
Tokens for Mappings
 Sometimes we would like to assign a token to the actual
mapping or rule used – so we can assign it a value
B C ann
2 3 s1
3 2 s2
4 3 s3
A B ann
1 2 r1
1 4 r2
R
S A C Ann
1 3 v1 = m1⊗ [r1 ⊗ s1] ⊕ m2⊗ [r2 ⊗
s3]
2 2 v2 = m2⊗ [s1 ⊗ s2]
3 3 v3 = m2⊗ [s2 ⊗ s1] 16
V1
View V1 (in Datalog):
V1(x,z) :- R(x,y), S(y,z)
V1(x,x) :- S(x,y), S(y,x)
Call this m1
Call this m2
Example Application:
Provenance Visualization
Base tuple derivation
(token not shown)
Tuple nodes
Derivation by
mapping M5
Example Application: Tuple
Scoring
 For ranked query results, we may adopt the following model
commonly used in ranking:
 Assign a score to each base tuple = - log2(probability)
 Use arithmetic sum as ⊗
 Use min as ⊕
 Suppose
 prob(r1) = 0.5, prob(s1) = 0.5, others are 1.0
A C Ann
1 3 v1 = r1 ⊗ s1 ⊕ r2 ⊗ s3 = min((2+1),(1+1)) = 2
2 2 v2 = s1 ⊗ s2 = 2+1 = 3
3 3 v3 = s2 ⊗ s1 = 1+2 = 3
V1
Useful Semirings
Use case Base value Product R ⊗ S Sum R ⊕ S
Derivability True R ∧ S R ∨ S
Trust Trust condition
result
R ∧ S R ∨ S
Confidentiality
level
Tuple
confidentiality
level
More_secure(R,
S)
Less_secure(R,S
)
Weight / cost Base tuple
weight
R + S min(R,S)
Lineage Tuple ID R ∪ S R ∩ S
Probabilistic
event
Tuple
probabilistic
event
R ∧ S R ∨ S
Number of
derivations
1 R ⋅ S R + S
19
Outline
 The two views of provenance
 Applications of data provenance
 Provenance semirings: one ring to rule them all
 Storing provenance
20
Storing Provenance
 Use tuple keys as tokens
 Encode provenance graph as relations
B C
2 3
3 2
4 3
A B
1 2
1 4
R
S
A C
1 3
2 2
3 3
V1
View V1 (in Datalog):
V1(x,z) :- R(x,y), S(y,z)
V1(x,x) :- S(x,y), S(y,x)
Relate tuples with table Pv
Relate tuples with table Pv1
R.A R.B S. B S.C V1.A V1.C
1 2 2 3 1 3
1 4 4 3 1 3
S.B S.C S.B’ S.C
’
V1.A V1.C
2 3 3 2 2 2
3 2 2 3 3 3 21
Pv1-1
Pv1-2
Storing Provenance
 Use tuple keys as tokens
 Encode provenance graph as relations
B C
2 3
3 2
4 3
A B
1 2
1 4
R
S
A C
1 3
2 2
3 3
V1
View V1 (in Datalog):
V1(x,z) :- R(x,y), S(y,z)
V1(x,x) :- S(x,y), S(y,x)
R.A R.B S. B S.C V1.A V1.C
1 2 2 3 1 3
1 4 4 3 1 3
S.B S.C S.B’ S.C
’
V1.A V1.C
2 3 3 2 2 2
3 2 2 3 3 3 22
Pv1-1
Pv1-2
These are redundant
if we know the Datalo
Storing Provenance
 Use tuple keys as tokens
 Encode provenance graph as relations
B C
2 3
3 2
4 3
A B
1 2
1 4
R
S
A C
1 3
2 2
3 3
V1
View V1 (in Datalog):
V1(x,z) :- R(x,y), S(y,z)
V1(x,x) :- S(x,y), S(y,x)
A B C
1 2 3
1 4 3
B C C’
2 3 2
3 2 3
23
Pv1-1
Pv1-2
Data Provenance Wrap-up
 Provenance is critical to understanding and assessing
the believability of data, and in debugging
 Two equivalent representations – annotations vs graph
 Provenance semiring model preserves the “expected”
equivalences of the relational algebra
 We can take semiring provenance and evaluate it with
different semirings to get useful scores
 We can store provenance using relations
 Recent work beyond the scope of the book:
 Extending provenance to more complex queries, e.g., with
aggregation
 Languages for querying provenance (primarily as a graph)

More Related Content

Similar to Data integration and provenance-Chapter-14 (20)

PDF
QUERY INVERSION TO FIND DATA PROVENANCE
cscpconf
 
PPT
Provinance in scientific workflows in e science
bdemchak
 
PDF
Provenance Analysis and RDF Query Processing: W3C PROV for Data Quality and T...
satyasanket
 
PPTX
Provenance for Data Munging Environments
Paul Groth
 
PPTX
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Paolo Missier
 
PPTX
Data Provenance and its role in Data Science
Paolo Missier
 
PPT
Recording and Reasoning Over Data Provenance in Web and Grid Services
Martin Szomszor
 
PPTX
Thoughts on Knowledge Graphs & Deeper Provenance
Paul Groth
 
PDF
Declarative Datalog Debugging for Mere Mortals
Bertram Ludäscher
 
PDF
Works 2015-provenance-mileage
Bertram Ludäscher
 
PDF
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
Boris Glavic
 
PPT
Trio Notes
Social Media Marketing
 
PPT
Mazda Trio Notes
CardinaleWay Mazda
 
PPTX
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Paolo Missier
 
PPTX
Data Base Management system relation algebra ER diageam Sql Query -nested qu...
kudiyarc
 
PDF
Transcript - Provenance and Social Science data
ARDC
 
PPTX
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
Boris Glavic
 
PDF
DAIS Seminar: The Many Faces of Provenance in Databases and Workflows
Bertram Ludäscher
 
QUERY INVERSION TO FIND DATA PROVENANCE
cscpconf
 
Provinance in scientific workflows in e science
bdemchak
 
Provenance Analysis and RDF Query Processing: W3C PROV for Data Quality and T...
satyasanket
 
Provenance for Data Munging Environments
Paul Groth
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Paolo Missier
 
Data Provenance and its role in Data Science
Paolo Missier
 
Recording and Reasoning Over Data Provenance in Web and Grid Services
Martin Szomszor
 
Thoughts on Knowledge Graphs & Deeper Provenance
Paul Groth
 
Declarative Datalog Debugging for Mere Mortals
Bertram Ludäscher
 
Works 2015-provenance-mileage
Bertram Ludäscher
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
Boris Glavic
 
Mazda Trio Notes
CardinaleWay Mazda
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Paolo Missier
 
Data Base Management system relation algebra ER diageam Sql Query -nested qu...
kudiyarc
 
Transcript - Provenance and Social Science data
ARDC
 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
Boris Glavic
 
DAIS Seminar: The Many Faces of Provenance in Databases and Workflows
Bertram Ludäscher
 

Recently uploaded (20)

PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Ad

Data integration and provenance-Chapter-14

  • 1. ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION
  • 2. “Where Did this Data Come from?” Challenge: integrated data may come from many sources and mappings – of different quality or trustworthiness!  How did I get this particular result?  What mappings produced it?  How much should I trust (believe) it? Data provenance (lineage) captures the relationships between tuples in a set of data instances 2
  • 3. An Example: View Tuple Derivations B C 2 3 3 2 4 3 A B 1 2 2 4 R S Source relations A C directly derivable by 1 3 R(1,2) ⋈ S(2,3) ∪ R(1,4) ⋈ S(4,3) 2 2 S(2,3) ⋈ ρB  A, C  B S(3,2) 3 3 S(3,2) ⋈ ρB  A, C  B S(2,3) View V1 = R ⋈ S ∪ S ⋈ S 3
  • 4. Formulating a Provenance Model Conceptually, provenance captures the operations and operands going into a result There are many options to do this, and many levels of detail! A “good” provenance model should:  Have a formal semantics  Have equivalence properties such that equivalent query plans produce equivalent provenance  Connect to notions of value, quality or score 4
  • 5. Outline  The two views of provenance  Applications of data provenance  Provenance semirings: one ring to rule them all  Storing provenance 5
  • 6. Provenance as Annotations on Data  Annotate each derivation with an “explanation” in terms of relational algebra and the tuple operands  Lets us “look up” the derivation of a result B C 2 3 3 2 4 3 A B 1 2 1 4 R S A C provenance annotation 1 3 R(1,2) ⋈ S(2,3) ∪ R(1,4) ⋈ S(4,3) 2 2 S(2,3) ⋈ ρB  A, C  B S(3,2) 3 3 S(3,2) ⋈ ρB  A, C  B S(2,3) View V1 (in Datalog): V1(x,z) :- R(x,y), S(y,z) V1(x,x) :- S(x,y), S(y,x) 6
  • 7. Provenance as a Graph of Relationships  Bipartite graph: tuple nodes connected via “derivation nodes”  Encodes a hypergraph (hyperedges = derivations)  Makes direct derivation relationships more explicit 7 R(1,2) R(1,4) S(2,3) S(3,2) S(4,3) V1(1,3) V1(2,2) V1(3,3) derives via V1 derives via V1 derives via V1 derives via V1
  • 8. Making the Two Interchangeable  We can make these equivalent by introducing provenance tokens (equiv. node IDs) for each tuple  Derived tuples’ annotations = expressions over tokens B C ann 2 3 s1 3 2 s2 4 3 s3 A B ann 1 2 r1 1 4 r2 R S A C ann 1 3 v1 = r1 ⋈ s1 ∪ r2 ⋈ s3 2 2 v2 = s1 ⋈ s2 3 3 v3 = s2 ⋈ s1 8 V1 r1 r2 s1 s2 s3 v1 v2 v3 V1 V1 V1 V1
  • 9. Outline  The two views of provenance  Applications of data provenance  Provenance semirings: one ring to rule them all  Storing provenance 9
  • 10. Where Can We Use Provenance? Explanations  Help the user understand why an item exists Scoring  Provide a ranked list of “most relevant” results Reasoning about interactions  Help the user understand data relationships
  • 11. Examples of Provenance’s Utility Schema mapping debugging: We may have a bad result Determine why that result exists, what is faulty Bioinformatics data integration: Different sources have different levels of reliability or authoritativeness Rank results by score! Probabilistic databases: We may need to know that results are correlated Encode the relationships, use to assign probabilities
  • 12. Outline  The two views of provenance  Applications of data provenance  Provenance semirings: one ring to rule them all  Storing provenance 12
  • 13. The Notion of Provenance as Annotations  Many formalisms were defined for using query computations to produce annotations  Each captured certain subtleties  The key question: Is there one “most powerful” model that captures the properties of the relational algebra*?  Equivalent queries should produce equivalent provenance * over multi-sets or bags, as used by “real” systems
  • 14. The Provenance Semiring Model To represent provenance, use:  A set of provenance tokens or tuple IDs, K  Abstract operators representing combination of tuples Abstract sum operator, ⊕, for union or projection has identity element 0 (a ⊕ 0 ≡ 0 ⊕ a ≡ 0) Abstract product operator, ⊗, for join  has identity element 1 (a ⊗ 1 ≡ 1 ⊗ a ≡ 1)  also (a ⊗ 0 ≡ 0 ⊗ a ≡ 0) This is formally a commutative semiring 14
  • 15. The Provenance Semiring Model  We can re-express our example as below, using the semiring operators instead of the relational algebra ones B C ann 2 3 s1 3 2 s2 4 3 s3 A B ann 1 2 r1 1 4 r2 R S A C Ann 1 3 v1 = r1 ⊗ s1 ⊕ r2 ⊗ s3 2 2 v2 = s1 ⊗ s2 3 3 v3 = s2 ⊗ s1 15 V1 r1 r2 s1 s2 s3 v1 v2 v3 V1 V1 V1 V1
  • 16. Tokens for Mappings  Sometimes we would like to assign a token to the actual mapping or rule used – so we can assign it a value B C ann 2 3 s1 3 2 s2 4 3 s3 A B ann 1 2 r1 1 4 r2 R S A C Ann 1 3 v1 = m1⊗ [r1 ⊗ s1] ⊕ m2⊗ [r2 ⊗ s3] 2 2 v2 = m2⊗ [s1 ⊗ s2] 3 3 v3 = m2⊗ [s2 ⊗ s1] 16 V1 View V1 (in Datalog): V1(x,z) :- R(x,y), S(y,z) V1(x,x) :- S(x,y), S(y,x) Call this m1 Call this m2
  • 17. Example Application: Provenance Visualization Base tuple derivation (token not shown) Tuple nodes Derivation by mapping M5
  • 18. Example Application: Tuple Scoring  For ranked query results, we may adopt the following model commonly used in ranking:  Assign a score to each base tuple = - log2(probability)  Use arithmetic sum as ⊗  Use min as ⊕  Suppose  prob(r1) = 0.5, prob(s1) = 0.5, others are 1.0 A C Ann 1 3 v1 = r1 ⊗ s1 ⊕ r2 ⊗ s3 = min((2+1),(1+1)) = 2 2 2 v2 = s1 ⊗ s2 = 2+1 = 3 3 3 v3 = s2 ⊗ s1 = 1+2 = 3 V1
  • 19. Useful Semirings Use case Base value Product R ⊗ S Sum R ⊕ S Derivability True R ∧ S R ∨ S Trust Trust condition result R ∧ S R ∨ S Confidentiality level Tuple confidentiality level More_secure(R, S) Less_secure(R,S ) Weight / cost Base tuple weight R + S min(R,S) Lineage Tuple ID R ∪ S R ∩ S Probabilistic event Tuple probabilistic event R ∧ S R ∨ S Number of derivations 1 R ⋅ S R + S 19
  • 20. Outline  The two views of provenance  Applications of data provenance  Provenance semirings: one ring to rule them all  Storing provenance 20
  • 21. Storing Provenance  Use tuple keys as tokens  Encode provenance graph as relations B C 2 3 3 2 4 3 A B 1 2 1 4 R S A C 1 3 2 2 3 3 V1 View V1 (in Datalog): V1(x,z) :- R(x,y), S(y,z) V1(x,x) :- S(x,y), S(y,x) Relate tuples with table Pv Relate tuples with table Pv1 R.A R.B S. B S.C V1.A V1.C 1 2 2 3 1 3 1 4 4 3 1 3 S.B S.C S.B’ S.C ’ V1.A V1.C 2 3 3 2 2 2 3 2 2 3 3 3 21 Pv1-1 Pv1-2
  • 22. Storing Provenance  Use tuple keys as tokens  Encode provenance graph as relations B C 2 3 3 2 4 3 A B 1 2 1 4 R S A C 1 3 2 2 3 3 V1 View V1 (in Datalog): V1(x,z) :- R(x,y), S(y,z) V1(x,x) :- S(x,y), S(y,x) R.A R.B S. B S.C V1.A V1.C 1 2 2 3 1 3 1 4 4 3 1 3 S.B S.C S.B’ S.C ’ V1.A V1.C 2 3 3 2 2 2 3 2 2 3 3 3 22 Pv1-1 Pv1-2 These are redundant if we know the Datalo
  • 23. Storing Provenance  Use tuple keys as tokens  Encode provenance graph as relations B C 2 3 3 2 4 3 A B 1 2 1 4 R S A C 1 3 2 2 3 3 V1 View V1 (in Datalog): V1(x,z) :- R(x,y), S(y,z) V1(x,x) :- S(x,y), S(y,x) A B C 1 2 3 1 4 3 B C C’ 2 3 2 3 2 3 23 Pv1-1 Pv1-2
  • 24. Data Provenance Wrap-up  Provenance is critical to understanding and assessing the believability of data, and in debugging  Two equivalent representations – annotations vs graph  Provenance semiring model preserves the “expected” equivalences of the relational algebra  We can take semiring provenance and evaluate it with different semirings to get useful scores  We can store provenance using relations  Recent work beyond the scope of the book:  Extending provenance to more complex queries, e.g., with aggregation  Languages for querying provenance (primarily as a graph)