SlideShare a Scribd company logo
Querying and Merging Heterogeneous
Data by Approximate Joins on Higher-
Order Terms
Simon Price and Peter Flach
ILP 2008
Query heterogeneous data sources as if their data
were conveniently held in a single relational
database.
Example data sources:
• web pages
• digital libraries
• knowledge bases
• Semantic Web
• databases
Our Aim
2
Outline of this paper
1. Relational Joins (a quick review)
2. Basic Terms
3. Relational Joins for Basic Terms
4. Basic Term Proximity-Join
5. Application
3
Contribution of this work
Relational Algebra for Basic Terms
Basic Term Proximity-Join
Application to bibliographic data
4
1. Relational Joins (a quick review)
5
1. Relational Joins (a quick review)
2. Basic Terms
3. Relational Joins for Basic Terms
4. Basic Term Proximity-Join
5. Application
θ-Join
6
publications
publication
#
title venue year
1 a b c
2 d e f
authors
author#
publication
#
name
1 1 p
2 1 q
3 2 r
publication
#
title venue year
1 a b c
author#
publication
#
name
1 1 p
publication
#
title venue year
2 d e f
author#
publication
#
name
3 2 r
publication
#
title venue year
1 a b c
author#
publication
#
name
2 1 q
2. Basic Terms
7
1. Relational Joins (a quick review)
2. Basic Terms
3. Relational Joins for Basic Terms
4. Basic Term Proximity-Join
5. Application
Basic Terms
8
• Proposed by John Lloyd
• Family of typed-terms in higher-order logic
• Based on Church’s simple theory of types
• Data types representing:
• tuples
• structures - e.g. trees and graphs
• abstractions - e.g. sets and multisets (bags)
• Basic Terms and the “individuals-as-terms”
model
Lloyd, J. W.: Logic for Learning. Springer. New York
(2003)
Representing Individuals as Basic Terms
9
1. Define basic type structure
e.g. an academic publication with following basic type structure
2. Transform data instances to basic terms of that type
e.g. a publication record from the CORA bibliographic database
( { “Mitchell, T.”, “Thrun, S.” },
“Explanation-Based Learning: A Comparison of Symbolic and Neural Network
Approaches.”,
“In Proceedings of the Tenth International Conference on Machine Learning”,
“1993” )
3. Relational Joins for Basic Terms
10
1. Relational Joins (a quick review)
2. Basic Terms
3. Relational Joins for Basic Terms
4. Basic Term Proximity-Join
5. Application
Upgrading Relational Joins for Basic Terms
11
1. Restate (a sub-set of) traditional relational algebra
1. Remove schematic metadata from the data itself
2. Make explicit the tuple item indexing function
2. Replace sets of tuples (relations) with sets of basic
terms (“basic term relations”)
3. Upgrade indexing function to index all types of basic
terms:
1. basic tuples (tuples of basic terms)
2. basic structures (e.g. lists and trees of basic terms)
3. basic abstractions (e.g. sets and multisets of basic terms)
4. any combination of the above three
Basic Term θ-Join (Example 1)
12
A = { ({p, q}, a, b, c),
({r}, d, e, f),
({s, t}, d, g, h) }
B = { ({p, q}, a, b, c),
({p, q}, d, e, f) }
{ ( ({p, q}, a, b, c), ({p, q}, a, b, c) ),
( ({r}, d, e, f), ({p, q}, d, e, f) ),
( ({s, t}, d, g, h), ({p, q}, d, e, f) ) }
Title = TitleA B =
Basic Term θ-Join (Example 2)
13
A = { ({p, q}, a, b, c),
({r}, d, e, f),
({s, t}, g, h, k) }
B = { ({p, q}, a, b, c),
({p, q}, d, e, f) }
{ ( ({p, q}, a, b, c), ({p, q}, a, b, c) ),
( ({p, q}, a, b, c), ({p, q}, d, e, f) ) }
Coauthors = CoauthorsA B =
Basic Term θ-Join (Example 3)
14
A = { ({p, q}, a, b, c),
({r}, d, e, f),
({s, t}, g, h, k) }
B = { ({p, q}, a, b, c),
({p, q}, d, e, f) }
{ ( ({p, q}, a, b, c), ({p, q}, a, b, c) ) }Publication = PublicationA B =
4. Basic Term Proximity Join
15
1. Relational Joins (a quick review)
2. Basic Terms
3. Relational Joins for Basic Terms
4. Basic Term Proximity-Join
5. Application
Replacing Equality with Proximity
16
Proximity
Distance Threshold
is a dependency relation, but not an
equivalence relation.
i.e. proximity is reflexive and
symmetric but not necessarily
transitive. Due to:
Properties of Proximity
• dist is not constrained to have
an upper bound.
• Some normalising function
may be used,
usually into the closed
interval [0, 1].
• Or can normalise in feature
space (e.g. normalising
kernels).
Normalisation
Basic Term Proximity Join
• Basic Term Projection
• Basic Term θ-Restriction
• Basic Term Proximity Join
17
s is a basic subterm at type tree index i
5. Application
18
1. Relational Joins (a quick review)
2. Basic Terms
3. Relational Joins for Basic Terms
4. Basic Term Proximity-Join
5. Application
• Given the ground truth
where
• The goal is to reconstruct V as V’ by choosing an appropriate s
Proximity joins on bibliographic publications
data
19
• Currently, ground truths for pairs of data sets are rare.
So, we choose with
• CORA-REFS publications data set (1881 instances)
• Data type is the one used in examples throughout this talk
• Example pair of instances that require approximate join:
CORA Data Set
20
( { “Mitchell, T.”, “Thrun, S.” },
“Explanation-Based Learning:
...”,
“In Proceedings of the Tenth
...”,
“1993” )
( { “Tom Mitchell”, “Sven Thrun”
},
“Explanation based learning:
...”,
“Proceedings of the 10th ...”,
“ ’93 ” )
Experiments on CORA data set
21
For each join:
1. Calculate pairwise distances
between all basic terms in
CORA
2. Construct a dendrogram
3. Calculate precision-recall at
each node in the dendrogram
i.e. plot a point on the p-r chart
for each node in the
dendrogram
e.g.
threshold = 120
TP FN
FP TN
Confusion Matrix
TP is no. pairs in same cluster
that should be in the same
cluster.
TN is no. pairs in different
clusters that should be in
different clusters.
FP is etc...
Proximity Joins on CORA Dataset
23
Publication
Publication.Coauthors
Publication.Title
Publication.Venue
Precision
Recall
• Distance derived from a kernel in the usual way
• Basic term kernel = default kernel for basic terms
• String kernel = p-spectrum kernel
• Default kernel = matching kernel
And finally...
24
Conclusion
26
• Relational Algebra for Basic Terms
• Combines relational model and basic terms in single
formalism
• Basic Term Proximity-Join
• Enables approximate querying and merging of basic terms
• Application to bibliographic data
• Shows potential for data integration
❦ ❦ ❦
Default Kernel for Basic Terms
27

More Related Content

What's hot (20)

PDF
Learning-based Data Cleaning
Christian Stade-Schuldt
 
PPTX
Data Structure & Algorithms | Computer Science
Transweb Global Inc
 
DOC
Discovering Novel Information with sentence Level clustering From Multi-docu...
irjes
 
PPTX
Binary tree and Binary search tree
Mayeesha Samiha
 
PDF
Binary tree
Rajendran
 
PPT
Trees - Data structures in C/Java
geeksrik
 
PPT
computer notes - Data Structures - 13
ecomputernotes
 
PPTX
THREADED BINARY TREE AND BINARY SEARCH TREE
Siddhi Shrivas
 
PDF
Hash table methods
unyil96
 
PDF
Text Mining Using R
Knoldus Inc.
 
PDF
Information Extraction from Web-Scale N-Gram Data
Gerard de Melo
 
PDF
Interactive Knowledge Discovery over Web of Data.
Mehwish Alam
 
PPT
1.1 binary tree
Krish_ver2
 
PPT
Tree and Binary Search tree
Muhazzab Chouhadry
 
PPTX
Data Structures
Rahul Jamwal
 
PDF
Final-Report
Ben Reichert
 
PDF
Introduction of Data Structures and Algorithms by GOWRU BHARATH KUMAR
BHARATH KUMAR
 
PDF
Framester: A Wide Coverage Linguistic Linked Data Hub
Mehwish Alam
 
PDF
Furnish an Index Using the Works of Tree Structures
ijceronline
 
PPTX
Lecture 8 data structures and algorithms
Aakash deep Singhal
 
Learning-based Data Cleaning
Christian Stade-Schuldt
 
Data Structure & Algorithms | Computer Science
Transweb Global Inc
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
irjes
 
Binary tree and Binary search tree
Mayeesha Samiha
 
Binary tree
Rajendran
 
Trees - Data structures in C/Java
geeksrik
 
computer notes - Data Structures - 13
ecomputernotes
 
THREADED BINARY TREE AND BINARY SEARCH TREE
Siddhi Shrivas
 
Hash table methods
unyil96
 
Text Mining Using R
Knoldus Inc.
 
Information Extraction from Web-Scale N-Gram Data
Gerard de Melo
 
Interactive Knowledge Discovery over Web of Data.
Mehwish Alam
 
1.1 binary tree
Krish_ver2
 
Tree and Binary Search tree
Muhazzab Chouhadry
 
Data Structures
Rahul Jamwal
 
Final-Report
Ben Reichert
 
Introduction of Data Structures and Algorithms by GOWRU BHARATH KUMAR
BHARATH KUMAR
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Mehwish Alam
 
Furnish an Index Using the Works of Tree Structures
ijceronline
 
Lecture 8 data structures and algorithms
Aakash deep Singhal
 

Viewers also liked (20)

PPTX
Adapting CARDIO for BOS
Simon Price
 
PPT
Webs of People, Webs of Data
Simon Price
 
PDF
Двигатели серии Hja hjn Marathon-Regal
Arve
 
PPT
Nature Locator
Simon Price
 
PPTX
Co-designing Research IT and Research Data Services
Simon Price
 
PPTX
NewsPatterns - visualisation layer of news feed mining
Simon Price
 
PPT
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Simon Price
 
PPTX
Managing Large-scale Multimedia Development Projects
Simon Price
 
PPT
Managing research data at Bristol
Simon Price
 
PPTX
Research IT at the University of Bristol
Simon Price
 
PPT
A review of the state of the art in Machine Learning on the Semantic Web
Simon Price
 
PPTX
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Simon Price
 
PDF
A Minimum Spanning Tree Approach of Solving a Transportation Problem
inventionjournals
 
PDF
Oscillation of Solutions to Neutral Delay and Advanced Difference Equations w...
inventionjournals
 
PPTX
чурсина
virtualtaganrog
 
PPTX
data.bris - Use case, role and functionality for CKAN adoption
Simon Price
 
PPTX
Visualising China - historical photos of China
Simon Price
 
PPTX
Historical Photographs of China - the journey towards sustainability and utility
Simon Price
 
PPTX
Supporting Big Data, Open Data, Data Analytics and Data Science
Simon Price
 
PPTX
Data Sharing and Standards
Simon Price
 
Adapting CARDIO for BOS
Simon Price
 
Webs of People, Webs of Data
Simon Price
 
Двигатели серии Hja hjn Marathon-Regal
Arve
 
Nature Locator
Simon Price
 
Co-designing Research IT and Research Data Services
Simon Price
 
NewsPatterns - visualisation layer of news feed mining
Simon Price
 
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Simon Price
 
Managing Large-scale Multimedia Development Projects
Simon Price
 
Managing research data at Bristol
Simon Price
 
Research IT at the University of Bristol
Simon Price
 
A review of the state of the art in Machine Learning on the Semantic Web
Simon Price
 
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Simon Price
 
A Minimum Spanning Tree Approach of Solving a Transportation Problem
inventionjournals
 
Oscillation of Solutions to Neutral Delay and Advanced Difference Equations w...
inventionjournals
 
чурсина
virtualtaganrog
 
data.bris - Use case, role and functionality for CKAN adoption
Simon Price
 
Visualising China - historical photos of China
Simon Price
 
Historical Photographs of China - the journey towards sustainability and utility
Simon Price
 
Supporting Big Data, Open Data, Data Analytics and Data Science
Simon Price
 
Data Sharing and Standards
Simon Price
 
Ad

Similar to Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order Terms (20)

PDF
Complex queries in a distributed multi-model database
Max Neunhöffer
 
PPTX
Rules for inducing hierarchies from social tagging data
Hang Dong
 
PPT
Intelligent Methods in Models of Text Information Retrieval: Implications for...
inscit2006
 
PPT
Phylogenetic Signal with Induction and non-Contradiction - V Berry
Roderic Page
 
PDF
Concepts and Challenges of Text Retrieval for Search Engine
Gan Keng Hoon
 
PPT
it is about telorant retrieval in information retrieval.ppt
NedayeMehrabani
 
PPT
Cheminformatics: An overview
subhasis banerjee
 
PDF
Rdf conjunctive query selectivity estimation
INRIA-OAK
 
PDF
Lesson11 transactions
teddy demissie
 
PDF
Introduction to data analysis using R
Victoria López
 
PDF
Online Relation Alignment for Linked Datasets
Maria Koutraki
 
PPT
Pertemuan 5_Relation Matriks_01 (17)
Evert Sandye Taasiringan
 
PDF
2014-mo444-practical-assignment-02-paulo_faria
Paulo Faria
 
PDF
Dbms fundamentals
venkatme83
 
PDF
2a-Linked Listsxcxxcxxcxcxcxcxcxcxcxcxx.pdf
NGUYNTHNHQUC2
 
PDF
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
Holistic Benchmarking of Big Linked Data
 
PPTX
Democratizing Big Semantic Data management
WU (Vienna University of Economics and Business)
 
PPT
Dictionaries and Tolerant Retrieval.ppt
Manimaran A
 
PPT
Cs341
Serghei Urban
 
Complex queries in a distributed multi-model database
Max Neunhöffer
 
Rules for inducing hierarchies from social tagging data
Hang Dong
 
Intelligent Methods in Models of Text Information Retrieval: Implications for...
inscit2006
 
Phylogenetic Signal with Induction and non-Contradiction - V Berry
Roderic Page
 
Concepts and Challenges of Text Retrieval for Search Engine
Gan Keng Hoon
 
it is about telorant retrieval in information retrieval.ppt
NedayeMehrabani
 
Cheminformatics: An overview
subhasis banerjee
 
Rdf conjunctive query selectivity estimation
INRIA-OAK
 
Lesson11 transactions
teddy demissie
 
Introduction to data analysis using R
Victoria López
 
Online Relation Alignment for Linked Datasets
Maria Koutraki
 
Pertemuan 5_Relation Matriks_01 (17)
Evert Sandye Taasiringan
 
2014-mo444-practical-assignment-02-paulo_faria
Paulo Faria
 
Dbms fundamentals
venkatme83
 
2a-Linked Listsxcxxcxxcxcxcxcxcxcxcxcxx.pdf
NGUYNTHNHQUC2
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
Holistic Benchmarking of Big Linked Data
 
Democratizing Big Semantic Data management
WU (Vienna University of Economics and Business)
 
Dictionaries and Tolerant Retrieval.ppt
Manimaran A
 
Ad

More from Simon Price (10)

PPTX
Adding Open Data Value to 'Closed Data' Problems
Simon Price
 
PPT
Citizen Science and Crowd-sourcing Biological Surveys
Simon Price
 
PPT
Mining and Mapping the Research Landscape
Simon Price
 
PPTX
A Higher-Order Data Flow Model for Heterogeneous Big Data
Simon Price
 
PPT
SubSift web services and workflows for profiling and comparing scientists and...
Simon Price
 
PPT
SubSift: a novel application of the vector space model to support the academi...
Simon Price
 
PPTX
Code Club - a Fight Club inspired approach to software inspection and review
Simon Price
 
PPTX
Academic IT support for Data Science
Simon Price
 
PPTX
Mobile Apps for Research Data Collection
Simon Price
 
PPTX
Clinical Experience Recorder
Simon Price
 
Adding Open Data Value to 'Closed Data' Problems
Simon Price
 
Citizen Science and Crowd-sourcing Biological Surveys
Simon Price
 
Mining and Mapping the Research Landscape
Simon Price
 
A Higher-Order Data Flow Model for Heterogeneous Big Data
Simon Price
 
SubSift web services and workflows for profiling and comparing scientists and...
Simon Price
 
SubSift: a novel application of the vector space model to support the academi...
Simon Price
 
Code Club - a Fight Club inspired approach to software inspection and review
Simon Price
 
Academic IT support for Data Science
Simon Price
 
Mobile Apps for Research Data Collection
Simon Price
 
Clinical Experience Recorder
Simon Price
 

Recently uploaded (20)

PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PPTX
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
Top Civil Engineer Canada Services111111
nengineeringfirms
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PPTX
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPT
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
short term internship project on Data visualization
JMJCollegeComputerde
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Top Civil Engineer Canada Services111111
nengineeringfirms
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 

Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order Terms

  • 1. Querying and Merging Heterogeneous Data by Approximate Joins on Higher- Order Terms Simon Price and Peter Flach ILP 2008
  • 2. Query heterogeneous data sources as if their data were conveniently held in a single relational database. Example data sources: • web pages • digital libraries • knowledge bases • Semantic Web • databases Our Aim 2
  • 3. Outline of this paper 1. Relational Joins (a quick review) 2. Basic Terms 3. Relational Joins for Basic Terms 4. Basic Term Proximity-Join 5. Application 3
  • 4. Contribution of this work Relational Algebra for Basic Terms Basic Term Proximity-Join Application to bibliographic data 4
  • 5. 1. Relational Joins (a quick review) 5 1. Relational Joins (a quick review) 2. Basic Terms 3. Relational Joins for Basic Terms 4. Basic Term Proximity-Join 5. Application
  • 6. θ-Join 6 publications publication # title venue year 1 a b c 2 d e f authors author# publication # name 1 1 p 2 1 q 3 2 r publication # title venue year 1 a b c author# publication # name 1 1 p publication # title venue year 2 d e f author# publication # name 3 2 r publication # title venue year 1 a b c author# publication # name 2 1 q
  • 7. 2. Basic Terms 7 1. Relational Joins (a quick review) 2. Basic Terms 3. Relational Joins for Basic Terms 4. Basic Term Proximity-Join 5. Application
  • 8. Basic Terms 8 • Proposed by John Lloyd • Family of typed-terms in higher-order logic • Based on Church’s simple theory of types • Data types representing: • tuples • structures - e.g. trees and graphs • abstractions - e.g. sets and multisets (bags) • Basic Terms and the “individuals-as-terms” model Lloyd, J. W.: Logic for Learning. Springer. New York (2003)
  • 9. Representing Individuals as Basic Terms 9 1. Define basic type structure e.g. an academic publication with following basic type structure 2. Transform data instances to basic terms of that type e.g. a publication record from the CORA bibliographic database ( { “Mitchell, T.”, “Thrun, S.” }, “Explanation-Based Learning: A Comparison of Symbolic and Neural Network Approaches.”, “In Proceedings of the Tenth International Conference on Machine Learning”, “1993” )
  • 10. 3. Relational Joins for Basic Terms 10 1. Relational Joins (a quick review) 2. Basic Terms 3. Relational Joins for Basic Terms 4. Basic Term Proximity-Join 5. Application
  • 11. Upgrading Relational Joins for Basic Terms 11 1. Restate (a sub-set of) traditional relational algebra 1. Remove schematic metadata from the data itself 2. Make explicit the tuple item indexing function 2. Replace sets of tuples (relations) with sets of basic terms (“basic term relations”) 3. Upgrade indexing function to index all types of basic terms: 1. basic tuples (tuples of basic terms) 2. basic structures (e.g. lists and trees of basic terms) 3. basic abstractions (e.g. sets and multisets of basic terms) 4. any combination of the above three
  • 12. Basic Term θ-Join (Example 1) 12 A = { ({p, q}, a, b, c), ({r}, d, e, f), ({s, t}, d, g, h) } B = { ({p, q}, a, b, c), ({p, q}, d, e, f) } { ( ({p, q}, a, b, c), ({p, q}, a, b, c) ), ( ({r}, d, e, f), ({p, q}, d, e, f) ), ( ({s, t}, d, g, h), ({p, q}, d, e, f) ) } Title = TitleA B =
  • 13. Basic Term θ-Join (Example 2) 13 A = { ({p, q}, a, b, c), ({r}, d, e, f), ({s, t}, g, h, k) } B = { ({p, q}, a, b, c), ({p, q}, d, e, f) } { ( ({p, q}, a, b, c), ({p, q}, a, b, c) ), ( ({p, q}, a, b, c), ({p, q}, d, e, f) ) } Coauthors = CoauthorsA B =
  • 14. Basic Term θ-Join (Example 3) 14 A = { ({p, q}, a, b, c), ({r}, d, e, f), ({s, t}, g, h, k) } B = { ({p, q}, a, b, c), ({p, q}, d, e, f) } { ( ({p, q}, a, b, c), ({p, q}, a, b, c) ) }Publication = PublicationA B =
  • 15. 4. Basic Term Proximity Join 15 1. Relational Joins (a quick review) 2. Basic Terms 3. Relational Joins for Basic Terms 4. Basic Term Proximity-Join 5. Application
  • 16. Replacing Equality with Proximity 16 Proximity Distance Threshold is a dependency relation, but not an equivalence relation. i.e. proximity is reflexive and symmetric but not necessarily transitive. Due to: Properties of Proximity • dist is not constrained to have an upper bound. • Some normalising function may be used, usually into the closed interval [0, 1]. • Or can normalise in feature space (e.g. normalising kernels). Normalisation
  • 17. Basic Term Proximity Join • Basic Term Projection • Basic Term θ-Restriction • Basic Term Proximity Join 17 s is a basic subterm at type tree index i
  • 18. 5. Application 18 1. Relational Joins (a quick review) 2. Basic Terms 3. Relational Joins for Basic Terms 4. Basic Term Proximity-Join 5. Application
  • 19. • Given the ground truth where • The goal is to reconstruct V as V’ by choosing an appropriate s Proximity joins on bibliographic publications data 19
  • 20. • Currently, ground truths for pairs of data sets are rare. So, we choose with • CORA-REFS publications data set (1881 instances) • Data type is the one used in examples throughout this talk • Example pair of instances that require approximate join: CORA Data Set 20 ( { “Mitchell, T.”, “Thrun, S.” }, “Explanation-Based Learning: ...”, “In Proceedings of the Tenth ...”, “1993” ) ( { “Tom Mitchell”, “Sven Thrun” }, “Explanation based learning: ...”, “Proceedings of the 10th ...”, “ ’93 ” )
  • 21. Experiments on CORA data set 21 For each join: 1. Calculate pairwise distances between all basic terms in CORA 2. Construct a dendrogram 3. Calculate precision-recall at each node in the dendrogram i.e. plot a point on the p-r chart for each node in the dendrogram e.g. threshold = 120 TP FN FP TN Confusion Matrix TP is no. pairs in same cluster that should be in the same cluster. TN is no. pairs in different clusters that should be in different clusters. FP is etc...
  • 22. Proximity Joins on CORA Dataset 23 Publication Publication.Coauthors Publication.Title Publication.Venue Precision Recall • Distance derived from a kernel in the usual way • Basic term kernel = default kernel for basic terms • String kernel = p-spectrum kernel • Default kernel = matching kernel
  • 24. Conclusion 26 • Relational Algebra for Basic Terms • Combines relational model and basic terms in single formalism • Basic Term Proximity-Join • Enables approximate querying and merging of basic terms • Application to bibliographic data • Shows potential for data integration ❦ ❦ ❦
  • 25. Default Kernel for Basic Terms 27