Instance Matching

WWW2012 Tutorial
Practical Cross-Dataset Queries on the Web of Data

Instance Matching

Robert Isele
Freie Universität Berlin

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Outline
 Motivation
 Link Discovery Tools
 Linking Workflow
 Silk Workbench


Motivation
 The Web of Data is a single global data space because data sources are
connected by links
 Over 31 billion triples published as Linked Open Data and growing
 But:
● Less than 500 million links
● Most publishers only link to one other dataset


Use Case 1: Publishing a New Dataset
 A data provider wants to publish a new dataset
 Wants to interlink with existing data sets from the same
domain
 Example
● A data publisher wants to publish a new dataset about movies
● Interlink movies with LinkedMDB (Linked Movie Data Base)
● Interlink directors with DBpedia (Wikipedia)


Use Case 2: Linked Data Application
 Linked Data application integrates multiple data sources from
the same domain
 In the decentralized Web of Data, many data sources use
different URIs for the same real world object.
 Identifying these URI aliases, is a central problem in Linked
Data.


Challenges for Link Discovery

 The Web of Data is heterogeneous
● Many different vocabularies are in use
● Different data formats
● Many different ways to represent the same information

Distribution of the most widely used vocabularies

 Large range of domains
● 256 data sources in the LOD cloud from a variety of domains
● Linkage Rules are different in each domain
● Writing a Linkage Rule is for each of these domains is usually not
trivial

Distribution of triples by domain


 Scalability
● The current LOD cloud contains 277 datasets (August 2011)
● 30 billion triples in total
● Infeasible to compare every possible entity pair


Link Discovery Tools
 Tools enable data publishers to set links
 Most tools generate links based on user-defined linkage rules
 A linkage rule specifies the conditions data items must fulfill
in order to be interlinked
 Popular Link Discover Tools:
● Silk Link Discovery Framework
● LIMES
● Others: http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/EquivalenceMining


Silk Link Discovery Framework
 Tool for discovering links between data items within different
Linked Data sources.
 The Silk Link Specification Language (Silk-LSL) allows to
express complex linkage rules
 Can be used to generate owl:sameAs links as well as other
relationships
 Scalability and high performance through efficient data
handling


Silk Versions
 Silk Single Machine
● Generate links on a single machine
● Local or remote data sets
 Silk MapReduce
● Generate RDF links using a cluster of multiple machines
● Based on Hadoop (Can be run on Amazon Elastic MapReduce)
 Silk Server
● Provides an HTTP API for matching instances from an incoming
stream of RDF data while keeping track of known entities
● Can be used as an identity resolution component within
applications that consume Linked Data from the Web

Silk Workbench
 Silk Workbench is a web application which guides the user
through the process of interlinking different data sources.
 Enables the user to manage different sets of data sources
and linking tasks.
 Offers a graphical editor which enables the user to easily
create and edit linkage rules
 Offers tools to evaluate the current linkage rule
 Includes experimental support for learning linkage rules


Linking Workflow


Typical linkage rule
 Select the values to be compared
● Example: Select labels and dates of a music record
 Normalize the values
● Example: Transform dates to a common format
 Compare different values using similarity measures
● Example: Compare labels and dates of a music record
 Aggregate the results of multiple comparisons
● Example: Compute the average of the label and date similarity


Value selectors
 Values in the graph around the entities can be used for comparison
 Property path languages have been developed for that purpose
 Examples (SPARQL 1.1 Property Paths Language):
● Entity label: rdfs:label
● Movie director name: dbpedia-owl:director/foaf:name
● All movies of a director: ^dbpedia-owl:director/rdfs:label


Data Transformations
 Different data sets may use different data formats
 Data sets may be noisy
⇒ Values must be normalized prior to comparison


Common Transformations
 Case normalization

 Structural transformation

 Extract values from URIs


Similarity Measures
 A similarity measure compares two values
 It returns a value between 0 (no similarity) and 1 (equality)
 Formally, a similarity measure is a function:
* *
sim : Σ ×Σ →[0,1]

 Various similarity measures have been proposed
● Character-based measures
● Token-based measures
● Domain-specific measures


Character-Based Similarity Measures
 Usually rely on character edit operations
 Often used for catching typographical errors
 Most popular
● Levenstein
● Jaro/Jaro-Winkler


Levenshtein Distance
 The minimum number of edits needed to transform one
string into the other
 Allowed edit operations:
● insert a character into the string
● delete a character from the string
● replace one character with a different character
 Examples:
● levensthein('Table', 'Cable') = 1 (1 Substitution)
● levensthein('Table', 'able') = 1 (1 Deletion)


Token-Based Similarity Measures
 Character-based measures work well for typographical
errors, but fail when word arrangements differ
 Example: 'John Doe', 'Doe, John', 'Mr. John Doe'

 Token-based measures split the values into tokens before
computing the similarity
 Example: tokenize('Mr. John Doe') = {'Mr.', 'John', 'Doe'}

 Most popular: Jaccard, Dice


Jaccard coefficient
 Intuition: Measure the fraction of the tokens which are
shared by both strings
 Defined as the number of matching words divided by the
total number of distinct words:

∣A∩B∣
Jaccard ( A , B)=
∣A∪B∣

 Example:
2
Jaccard ({Thomas ,Sean , Connery },{Sir ,Sean , Connery })= =0.5
4


Domain-Specific Similarity Measures

 Geographic distance
 Date/Time
 Numbers


Aggregating Similarity Values
 In order to determine if two entities are duplicates it is
usually not sufficient to compare a single property
 Aggregation Functions aggregate the similarity of multiple
comparisons
 Example: Interlinking geographical datasets
● Compare by label and geographic coordinates
● Aggregate similarity values


Popular Aggregation Functions
 Minimum
● Choose the lowest value
● ⇒ All values must exceed the threshold
 Maximum
● Choose the highest value
● ⇒ At least one value must exceed the threshold
 Weighted Average
● Assign a weight to each comparison
● Compute the weighted mean


Putting it all together


Example
 Interlink cities in different data sources:


Evaluating Linkage Rules
 Gold standard in the form of reference links
● Positive links (definitive matches)
● Negative links (definitive non-matches)
 Based on the reference links, we can determine the number
of correct and incorrect matches
 We distinguish between 4 cases:

Positive Link Negative Link

match(a,b) = link True positive False positive

match(a,b) = nonlink False negative True negative


Evaluating Linkage Rules
 Recall: Ratio of correct links compared to all known links
∣true positives∣
recall =
∣true positives∣+ ∣ false positives∣

 Precision: Ratio of correct links compared to all found links
∣true positives∣
precision =
∣true positives∣+ ∣ false negatives∣

 F-measure: Harmonic mean of precision and recall
2⋅precision⋅recall
F=
precision + recall


Recall-Precision diagram
 A recall-precision diagram visualizes the trade-off between
maximizing the recall and maximizing the precision

From: Creating probabilistic databases from duplicated data, Oktie Hassanzadeh · Renée J. Miller (VLDBJ)

Silk Worbench

 Silk Workbench offers a GUI for:
● Manage different data sourcs and linkage rules
● Creating linkage rules
● Executing linkage rules
● Evaluating linkage rules
● Learning Linkage Rules


Workspace
The Workspace holds a set of projects
consisting of:

 Data Sources
● Holds all information that is needed
by Silk to retrieve entities from it.
● Usually a file dump or a SPARQL
endpoint
 Linking Tasks
● Interlinks a type of entity between
two data sources
● e.g. Interlinkiing movies in DBpedia
and LinkedMDB


Linkage Rule Editor
 Allows to view and edit linkage rules
 Linkage Rules are shown as a tree
 Editing using drag & drop.


Generating Links


Managing Reference Links


Conclusion
 In order to publish a new data set or to consume an existing
dataset we need to generate links
 A linkage rule specifies the conditions which must hold true
for two entities in order to be considered the same real-
world object.
 The Silk Workbench provides a graphical user interface to
create and edit linking tasks
 The hands on session will cover a simple example interlinking
musical artists in freebase and DBpedia


Q&A


Instance Matching

More Related Content

What's hot (19)

Similar to Instance Matching (20)

Recently uploaded (20)

Instance Matching