SlideShare a Scribd company logo
Slides @
www.jakequist.com/go/dataengconf
http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
Entity Resolution
Talk Structure
Layer 1: Naive ER
Layer 2: Graphical ER
Layer 3: Big Data ER
Layer 4: Temporal ER
Layer 5: Learned ER
Naive ER
Entity Resolution
ID Name Website Geo
A Facebook facebook.com Menlo	Park,	CA
B FB facebook.com CA
C Joe's	Cookies joescookies.com San	Francisco,	CA
Suppose we have the following data:
Entity Resolution
Suppose we have the following data:
ID Name Website Geo
A Facebook facebook.com Menlo	Park,	CA
B FB facebook.com CA
C Joe's	Cookies joescookies.com San	Francisco,	CA
D Joes	Cookies facebook.com San	Francisco,	CA
Entity Resolution
Suppose we have the following data:
ID Name Website Geo
A Facebook facebook.com Menlo	Park,	CA
B FB facebook.com CA
C Joe's	Cookies joescookies.com San	Francisco,	CA
D Joes	Cookies facebook.com San	Francisco,	CA
E Joes	Cookies NULL New	York,	NY
Fundamental Concept
Match entities on the similarity of
their properties
Example: Company
Similarity
Example: Company
Similarity
Problems
• What about when match arity != 2
• Entities can’t duplicate across matches
• O(N^2) isn’t great either
Graphical ER
Think Like a Graph
A B
EC
D
ID Name Website Geo
A Facebook facebook.com
Menlo	Park,	
CA
B FB facebook.com CA
C Joe's	Cookies joescookies.com
San	Francisco,	
CA
D Joes	Cookies facebook.com
San	Francisco,	
CA
E Joes	Cookies NULL New	York,	NY
Think Like a Graph
A B
EC
D
150
50
-100 -100
50 50
50 50
-150-150
Think Like a Graph
A B
EC
D
150
50
-100 -100
50 50
50 50
-150-150
Key Concept: Cliques
Think Like a Clique
A B
EC
D
150
50
-100 -100
50 50
50 50
-150-150
{A}
{B}
{C}
{D}
{E}
{E, A}
{E, B}
{E, C}
{E, D}
{A, B}
{A, C}
{A, D}
{B, C}
{B, D}
{C, D}
{E, A, B}
{E, A, C}
{E, A, D}
{E, B, C}
{E, B, D}
{E, C, D}
{A, B, C}
{A, B, D}
{A, C, D}
{B, C, D}
{E, A, B, C}
{E, A, B, D}
{E, A, C, D}
{E, B, C, D}
{A, B, C, D}
{E, A, B, C, D}
possible cliques =>
Recurring Theme:
Powerset
Scoring Cliques
from above
Overlapping Cliques
A B
EC
D
A B
EC
D
A = 0.75 B = 0.55
Overlapping Cliques
An entity can’t belong to more
than one clique.
When we choose a clique, we
must ensure no other cliques
use any of those entities
Clique Choosing
Clique Choosing
Recap
• Given a dataset of entities…
• Take the powerset of those entities => every
possible clique
• Score all the cliques
• In sorted order, choose the best cliques when no
elements have been touched
ER on Bigger Data
• Get potential matches on the same machine
• Avoid using powerset(n) for large n
Challenges
Locality-Sensitive Hashing
(LSH)
Basic Idea: Use Map Reduce to get likely matches onto the
same machines
“Johnathon”
“Sequoia Capital, LLC”
[37.773972, -122.431297]
“John”
“Sequoia”
[37.73, -122.43]
“app.example.com” “example.com”
Locality-Sensitive Hashing
Locality-Sensitive Hashing
Problems
• What if our entities have missing properties?
Locality-Sensitive Hashing
Joe’s CookiesJoe’s Cookie’s
joescookies.com joescookies.com
A B C
“Joe Cookie” “Joe Cookie” “”
LSH on “name”
Multilevel LSH
• Basic Idea: Use LSH multiple times on converging
cliques
Joe’s CookiesJoe’s Cookie’s
joescookies.com joescookies.com
A B C
“Joe Cookie” “Joe Cookie” “”
LSN on “name”
Joe’s Cookie’s
joescookies.com joescookies.com
Clique #3
Clique #2
“joescookies.com” “joescookies.com”
LSN on “website”
Clique #1
Clique Choosing
• We now have all potential cliques, spread across
the cluster
• We now need to choose the best cliques?
• Remember: But choosing one clique invalidates
others
• Fundamentally a Serial Algorithm!
Clique Choosing
RDD[T].toLocalIterator() : Iterator[T]
• Produces an iterator on the Driver that seamlessly
iterates every partition
Clique Choosing
Clique Choosing
uh oh
Challenge
• We need to keep track of which entities we’ve
“touched”
• But using a HashSet means we will start eating a lot
memory
Primer: Bloom Filters
BloomFilter {
def mightContain(T obj)
def put(T obj)
}
example: 1 MB @ 0.5% error => 130 KB
Clique Choosing w/ Bloom
Filters
Clique Choosing w/ Bloom
Filters
Recap
• Challenge: Get data to the right machine.
Solution: Use Locality-Sensitive-Hashing
• Challenge: Choose the best cliques.
Solution: Use serial iterator and bloom-filters to
keep memory low
Temporal ER
Temporal Entity
Resolution
T1 T2
Ms Sally Smith Mrs Sally Doe
thefacebook.com facebook.com
Zen Payroll Gusto
Temporal Entity
Resolution
A B
Zen Payroll
zenpayroll.com
Gusto
gusto.com
-1000
Temporal Entity
Resolution
A B
Zen Payroll
zenpayroll.com
+100
C
Zen Payroll <=> Gusto
zenpayroll.com <=> gusto.com
Gusto
gusto.com
+100
-1000
Iterative Poison Pills
• Basic Idea: Use ER techniques we’ve already
established
• Introduce “poison pills” that can break up cliques if
temporal properties don’t match
• Iteratively use the poison pills to match on
increasingly temporally-aware entities
gusto.com
(Payroll)
2016
Perform Regular ER
gusto.com
(Travel)
2010
gusto.com
< 2015
gusto.com
zenpayroll.com
> 2015
zenpayroll.com
(Payroll)
2014
A B C D E
A, C, D, E B, E
Kick Out Entities That
Don’t Match Temporal
Requirements
A, D
gusto.com < 2015
B, E
gusto.com > 2015
zenpayroll < 2014
C, E
gusto,2016
Perform Regular ER
(now with more temporal
fields available)
A, C, D B, C, E
Temporal Poison Pills
Temporal Entity
Resolution
• Very Computational Expensive
• Requires Significant Tuning & Tweaking to Keep
Tractable
• Considered one of the Holy Grails of ER
Learned ER
Recap
• Gorilla in the room: All of our scoring has been
manual
Supervised Learning ER
• Basic Idea: Use a training set to learn the weights
in our scoring functions
• Disclaimer: Only proceed with this if you have very
complex scoring properties
Supervised Learning ER
Supervised Learning ER
More Learning Opts
• Gradient Descent: What if we viewed the system
as having overall “error”? We can then use
Gradient Descent to find optimal solution.
• Very very computationally intense
Questions?
Thanks!
jakequist@gmail.com
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

More Related Content

Viewers also liked (19)

PDF
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
PDF
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
PDF
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
PDF
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
PDF
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
PDF
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
PDF
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
PPT
Lect21 09-11
Mahesh Kumar Attri
 
PDF
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
PDF
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Spark Summit
 
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
PDF
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
PDF
DataEngConf: Apache Spark in Financial Modeling at BlackRock
Hakka Labs
 
PPTX
Knowledge Collaboration: Working with Data and Web Specialists
Olivier Serrat, PhD
 
PPTX
Large scale social recommender systems at LinkedIn
Mitul Tiwari
 
PPTX
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
PDF
High-Scale Entity Resolution in Hadoop
DataWorks Summit/Hadoop Summit
 
PDF
AI and Big Data For National Intelligence
Sonal Goyal
 
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
Lect21 09-11
Mahesh Kumar Attri
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Spark Summit
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
DataEngConf: Apache Spark in Financial Modeling at BlackRock
Hakka Labs
 
Knowledge Collaboration: Working with Data and Web Specialists
Olivier Serrat, PhD
 
Large scale social recommender systems at LinkedIn
Mitul Tiwari
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
High-Scale Entity Resolution in Hadoop
DataWorks Summit/Hadoop Summit
 
AI and Big Data For National Intelligence
Sonal Goyal
 

Similar to DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark (20)

PDF
[243] turning data into value
NAVER D2
 
PDF
Lecture 6: Watson and the Social Web (2014), Chris Welty
Lora Aroyo
 
PDF
Measuring Relevance in the Negative Space
Trey Grainger
 
PPTX
How can algorithms be biased?
Software Guru
 
PPT
Machine Learning ICS 273A
butest
 
PPT
Machine Learning ICS 273A
butest
 
PDF
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
MLconf
 
PDF
Tokens, Complex Systems, and Nature
Trent McConaghy
 
PPTX
Sippin: A Mobile Application Case Study presented at Techfest Louisville
Dawn Yankeelov
 
PPTX
Enabling Opinion Driven Decision Making - Kavita Ganesan, GitHub
Lucidworks
 
PDF
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...
Core Security
 
PDF
Bootstrapping Recommendations with Neo4j
Max De Marzi
 
PDF
What have fruits to do with technology? The case of Orange, Blackberry and Apple
PlanetData Network of Excellence
 
PDF
Introduction to ML and Decision Tree
Suman Debnath
 
PDF
Hacking Culture at VelocityConf
Jesse Robbins
 
PDF
Artificial Neural Network Seminar - Google Brain
Rawan Al-Omari
 
PPT
IBM IOD Conference 2011 Opening Keynote Deck
Jeff Jonas
 
PDF
Barga Data Science lecture 9
Roger Barga
 
PDF
Tim Mackinnon Agile And Beyond
deimos
 
PDF
Semantic Optimization with Structured Data - SMX Munich
Craig Bradford
 
[243] turning data into value
NAVER D2
 
Lecture 6: Watson and the Social Web (2014), Chris Welty
Lora Aroyo
 
Measuring Relevance in the Negative Space
Trey Grainger
 
How can algorithms be biased?
Software Guru
 
Machine Learning ICS 273A
butest
 
Machine Learning ICS 273A
butest
 
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
MLconf
 
Tokens, Complex Systems, and Nature
Trent McConaghy
 
Sippin: A Mobile Application Case Study presented at Techfest Louisville
Dawn Yankeelov
 
Enabling Opinion Driven Decision Making - Kavita Ganesan, GitHub
Lucidworks
 
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...
Core Security
 
Bootstrapping Recommendations with Neo4j
Max De Marzi
 
What have fruits to do with technology? The case of Orange, Blackberry and Apple
PlanetData Network of Excellence
 
Introduction to ML and Decision Tree
Suman Debnath
 
Hacking Culture at VelocityConf
Jesse Robbins
 
Artificial Neural Network Seminar - Google Brain
Rawan Al-Omari
 
IBM IOD Conference 2011 Opening Keynote Deck
Jeff Jonas
 
Barga Data Science lecture 9
Roger Barga
 
Tim Mackinnon Agile And Beyond
deimos
 
Semantic Optimization with Structured Data - SMX Munich
Craig Bradford
 
Ad

More from Hakka Labs (12)

PDF
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
PDF
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
Hakka Labs
 
PDF
DataEngConf: Data Science at the New York Times by Chris Wiggins
Hakka Labs
 
PPTX
DataEngConf: Building the Next New York Times Recommendation Engine
Hakka Labs
 
PDF
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
Hakka Labs
 
PDF
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
Hakka Labs
 
PPTX
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
Hakka Labs
 
PPTX
DataEngConf: The Science of Virality at BuzzFeed
Hakka Labs
 
PPTX
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
Hakka Labs
 
PDF
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Hakka Labs
 
PPTX
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
Hakka Labs
 
PPTX
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
Hakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
Hakka Labs
 
DataEngConf: Data Science at the New York Times by Chris Wiggins
Hakka Labs
 
DataEngConf: Building the Next New York Times Recommendation Engine
Hakka Labs
 
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
Hakka Labs
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
Hakka Labs
 
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
Hakka Labs
 
DataEngConf: The Science of Virality at BuzzFeed
Hakka Labs
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
Hakka Labs
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Hakka Labs
 
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
Hakka Labs
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
Hakka Labs
 
Ad

Recently uploaded (20)

PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 

DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark