SlideShare a Scribd company logo
Efficient Duplicate Detection Over Massive Data Sets
Pradeeban Kathiravelu
INESC-ID Lisboa
Instituto Superior T´ecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 4.
April 21, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 1 / 21
Dedoop: Efficient Deduplication with Hadoop
Introduction
Blocking
Grouping of entities that are “somehow similar”.
Comparisons restricted to entities from the same block.
Entity Resolution (ER, Object matching, deduplication)
Costly.
Traditional Blocking Approaches not effective.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 2 / 21
Dedoop: Efficient Deduplication with Hadoop
Motivation
Advantages of leveraging parallel and cloud environments.
Manual tuning of ER parameters is facilitated as ER results can be
quickly generated and evaluated.
⇓ Execution times for large data sets ⇒ Speed up common data
management processes.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 3 / 21
Dedoop: Efficient Deduplication with Hadoop
Dedoop
http://dbs.uni-leipzig.de/dedoop
MapReduce-based entity resolution of large datasets.
Pair-wise similarity computation [O(n2)] executed in parallel.
Automatic transformation:
Workflow definition ⇒ Executable MapReduce workflow.
Avoid unnecessary entity pair comparisons
That result from the utilization of multiple blocking keys.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 4 / 21
Dedoop: Efficient Deduplication with Hadoop
Features
Several load balancing strategies
In combination with its blocking techniques.
To achieve balanced workloads across all employed nodes of the cluster.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 5 / 21
Dedoop: Efficient Deduplication with Hadoop
User Interface
Users easily specify advanced ER workflows in a web browser.
Choose from a rich toolset of common ER components.
Blocking techniques.
Similarity functions.
Machine learning for automatically building match classifiers.
Visualization of the ER results and the workload of all cluster nodes.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 6 / 21
Dedoop: Efficient Deduplication with Hadoop
Solution Architecture
Map determines blocking keys for each entity and outputs (blockkey,
entity) pairs.
Reduce compares entities that belong to the same block.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 7 / 21
MapDupReducer: Detecting Near Duplicates ..
Near Duplicate Detection (NDD)
Multi-Processor Systems are more effective.
MapReduce Platform.
Ease of use.
High Efficiency.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 8 / 21
MapDupReducer: Detecting Near Duplicates ..
System Architecture
Non-trivial generalization of the PPJoin algorithm into the
MapReduce framework.
Redesigning the position and prefix filtering.
Document signature filtering to further reduce the candidate size.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 9 / 21
MapDupReducer: Detecting Near Duplicates ..
Evaluation
Data sets.
MEDLINE documents.
Finding plagiarized documents.
18.5 million records.
BING.
Web pages with an aggregated size of 2TB.
Hotspot.
High update frequency.
Altering the arguments.
Different number of map() and reduce() params.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 10 / 21
Efficient Similarity Joins for Near Duplicate Detection
Similarity Definitions
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 11 / 21
Efficient Similarity Joins for Near Duplicate Detection
Efficient Similarity Join Algorithms
Efficient similarity join algorithms by exploiting the ordering of tokens
in the records.
Positional filtering and suffix filtering are complementary to the
existing prefix filtering technique.
Commonly used strategy depends on the size of the document.
Text documents: Edit distance and Jaccard similarity.
Edit distance: Minimum number of edits required to transform one
string to another.
An insertion, deletion, or substitution of a single character.
Web documents: Jaccard or overlap similarity on small or fix sized
sketches.
Near duplicate object detection problem is a generalization of the
well-known nearest neighbor problem.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 12 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
Introduction
Efficiently perform set-similarity joins in parallel using the popular
MapReduce framework.
A 3-stage approach for end-to-end set-similarity joins.
Efficiently partition the data across nodes.
Balance the workload.
The need for replication ⇓.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 13 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
MapReduce
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 14 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
Parallel Set-Similarity Joins Stages
1 Token Ordering:
Computes data statistics in order to generate good signatures.
The techniques in later stages utilize these statistics.
2 RID-Pair Generation:
Extracts the record IDs (“RID”) and the join-attribute value from
each record.
Distributes the RID and the join-attribute value pairs.
The pairs sharing a signature go to at least one common reducer.
Reducers compute the similarity of the join-attribute values and output
RID pairs of similar records.
3 Record Join:
Generates actual pairs of joined records.
It uses the list of RID pairs from the second stage and the original data
to build the pairs of similar records.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 15 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
Token Ordering
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 16 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
Handling Insufficient Memory
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 17 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
Speedup
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 18 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
Scalability
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 19 / 21
Conclusion
Conclusion
MapReduce frameworks offer an effective platform for near duplicate
detection.
Distributed execution frameworks can be leveraged for a scalable data
cleaning.
Efficient partitioning for data that cannot fit in the main memory.
Software-Defined Networking and later advances in networking can
lead to better data solutions.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 20 / 21
Conclusion
References
Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: efficient deduplication
with Hadoop. Proceedings of the VLDB Endowment, 5(12), 1878-1881.
Vernica, R., Carey, M. J., & Li, C. (2010, June). Efficient parallel
set-similarity joins using MapReduce. In Proceedings of the 2010 ACM
SIGMOD International Conference on Management of data (pp. 495-506).
ACM.
Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., ... & Li, R.
(2010, June). MapDupReducer: detecting near duplicates over massive
datasets. In Proceedings of the 2010 ACM SIGMOD International
Conference on Management of data (pp. 1119-1122). ACM.
Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficient
similarity joins for near-duplicate detection. ACM Transactions on Database
Systems (TODS), 36(3), 15.
Thank you!
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 21 / 21

More Related Content

What's hot (19)

PPTX
Topic modeling using big data analytics
Farheen Nilofer
 
PDF
18 Meta Techniques in Computer Science
nakano_lab
 
PDF
On how to efficiently implement Deep Learning algorithms on PYNQ platform
NECST Lab @ Politecnico di Milano
 
PPT
Large Scale On-Demand Image Processing For Disaster Relief
Robert Grossman
 
PDF
Introduction to Data streaming - 05/12/2014
Raja Chiky
 
PPTX
Networking Materials Data
Ian Foster
 
PDF
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
idescitation
 
PDF
Dremel
Anhua Xu
 
PDF
Daniel Preoţiuc-Pietro - 2013 - A temporal model of text periodicities using ...
Association for Computational Linguistics
 
PPTX
Frequent Itemset Mining on BigData
Raju Gupta
 
PPTX
Data Trajectories: tracking the reuse of published data for transitive credi...
Paolo Missier
 
PDF
An Introduction of Recent Research on MapReduce (2011)
Yu Liu
 
PPT
A New Partnership for Cross-Scale, Cross-Domain eScience
University of Washington
 
PDF
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
PDF
NNLO PDF fits with top-quark pair differential distributions
Juan Rojo
 
PDF
Introducing Novel Graph Database Cloud Computing For Efficient Data Management
IJERA Editor
 
DOCX
A tree cluster-based data-gathering algorithm for industrial ws ns with a mob...
LogicMindtech Nologies
 
Topic modeling using big data analytics
Farheen Nilofer
 
18 Meta Techniques in Computer Science
nakano_lab
 
On how to efficiently implement Deep Learning algorithms on PYNQ platform
NECST Lab @ Politecnico di Milano
 
Large Scale On-Demand Image Processing For Disaster Relief
Robert Grossman
 
Introduction to Data streaming - 05/12/2014
Raja Chiky
 
Networking Materials Data
Ian Foster
 
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
idescitation
 
Dremel
Anhua Xu
 
Daniel Preoţiuc-Pietro - 2013 - A temporal model of text periodicities using ...
Association for Computational Linguistics
 
Frequent Itemset Mining on BigData
Raju Gupta
 
Data Trajectories: tracking the reuse of published data for transitive credi...
Paolo Missier
 
An Introduction of Recent Research on MapReduce (2011)
Yu Liu
 
A New Partnership for Cross-Scale, Cross-Domain eScience
University of Washington
 
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
NNLO PDF fits with top-quark pair differential distributions
Juan Rojo
 
Introducing Novel Graph Database Cloud Computing For Efficient Data Management
IJERA Editor
 
A tree cluster-based data-gathering algorithm for industrial ws ns with a mob...
LogicMindtech Nologies
 

Viewers also liked (20)

PPTX
novel and efficient approch for detection of duplicate pages in web crawling
Vipin Kp
 
PDF
A study and survey on various progressive duplicate detection mechanisms
eSAT Journals
 
PDF
Progressive duplicate detection
ieeepondy
 
PDF
Duplicate detection
jonecx
 
PDF
Tutorial 4 (duplicate detection)
Kira
 
PPTX
Data Cleaning Techniques
Amir Masoud Sefidian
 
PDF
Duplicate Detection of Records in Queries using Clustering
IJORCS
 
PPTX
Record matching over query results from Web Databases
tusharjadhav2611
 
PPT
Progressive Texture
Dr Rupesh Shet
 
PPT
An adaptive algorithm for detection of duplicate records
Likan Patra
 
PPTX
powerpoint feb
imu409
 
PPT
Handling Data in Mega Scale Web Systems
Vineet Gupta
 
PDF
Hpts 2011 flexible_oltp
Jags Ramnarayan
 
PPTX
Linking data without common identifiers
Lars Marius Garshol
 
PPSX
Adaptive Intrusion Detection Using Learning Classifiers
Patrick Nicolas
 
PDF
Predictive Models and data linkage
Nuffield Trust
 
PDF
Brisbane Health-y Data: Queensland Data Linkage Framework
ARDC
 
PDF
Privacy Preserved Distributed Data Sharing with Load Balancing Scheme
Editor IJMTER
 
PPTX
Data Linkage
Alasdair Gray
 
PDF
Approximate Protocol for Privacy Preserving Associate Rule Mining
Pushpalanka Jayawardhana
 
novel and efficient approch for detection of duplicate pages in web crawling
Vipin Kp
 
A study and survey on various progressive duplicate detection mechanisms
eSAT Journals
 
Progressive duplicate detection
ieeepondy
 
Duplicate detection
jonecx
 
Tutorial 4 (duplicate detection)
Kira
 
Data Cleaning Techniques
Amir Masoud Sefidian
 
Duplicate Detection of Records in Queries using Clustering
IJORCS
 
Record matching over query results from Web Databases
tusharjadhav2611
 
Progressive Texture
Dr Rupesh Shet
 
An adaptive algorithm for detection of duplicate records
Likan Patra
 
powerpoint feb
imu409
 
Handling Data in Mega Scale Web Systems
Vineet Gupta
 
Hpts 2011 flexible_oltp
Jags Ramnarayan
 
Linking data without common identifiers
Lars Marius Garshol
 
Adaptive Intrusion Detection Using Learning Classifiers
Patrick Nicolas
 
Predictive Models and data linkage
Nuffield Trust
 
Brisbane Health-y Data: Queensland Data Linkage Framework
ARDC
 
Privacy Preserved Distributed Data Sharing with Load Balancing Scheme
Editor IJMTER
 
Data Linkage
Alasdair Gray
 
Approximate Protocol for Privacy Preserving Associate Rule Mining
Pushpalanka Jayawardhana
 
Ad

Similar to Efficient Duplicate Detection Over Massive Data Sets (20)

PPTX
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Ashok Royal
 
PDF
Near Duplicate Detection for Medical Imaging Data Warehouse Construction
Pradeeban Kathiravelu, Ph.D.
 
PDF
File Sharing and Data Duplication Removal in Cloud Using File Checksum
ijtsrd
 
PDF
Data Deduplication Approaches: Concepts, Strategies, and Challenges 1st Editi...
kuheljnobs
 
PDF
F0431025031
ijceronline
 
PDF
IRJET- Cross User Bigdata Deduplication
IRJET Journal
 
PDF
GPU Acceleration of Set Similarity Joins
Mateus S. H. Cruz
 
PDF
EPAS: A SAMPLING BASED SIMILARITY IDENTIFICATION ALGORITHM FOR THE CLOUD
Nexgen Technology
 
PPTX
Efficient Parallel Set-Similarity Joins Using MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
PPTX
Presentation on Big Data Hadoop (Summer Training Demo)
Ashok Royal
 
PPTX
BibBase Linked Data Triplification Challenge 2010 Presentation
Reynold Xin
 
PDF
IRJET - Applications of Image and Video Deduplication: A Survey
IRJET Journal
 
PDF
The Search of New Issues in the Detection of Near-duplicated Documents
ijceronline
 
PPTX
Deduplication
Lars Marius Garshol
 
PPT
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Yahoo Developer Network
 
PDF
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
IJDKP
 
PPTX
Beyond Kaggle: Solving Data Science Challenges at Scale
Turi, Inc.
 
PDF
Indexing based Genetic Programming Approach to Record Deduplication
idescitation
 
PDF
A Deterministic Eviction Model for Removing Redundancies in Video Corpus
IJECEIAES
 
PDF
Document Similarity with Cloud Computing
Bryan Bende
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Ashok Royal
 
Near Duplicate Detection for Medical Imaging Data Warehouse Construction
Pradeeban Kathiravelu, Ph.D.
 
File Sharing and Data Duplication Removal in Cloud Using File Checksum
ijtsrd
 
Data Deduplication Approaches: Concepts, Strategies, and Challenges 1st Editi...
kuheljnobs
 
F0431025031
ijceronline
 
IRJET- Cross User Bigdata Deduplication
IRJET Journal
 
GPU Acceleration of Set Similarity Joins
Mateus S. H. Cruz
 
EPAS: A SAMPLING BASED SIMILARITY IDENTIFICATION ALGORITHM FOR THE CLOUD
Nexgen Technology
 
Efficient Parallel Set-Similarity Joins Using MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Presentation on Big Data Hadoop (Summer Training Demo)
Ashok Royal
 
BibBase Linked Data Triplification Challenge 2010 Presentation
Reynold Xin
 
IRJET - Applications of Image and Video Deduplication: A Survey
IRJET Journal
 
The Search of New Issues in the Detection of Near-duplicated Documents
ijceronline
 
Deduplication
Lars Marius Garshol
 
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Yahoo Developer Network
 
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
IJDKP
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Turi, Inc.
 
Indexing based Genetic Programming Approach to Record Deduplication
idescitation
 
A Deterministic Eviction Model for Removing Redundancies in Video Corpus
IJECEIAES
 
Document Similarity with Cloud Computing
Bryan Bende
 
Ad

More from Pradeeban Kathiravelu, Ph.D. (20)

PDF
Google Summer of Code_2023.pdf
Pradeeban Kathiravelu, Ph.D.
 
PDF
Google Summer of Code (GSoC) 2022
Pradeeban Kathiravelu, Ph.D.
 
PDF
Google Summer of Code (GSoC) 2022
Pradeeban Kathiravelu, Ph.D.
 
PPTX
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Pradeeban Kathiravelu, Ph.D.
 
PDF
Google summer of code (GSoC) 2021
Pradeeban Kathiravelu, Ph.D.
 
PPTX
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
Pradeeban Kathiravelu, Ph.D.
 
PDF
Google Summer of Code (GSoC) 2020 for mentors
Pradeeban Kathiravelu, Ph.D.
 
PDF
Google Summer of Code (GSoC) 2020
Pradeeban Kathiravelu, Ph.D.
 
PDF
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Pradeeban Kathiravelu, Ph.D.
 
PDF
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
Pradeeban Kathiravelu, Ph.D.
 
PDF
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
Pradeeban Kathiravelu, Ph.D.
 
PDF
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
Pradeeban Kathiravelu, Ph.D.
 
PDF
UCL Ph.D. Confirmation 2018
Pradeeban Kathiravelu, Ph.D.
 
PDF
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Pradeeban Kathiravelu, Ph.D.
 
PDF
Moving bits with a fleet of shared virtual routers
Pradeeban Kathiravelu, Ph.D.
 
PDF
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Pradeeban Kathiravelu, Ph.D.
 
PDF
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
Pradeeban Kathiravelu, Ph.D.
 
PDF
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Pradeeban Kathiravelu, Ph.D.
 
PDF
Software-Defined Inter-Cloud Composition of Big Services
Pradeeban Kathiravelu, Ph.D.
 
PDF
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Pradeeban Kathiravelu, Ph.D.
 
Google Summer of Code_2023.pdf
Pradeeban Kathiravelu, Ph.D.
 
Google Summer of Code (GSoC) 2022
Pradeeban Kathiravelu, Ph.D.
 
Google Summer of Code (GSoC) 2022
Pradeeban Kathiravelu, Ph.D.
 
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Pradeeban Kathiravelu, Ph.D.
 
Google summer of code (GSoC) 2021
Pradeeban Kathiravelu, Ph.D.
 
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
Pradeeban Kathiravelu, Ph.D.
 
Google Summer of Code (GSoC) 2020 for mentors
Pradeeban Kathiravelu, Ph.D.
 
Google Summer of Code (GSoC) 2020
Pradeeban Kathiravelu, Ph.D.
 
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Pradeeban Kathiravelu, Ph.D.
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
Pradeeban Kathiravelu, Ph.D.
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
Pradeeban Kathiravelu, Ph.D.
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
Pradeeban Kathiravelu, Ph.D.
 
UCL Ph.D. Confirmation 2018
Pradeeban Kathiravelu, Ph.D.
 
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Pradeeban Kathiravelu, Ph.D.
 
Moving bits with a fleet of shared virtual routers
Pradeeban Kathiravelu, Ph.D.
 
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Pradeeban Kathiravelu, Ph.D.
 
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
Pradeeban Kathiravelu, Ph.D.
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Pradeeban Kathiravelu, Ph.D.
 
Software-Defined Inter-Cloud Composition of Big Services
Pradeeban Kathiravelu, Ph.D.
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Pradeeban Kathiravelu, Ph.D.
 

Recently uploaded (20)

PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 

Efficient Duplicate Detection Over Massive Data Sets

  • 1. Efficient Duplicate Detection Over Massive Data Sets Pradeeban Kathiravelu INESC-ID Lisboa Instituto Superior T´ecnico, Universidade de Lisboa Lisbon, Portugal Data Quality – Presentation 4. April 21, 2015. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 1 / 21
  • 2. Dedoop: Efficient Deduplication with Hadoop Introduction Blocking Grouping of entities that are “somehow similar”. Comparisons restricted to entities from the same block. Entity Resolution (ER, Object matching, deduplication) Costly. Traditional Blocking Approaches not effective. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 2 / 21
  • 3. Dedoop: Efficient Deduplication with Hadoop Motivation Advantages of leveraging parallel and cloud environments. Manual tuning of ER parameters is facilitated as ER results can be quickly generated and evaluated. ⇓ Execution times for large data sets ⇒ Speed up common data management processes. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 3 / 21
  • 4. Dedoop: Efficient Deduplication with Hadoop Dedoop http://dbs.uni-leipzig.de/dedoop MapReduce-based entity resolution of large datasets. Pair-wise similarity computation [O(n2)] executed in parallel. Automatic transformation: Workflow definition ⇒ Executable MapReduce workflow. Avoid unnecessary entity pair comparisons That result from the utilization of multiple blocking keys. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 4 / 21
  • 5. Dedoop: Efficient Deduplication with Hadoop Features Several load balancing strategies In combination with its blocking techniques. To achieve balanced workloads across all employed nodes of the cluster. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 5 / 21
  • 6. Dedoop: Efficient Deduplication with Hadoop User Interface Users easily specify advanced ER workflows in a web browser. Choose from a rich toolset of common ER components. Blocking techniques. Similarity functions. Machine learning for automatically building match classifiers. Visualization of the ER results and the workload of all cluster nodes. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 6 / 21
  • 7. Dedoop: Efficient Deduplication with Hadoop Solution Architecture Map determines blocking keys for each entity and outputs (blockkey, entity) pairs. Reduce compares entities that belong to the same block. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 7 / 21
  • 8. MapDupReducer: Detecting Near Duplicates .. Near Duplicate Detection (NDD) Multi-Processor Systems are more effective. MapReduce Platform. Ease of use. High Efficiency. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 8 / 21
  • 9. MapDupReducer: Detecting Near Duplicates .. System Architecture Non-trivial generalization of the PPJoin algorithm into the MapReduce framework. Redesigning the position and prefix filtering. Document signature filtering to further reduce the candidate size. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 9 / 21
  • 10. MapDupReducer: Detecting Near Duplicates .. Evaluation Data sets. MEDLINE documents. Finding plagiarized documents. 18.5 million records. BING. Web pages with an aggregated size of 2TB. Hotspot. High update frequency. Altering the arguments. Different number of map() and reduce() params. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 10 / 21
  • 11. Efficient Similarity Joins for Near Duplicate Detection Similarity Definitions Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 11 / 21
  • 12. Efficient Similarity Joins for Near Duplicate Detection Efficient Similarity Join Algorithms Efficient similarity join algorithms by exploiting the ordering of tokens in the records. Positional filtering and suffix filtering are complementary to the existing prefix filtering technique. Commonly used strategy depends on the size of the document. Text documents: Edit distance and Jaccard similarity. Edit distance: Minimum number of edits required to transform one string to another. An insertion, deletion, or substitution of a single character. Web documents: Jaccard or overlap similarity on small or fix sized sketches. Near duplicate object detection problem is a generalization of the well-known nearest neighbor problem. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 12 / 21
  • 13. Efficient Parallel Set-Similarity Joins Using MapReduce Introduction Efficiently perform set-similarity joins in parallel using the popular MapReduce framework. A 3-stage approach for end-to-end set-similarity joins. Efficiently partition the data across nodes. Balance the workload. The need for replication ⇓. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 13 / 21
  • 14. Efficient Parallel Set-Similarity Joins Using MapReduce MapReduce Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 14 / 21
  • 15. Efficient Parallel Set-Similarity Joins Using MapReduce Parallel Set-Similarity Joins Stages 1 Token Ordering: Computes data statistics in order to generate good signatures. The techniques in later stages utilize these statistics. 2 RID-Pair Generation: Extracts the record IDs (“RID”) and the join-attribute value from each record. Distributes the RID and the join-attribute value pairs. The pairs sharing a signature go to at least one common reducer. Reducers compute the similarity of the join-attribute values and output RID pairs of similar records. 3 Record Join: Generates actual pairs of joined records. It uses the list of RID pairs from the second stage and the original data to build the pairs of similar records. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 15 / 21
  • 16. Efficient Parallel Set-Similarity Joins Using MapReduce Token Ordering Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 16 / 21
  • 17. Efficient Parallel Set-Similarity Joins Using MapReduce Handling Insufficient Memory Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 17 / 21
  • 18. Efficient Parallel Set-Similarity Joins Using MapReduce Speedup Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 18 / 21
  • 19. Efficient Parallel Set-Similarity Joins Using MapReduce Scalability Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 19 / 21
  • 20. Conclusion Conclusion MapReduce frameworks offer an effective platform for near duplicate detection. Distributed execution frameworks can be leveraged for a scalable data cleaning. Efficient partitioning for data that cannot fit in the main memory. Software-Defined Networking and later advances in networking can lead to better data solutions. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 20 / 21
  • 21. Conclusion References Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: efficient deduplication with Hadoop. Proceedings of the VLDB Endowment, 5(12), 1878-1881. Vernica, R., Carey, M. J., & Li, C. (2010, June). Efficient parallel set-similarity joins using MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 495-506). ACM. Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., ... & Li, R. (2010, June). MapDupReducer: detecting near duplicates over massive datasets. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 1119-1122). ACM. Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36(3), 15. Thank you! Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 21 / 21