Efficient Duplicate Detection Over Massive Data Sets

Eﬃcient Duplicate Detection Over Massive Data Sets
Pradeeban Kathiravelu
INESC-ID Lisboa
Instituto Superior T´ecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 4.
April 21, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 1 / 21

Dedoop: Eﬃcient Deduplication with Hadoop
Introduction
Blocking
Grouping of entities that are “somehow similar”.
Comparisons restricted to entities from the same block.
Entity Resolution (ER, Object matching, deduplication)
Costly.
Traditional Blocking Approaches not eﬀective.

Motivation
Advantages of leveraging parallel and cloud environments.
Manual tuning of ER parameters is facilitated as ER results can be
quickly generated and evaluated.
⇓ Execution times for large data sets ⇒ Speed up common data
management processes.

Dedoop
http://dbs.uni-leipzig.de/dedoop
MapReduce-based entity resolution of large datasets.
Pair-wise similarity computation [O(n2)] executed in parallel.
Automatic transformation:
Workflow definition ⇒ Executable MapReduce workflow.
Avoid unnecessary entity pair comparisons
That result from the utilization of multiple blocking keys.

Features
Several load balancing strategies
In combination with its blocking techniques.
To achieve balanced workloads across all employed nodes of the cluster.

User Interface
Users easily specify advanced ER workﬂows in a web browser.
Choose from a rich toolset of common ER components.
Blocking techniques.
Similarity functions.
Machine learning for automatically building match classiﬁers.
Visualization of the ER results and the workload of all cluster nodes.

Solution Architecture
Map determines blocking keys for each entity and outputs (blockkey,
entity) pairs.
Reduce compares entities that belong to the same block.

MapDupReducer: Detecting Near Duplicates ..
Near Duplicate Detection (NDD)
Multi-Processor Systems are more eﬀective.
MapReduce Platform.
Ease of use.
High Eﬃciency.

System Architecture
Non-trivial generalization of the PPJoin algorithm into the
MapReduce framework.
Redesigning the position and prefix filtering.
Document signature filtering to further reduce the candidate size.

Evaluation
Data sets.
MEDLINE documents.
Finding plagiarized documents.
18.5 million records.
BING.
Web pages with an aggregated size of 2TB.
Hotspot.
High update frequency.
Altering the arguments.
Diﬀerent number of map() and reduce() params.

Eﬃcient Similarity Joins for Near Duplicate Detection
Similarity Deﬁnitions

Efficient Similarity Joins for Near Duplicate Detection
Efficient Similarity Join Algorithms
Efficient similarity join algorithms by exploiting the ordering of tokens
in the records.
Positional filtering and suffix filtering are complementary to the
existing prefix filtering technique.
Commonly used strategy depends on the size of the document.
Text documents: Edit distance and Jaccard similarity.
Edit distance: Minimum number of edits required to transform one
string to another.
An insertion, deletion, or substitution of a single character.
Web documents: Jaccard or overlap similarity on small or fix sized
sketches.
Near duplicate object detection problem is a generalization of the
well-known nearest neighbor problem.

Efficient Parallel Set-Similarity Joins Using MapReduce
Introduction
Efficiently perform set-similarity joins in parallel using the popular
MapReduce framework.
A 3-stage approach for end-to-end set-similarity joins.
Efficiently partition the data across nodes.
Balance the workload.
The need for replication ⇓.

MapReduce

Parallel Set-Similarity Joins Stages
1 Token Ordering:
Computes data statistics in order to generate good signatures.
The techniques in later stages utilize these statistics.
2 RID-Pair Generation:
Extracts the record IDs (“RID”) and the join-attribute value from
each record.
Distributes the RID and the join-attribute value pairs.
The pairs sharing a signature go to at least one common reducer.
Reducers compute the similarity of the join-attribute values and output
RID pairs of similar records.
3 Record Join:
Generates actual pairs of joined records.
It uses the list of RID pairs from the second stage and the original data
to build the pairs of similar records.

Token Ordering

Handling Insuﬃcient Memory

Speedup

Scalability

Conclusion
Conclusion
MapReduce frameworks offer an effective platform for near duplicate
detection.
Distributed execution frameworks can be leveraged for a scalable data
cleaning.
Efficient partitioning for data that cannot fit in the main memory.
Software-Defined Networking and later advances in networking can
lead to better data solutions.

Conclusion
References
Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: efficient deduplication
with Hadoop. Proceedings of the VLDB Endowment, 5(12), 1878-1881.
Vernica, R., Carey, M. J., & Li, C. (2010, June). Efficient parallel
set-similarity joins using MapReduce. In Proceedings of the 2010 ACM
SIGMOD International Conference on Management of data (pp. 495-506).
ACM.
Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., ... & Li, R.
(2010, June). MapDupReducer: detecting near duplicates over massive
datasets. In Proceedings of the 2010 ACM SIGMOD International
Conference on Management of data (pp. 1119-1122). ACM.
Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficient
similarity joins for near-duplicate detection. ACM Transactions on Database
Systems (TODS), 36(3), 15.
Thank you!

Efficient Duplicate Detection Over Massive Data Sets

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Efficient Duplicate Detection Over Massive Data Sets (20)

More from Pradeeban Kathiravelu, Ph.D. (20)

Recently uploaded (20)

Efficient Duplicate Detection Over Massive Data Sets