SlideShare a Scribd company logo
Benchmark MinHash +
LSH Algorithm on Spark
Insight Data Engineering Fellow Program, Silicon Valley
Xiaoqian Liu
June 2016
What’s next?
Post Recommendation
● Data: Reddit posts and titles in 12/2014
● Similarity metric: Jaccard Similarity
○ (%common) on titles
Pairwise Similarity Calculation is
Expensive!!
● ~700k posts in 12/2014
● Individual lookup: 700K times, O(n)
● Pairwise calculation: 490B times, O(n^2)
MinHash: Dimension Reduction
Post 1 Dave Grohl tells a story
Post 2 Dave Grohl shares a story with Taylor Swift
Post 3 I knew it was trouble when they drove by
Min hash 1 Min hash 2 Min hash 3 Min hash 4
Post 1 932378 11070 107000 195512
Post 2 20930 213012 107000 195512
Post 3 27698 14136 104464 154376
4 hash funcs
LSH (Locality Sensitive Hashing)
● Further reduce the dimension
● Suppose the table is divided into 2 bands w/ width of 2
● Rehash on each item
● Use (Band id, Band hash) to find similar items
Band 1 Band 2
Post 1 Hash (932378,11070) Hash (107000,195512)
Post 2 Hash (20930, 213012) Hash (107000,195512)
Post 3 Hash (27698,14136) Hash (104464,154376)
Dave Grohl
Dave Grohl
Trouble
*Algorithm source: Mining of Massive Datasets (Rajaraman,Leskovec)
Infrastructure for Evaluation
● Batch implementation+Eval
● Real-time implementation+Eval
Preprocessing
(tokenize, remove
stopwords)
Minhash+LSH
(batch version)
Minhash+LSH
(online version)
Reddits (1/2015)
Reddits (12/2014)
Export LSH+post info
Group lookup+update
6 nodes
(m4.xlarge)
3 nodes
(m4.xlarge)
Evaluation &
Lookup
6 nodes
(m4.xlarge)
1 node
(m4.xlarge)
Batch Processing Optimization on Spark
● SparkSQL join, cartesian product
● Reduce Shuffle times for joining two different datasets:
○ Co-partition before joining
● Persist the data before actions
○ Storage level depends on the RDD size
● Filter results before joining and calculating similarities
○ filter(), reducebyKey()
Batch Processing: Brute-force vs
Minhash+LSH (10 hash funcs, 2 bands)
100k entries, 12/2014 Reddits
Precision and Recall
● 100k entries, estimated threshold = 0.44
Parameters Items
>=threshold
Total
count
Time (sec) Precision Recall num
partitions
Brute-force 16,046 9.99B 29,880 1 1 3,600
k=10, b=2 585 65,353 7.68 0.009 0.036 60
780k reddit posts, precision vs k values
K = # hash functions
780k reddit posts, time vs k values
Streaming: Average Time
● Throughput: 315 events/sec, 10 sec time window
● 8 sec/microbatch, 6 nodes,
Conclusion
● Effectively speed up on batch processing
● Use 400-500 hash functions, set the threshold above .65
○ Filter out pairs w/ low similarities
○ Linear scan for pairs w/ 0 neighbors
● Only for Jaccard Similarity.
○ For cosine similarity: LSH + random projection
About Me
● BS, MS in Systems Engineering
(CS minor), UVA
● Operations/Data Science Intern,
Samsung Austin R&D Center
● ML, NLP at scale
● Music, Singing
“We can have a party, just listening to music”
Backup Slides
Limits & Future Work
● Investigate recall values vs parameters/time/...
○ More recall and precision comparison btw Brute-Force and LSH+MinHash
○ More comparison between different parameter comparisons
● Benchmark for batch processing:
○ Size vs Time
● More detailed benchmark on real-time processing
● More runs of experiments:
○ More representative data
● Optimize resource utilization
MapReduce version of MinHash+LSH
● Mapper side: for each post
○ Calculate min hash values
○ Create bands and band hashes
● Reducer side:
○ Get similar items grouped by (band id, band hash)
○ Calculate jaccard similarity on each item combination ->
find the most similar pair
Threshold of MinHash + LSH
● Estimated Similarity Lower bound for each band:
○ ~(1/#bands)^(1/#rows)
● e.g. k =4, 2 bands and 2 rows. at least 0.70 similar
● Collision
● Higher k, more accurate, but slower
Streaming: Kafka
● Throughput: 146 events/sec, 10 ms time window
Streaming: Average Time
● Throughput:120 events/sec, 10 ms time window
● 8 sec/microbatch, 6 nodes, 1024 MB memory/node
Streaming: Kafka
● Throughput: 315.30 events/sec, 10 ms time window
780k reddit posts, precision vs time
Threshold: 0.4-0.5 Threshold: 0.6-0.7 Threshold: 0.8-0.9
CPU usage
●
Task Diagram
●

More Related Content

What's hot (20)

PPTX
Druid deep dive
Kashif Khan
 
PDF
Bloom filter
Hamid Feizabadi
 
PDF
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
PDF
Facebook Messages & HBase
强 王
 
PDF
Delta: Building Merge on Read
Databricks
 
PDF
Fuzzy Matching on Apache Spark with Jennifer Shin
Databricks
 
PDF
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
StreamNative
 
PDF
HBase Advanced - Lars George
JAX London
 
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
PDF
2020.02.06 우리는 왜 glue를 버렸나?
Thomas Hyun (동현) Park
 
PPTX
Big Data & Hadoop Tutorial
Edureka!
 
PPTX
Gcp dataflow
Igor Roiter
 
PDF
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PDF
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Databricks
 
PDF
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon Web Services Korea
 
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
PDF
Log Structured Merge Tree
University of California, Santa Cruz
 
Druid deep dive
Kashif Khan
 
Bloom filter
Hamid Feizabadi
 
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Facebook Messages & HBase
强 王
 
Delta: Building Merge on Read
Databricks
 
Fuzzy Matching on Apache Spark with Jennifer Shin
Databricks
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
StreamNative
 
HBase Advanced - Lars George
JAX London
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
2020.02.06 우리는 왜 glue를 버렸나?
Thomas Hyun (동현) Park
 
Big Data & Hadoop Tutorial
Edureka!
 
Gcp dataflow
Igor Roiter
 
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Iceberg: a fast table format for S3
DataWorks Summit
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Databricks
 
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon Web Services Korea
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Log Structured Merge Tree
University of California, Santa Cruz
 

Similar to Benchmark MinHash+LSH algorithm on Spark (20)

PPTX
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
PDF
Local sensitive hashing & minhash on facebook friend
Chengeng Ma
 
PDF
Open LSH - september 2014 update
J Singh
 
PDF
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov
 
PDF
OpenLSH - a framework for locality sensitive hashing
J Singh
 
PDF
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
DECK36
 
PPTX
Probabilistic data structure
Thinh Dang
 
PDF
A Gentle Introduction to Locality Sensitive Hashing with Apache Spark
François Garillot
 
PDF
Scalable Recommendation Algorithms with LSH
Maruf Aytekin
 
PPT
similarity1 (6).ppt
ssuserf11a32
 
PDF
Locality sensitive hashing
SEMINARGROOT
 
PPTX
The Performance of MapReduce: An In-depth Study
Kevin Tong
 
PDF
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Mail.ru Group
 
PDF
Finding similar items in high dimensional spaces locality sensitive hashing
Dmitriy Selivanov
 
PDF
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Gabriela Agustini
 
PDF
Big data-and-the-web
Aravindharamanan S
 
PPTX
Data Con LA 2018 - Applying Probabilistic Algorithms by Grant Kushida
Data Con LA
 
PDF
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
PDF
Imply at Apache Druid Meetup in London 1-15-20
Jelena Zanko
 
PPTX
DA_02_algorithms.pptx
Alok Mohapatra
 
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
Local sensitive hashing & minhash on facebook friend
Chengeng Ma
 
Open LSH - september 2014 update
J Singh
 
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov
 
OpenLSH - a framework for locality sensitive hashing
J Singh
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
DECK36
 
Probabilistic data structure
Thinh Dang
 
A Gentle Introduction to Locality Sensitive Hashing with Apache Spark
François Garillot
 
Scalable Recommendation Algorithms with LSH
Maruf Aytekin
 
similarity1 (6).ppt
ssuserf11a32
 
Locality sensitive hashing
SEMINARGROOT
 
The Performance of MapReduce: An In-depth Study
Kevin Tong
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Mail.ru Group
 
Finding similar items in high dimensional spaces locality sensitive hashing
Dmitriy Selivanov
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Gabriela Agustini
 
Big data-and-the-web
Aravindharamanan S
 
Data Con LA 2018 - Applying Probabilistic Algorithms by Grant Kushida
Data Con LA
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
Imply at Apache Druid Meetup in London 1-15-20
Jelena Zanko
 
DA_02_algorithms.pptx
Alok Mohapatra
 
Ad

Recently uploaded (20)

PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Ad

Benchmark MinHash+LSH algorithm on Spark

  • 1. Benchmark MinHash + LSH Algorithm on Spark Insight Data Engineering Fellow Program, Silicon Valley Xiaoqian Liu June 2016
  • 3. Post Recommendation ● Data: Reddit posts and titles in 12/2014 ● Similarity metric: Jaccard Similarity ○ (%common) on titles
  • 4. Pairwise Similarity Calculation is Expensive!! ● ~700k posts in 12/2014 ● Individual lookup: 700K times, O(n) ● Pairwise calculation: 490B times, O(n^2)
  • 5. MinHash: Dimension Reduction Post 1 Dave Grohl tells a story Post 2 Dave Grohl shares a story with Taylor Swift Post 3 I knew it was trouble when they drove by Min hash 1 Min hash 2 Min hash 3 Min hash 4 Post 1 932378 11070 107000 195512 Post 2 20930 213012 107000 195512 Post 3 27698 14136 104464 154376 4 hash funcs
  • 6. LSH (Locality Sensitive Hashing) ● Further reduce the dimension ● Suppose the table is divided into 2 bands w/ width of 2 ● Rehash on each item ● Use (Band id, Band hash) to find similar items Band 1 Band 2 Post 1 Hash (932378,11070) Hash (107000,195512) Post 2 Hash (20930, 213012) Hash (107000,195512) Post 3 Hash (27698,14136) Hash (104464,154376) Dave Grohl Dave Grohl Trouble *Algorithm source: Mining of Massive Datasets (Rajaraman,Leskovec)
  • 7. Infrastructure for Evaluation ● Batch implementation+Eval ● Real-time implementation+Eval Preprocessing (tokenize, remove stopwords) Minhash+LSH (batch version) Minhash+LSH (online version) Reddits (1/2015) Reddits (12/2014) Export LSH+post info Group lookup+update 6 nodes (m4.xlarge) 3 nodes (m4.xlarge) Evaluation & Lookup 6 nodes (m4.xlarge) 1 node (m4.xlarge)
  • 8. Batch Processing Optimization on Spark ● SparkSQL join, cartesian product ● Reduce Shuffle times for joining two different datasets: ○ Co-partition before joining ● Persist the data before actions ○ Storage level depends on the RDD size ● Filter results before joining and calculating similarities ○ filter(), reducebyKey()
  • 9. Batch Processing: Brute-force vs Minhash+LSH (10 hash funcs, 2 bands) 100k entries, 12/2014 Reddits
  • 10. Precision and Recall ● 100k entries, estimated threshold = 0.44 Parameters Items >=threshold Total count Time (sec) Precision Recall num partitions Brute-force 16,046 9.99B 29,880 1 1 3,600 k=10, b=2 585 65,353 7.68 0.009 0.036 60
  • 11. 780k reddit posts, precision vs k values K = # hash functions
  • 12. 780k reddit posts, time vs k values
  • 13. Streaming: Average Time ● Throughput: 315 events/sec, 10 sec time window ● 8 sec/microbatch, 6 nodes,
  • 14. Conclusion ● Effectively speed up on batch processing ● Use 400-500 hash functions, set the threshold above .65 ○ Filter out pairs w/ low similarities ○ Linear scan for pairs w/ 0 neighbors ● Only for Jaccard Similarity. ○ For cosine similarity: LSH + random projection
  • 15. About Me ● BS, MS in Systems Engineering (CS minor), UVA ● Operations/Data Science Intern, Samsung Austin R&D Center ● ML, NLP at scale ● Music, Singing “We can have a party, just listening to music”
  • 17. Limits & Future Work ● Investigate recall values vs parameters/time/... ○ More recall and precision comparison btw Brute-Force and LSH+MinHash ○ More comparison between different parameter comparisons ● Benchmark for batch processing: ○ Size vs Time ● More detailed benchmark on real-time processing ● More runs of experiments: ○ More representative data ● Optimize resource utilization
  • 18. MapReduce version of MinHash+LSH ● Mapper side: for each post ○ Calculate min hash values ○ Create bands and band hashes
  • 19. ● Reducer side: ○ Get similar items grouped by (band id, band hash) ○ Calculate jaccard similarity on each item combination -> find the most similar pair
  • 20. Threshold of MinHash + LSH ● Estimated Similarity Lower bound for each band: ○ ~(1/#bands)^(1/#rows) ● e.g. k =4, 2 bands and 2 rows. at least 0.70 similar ● Collision ● Higher k, more accurate, but slower
  • 21. Streaming: Kafka ● Throughput: 146 events/sec, 10 ms time window
  • 22. Streaming: Average Time ● Throughput:120 events/sec, 10 ms time window ● 8 sec/microbatch, 6 nodes, 1024 MB memory/node
  • 23. Streaming: Kafka ● Throughput: 315.30 events/sec, 10 ms time window
  • 24. 780k reddit posts, precision vs time Threshold: 0.4-0.5 Threshold: 0.6-0.7 Threshold: 0.8-0.9