Benchmark MinHash+LSH algorithm on Spark

Benchmark MinHash +
LSH Algorithm on Spark
Insight Data Engineering Fellow Program, Silicon Valley
Xiaoqian Liu
June 2016

Post Recommendation
● Data: Reddit posts and titles in 12/2014
● Similarity metric: Jaccard Similarity
○ (%common) on titles

Pairwise Similarity Calculation is
Expensive!!
● ~700k posts in 12/2014
● Individual lookup: 700K times, O(n)
● Pairwise calculation: 490B times, O(n^2)

MinHash: Dimension Reduction
Post 1 Dave Grohl tells a story
Post 2 Dave Grohl shares a story with Taylor Swift
Post 3 I knew it was trouble when they drove by
Min hash 1 Min hash 2 Min hash 3 Min hash 4
Post 1 932378 11070 107000 195512
Post 2 20930 213012 107000 195512
Post 3 27698 14136 104464 154376
4 hash funcs

LSH (Locality Sensitive Hashing)
● Further reduce the dimension
● Suppose the table is divided into 2 bands w/ width of 2
● Rehash on each item
● Use (Band id, Band hash) to find similar items
Band 1 Band 2
Post 1 Hash (932378,11070) Hash (107000,195512)
Post 2 Hash (20930, 213012) Hash (107000,195512)
Post 3 Hash (27698,14136) Hash (104464,154376)
Dave Grohl
Dave Grohl
Trouble
*Algorithm source: Mining of Massive Datasets (Rajaraman,Leskovec)

Infrastructure for Evaluation
● Batch implementation+Eval
● Real-time implementation+Eval
Preprocessing
(tokenize, remove
stopwords)
Minhash+LSH
(batch version)
Minhash+LSH
(online version)
Reddits (1/2015)
Reddits (12/2014)
Export LSH+post info
Group lookup+update
6 nodes
(m4.xlarge)
3 nodes
(m4.xlarge)
Evaluation &
Lookup
6 nodes
(m4.xlarge)
1 node
(m4.xlarge)

Batch Processing Optimization on Spark
● SparkSQL join, cartesian product
● Reduce Shuffle times for joining two different datasets:
○ Co-partition before joining
● Persist the data before actions
○ Storage level depends on the RDD size
● Filter results before joining and calculating similarities
○ filter(), reducebyKey()

Batch Processing: Brute-force vs
Minhash+LSH (10 hash funcs, 2 bands)
100k entries, 12/2014 Reddits

Precision and Recall
● 100k entries, estimated threshold = 0.44
Parameters Items
>=threshold
Total
count
Time (sec) Precision Recall num
partitions
Brute-force 16,046 9.99B 29,880 1 1 3,600
k=10, b=2 585 65,353 7.68 0.009 0.036 60

780k reddit posts, precision vs k values
K = # hash functions

780k reddit posts, time vs k values

Streaming: Average Time
● Throughput: 315 events/sec, 10 sec time window
● 8 sec/microbatch, 6 nodes,

Conclusion
● Effectively speed up on batch processing
● Use 400-500 hash functions, set the threshold above .65
○ Filter out pairs w/ low similarities
○ Linear scan for pairs w/ 0 neighbors
● Only for Jaccard Similarity.
○ For cosine similarity: LSH + random projection

About Me
● BS, MS in Systems Engineering
(CS minor), UVA
● Operations/Data Science Intern,
Samsung Austin R&D Center
● ML, NLP at scale
● Music, Singing
“We can have a party, just listening to music”

Limits & Future Work
● Investigate recall values vs parameters/time/...
○ More recall and precision comparison btw Brute-Force and LSH+MinHash
○ More comparison between different parameter comparisons
● Benchmark for batch processing:
○ Size vs Time
● More detailed benchmark on real-time processing
● More runs of experiments:
○ More representative data
● Optimize resource utilization

MapReduce version of MinHash+LSH
● Mapper side: for each post
○ Calculate min hash values
○ Create bands and band hashes

● Reducer side:
○ Get similar items grouped by (band id, band hash)
○ Calculate jaccard similarity on each item combination ->
find the most similar pair

Threshold of MinHash + LSH
● Estimated Similarity Lower bound for each band:
○ ~(1/#bands)^(1/#rows)
● e.g. k =4, 2 bands and 2 rows. at least 0.70 similar
● Collision
● Higher k, more accurate, but slower

Streaming: Kafka
● Throughput: 146 events/sec, 10 ms time window

Streaming: Average Time
● Throughput:120 events/sec, 10 ms time window
● 8 sec/microbatch, 6 nodes, 1024 MB memory/node

Streaming: Kafka
● Throughput: 315.30 events/sec, 10 ms time window

780k reddit posts, precision vs time
Threshold: 0.4-0.5 Threshold: 0.6-0.7 Threshold: 0.8-0.9

Benchmark MinHash+LSH algorithm on Spark

More Related Content

What's hot (20)

Similar to Benchmark MinHash+LSH algorithm on Spark (20)

Recently uploaded (20)

Benchmark MinHash+LSH algorithm on Spark