SlideShare a Scribd company logo
Going Beyond k-meansGoing Beyond k-means
Developments in the ≈60 years since its publication
J Singh and Teresa Brooks
March 17, 2015
Hello Bulgaria
• A website with thousands of pages...
– Some pages identical to other pages
– Some pages nearly identical to other pages
• We want smart indexing of the collection
2
© DataThinks 2013-15
2
• We want smart indexing of the collection
– Save just one copy of the duplicate pages
– Save one copy of the nearly duplicate pages
– Filter out similar documents when returning search results
• And we want to keep the index up to date
– Detect content changes quickly, possibly without reading
old copies from a slow storage
The Naïve Way to Address this Challenge
• Represent each document as a dot in d-dimensional space
• Run a k-means algorithm on the document set
– Resulting in k clusters
• When presented with a new document
3
© DataThinks 2013-15 3
• When presented with a new document
– Find the “nearest cluster”
– Find the documents within the nearest
cluster that are nearest to the document in
question
• Can be skipped if the cluster is small enough
• i.e., k is large enough that everything in the
cluster is close!
The Naïve Way has conceptual problems
• No good way to decide optimal k
• All documents have to be re-clustered if we want to
change k
• A document may “belong” to multiple clusters
• All clusters are roughly the same size
4
© DataThinks 2013-15
4
• All clusters are roughly the same size
– In practice, this terrain is lumpy – some documents are
one-of-a-kind and others are similar to many others.
The Naïve Way has technical problems
• End result is subject to initial choice of centroids
– Leads to results not being repeatable
• Performance is O(nk), or worse!
– Especially unfortunate because we want k to be large
• Algorithm is not easily adapted to map/reduce
5
© DataThinks 2013-15
5
• Algorithm is not easily adapted to map/reduce
– We need a pipeline of map/reduce jobs to compute it
Any Evolutionary Alternatives?
• Clustering has been picked over quite well
due to its combination of interesting math
and wide applicability
• Two dominant types have emerged:
– Hierarchical clustering
6
© DataThinks 2013-15
6
– Hierarchical clustering
– Partitional clustering (e.g., k-means)
• k-Means Variations based on
– Choice of Initial Centroids
– Choice of k
– Parameters at each iteration
Another line of inquiry: Nearest Neighbor
• Based on partitioning the search space
– Quad Trees
– kd-Trees
7
© DataThinks 2013-15
7
– Locality-Sensitive Hashing
• Hash functions are locality-sensitive, if, for a
random hash function h, for any pair of points p,q :
– Pr[h(p)=h(q)] is “high” if p is “close” to q
– Pr[h(p)=h(q)] is “low” if p is “far” from q
More on Nearest Neighbor…
• Locality-Sensitive Hashing†
– Hash functions are locality-sensitive, if, for a random hash
random function h, for any pair of points p,q we have:
• Pr[h(p)=h(q)] is “high” if p is “close” to q
• Pr[h(p)=h(q)] is “low” if p is”far” from q
8
© DataThinks 2013-15
8
†Indyk-Motwani’98
The LSH Idea
• Treat items as vectors in d-
dimensional space.
• Draw k random hyper-planes in
that space.
• For each hyper-plane:
4
5
9
© DataThinks 2013-15 9
– Is each vector on the (0) side of
the hyperplane or the (1) side?
• Hash(Item1) = 000
• Hash(Item3) = 101
• Hashes each item into a number
• The magic is in choosing h1, h2,
…
2
13 6
7
h3
h1
h2
The LSH Hash Code Idea…
• …Breaks d-dimensional space into proximity-polyhedra.
• Each purple block
represents a document Buckets
10
© DataThinks 2013-15
represents a document
– Each Bucket represents a
group of alike docs
• Docs within each bucket
still need to be compared
to see which ones are the
“closest”
A Brief History of LSH
• Origins at Stanford (1998)
• Continuing research in universities
– Stanford, MIT, Rutgers, Cornell, …
• Continuing research in Industry
– Intel, Microsoft, Google, …
11
© DataThinks 2013-15
11
– Intel, Microsoft, Google, …
• Textbook:
– A. Rajaraman and J. Ullman (2010). (http://goo.gl/8AJDgI)
• Our contribution:
– An extensible implementation for large datasets
Choosing hash functions
• Introducing minhash
1. Sample each document to get its “shingles” – small
fragments
• “Mary had a “ “mary”, “ary “, “ry h”, “y ha”, “ had”, …
• “CTAGTATAAA” “CTAGTATA”, “TAGTATAA”, “AGTATAAA”,
• “now is the time” “now is”, “is the”, “the time”
12
© DataThinks 2013-15
12
• “now is the time” “now is”, “is the”, “the time”
2. Calculate the hash value for every shingle.
3. Store the minimum hash value found in step 2.
4. Repeat steps 2 and 3 with different hash algorithms 199
more times to get a total of 200 minhash values.
Interesting thing about minhashes
• The resulting minhashes are 200 integer values
representing a random selection of shingles.
– Property of minhashes: If the minhashes for two docs
are the same, their shingles are likely to be the same
– If the shingles for two docs are the same, the docs
themselves are likely to be the same
13
© DataThinks 2013-15
13
themselves are likely to be the same
• Beware…
– Minhash is specific to a particular similarity measure –
Jaccard similarity
– Other hash families exist for other similarity measures
All 200 minhashes must match?
• If all minhashes match, it implies a strong similarity
between docs.
• To catch most cases with weaker similarity
– Don’t compare all minhashes at once, compare them in
bands. Candidate pairs are those that hash to the same
bucket for ≥ 1 band.
14
© DataThinks 2013-15
14
bucket for ≥ 1 band.
– Sometimes one band will reject a pair and another band
will consider it a candidate.
LSH Involves a Tradeoff
• Pick the number of minhashes, the number of bands, and
the number of rows per band to balance false
positives/negatives.
– False positives ⇒ need to examine more pairs that are not
really similar. More processing resources, more time.
– False negatives ⇒ failed to examine pairs that were similar,
15
© DataThinks 2013-15
15
– False negatives ⇒ failed to examine pairs that were similar,
didn’t find all similar results. But got done faster!
Summary
• Mine the data and place
members into hash buckets
• When you need to find a
match, hash it and possible
nearest neighbors will be in
one of b buckets.
16
© DataThinks 2013-15
16
one of b buckets.
• Algorithm performance O(n)
Going Beyond k-meansGoing Beyond k-means
Demo
J Singh and Teresa Brooks
March 17, 2015
Peerbelt Results Example
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
MUhNgZKlQ5qWKlSzlQ4auA
_UOkeHgLQn2HaLM5AdJBcw
fjw99kNgSBSNLXDjBijKIQ
6sp57Uq1TjCCb6ozoHlcEw
P2WtoTI0ROiwMu-KqcFmrg
f6aRJpmmQZWshgUJY6ddiA
o_CEANscQM-IC2VX-kk6Ag
mzfgajvrRNyJGYcr0d7i5g
Knvo0RsrRWeE-75QfeYRAQ
cllezWovQay6ZA1Ubxqbzw
oLXkAmQ5RIOM4svywmynbQ
TTWu2oleRcuHcNKqQqL_9Q
2is32hFhRACt-qAAg15eSQ
KnOpBza6TQO2lNHo45i08A
PebEFSHLQmGxI4aMAP-Pmw
T8TFg700R-WCACYPceCRfg
18
© DataThinks 2013-15
18
T8TFg700R-WCACYPceCRfg
BnM7ETiXQAywiFYEenzGfw
q6DSgUlOTVuro67PY2zpOQ
YZP6Tk7ZTBKPZnLTSctZEQ
yoTVRL8jSDyJHtS-Vcgkgw
xhA8UNjOTBuDt-VRnMTTnw
BQNIVz_5TxSlXZMJYV9lhA
S6FG_NUaQU-UyIoez_k2zg
_KJHmfuzQtKiCGHVT45JPg
AEWdkJ3QTAiaFRwOsbTcsA
MsVBW3oKT6yZNP9J8-2jKw
c7bRvt-dQse7n4tmFkuQCQ
K4DcDglWS3OdUXTGqTX1LA
lWgrETQwQsmmTDitHstIiQ
eAOq-w3pRJq1T0mdEeYBJA
OfTond3JRjCmaNaHJc5pcw
Wv4BFePCR0SSvotcfTbI-A
62p0zfd2SZOhH0niF90QcA
AxNLgwmBS1uK-QivL3bKWw
BcYtpGdbTtazQCp7ez7nCw
UpOP24JMSJuP58TAHkvc4w
K7fSX7v0Qcy4PAbGl7ZFFw
Zwc1YB8SSeSrcALscMfDNQ
mpmoIZY6S4Si89wdEyX9IA
3YhvLB30QJiFQXBA1vIqsA
=-xm8tkdTRN6i18BkP-EF4Q
YQ9K2Ka2TGic_7FZFb7pJg
Database Architecture Requirements
• Need a very large range of bucket numbers
– Bucket Numbers in our implementation are -231 to +231-1
• Most buckets are empty
– Empty buckets must not take any space in the database
19
© DataThinks 2013-15
19
– Empty buckets must not take any space in the database
– Some buckets have a lot of documents in them, we need to
be able to locate all of them
• To find documents similar to a given document,
– Bucketize the document, then find other documents in the
same buckets
Implementation: OpenLSH
• We started OpenLSH to provide a framework for LSH
• Factor out the database
– Started on Google App Engine
– Virtualized interface to make it work on Cassandra
20
© DataThinks 2013-15
20
– Virtualized interface to make it work on Cassandra
• Factor out the calculation engine
– Started on Google App Engine
– Can plug in Google MapReduce
– Ported to run in Batch mode on Cassandra
Using OpenLSH
• We’re looking for one or two interesting use cases
– Application areas:
• Near de-duplicaction (covered with Peerbelt’s data)
• Stocks that move independent of the herd
• Filtering “unique stories” from the News
21
© DataThinks 2013-15
21
• Contact us to discuss
What you can do
• For more information: http://openlsh.datathinks.org/
– Links to code and data set are included
• Run on App Engine
– Minimum setup required
22
© DataThinks 2013-15
22
– Minimum setup required
• Adapt it to your environment and need
• If you need help, send email or create a Github issue.
• Send us a pull request for any improvements you make.
Thank you
• J Singh
– Principal, DataThinks
• Algorithms for big data
• @datathinks, @singh_j
• j . singh @ datathinks . org
23
© DataThinks 2013-15
23
• j . singh @ datathinks . org
– Adj. Prof, Computer Science, WPI
• Teresa Brooks
– Senior Software Engineer @ Xero
• teresa.brooks@xero.com
• @VaderGirl13

More Related Content

What's hot (20)

PDF
Future of Data Intensive Applicaitons
Milind Bhandarkar
 
PDF
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
PPTX
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
PPTX
Apache Hadoop at 10
Cloudera, Inc.
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
PDF
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
PPTX
EDHREC @ Data Science MD
Donald Miner
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PDF
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
PDF
Hadoop - How It Works
Vladimír Hanušniak
 
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Adam Kawa
 
PPTX
The Hadoop Ecosystem
J Singh
 
PDF
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Milind Bhandarkar
 
PDF
Summary machine learning and model deployment
Novita Sari
 
PPT
Giraph at Hadoop Summit 2014
Claudio Martella
 
PPTX
Apache HBase Application Archetypes
Cloudera, Inc.
 
PPT
Hive Training -- Motivations and Real World Use Cases
nzhang
 
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Future of Data Intensive Applicaitons
Milind Bhandarkar
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
Apache Hadoop at 10
Cloudera, Inc.
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
EDHREC @ Data Science MD
Donald Miner
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
Hadoop - How It Works
Vladimír Hanušniak
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Adam Kawa
 
The Hadoop Ecosystem
J Singh
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Milind Bhandarkar
 
Summary machine learning and model deployment
Novita Sari
 
Giraph at Hadoop Summit 2014
Claudio Martella
 
Apache HBase Application Archetypes
Cloudera, Inc.
 
Hive Training -- Motivations and Real World Use Cases
nzhang
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 

Similar to OpenLSH - a framework for locality sensitive hashing (20)

PPTX
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
PDF
Open LSH - september 2014 update
J Singh
 
PDF
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov
 
PDF
Local sensitive hashing & minhash on facebook friend
Chengeng Ma
 
PDF
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
PDF
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
DECK36
 
PDF
Locality sensitive hashing
SEMINARGROOT
 
PPTX
Probabilistic data structure
Thinh Dang
 
PDF
Locality Sensitive Hashing By Spark
Spark Summit
 
PPTX
3 - Finding similar items
Viet-Trung TRAN
 
PPT
20140327 - Hashing Object Embedding
Jacob Xu
 
PDF
large_scale_search.pdf
Emerald72
 
PDF
Approximate methods for scalable data mining (long version)
Andrew Clegg
 
PPTX
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
Sameera Horawalavithana
 
PPTX
ch03-lsh.pptxCS246: Mining Massive Datasets Jure Leskovec, Stanford Universi...
sidok73251
 
PDF
Scalable Recommendation Algorithms with LSH
Maruf Aytekin
 
PDF
04-lsh_theory.pdfCS246: Mining Massive Datasets Jure Leskovec, Stanford Univ...
sidok73251
 
PPTX
Shingling of documents , business intelligence
thislaptop747
 
PDF
Binary Similarity : Theory, Algorithms and Tool Evaluation
Liwei Ren任力偉
 
PPT
similarity1 (6).ppt
ssuserf11a32
 
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
Open LSH - september 2014 update
J Singh
 
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov
 
Local sensitive hashing & minhash on facebook friend
Chengeng Ma
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
DECK36
 
Locality sensitive hashing
SEMINARGROOT
 
Probabilistic data structure
Thinh Dang
 
Locality Sensitive Hashing By Spark
Spark Summit
 
3 - Finding similar items
Viet-Trung TRAN
 
20140327 - Hashing Object Embedding
Jacob Xu
 
large_scale_search.pdf
Emerald72
 
Approximate methods for scalable data mining (long version)
Andrew Clegg
 
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
Sameera Horawalavithana
 
ch03-lsh.pptxCS246: Mining Massive Datasets Jure Leskovec, Stanford Universi...
sidok73251
 
Scalable Recommendation Algorithms with LSH
Maruf Aytekin
 
04-lsh_theory.pdfCS246: Mining Massive Datasets Jure Leskovec, Stanford Univ...
sidok73251
 
Shingling of documents , business intelligence
thislaptop747
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Liwei Ren任力偉
 
similarity1 (6).ppt
ssuserf11a32
 
Ad

More from J Singh (16)

PPTX
PaaS - google app engine
J Singh
 
PPTX
Big Data Laboratory
J Singh
 
PPTX
Social Media Mining using GAE Map Reduce
J Singh
 
PPTX
High Throughput Data Analysis
J Singh
 
PPTX
NoSQL and MapReduce
J Singh
 
PPTX
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
PPTX
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
PPTX
CS 542 -- Query Optimization
J Singh
 
PPTX
CS 542 -- Query Execution
J Singh
 
PPTX
CS 542 Putting it all together -- Storage Management
J Singh
 
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
PPTX
CS 542 Database Index Structures
J Singh
 
PPTX
CS 542 Controlling Database Integrity and Performance
J Singh
 
PPTX
CS 542 Overview of query processing
J Singh
 
PPTX
CS 542 Introduction
J Singh
 
PDF
Cloud Computing from an Entrpreneur's Viewpoint
J Singh
 
PaaS - google app engine
J Singh
 
Big Data Laboratory
J Singh
 
Social Media Mining using GAE Map Reduce
J Singh
 
High Throughput Data Analysis
J Singh
 
NoSQL and MapReduce
J Singh
 
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
CS 542 -- Query Optimization
J Singh
 
CS 542 -- Query Execution
J Singh
 
CS 542 Putting it all together -- Storage Management
J Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
CS 542 Database Index Structures
J Singh
 
CS 542 Controlling Database Integrity and Performance
J Singh
 
CS 542 Overview of query processing
J Singh
 
CS 542 Introduction
J Singh
 
Cloud Computing from an Entrpreneur's Viewpoint
J Singh
 
Ad

Recently uploaded (20)

PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 

OpenLSH - a framework for locality sensitive hashing

  • 1. Going Beyond k-meansGoing Beyond k-means Developments in the ≈60 years since its publication J Singh and Teresa Brooks March 17, 2015
  • 2. Hello Bulgaria • A website with thousands of pages... – Some pages identical to other pages – Some pages nearly identical to other pages • We want smart indexing of the collection 2 © DataThinks 2013-15 2 • We want smart indexing of the collection – Save just one copy of the duplicate pages – Save one copy of the nearly duplicate pages – Filter out similar documents when returning search results • And we want to keep the index up to date – Detect content changes quickly, possibly without reading old copies from a slow storage
  • 3. The Naïve Way to Address this Challenge • Represent each document as a dot in d-dimensional space • Run a k-means algorithm on the document set – Resulting in k clusters • When presented with a new document 3 © DataThinks 2013-15 3 • When presented with a new document – Find the “nearest cluster” – Find the documents within the nearest cluster that are nearest to the document in question • Can be skipped if the cluster is small enough • i.e., k is large enough that everything in the cluster is close!
  • 4. The Naïve Way has conceptual problems • No good way to decide optimal k • All documents have to be re-clustered if we want to change k • A document may “belong” to multiple clusters • All clusters are roughly the same size 4 © DataThinks 2013-15 4 • All clusters are roughly the same size – In practice, this terrain is lumpy – some documents are one-of-a-kind and others are similar to many others.
  • 5. The Naïve Way has technical problems • End result is subject to initial choice of centroids – Leads to results not being repeatable • Performance is O(nk), or worse! – Especially unfortunate because we want k to be large • Algorithm is not easily adapted to map/reduce 5 © DataThinks 2013-15 5 • Algorithm is not easily adapted to map/reduce – We need a pipeline of map/reduce jobs to compute it
  • 6. Any Evolutionary Alternatives? • Clustering has been picked over quite well due to its combination of interesting math and wide applicability • Two dominant types have emerged: – Hierarchical clustering 6 © DataThinks 2013-15 6 – Hierarchical clustering – Partitional clustering (e.g., k-means) • k-Means Variations based on – Choice of Initial Centroids – Choice of k – Parameters at each iteration
  • 7. Another line of inquiry: Nearest Neighbor • Based on partitioning the search space – Quad Trees – kd-Trees 7 © DataThinks 2013-15 7 – Locality-Sensitive Hashing • Hash functions are locality-sensitive, if, for a random hash function h, for any pair of points p,q : – Pr[h(p)=h(q)] is “high” if p is “close” to q – Pr[h(p)=h(q)] is “low” if p is “far” from q
  • 8. More on Nearest Neighbor… • Locality-Sensitive Hashing† – Hash functions are locality-sensitive, if, for a random hash random function h, for any pair of points p,q we have: • Pr[h(p)=h(q)] is “high” if p is “close” to q • Pr[h(p)=h(q)] is “low” if p is”far” from q 8 © DataThinks 2013-15 8 †Indyk-Motwani’98
  • 9. The LSH Idea • Treat items as vectors in d- dimensional space. • Draw k random hyper-planes in that space. • For each hyper-plane: 4 5 9 © DataThinks 2013-15 9 – Is each vector on the (0) side of the hyperplane or the (1) side? • Hash(Item1) = 000 • Hash(Item3) = 101 • Hashes each item into a number • The magic is in choosing h1, h2, … 2 13 6 7 h3 h1 h2
  • 10. The LSH Hash Code Idea… • …Breaks d-dimensional space into proximity-polyhedra. • Each purple block represents a document Buckets 10 © DataThinks 2013-15 represents a document – Each Bucket represents a group of alike docs • Docs within each bucket still need to be compared to see which ones are the “closest”
  • 11. A Brief History of LSH • Origins at Stanford (1998) • Continuing research in universities – Stanford, MIT, Rutgers, Cornell, … • Continuing research in Industry – Intel, Microsoft, Google, … 11 © DataThinks 2013-15 11 – Intel, Microsoft, Google, … • Textbook: – A. Rajaraman and J. Ullman (2010). (http://goo.gl/8AJDgI) • Our contribution: – An extensible implementation for large datasets
  • 12. Choosing hash functions • Introducing minhash 1. Sample each document to get its “shingles” – small fragments • “Mary had a “ “mary”, “ary “, “ry h”, “y ha”, “ had”, … • “CTAGTATAAA” “CTAGTATA”, “TAGTATAA”, “AGTATAAA”, • “now is the time” “now is”, “is the”, “the time” 12 © DataThinks 2013-15 12 • “now is the time” “now is”, “is the”, “the time” 2. Calculate the hash value for every shingle. 3. Store the minimum hash value found in step 2. 4. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 minhash values.
  • 13. Interesting thing about minhashes • The resulting minhashes are 200 integer values representing a random selection of shingles. – Property of minhashes: If the minhashes for two docs are the same, their shingles are likely to be the same – If the shingles for two docs are the same, the docs themselves are likely to be the same 13 © DataThinks 2013-15 13 themselves are likely to be the same • Beware… – Minhash is specific to a particular similarity measure – Jaccard similarity – Other hash families exist for other similarity measures
  • 14. All 200 minhashes must match? • If all minhashes match, it implies a strong similarity between docs. • To catch most cases with weaker similarity – Don’t compare all minhashes at once, compare them in bands. Candidate pairs are those that hash to the same bucket for ≥ 1 band. 14 © DataThinks 2013-15 14 bucket for ≥ 1 band. – Sometimes one band will reject a pair and another band will consider it a candidate.
  • 15. LSH Involves a Tradeoff • Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. – False positives ⇒ need to examine more pairs that are not really similar. More processing resources, more time. – False negatives ⇒ failed to examine pairs that were similar, 15 © DataThinks 2013-15 15 – False negatives ⇒ failed to examine pairs that were similar, didn’t find all similar results. But got done faster!
  • 16. Summary • Mine the data and place members into hash buckets • When you need to find a match, hash it and possible nearest neighbors will be in one of b buckets. 16 © DataThinks 2013-15 16 one of b buckets. • Algorithm performance O(n)
  • 17. Going Beyond k-meansGoing Beyond k-means Demo J Singh and Teresa Brooks March 17, 2015
  • 18. Peerbelt Results Example ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### MUhNgZKlQ5qWKlSzlQ4auA _UOkeHgLQn2HaLM5AdJBcw fjw99kNgSBSNLXDjBijKIQ 6sp57Uq1TjCCb6ozoHlcEw P2WtoTI0ROiwMu-KqcFmrg f6aRJpmmQZWshgUJY6ddiA o_CEANscQM-IC2VX-kk6Ag mzfgajvrRNyJGYcr0d7i5g Knvo0RsrRWeE-75QfeYRAQ cllezWovQay6ZA1Ubxqbzw oLXkAmQ5RIOM4svywmynbQ TTWu2oleRcuHcNKqQqL_9Q 2is32hFhRACt-qAAg15eSQ KnOpBza6TQO2lNHo45i08A PebEFSHLQmGxI4aMAP-Pmw T8TFg700R-WCACYPceCRfg 18 © DataThinks 2013-15 18 T8TFg700R-WCACYPceCRfg BnM7ETiXQAywiFYEenzGfw q6DSgUlOTVuro67PY2zpOQ YZP6Tk7ZTBKPZnLTSctZEQ yoTVRL8jSDyJHtS-Vcgkgw xhA8UNjOTBuDt-VRnMTTnw BQNIVz_5TxSlXZMJYV9lhA S6FG_NUaQU-UyIoez_k2zg _KJHmfuzQtKiCGHVT45JPg AEWdkJ3QTAiaFRwOsbTcsA MsVBW3oKT6yZNP9J8-2jKw c7bRvt-dQse7n4tmFkuQCQ K4DcDglWS3OdUXTGqTX1LA lWgrETQwQsmmTDitHstIiQ eAOq-w3pRJq1T0mdEeYBJA OfTond3JRjCmaNaHJc5pcw Wv4BFePCR0SSvotcfTbI-A 62p0zfd2SZOhH0niF90QcA AxNLgwmBS1uK-QivL3bKWw BcYtpGdbTtazQCp7ez7nCw UpOP24JMSJuP58TAHkvc4w K7fSX7v0Qcy4PAbGl7ZFFw Zwc1YB8SSeSrcALscMfDNQ mpmoIZY6S4Si89wdEyX9IA 3YhvLB30QJiFQXBA1vIqsA =-xm8tkdTRN6i18BkP-EF4Q YQ9K2Ka2TGic_7FZFb7pJg
  • 19. Database Architecture Requirements • Need a very large range of bucket numbers – Bucket Numbers in our implementation are -231 to +231-1 • Most buckets are empty – Empty buckets must not take any space in the database 19 © DataThinks 2013-15 19 – Empty buckets must not take any space in the database – Some buckets have a lot of documents in them, we need to be able to locate all of them • To find documents similar to a given document, – Bucketize the document, then find other documents in the same buckets
  • 20. Implementation: OpenLSH • We started OpenLSH to provide a framework for LSH • Factor out the database – Started on Google App Engine – Virtualized interface to make it work on Cassandra 20 © DataThinks 2013-15 20 – Virtualized interface to make it work on Cassandra • Factor out the calculation engine – Started on Google App Engine – Can plug in Google MapReduce – Ported to run in Batch mode on Cassandra
  • 21. Using OpenLSH • We’re looking for one or two interesting use cases – Application areas: • Near de-duplicaction (covered with Peerbelt’s data) • Stocks that move independent of the herd • Filtering “unique stories” from the News 21 © DataThinks 2013-15 21 • Contact us to discuss
  • 22. What you can do • For more information: http://openlsh.datathinks.org/ – Links to code and data set are included • Run on App Engine – Minimum setup required 22 © DataThinks 2013-15 22 – Minimum setup required • Adapt it to your environment and need • If you need help, send email or create a Github issue. • Send us a pull request for any improvements you make.
  • 23. Thank you • J Singh – Principal, DataThinks • Algorithms for big data • @datathinks, @singh_j • j . singh @ datathinks . org 23 © DataThinks 2013-15 23 • j . singh @ datathinks . org – Adj. Prof, Computer Science, WPI • Teresa Brooks – Senior Software Engineer @ Xero • [email protected] • @VaderGirl13