OpenLSH - a framework for locality sensitive hashing

Going Beyond k-meansGoing Beyond k-means
Developments in the ≈60 years since its publication
J Singh and Teresa Brooks
March 17, 2015

Hello Bulgaria
• A website with thousands of pages...
– Some pages identical to other pages
– Some pages nearly identical to other pages
• We want smart indexing of the collection
2
© DataThinks 2013-15
2
• We want smart indexing of the collection
– Save just one copy of the duplicate pages
– Save one copy of the nearly duplicate pages
– Filter out similar documents when returning search results
• And we want to keep the index up to date
– Detect content changes quickly, possibly without reading
old copies from a slow storage

The Naïve Way to Address this Challenge
• Represent each document as a dot in d-dimensional space
• Run a k-means algorithm on the document set
– Resulting in k clusters
• When presented with a new document
3
© DataThinks 2013-15 3
• When presented with a new document
– Find the “nearest cluster”
– Find the documents within the nearest
cluster that are nearest to the document in
question
• Can be skipped if the cluster is small enough
• i.e., k is large enough that everything in the
cluster is close!

The Naïve Way has conceptual problems
• No good way to decide optimal k
• All documents have to be re-clustered if we want to
change k
• A document may “belong” to multiple clusters
• All clusters are roughly the same size
4
4
• All clusters are roughly the same size
– In practice, this terrain is lumpy – some documents are
one-of-a-kind and others are similar to many others.

The Naïve Way has technical problems
• End result is subject to initial choice of centroids
– Leads to results not being repeatable
• Performance is O(nk), or worse!
– Especially unfortunate because we want k to be large
• Algorithm is not easily adapted to map/reduce
5
5
• Algorithm is not easily adapted to map/reduce
– We need a pipeline of map/reduce jobs to compute it

Any Evolutionary Alternatives?
• Clustering has been picked over quite well
due to its combination of interesting math
and wide applicability
• Two dominant types have emerged:
– Hierarchical clustering
6
6
– Hierarchical clustering
– Partitional clustering (e.g., k-means)
• k-Means Variations based on
– Choice of Initial Centroids
– Choice of k
– Parameters at each iteration

Another line of inquiry: Nearest Neighbor
• Based on partitioning the search space
– Quad Trees
– kd-Trees
7
7
– Locality-Sensitive Hashing
• Hash functions are locality-sensitive, if, for a
random hash function h, for any pair of points p,q :
– Pr[h(p)=h(q)] is “high” if p is “close” to q
– Pr[h(p)=h(q)] is “low” if p is “far” from q

More on Nearest Neighbor…
• Locality-Sensitive Hashing†
– Hash functions are locality-sensitive, if, for a random hash
random function h, for any pair of points p,q we have:
• Pr[h(p)=h(q)] is “high” if p is “close” to q
• Pr[h(p)=h(q)] is “low” if p is”far” from q
8
8
†Indyk-Motwani’98

The LSH Idea
• Treat items as vectors in d-
dimensional space.
• Draw k random hyper-planes in
that space.
• For each hyper-plane:
4
5
9
© DataThinks 2013-15 9
– Is each vector on the (0) side of
the hyperplane or the (1) side?
• Hash(Item1) = 000
• Hash(Item3) = 101
• Hashes each item into a number
• The magic is in choosing h1, h2,
…
2
13 6
7
h3
h1
h2

The LSH Hash Code Idea…
• …Breaks d-dimensional space into proximity-polyhedra.
• Each purple block
represents a document Buckets
10
represents a document
– Each Bucket represents a
group of alike docs
• Docs within each bucket
still need to be compared
to see which ones are the
“closest”

A Brief History of LSH
• Origins at Stanford (1998)
• Continuing research in universities
– Stanford, MIT, Rutgers, Cornell, …
• Continuing research in Industry
– Intel, Microsoft, Google, …
11
11
– Intel, Microsoft, Google, …
• Textbook:
– A. Rajaraman and J. Ullman (2010). (http://goo.gl/8AJDgI)
• Our contribution:
– An extensible implementation for large datasets

Choosing hash functions
• Introducing minhash
1. Sample each document to get its “shingles” – small
fragments
• “Mary had a “ “mary”, “ary “, “ry h”, “y ha”, “ had”, …
• “CTAGTATAAA” “CTAGTATA”, “TAGTATAA”, “AGTATAAA”,
• “now is the time” “now is”, “is the”, “the time”
12
12
• “now is the time” “now is”, “is the”, “the time”
2. Calculate the hash value for every shingle.
3. Store the minimum hash value found in step 2.
4. Repeat steps 2 and 3 with different hash algorithms 199
more times to get a total of 200 minhash values.

Interesting thing about minhashes
• The resulting minhashes are 200 integer values
representing a random selection of shingles.
– Property of minhashes: If the minhashes for two docs
are the same, their shingles are likely to be the same
– If the shingles for two docs are the same, the docs
themselves are likely to be the same
13
13
themselves are likely to be the same
• Beware…
– Minhash is specific to a particular similarity measure –
Jaccard similarity
– Other hash families exist for other similarity measures

All 200 minhashes must match?
• If all minhashes match, it implies a strong similarity
between docs.
• To catch most cases with weaker similarity
– Don’t compare all minhashes at once, compare them in
bands. Candidate pairs are those that hash to the same
bucket for ≥ 1 band.
14
14
bucket for ≥ 1 band.
– Sometimes one band will reject a pair and another band
will consider it a candidate.

LSH Involves a Tradeoff
• Pick the number of minhashes, the number of bands, and
the number of rows per band to balance false
positives/negatives.
– False positives ⇒ need to examine more pairs that are not
really similar. More processing resources, more time.
– False negatives ⇒ failed to examine pairs that were similar,
15
15
– False negatives ⇒ failed to examine pairs that were similar,
didn’t find all similar results. But got done faster!

Summary
• Mine the data and place
members into hash buckets
• When you need to find a
match, hash it and possible
nearest neighbors will be in
one of b buckets.
16
16
one of b buckets.
• Algorithm performance O(n)

Going Beyond k-meansGoing Beyond k-means
Demo
J Singh and Teresa Brooks
March 17, 2015

Peerbelt Results Example
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
MUhNgZKlQ5qWKlSzlQ4auA
_UOkeHgLQn2HaLM5AdJBcw
fjw99kNgSBSNLXDjBijKIQ
6sp57Uq1TjCCb6ozoHlcEw
P2WtoTI0ROiwMu-KqcFmrg
f6aRJpmmQZWshgUJY6ddiA
o_CEANscQM-IC2VX-kk6Ag
mzfgajvrRNyJGYcr0d7i5g
Knvo0RsrRWeE-75QfeYRAQ
cllezWovQay6ZA1Ubxqbzw
oLXkAmQ5RIOM4svywmynbQ
TTWu2oleRcuHcNKqQqL_9Q
2is32hFhRACt-qAAg15eSQ
KnOpBza6TQO2lNHo45i08A
PebEFSHLQmGxI4aMAP-Pmw
T8TFg700R-WCACYPceCRfg
18
18
T8TFg700R-WCACYPceCRfg
BnM7ETiXQAywiFYEenzGfw
q6DSgUlOTVuro67PY2zpOQ
YZP6Tk7ZTBKPZnLTSctZEQ
yoTVRL8jSDyJHtS-Vcgkgw
xhA8UNjOTBuDt-VRnMTTnw
BQNIVz_5TxSlXZMJYV9lhA
S6FG_NUaQU-UyIoez_k2zg
_KJHmfuzQtKiCGHVT45JPg
AEWdkJ3QTAiaFRwOsbTcsA
MsVBW3oKT6yZNP9J8-2jKw
c7bRvt-dQse7n4tmFkuQCQ
K4DcDglWS3OdUXTGqTX1LA
lWgrETQwQsmmTDitHstIiQ
eAOq-w3pRJq1T0mdEeYBJA
OfTond3JRjCmaNaHJc5pcw
Wv4BFePCR0SSvotcfTbI-A
62p0zfd2SZOhH0niF90QcA
AxNLgwmBS1uK-QivL3bKWw
BcYtpGdbTtazQCp7ez7nCw
UpOP24JMSJuP58TAHkvc4w
K7fSX7v0Qcy4PAbGl7ZFFw
Zwc1YB8SSeSrcALscMfDNQ
mpmoIZY6S4Si89wdEyX9IA
3YhvLB30QJiFQXBA1vIqsA
=-xm8tkdTRN6i18BkP-EF4Q
YQ9K2Ka2TGic_7FZFb7pJg

Database Architecture Requirements
• Need a very large range of bucket numbers
– Bucket Numbers in our implementation are -231 to +231-1
• Most buckets are empty
– Empty buckets must not take any space in the database
19
19
– Empty buckets must not take any space in the database
– Some buckets have a lot of documents in them, we need to
be able to locate all of them
• To find documents similar to a given document,
– Bucketize the document, then find other documents in the
same buckets

Implementation: OpenLSH
• We started OpenLSH to provide a framework for LSH
• Factor out the database
– Started on Google App Engine
– Virtualized interface to make it work on Cassandra
20
20
– Virtualized interface to make it work on Cassandra
• Factor out the calculation engine
– Started on Google App Engine
– Can plug in Google MapReduce
– Ported to run in Batch mode on Cassandra

Using OpenLSH
• We’re looking for one or two interesting use cases
– Application areas:
• Near de-duplicaction (covered with Peerbelt’s data)
• Stocks that move independent of the herd
• Filtering “unique stories” from the News
21
21
• Contact us to discuss

What you can do
• For more information: http://openlsh.datathinks.org/
– Links to code and data set are included
• Run on App Engine
– Minimum setup required
22
22
– Minimum setup required
• Adapt it to your environment and need
• If you need help, send email or create a Github issue.
• Send us a pull request for any improvements you make.

Thank you
• J Singh
– Principal, DataThinks
• Algorithms for big data
• @datathinks, @singh_j
• j . singh @ datathinks . org
23
23
• j . singh @ datathinks . org
– Adj. Prof, Computer Science, WPI
• Teresa Brooks
– Senior Software Engineer @ Xero
• teresa.brooks@xero.com
• @VaderGirl13

OpenLSH - a framework for locality sensitive hashing

More Related Content

What's hot (20)

Similar to OpenLSH - a framework for locality sensitive hashing (20)

More from J Singh (16)

Recently uploaded (20)

OpenLSH - a framework for locality sensitive hashing