SlideShare a Scribd company logo
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection
Andrey Gusev June 6, 2018
Using LSH and Tensorflow
Help you discover and do
what you love.
200m+People on
Pinterest
each month
100b+Pins
2b+Boards
10b+Recommendations/Day
1
2
3
4
5
Agenda Neardup, clustering and LSH
Candidate generation
Deep dive
Candidate selection
TF on Spark
Neardup
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Not Neardup
Unrelated
Neardup
Duplicate
Clustering
Not An Equivalence Class
Formulation
For each image find a canonical image which represents an equivalence class.
Problem
Neardup is not an equivalence relation because neardup relation is not a
transitive relation.
It means we can not find a perfect partition such that all images within a
cluster are closer to each other than to the other clusters.
Incremental approximate K-Cut
Incrementally:
1. Generate candidates via batch LSH search
2. Select candidates via a TF model
3. Take a transitive closure over selected candidates
4. Pass over clusters and greedily select sub-clusters (K-Cut).
LSH
Embeddings and LSH
- Visual Embeddings are high-dimensional vector representations of
entities (in our case images) which capture semantic similarity.
- Produced via Neural Networks like VGG16, Inception, etc.
- Locality-sensitive hashing or LSH is a modern technique used to reduce
dimensionality of high-dimensional data while preserving pairwise
distances between individual points.
LSH: Locality Sensitive Hashing
- Pick random projection vectors (black)
- For each embeddings vector determine
on which side of the hyperplane the
embeddings vector lands
- On the same side: set bit to 1
- On different side: set bit to 0
Result 1: <1 1 0>
Result 2: <1 0 1>
1
1
0
1
0
1
LSH terms
Pick optimal number of terms and bits per term
- 1001110001011000 -> [00]1001 - [01]1100 - [10]0101 - [11]1000
- [x] → a term index
Candidate
Generation
Neardup Candidate Generation
- Input Data:
RDD[(ImgId, List[LSHTerm])] // billions
- Goal:
RDD[(ImgId, TopK[(ImgId, Overlap))]
Nearest Neighbor (KNN) problem formulation
Neardup Candidate Generation
Given a set of documents each described by LSH terms, example:
A → (1,2,3)
B → (1,3,10)
C → (2,10)
And more generally:
Di
→ [tj
]
Where each Di
is a document and [tj
] is a list of LSH terms (assume each is a 4 byte integer)
Results:
A → (B,2), (C,1)
B → (A,2), (C,1)
C → (A,1), (B,1)
Spark Candidate Generation
1. Input RDD[(ImgId, List[LSHTerm])] ← both index and query sets
2. flatMap, groupBy input into RDD[(LSHTerm, PostingList)] ← an inverted index
3. flatMap, groupBy into RDD[(LSHTerm, PostingList)] ← a query list
4. Join (2) and (3), flatMap over queries posting list, and groupBy query ImgId;
RDD[(ImgId, List[PostingList])] ← search results by query.
5. Merge List[List[ImgId]] into TopK(ImgId, Overlap) counting number of times each ImgId is
seen → RDD[ImgId, TopK[(ImgId, Overlap)].
* PostingList = List[ImgId]
Orders of magnitude too slow.
Deep Dive
def mapDocToInt(termIndexRaw: RDD[(String, List[TermId])]): RDD[(String, DocId)] = {
// ensure that mapping between string and id is stable by sorting
// this allows attempts to re-use partial stage completions
termIndexRaw.keys.distinct().sortBy(x => x).zipWithIndex()
}
val stringArray = (for (ind <- 0 to 1000) yield randomString(32)).toArray
val intArray = (for (ind <- 0 to 1000) yield ind).toArray
* https://www.javamex.com/classmexer/
Dictionary encoding
108128 Bytes*
4024 Bytes*
25x
Variable Byte Encoding
- One bit of each byte is a continuation bit; overhead
- int → byte (best case)
- 32 char string up to 25x4 = 100x memory reduction
https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html
Inverted Index Partitioning
Inverted index is skewed
/**
* Build partitioned inverted index by taking module of docId into partition.
*/
def buildPartitionedInvertedIndex(flatTermIndexAndFreq: RDD[(TermId, (DocId, TermFreq))]):
RDD[((TermId, TermPartition), Iterable[DocId])] = {
flatTermIndexAndFreq.map { case (termId, (docId, _)) =>
// partition documents within the same term to improve balance
// and reduce the posting list length
((termId, (Math.abs(docId) % TERM_PARTITIONING).toByte), docId)
}.groupByKey()
}
Packing
(Int, Byte) => Long
Before:
Unsorted: 128.77 MB in 549ms
Sort+Limit: 4.41 KB in 7511ms
After:
Unsorted: 38.83 MB in 219ms
Sort+Limit: 4.41 KB in 467ms
def packDocIdAndByteIntoLong(docId: DocId, docFreq: DocFreq): Long = {
(docFreq.toLong << 32) | (docId & 0xffffffffL)
}
def unpackDocIdAndByteFromLong(packed: Long): (DocId, DocFreq) = {
(packed.toInt, (packed >> 32).toByte)
}
Slicing
Split query set into slices to reduce spill and size for
“widest” portion of the computation. Union at the end.
Additional Optimizations
- Cost based optimizer - significant improvements to runtime can be realized by
analyzing input data sets and setting performance parameters automatically.
- Counting - jaccard overlap counting is done via low level, high performance collections.
- Off heaping serialization when possible (spark.kryo.unsafe).
Generic Batch LSH Search
- Can be applied generically to KNN, embedding
agnostic.
- Can work on arbitrary large query set via slicing.
Candidate
Selection
TF DNN Classifier
- Transfer learning over VGG16
- Visual embeddings
- XOR hamming bits
- Learning still happens at >1B pairs
- Batch size of 1024, Adam optimizer
4096
2048
256
128
1
Vectorization: mapPartitions + grouped
- During training and inference vectorization reduces overhead.
- Spark mapPartitions + grouped allows for large batches and controlling
the size. Works well for inference.
- 2ms/prediction on c3.8xl CPUs with network of 10MM parameters .
input.mapPartitions { partition: Iterator[(ImgInfo, ImgInfo)] =>
// break down large partitions into groups and score per group
partition.grouped(BATCH_SIZE).flatMap { group: Seq[(ImgInfo, ImgInfo)] =>
// create tensors and score as features: Array[Array[Float]] --> Tensor.create(features)
}
}
One TF Session per JVM
- Reduce model loading overhead, load once per JVM; thread-safe.
object TensorflowModel {
lazy val model: Session = {
SavedModelBundle.load(...).session()
}
}
Summary
- Candidate Generation uses Batch LSH Search over terms from visual
embeddings.
- Batch LSH scales to billions of objects in the index and is embedding
agnostic.
- Candidate Selection uses a TF classifier over raw visual embeddings.
- Two-pass transitive closure to cluster results.
Thanks!

More Related Content

PDF
Understanding flex box CSS Day 2016
PDF
MySQL Cookbook: Recipes for Your Business
PDF
Database Automation with MySQL Triggers and Event Schedulers
PPTX
The selection sort algorithm
PDF
Tableau scatter plot
PPTX
Circular link list.ppt
PPT
Queue in Data Structure
PPT
Algorithm: Quick-Sort
Understanding flex box CSS Day 2016
MySQL Cookbook: Recipes for Your Business
Database Automation with MySQL Triggers and Event Schedulers
The selection sort algorithm
Tableau scatter plot
Circular link list.ppt
Queue in Data Structure
Algorithm: Quick-Sort

What's hot (12)

PDF
Kotlin @ Coupang Backend 2017
PDF
Econometrics Homework Help
PPTX
Searching & Sorting Algorithms
PPTX
Greedy Algorithm - Knapsack Problem
PDF
Triggers and Stored Procedures
PDF
MySQL: Indexing for Better Performance
PPTX
AWS 12월 웨비나 │성공적인 마이그레이션을 위한 클라우드 아키텍처 및 운영 고도화
PPTX
What is Link list? explained with animations
ODP
Intro To PostGIS
PPTX
Rahat &amp; juhith
PDF
AWS 스토리지 서비스 소개 및 실습 - 김용기, AWS 솔루션즈 아키텍트
Kotlin @ Coupang Backend 2017
Econometrics Homework Help
Searching & Sorting Algorithms
Greedy Algorithm - Knapsack Problem
Triggers and Stored Procedures
MySQL: Indexing for Better Performance
AWS 12월 웨비나 │성공적인 마이그레이션을 위한 클라우드 아키텍처 및 운영 고도화
What is Link list? explained with animations
Intro To PostGIS
Rahat &amp; juhith
AWS 스토리지 서비스 소개 및 실습 - 김용기, AWS 솔루션즈 아키텍트
Ad

Similar to Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev (20)

PDF
Project - Deep Locality Sensitive Hashing
PDF
Neighbourhood Preserving Quantisation for LSH SIGIR Poster
PDF
Fighting fraud: finding duplicates at scale
PDF
Landmark Retrieval & Recognition
PDF
Retrieving Visually-Similar Products for Shopping Recommendations using Spark...
PPTX
Distributed Deep Learning + others for Spark Meetup
PPTX
Deep Residual Hashing Neural Network for Image Retrieval
PDF
서버리스 기반 콘텐츠 추천 서비스 만들기 - 이상현, Vingle :: AWS Summit Seoul 2019
PDF
Scaling up data science applications
PDF
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
PPTX
Scaling up data science applications
PPTX
Sparkling Random Ferns by P Dendek and M Fedoryszak
PDF
Duplicates everywhere (Kiev)
PDF
Pr083 Non-local Neural Networks
PDF
Introduction to spark
PDF
Approximation algorithms for stream and batch processing
PDF
[221] 이미지를 이해하는 이미지검색: 텍스트기반 이미지검색에 CNN 이용하기
PPTX
Author paper identification problem final presentation
PDF
Scalable Recommendation Algorithms with LSH
PDF
Data mining for_java_and_dot_net 2016-17
Project - Deep Locality Sensitive Hashing
Neighbourhood Preserving Quantisation for LSH SIGIR Poster
Fighting fraud: finding duplicates at scale
Landmark Retrieval & Recognition
Retrieving Visually-Similar Products for Shopping Recommendations using Spark...
Distributed Deep Learning + others for Spark Meetup
Deep Residual Hashing Neural Network for Image Retrieval
서버리스 기반 콘텐츠 추천 서비스 만들기 - 이상현, Vingle :: AWS Summit Seoul 2019
Scaling up data science applications
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Scaling up data science applications
Sparkling Random Ferns by P Dendek and M Fedoryszak
Duplicates everywhere (Kiev)
Pr083 Non-local Neural Networks
Introduction to spark
Approximation algorithms for stream and batch processing
[221] 이미지를 이해하는 이미지검색: 텍스트기반 이미지검색에 CNN 이용하기
Author paper identification problem final presentation
Scalable Recommendation Algorithms with LSH
Data mining for_java_and_dot_net 2016-17
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
The Rise of Impact Investing- How to Align Profit with Purpose
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Quality review (1)_presentation of this 21
PPTX
Computer network topology notes for revision
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Data Science Trends & Career Guide---ppt
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
oil_refinery_comprehensive_20250804084928 (1).pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
climate analysis of Dhaka ,Banglades.pptx
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
.pdf is not working space design for the following data for the following dat...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Taxes Foundatisdcsdcsdon Certificate.pdf
Introduction-to-Cloud-ComputingFinal.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
The Rise of Impact Investing- How to Align Profit with Purpose
Introduction to Knowledge Engineering Part 1
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Quality review (1)_presentation of this 21
Computer network topology notes for revision
Major-Components-ofNKJNNKNKNKNKronment.pptx
Data Science Trends & Career Guide---ppt
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev

  • 2. Image Similarity Detection Andrey Gusev June 6, 2018 Using LSH and Tensorflow
  • 3. Help you discover and do what you love.
  • 5. 1 2 3 4 5 Agenda Neardup, clustering and LSH Candidate generation Deep dive Candidate selection TF on Spark
  • 11. Not An Equivalence Class Formulation For each image find a canonical image which represents an equivalence class. Problem Neardup is not an equivalence relation because neardup relation is not a transitive relation. It means we can not find a perfect partition such that all images within a cluster are closer to each other than to the other clusters.
  • 12. Incremental approximate K-Cut Incrementally: 1. Generate candidates via batch LSH search 2. Select candidates via a TF model 3. Take a transitive closure over selected candidates 4. Pass over clusters and greedily select sub-clusters (K-Cut).
  • 13. LSH
  • 14. Embeddings and LSH - Visual Embeddings are high-dimensional vector representations of entities (in our case images) which capture semantic similarity. - Produced via Neural Networks like VGG16, Inception, etc. - Locality-sensitive hashing or LSH is a modern technique used to reduce dimensionality of high-dimensional data while preserving pairwise distances between individual points.
  • 15. LSH: Locality Sensitive Hashing - Pick random projection vectors (black) - For each embeddings vector determine on which side of the hyperplane the embeddings vector lands - On the same side: set bit to 1 - On different side: set bit to 0 Result 1: <1 1 0> Result 2: <1 0 1> 1 1 0 1 0 1
  • 16. LSH terms Pick optimal number of terms and bits per term - 1001110001011000 -> [00]1001 - [01]1100 - [10]0101 - [11]1000 - [x] → a term index
  • 18. Neardup Candidate Generation - Input Data: RDD[(ImgId, List[LSHTerm])] // billions - Goal: RDD[(ImgId, TopK[(ImgId, Overlap))] Nearest Neighbor (KNN) problem formulation
  • 19. Neardup Candidate Generation Given a set of documents each described by LSH terms, example: A → (1,2,3) B → (1,3,10) C → (2,10) And more generally: Di → [tj ] Where each Di is a document and [tj ] is a list of LSH terms (assume each is a 4 byte integer) Results: A → (B,2), (C,1) B → (A,2), (C,1) C → (A,1), (B,1)
  • 20. Spark Candidate Generation 1. Input RDD[(ImgId, List[LSHTerm])] ← both index and query sets 2. flatMap, groupBy input into RDD[(LSHTerm, PostingList)] ← an inverted index 3. flatMap, groupBy into RDD[(LSHTerm, PostingList)] ← a query list 4. Join (2) and (3), flatMap over queries posting list, and groupBy query ImgId; RDD[(ImgId, List[PostingList])] ← search results by query. 5. Merge List[List[ImgId]] into TopK(ImgId, Overlap) counting number of times each ImgId is seen → RDD[ImgId, TopK[(ImgId, Overlap)]. * PostingList = List[ImgId]
  • 21. Orders of magnitude too slow.
  • 23. def mapDocToInt(termIndexRaw: RDD[(String, List[TermId])]): RDD[(String, DocId)] = { // ensure that mapping between string and id is stable by sorting // this allows attempts to re-use partial stage completions termIndexRaw.keys.distinct().sortBy(x => x).zipWithIndex() } val stringArray = (for (ind <- 0 to 1000) yield randomString(32)).toArray val intArray = (for (ind <- 0 to 1000) yield ind).toArray * https://www.javamex.com/classmexer/ Dictionary encoding 108128 Bytes* 4024 Bytes* 25x
  • 24. Variable Byte Encoding - One bit of each byte is a continuation bit; overhead - int → byte (best case) - 32 char string up to 25x4 = 100x memory reduction https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html
  • 25. Inverted Index Partitioning Inverted index is skewed /** * Build partitioned inverted index by taking module of docId into partition. */ def buildPartitionedInvertedIndex(flatTermIndexAndFreq: RDD[(TermId, (DocId, TermFreq))]): RDD[((TermId, TermPartition), Iterable[DocId])] = { flatTermIndexAndFreq.map { case (termId, (docId, _)) => // partition documents within the same term to improve balance // and reduce the posting list length ((termId, (Math.abs(docId) % TERM_PARTITIONING).toByte), docId) }.groupByKey() }
  • 26. Packing (Int, Byte) => Long Before: Unsorted: 128.77 MB in 549ms Sort+Limit: 4.41 KB in 7511ms After: Unsorted: 38.83 MB in 219ms Sort+Limit: 4.41 KB in 467ms def packDocIdAndByteIntoLong(docId: DocId, docFreq: DocFreq): Long = { (docFreq.toLong << 32) | (docId & 0xffffffffL) } def unpackDocIdAndByteFromLong(packed: Long): (DocId, DocFreq) = { (packed.toInt, (packed >> 32).toByte) }
  • 27. Slicing Split query set into slices to reduce spill and size for “widest” portion of the computation. Union at the end.
  • 28. Additional Optimizations - Cost based optimizer - significant improvements to runtime can be realized by analyzing input data sets and setting performance parameters automatically. - Counting - jaccard overlap counting is done via low level, high performance collections. - Off heaping serialization when possible (spark.kryo.unsafe).
  • 29. Generic Batch LSH Search - Can be applied generically to KNN, embedding agnostic. - Can work on arbitrary large query set via slicing.
  • 31. TF DNN Classifier - Transfer learning over VGG16 - Visual embeddings - XOR hamming bits - Learning still happens at >1B pairs - Batch size of 1024, Adam optimizer 4096 2048 256 128 1
  • 32. Vectorization: mapPartitions + grouped - During training and inference vectorization reduces overhead. - Spark mapPartitions + grouped allows for large batches and controlling the size. Works well for inference. - 2ms/prediction on c3.8xl CPUs with network of 10MM parameters . input.mapPartitions { partition: Iterator[(ImgInfo, ImgInfo)] => // break down large partitions into groups and score per group partition.grouped(BATCH_SIZE).flatMap { group: Seq[(ImgInfo, ImgInfo)] => // create tensors and score as features: Array[Array[Float]] --> Tensor.create(features) } }
  • 33. One TF Session per JVM - Reduce model loading overhead, load once per JVM; thread-safe. object TensorflowModel { lazy val model: Session = { SavedModelBundle.load(...).session() } }
  • 34. Summary - Candidate Generation uses Batch LSH Search over terms from visual embeddings. - Batch LSH scales to billions of objects in the index and is embedding agnostic. - Candidate Selection uses a TF classifier over raw visual embeddings. - Two-pass transitive closure to cluster results.