SlideShare a Scribd company logo
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection
Andrey Gusev June 6, 2018
Using LSH and Tensorflow
Help you discover and do
what you love.
200m+People on
Pinterest
each month
100b+Pins
2b+Boards
10b+Recommendations/Day
1
2
3
4
5
Agenda Neardup, clustering and LSH
Candidate generation
Deep dive
Candidate selection
TF on Spark
Neardup
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Not Neardup
Unrelated
Neardup
Duplicate
Clustering
Not An Equivalence Class
Formulation
For each image find a canonical image which represents an equivalence class.
Problem
Neardup is not an equivalence relation because neardup relation is not a
transitive relation.
It means we can not find a perfect partition such that all images within a
cluster are closer to each other than to the other clusters.
Incremental approximate K-Cut
Incrementally:
1. Generate candidates via batch LSH search
2. Select candidates via a TF model
3. Take a transitive closure over selected candidates
4. Pass over clusters and greedily select sub-clusters (K-Cut).
LSH
Embeddings and LSH
- Visual Embeddings are high-dimensional vector representations of
entities (in our case images) which capture semantic similarity.
- Produced via Neural Networks like VGG16, Inception, etc.
- Locality-sensitive hashing or LSH is a modern technique used to reduce
dimensionality of high-dimensional data while preserving pairwise
distances between individual points.
LSH: Locality Sensitive Hashing
- Pick random projection vectors (black)
- For each embeddings vector determine
on which side of the hyperplane the
embeddings vector lands
- On the same side: set bit to 1
- On different side: set bit to 0
Result 1: <1 1 0>
Result 2: <1 0 1>
1
1
0
1
0
1
LSH terms
Pick optimal number of terms and bits per term
- 1001110001011000 -> [00]1001 - [01]1100 - [10]0101 - [11]1000
- [x] → a term index
Candidate
Generation
Neardup Candidate Generation
- Input Data:
RDD[(ImgId, List[LSHTerm])] // billions
- Goal:
RDD[(ImgId, TopK[(ImgId, Overlap))]
Nearest Neighbor (KNN) problem formulation
Neardup Candidate Generation
Given a set of documents each described by LSH terms, example:
A → (1,2,3)
B → (1,3,10)
C → (2,10)
And more generally:
Di
→ [tj
]
Where each Di
is a document and [tj
] is a list of LSH terms (assume each is a 4 byte integer)
Results:
A → (B,2), (C,1)
B → (A,2), (C,1)
C → (A,1), (B,1)
Spark Candidate Generation
1. Input RDD[(ImgId, List[LSHTerm])] ← both index and query sets
2. flatMap, groupBy input into RDD[(LSHTerm, PostingList)] ← an inverted index
3. flatMap, groupBy into RDD[(LSHTerm, PostingList)] ← a query list
4. Join (2) and (3), flatMap over queries posting list, and groupBy query ImgId;
RDD[(ImgId, List[PostingList])] ← search results by query.
5. Merge List[List[ImgId]] into TopK(ImgId, Overlap) counting number of times each ImgId is
seen → RDD[ImgId, TopK[(ImgId, Overlap)].
* PostingList = List[ImgId]
Orders of magnitude too slow.
Deep Dive
def mapDocToInt(termIndexRaw: RDD[(String, List[TermId])]): RDD[(String, DocId)] = {
// ensure that mapping between string and id is stable by sorting
// this allows attempts to re-use partial stage completions
termIndexRaw.keys.distinct().sortBy(x => x).zipWithIndex()
}
val stringArray = (for (ind <- 0 to 1000) yield randomString(32)).toArray
val intArray = (for (ind <- 0 to 1000) yield ind).toArray
* https://www.javamex.com/classmexer/
Dictionary encoding
108128 Bytes*
4024 Bytes*
25x
Variable Byte Encoding
- One bit of each byte is a continuation bit; overhead
- int → byte (best case)
- 32 char string up to 25x4 = 100x memory reduction
https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html
Inverted Index Partitioning
Inverted index is skewed
/**
* Build partitioned inverted index by taking module of docId into partition.
*/
def buildPartitionedInvertedIndex(flatTermIndexAndFreq: RDD[(TermId, (DocId, TermFreq))]):
RDD[((TermId, TermPartition), Iterable[DocId])] = {
flatTermIndexAndFreq.map { case (termId, (docId, _)) =>
// partition documents within the same term to improve balance
// and reduce the posting list length
((termId, (Math.abs(docId) % TERM_PARTITIONING).toByte), docId)
}.groupByKey()
}
Packing
(Int, Byte) => Long
Before:
Unsorted: 128.77 MB in 549ms
Sort+Limit: 4.41 KB in 7511ms
After:
Unsorted: 38.83 MB in 219ms
Sort+Limit: 4.41 KB in 467ms
def packDocIdAndByteIntoLong(docId: DocId, docFreq: DocFreq): Long = {
(docFreq.toLong << 32) | (docId & 0xffffffffL)
}
def unpackDocIdAndByteFromLong(packed: Long): (DocId, DocFreq) = {
(packed.toInt, (packed >> 32).toByte)
}
Slicing
Split query set into slices to reduce spill and size for
“widest” portion of the computation. Union at the end.
Additional Optimizations
- Cost based optimizer - significant improvements to runtime can be realized by
analyzing input data sets and setting performance parameters automatically.
- Counting - jaccard overlap counting is done via low level, high performance collections.
- Off heaping serialization when possible (spark.kryo.unsafe).
Generic Batch LSH Search
- Can be applied generically to KNN, embedding
agnostic.
- Can work on arbitrary large query set via slicing.
Candidate
Selection
TF DNN Classifier
- Transfer learning over VGG16
- Visual embeddings
- XOR hamming bits
- Learning still happens at >1B pairs
- Batch size of 1024, Adam optimizer
4096
2048
256
128
1
Vectorization: mapPartitions + grouped
- During training and inference vectorization reduces overhead.
- Spark mapPartitions + grouped allows for large batches and controlling
the size. Works well for inference.
- 2ms/prediction on c3.8xl CPUs with network of 10MM parameters .
input.mapPartitions { partition: Iterator[(ImgInfo, ImgInfo)] =>
// break down large partitions into groups and score per group
partition.grouped(BATCH_SIZE).flatMap { group: Seq[(ImgInfo, ImgInfo)] =>
// create tensors and score as features: Array[Array[Float]] --> Tensor.create(features)
}
}
One TF Session per JVM
- Reduce model loading overhead, load once per JVM; thread-safe.
object TensorflowModel {
lazy val model: Session = {
SavedModelBundle.load(...).session()
}
}
Summary
- Candidate Generation uses Batch LSH Search over terms from visual
embeddings.
- Batch LSH scales to billions of objects in the index and is embedding
agnostic.
- Candidate Selection uses a TF classifier over raw visual embeddings.
- Two-pass transitive closure to cluster results.
Thanks!

More Related Content

What's hot (20)

PDF
Spark SQL Bucketing at Facebook
Databricks
 
PPTX
Introduction to Big Data and hadoop
Sandeep Patil
 
PDF
Data Source API in Spark
Databricks
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Lessons for the optimizer from running the TPC-DS benchmark
Sergey Petrunya
 
PDF
Information retrieval to recommender systems
Data Science Society
 
PPTX
Introduction to YARN and MapReduce 2
Cloudera, Inc.
 
PPTX
Cassandra internals
narsiman
 
PDF
Intro to Delta Lake
Databricks
 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
PDF
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Databricks
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PPTX
Transformations and actions a visual guide training
Spark Summit
 
PDF
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
DATAVERSITY
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Building Data Science Teams
EMC
 
PDF
Data engineering
Suman Debnath
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PDF
Apache spark
shima jafari
 
PDF
High-Performance Advanced Analytics with Spark-Alchemy
Databricks
 
Spark SQL Bucketing at Facebook
Databricks
 
Introduction to Big Data and hadoop
Sandeep Patil
 
Data Source API in Spark
Databricks
 
Introduction to Apache Spark
Rahul Jain
 
Lessons for the optimizer from running the TPC-DS benchmark
Sergey Petrunya
 
Information retrieval to recommender systems
Data Science Society
 
Introduction to YARN and MapReduce 2
Cloudera, Inc.
 
Cassandra internals
narsiman
 
Intro to Delta Lake
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Transformations and actions a visual guide training
Spark Summit
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
DATAVERSITY
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Building Data Science Teams
EMC
 
Data engineering
Suman Debnath
 
Programming in Spark using PySpark
Mostafa
 
Apache spark
shima jafari
 
High-Performance Advanced Analytics with Spark-Alchemy
Databricks
 

Similar to Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev (20)

PDF
Project - Deep Locality Sensitive Hashing
Gabriele Angeletti
 
PDF
Neighbourhood Preserving Quantisation for LSH SIGIR Poster
Sean Moran
 
PDF
Fighting fraud: finding duplicates at scale
Alexey Grigorev
 
PDF
Landmark Retrieval & Recognition
kenluck2001
 
PDF
Retrieving Visually-Similar Products for Shopping Recommendations using Spark...
Databricks
 
PPTX
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
PPTX
Deep Residual Hashing Neural Network for Image Retrieval
Edwin Efraín Jiménez Lepe
 
PDF
서버리스 기반 콘텐츠 추천 서비스 만들기 - 이상현, Vingle :: AWS Summit Seoul 2019
Amazon Web Services Korea
 
PDF
Scaling up data science applications
Kexin Xie
 
PDF
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Databricks
 
PPTX
Scaling up data science applications
Salesforce Engineering
 
PPTX
Sparkling Random Ferns by P Dendek and M Fedoryszak
Spark Summit
 
PDF
Duplicates everywhere (Kiev)
Alexey Grigorev
 
PDF
Pr083 Non-local Neural Networks
Taeoh Kim
 
PDF
Introduction to spark
Duyhai Doan
 
PDF
Approximation algorithms for stream and batch processing
Gabriele Modena
 
PDF
[221] 이미지를 이해하는 이미지검색: 텍스트기반 이미지검색에 CNN 이용하기
NAVER D2
 
PPTX
Author paper identification problem final presentation
Pooja Mishra
 
PDF
Scalable Recommendation Algorithms with LSH
Maruf Aytekin
 
PDF
Data mining for_java_and_dot_net 2016-17
redpel dot com
 
Project - Deep Locality Sensitive Hashing
Gabriele Angeletti
 
Neighbourhood Preserving Quantisation for LSH SIGIR Poster
Sean Moran
 
Fighting fraud: finding duplicates at scale
Alexey Grigorev
 
Landmark Retrieval & Recognition
kenluck2001
 
Retrieving Visually-Similar Products for Shopping Recommendations using Spark...
Databricks
 
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
Deep Residual Hashing Neural Network for Image Retrieval
Edwin Efraín Jiménez Lepe
 
서버리스 기반 콘텐츠 추천 서비스 만들기 - 이상현, Vingle :: AWS Summit Seoul 2019
Amazon Web Services Korea
 
Scaling up data science applications
Kexin Xie
 
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Databricks
 
Scaling up data science applications
Salesforce Engineering
 
Sparkling Random Ferns by P Dendek and M Fedoryszak
Spark Summit
 
Duplicates everywhere (Kiev)
Alexey Grigorev
 
Pr083 Non-local Neural Networks
Taeoh Kim
 
Introduction to spark
Duyhai Doan
 
Approximation algorithms for stream and batch processing
Gabriele Modena
 
[221] 이미지를 이해하는 이미지검색: 텍스트기반 이미지검색에 CNN 이용하기
NAVER D2
 
Author paper identification problem final presentation
Pooja Mishra
 
Scalable Recommendation Algorithms with LSH
Maruf Aytekin
 
Data mining for_java_and_dot_net 2016-17
redpel dot com
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PDF
Responsibilities of a Certified Data Engineer | IABAC
Seenivasan
 
PPTX
TSM_08_0811111111111111111111111111111111111111111111111
csomonasteriomoscow
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Dr. Robert Krug - Chief Data Scientist At DataInnovate Solutions
Dr. Robert Krug
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PPTX
Introduction to Artificial Intelligence.pptx
StarToon1
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PPTX
Learning Tendency Analysis of Scratch Programming Course(Entry Class) for Upp...
ryouta039
 
PPTX
fashion industry boom.pptx an economics project
TGMPandeyji
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PPTX
原版定制AIM毕业证(澳大利亚音乐学院毕业证书)成绩单底纹防伪如何办理
Taqyea
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
Data base management system Transactions.ppt
gandhamcharan2006
 
Responsibilities of a Certified Data Engineer | IABAC
Seenivasan
 
TSM_08_0811111111111111111111111111111111111111111111111
csomonasteriomoscow
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Dr. Robert Krug - Chief Data Scientist At DataInnovate Solutions
Dr. Robert Krug
 
AI/ML Applications in Financial domain projects
Rituparna De
 
Introduction to Artificial Intelligence.pptx
StarToon1
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
Learning Tendency Analysis of Scratch Programming Course(Entry Class) for Upp...
ryouta039
 
fashion industry boom.pptx an economics project
TGMPandeyji
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
原版定制AIM毕业证(澳大利亚音乐学院毕业证书)成绩单底纹防伪如何办理
Taqyea
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 

Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev

  • 2. Image Similarity Detection Andrey Gusev June 6, 2018 Using LSH and Tensorflow
  • 3. Help you discover and do what you love.
  • 5. 1 2 3 4 5 Agenda Neardup, clustering and LSH Candidate generation Deep dive Candidate selection TF on Spark
  • 11. Not An Equivalence Class Formulation For each image find a canonical image which represents an equivalence class. Problem Neardup is not an equivalence relation because neardup relation is not a transitive relation. It means we can not find a perfect partition such that all images within a cluster are closer to each other than to the other clusters.
  • 12. Incremental approximate K-Cut Incrementally: 1. Generate candidates via batch LSH search 2. Select candidates via a TF model 3. Take a transitive closure over selected candidates 4. Pass over clusters and greedily select sub-clusters (K-Cut).
  • 13. LSH
  • 14. Embeddings and LSH - Visual Embeddings are high-dimensional vector representations of entities (in our case images) which capture semantic similarity. - Produced via Neural Networks like VGG16, Inception, etc. - Locality-sensitive hashing or LSH is a modern technique used to reduce dimensionality of high-dimensional data while preserving pairwise distances between individual points.
  • 15. LSH: Locality Sensitive Hashing - Pick random projection vectors (black) - For each embeddings vector determine on which side of the hyperplane the embeddings vector lands - On the same side: set bit to 1 - On different side: set bit to 0 Result 1: <1 1 0> Result 2: <1 0 1> 1 1 0 1 0 1
  • 16. LSH terms Pick optimal number of terms and bits per term - 1001110001011000 -> [00]1001 - [01]1100 - [10]0101 - [11]1000 - [x] → a term index
  • 18. Neardup Candidate Generation - Input Data: RDD[(ImgId, List[LSHTerm])] // billions - Goal: RDD[(ImgId, TopK[(ImgId, Overlap))] Nearest Neighbor (KNN) problem formulation
  • 19. Neardup Candidate Generation Given a set of documents each described by LSH terms, example: A → (1,2,3) B → (1,3,10) C → (2,10) And more generally: Di → [tj ] Where each Di is a document and [tj ] is a list of LSH terms (assume each is a 4 byte integer) Results: A → (B,2), (C,1) B → (A,2), (C,1) C → (A,1), (B,1)
  • 20. Spark Candidate Generation 1. Input RDD[(ImgId, List[LSHTerm])] ← both index and query sets 2. flatMap, groupBy input into RDD[(LSHTerm, PostingList)] ← an inverted index 3. flatMap, groupBy into RDD[(LSHTerm, PostingList)] ← a query list 4. Join (2) and (3), flatMap over queries posting list, and groupBy query ImgId; RDD[(ImgId, List[PostingList])] ← search results by query. 5. Merge List[List[ImgId]] into TopK(ImgId, Overlap) counting number of times each ImgId is seen → RDD[ImgId, TopK[(ImgId, Overlap)]. * PostingList = List[ImgId]
  • 21. Orders of magnitude too slow.
  • 23. def mapDocToInt(termIndexRaw: RDD[(String, List[TermId])]): RDD[(String, DocId)] = { // ensure that mapping between string and id is stable by sorting // this allows attempts to re-use partial stage completions termIndexRaw.keys.distinct().sortBy(x => x).zipWithIndex() } val stringArray = (for (ind <- 0 to 1000) yield randomString(32)).toArray val intArray = (for (ind <- 0 to 1000) yield ind).toArray * https://www.javamex.com/classmexer/ Dictionary encoding 108128 Bytes* 4024 Bytes* 25x
  • 24. Variable Byte Encoding - One bit of each byte is a continuation bit; overhead - int → byte (best case) - 32 char string up to 25x4 = 100x memory reduction https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html
  • 25. Inverted Index Partitioning Inverted index is skewed /** * Build partitioned inverted index by taking module of docId into partition. */ def buildPartitionedInvertedIndex(flatTermIndexAndFreq: RDD[(TermId, (DocId, TermFreq))]): RDD[((TermId, TermPartition), Iterable[DocId])] = { flatTermIndexAndFreq.map { case (termId, (docId, _)) => // partition documents within the same term to improve balance // and reduce the posting list length ((termId, (Math.abs(docId) % TERM_PARTITIONING).toByte), docId) }.groupByKey() }
  • 26. Packing (Int, Byte) => Long Before: Unsorted: 128.77 MB in 549ms Sort+Limit: 4.41 KB in 7511ms After: Unsorted: 38.83 MB in 219ms Sort+Limit: 4.41 KB in 467ms def packDocIdAndByteIntoLong(docId: DocId, docFreq: DocFreq): Long = { (docFreq.toLong << 32) | (docId & 0xffffffffL) } def unpackDocIdAndByteFromLong(packed: Long): (DocId, DocFreq) = { (packed.toInt, (packed >> 32).toByte) }
  • 27. Slicing Split query set into slices to reduce spill and size for “widest” portion of the computation. Union at the end.
  • 28. Additional Optimizations - Cost based optimizer - significant improvements to runtime can be realized by analyzing input data sets and setting performance parameters automatically. - Counting - jaccard overlap counting is done via low level, high performance collections. - Off heaping serialization when possible (spark.kryo.unsafe).
  • 29. Generic Batch LSH Search - Can be applied generically to KNN, embedding agnostic. - Can work on arbitrary large query set via slicing.
  • 31. TF DNN Classifier - Transfer learning over VGG16 - Visual embeddings - XOR hamming bits - Learning still happens at >1B pairs - Batch size of 1024, Adam optimizer 4096 2048 256 128 1
  • 32. Vectorization: mapPartitions + grouped - During training and inference vectorization reduces overhead. - Spark mapPartitions + grouped allows for large batches and controlling the size. Works well for inference. - 2ms/prediction on c3.8xl CPUs with network of 10MM parameters . input.mapPartitions { partition: Iterator[(ImgInfo, ImgInfo)] => // break down large partitions into groups and score per group partition.grouped(BATCH_SIZE).flatMap { group: Seq[(ImgInfo, ImgInfo)] => // create tensors and score as features: Array[Array[Float]] --> Tensor.create(features) } }
  • 33. One TF Session per JVM - Reduce model loading overhead, load once per JVM; thread-safe. object TensorflowModel { lazy val model: Session = { SavedModelBundle.load(...).session() } }
  • 34. Summary - Candidate Generation uses Batch LSH Search over terms from visual embeddings. - Batch LSH scales to billions of objects in the index and is embedding agnostic. - Candidate Selection uses a TF classifier over raw visual embeddings. - Two-pass transitive closure to cluster results.