SlideShare a Scribd company logo
Community Over Code
07/10/2023
Alessandro Benedetti, Director @ Sease
Introducing Multi-valued Vector
Fields in Apache Lucene
1
‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master degree in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch/OpenSearch expert
‣ Semantic search, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
ALESSANDRO BENEDETTI
WHO AM I ?
2
‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevance Tuning
SEArch SErvices
www.sease.io
3
AGENDA
Why Multi-valued?
HNSW and modifications
Index time internals
Challenges of a contribution
Query time internals
4
3
2
1
WHAT
What Can you do now?
the text content of a field exceeds the maximum amount of characters accepted by
your inference model (to encode vectors)
Split the content in
paragraphs across
multiple
documents.
Your unit of
information
becomes the
paragraph
When returning the
results you need to
aggregate back to
documents
5
3
2
1Split the content in
paragraphs across
multiple
documents.
Your unit of
information
becomes the
paragraph
When returning the
results you need to
aggregate back to
documents
● Indexing Time: nested
documents(slow/expe
nsive)
● Indexing Time:
flattened
documents(redundant
data)
● Query Time: parent-child
join queries?
(slow/expensive)
● Query Time:
collapsing/grouping
● Aggregations: faceting
becomes more
complicated
● Stats: aggregating data
and calculating stats is
impacted
HOW
6
● This applies for all fields and field types actually
● you may be ok applying those strategies …
● … but for some users may be quite annoying and
expensive
WHY MULTI-VALUED
7
● K Nearest Neighbour Algorithm?
● Indexing data structures and approach?
● Query time data structures and approach?
What does it mean to bring multi-valued to vectors?
8
ANN - Approximate Nearest Neighbor
● Exact Nearest Neighbor is expensive! (1vs1 vector
distance)
● it’s fine to lose accuracy to get a massive performance gain
● pre-process the dataset to build index data structures
● Generally vectors are modelled in:
○ Trees
○ Hashes
○ Graphs - HNSW
9
HNSW - Hierarchical Navigable Small World graphs
Hierarchical Navigable Small World (HNSW)
graphs are among the top-performing index-
time data structures for approximate nearest
neighbor search (ANN).
References
https://doi.org/10.1016/j.is.2013.10.006
https://arxiv.org/abs/1603.09320
10
HNSW - How it works in a nutshell
● Proximity graph
● Vertices are vectors, closer vectors are linked
● Hierarchical Layers based on skip lists
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
expensive
○ move down layer for refining the minimum(closest
friend)
11
HNSW - Skip Lists
● the higher the layer, the more sparse
● descending in layers while searching
● fast to search and insert
12
HNSW - Small World
rd Graphs
● start from entry point
● greedy search (each time distance is calculated across friends)
● starting from zoom out (low degree) to zoom in(high degree)
● when building the graph, higher average degree improve quality at a cost
image from https://www.pinecone.io/learn/hnsw/
13
HNSW - Index time
● add a vector at the time
● probability to enter layer N
● when added, it goes to all other layers
-> identify the layer(s) of insertion
● topk=1 closest neighbour is identified
● we descend and repeat until
the layer of insertion
● topk=ef_construction to identify neighbours
candidates
● M neighbours are linked (easiest is calculate
the exact distance)
image from https://www.pinecone.io/learn/hnsw/
Multi-Valued
- each node is not a document
- multiple vectors per document Id
14
HNSW - Search time
● Start from layer N (top)
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
expensive
○ move down layer for refining the minimum(closest
friend)
Multi-Valued
you may add in the top-K results the same
document Id multiple times
15
HNSW - MAX/SUM approach
MAX
when adding a vector from the
same document, you update the
score with the max
SUM
when adding a vector from the
same document, you update the
score summing
16
Nov 2020 - Apache Lucene 9.0
Dedicated File Format for Navigable Small World Graphs
https://issues.apache.org/jira/browse/LUCENE-9004
Jan 2022 - Apache Lucene 9.0
Handle Document Deletions
https://issues.apache.org/jira/browse/LUCENE-10040
Feb 2022 - Apache Lucene 9.1
Introduced Hierarchy in HNSW
https://issues.apache.org/jira/browse/LUCENE-10054
Mar 2022 - Apache Lucene 9.1
Re-use data structures across HNSW Graph
https://issues.apache.org/jira/browse/LUCENE-10391
Mar 2022 - Apache Lucene 9.1
Pre filters with KNN queries
https://issues.apache.org/jira/browse/LUCENE-10382
Aug 2022 - Apache Lucene 9.4
8 bits vector quantization
https://github.com/apache/lucene/issues/11613
JIRA ISSUES:
https://issues.apache.org/jira/issues/?jql=project%20%3D%2
0LUCENE%20AND%20labels%20%3D%20vector-based-
search
GITHUB ISSUES:
https://github.com/apache/lucene/labels/vector-based-search
Apache Lucene
17
9.8 - Parent Join for Vectors
18
Released 28/09/23
Merged in 9.8
● PROS:
- a child vector is a document, so it can
have metadata
● CONS:
- performance?
● NodeIdCachingHeap
● Knn Collector (various implementations)
○ ToParentJoinKnnCollector
● ToParentBlockJoinKnnVectorQuery.java
● For more Info: Benjamin Trent
Apache Lucene
19
GitHub
Pull Request
INDEXING - Auxiliary Data Structures
MAP VECTOR IDS TO DOCUMENT IDS
in a multi-valued scenario multiple vectors may belong to the
same document
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● leverage sparse support
● ordinal (vector Id) to document (document Id) map
● DocsWithVectorsSet to keep track of vectors per documents
● DirectMonotonicWriter to write the map
Lucene95HnswVectorsWriter
Write auxiliary data structures
20
INDEXING - DocsWithVectorsSet
DocsWithVectorsSet
accumulator of documents that have vectors, used by the HnswVectorsWriter when writing data
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● compatible with single valued dense/sparse scenarios
● keep a stack of vectors per document
● able to return a count of vectors for each document
DocsWithVectorsSet
21
INDEXING - DirectMonotonicWriter
DirectMonotonicWriter
write a sequence of integers monotonically increasing (never decreasing),
in blocks
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● each integer is a document Id
● the same document Id repeated for each vector in the document
● DirectMonotonicReader ordToDoc used then at reading time in the
SparseOffHeapVectorValues
● public int ordToDoc(int ord) -> the ordinal (vector Id) is the index
to use to access the block and the position within the block to
finally get the document Id
22
INDEXING - building HNSW Graph
NODE ID is the VECTOR ID
same as the sparse scenario, each node in the graph has an incremental ID
aligned with the vector ID
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● the nodes count in the graph = vector count
● no code changes
23
QUERY TIME - Exact Search
Vector Scorer
(naive solution) all vectors are iterated, only the ones corresponding
to an accepted doc are scored
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● VectorScorer scores only BitSet acceptedDocs
● all vectors from ByteVectorValues/FloatVectorValues are iterated
● scores are updated MAX/SUM
AbstractKnnVectorQuery
24
QUERY TIME - Approximate Search
HNSW SEARCH
searching on vectors(graph nodes) and returning
documents(max/sum score)
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● searching on level != 0 -> vectors are added as candidates/results
● searching on level = 0 -> document ID is added to the results
● int docId = vectors.ordToDoc(vectorId);
● results are added to NeighborQueue
HnswGraphSearcher
25
QUERY TIME - NeighborQueue
TOP-K DOCUMENTS (NeighborQueue)
data structure used to collect the top-k results
as a long heap
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● MIN HEAP
● each element is a long
[32 bits][32 bits] ->
[score][~document Id]
NeighborQueue
26
QUERY TIME - NeighborQueue
TOP-K DOCUMENTS (NeighborQueue)
data structure used to collect the top-k results
as a long heap
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
● nodeIdToHeapIndex cache is used to
keep track of nodes position
● score is updated for the node
(MAX/SUM)
● DOWNHEAP (as the ranking may
have improved)
NeighborQueue
27
● to build the first prototype -> 1 year
● super active area -> merging
● Lucene codecs change names and old codec
is moved back to backwards codecs
● 85 classes DIFF!
○ simplified temporarily removing MAX/SUM
○ simplified temporarily removing separate
code branches for single/multivalued
○ down to 25 classes!
○ thanks to Benjamin Trent, Mayya Sharipova, Jim Ferenczi,
Josh Devins for the first reviews
Challenges - side project and merging
28
WRAPPING UP
Why Multi-valued?
HNSW and modifications
Index time internals
Challenges of a contribution
Query time internals
29
DO YOU WANT TO MAKE IT HAPPEN?
HELP WITH CODE
Pull Request
HELP WITH FUNDINGS
info@sease.io
30
After Lucene 9.8
significant changes happened
THANK YOU!
@seaseltd @sease-
ltd
@seaseltd @sease_ltd 31

More Related Content

PDF
Introducing Multi Valued Vectors Fields in Apache Lucene
Sease
 
PDF
Which Questions We Should Have
Oracle Korea
 
PDF
Domain driven design: a gentle introduction
Asher Sterkin
 
PDF
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
PPTX
Big Data and NoSQL for Database and BI Pros
Andrew Brust
 
PDF
Final Report_798 Project_Nithin_Sharmila
Nithin Kakkireni
 
PDF
GraphTour 2020 - Neo4j: What's New?
Neo4j
 
PPTX
Big data meet_up_08042016
Mark Smith
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Sease
 
Which Questions We Should Have
Oracle Korea
 
Domain driven design: a gentle introduction
Asher Sterkin
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
Big Data and NoSQL for Database and BI Pros
Andrew Brust
 
Final Report_798 Project_Nithin_Sharmila
Nithin Kakkireni
 
GraphTour 2020 - Neo4j: What's New?
Neo4j
 
Big data meet_up_08042016
Mark Smith
 

Similar to Multi Valued Vectors Lucene (20)

PPT
Mr bi
renjan131
 
PDF
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Neo4j
 
PPTX
Data Structure Graph DMZ #DMZone
Doug Needham
 
PDF
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j
 
DOC
ast nearest neighbor search with keywords
swathi78
 
PDF
LDBC 8th TUC Meeting: Introduction and status update
LDBC council
 
PDF
Vector databases and neural search
Dmitry Kan
 
PDF
MongoDB - General Purpose Database
Ashnikbiz
 
PDF
big_data_case_studies.pdf
vishal choudhary
 
PDF
Why Distributed Tracing is Essential for Performance and Reliability
DevOps.com
 
PDF
Vector Search at Scale - Pro Tips - Stephen Batifol
Zilliz
 
ODP
BigData Hadoop
Kumari Surabhi
 
PPTX
SOFTWARE ENGINEERING PROJECT FOR AI AND APPLICATION
oishis2004
 
PPTX
Data analysis
AnandDesshpande
 
PPTX
High Performance Computing on NYC Yellow Taxi Data Set
Parag Ahire
 
PPT
Map Reduce amrp presentation
renjan131
 
PDF
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB
 
PDF
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dataconomy Media
 
PDF
Unleashing the Power of Vector Search in .NET - DotNETConf2024.pdf
Luigi Fugaro
 
PDF
MongoDB NoSQL database a deep dive -MyWhitePaper
Rajesh Kumar
 
Mr bi
renjan131
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Neo4j
 
Data Structure Graph DMZ #DMZone
Doug Needham
 
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j
 
ast nearest neighbor search with keywords
swathi78
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC council
 
Vector databases and neural search
Dmitry Kan
 
MongoDB - General Purpose Database
Ashnikbiz
 
big_data_case_studies.pdf
vishal choudhary
 
Why Distributed Tracing is Essential for Performance and Reliability
DevOps.com
 
Vector Search at Scale - Pro Tips - Stephen Batifol
Zilliz
 
BigData Hadoop
Kumari Surabhi
 
SOFTWARE ENGINEERING PROJECT FOR AI AND APPLICATION
oishis2004
 
Data analysis
AnandDesshpande
 
High Performance Computing on NYC Yellow Taxi Data Set
Parag Ahire
 
Map Reduce amrp presentation
renjan131
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dataconomy Media
 
Unleashing the Power of Vector Search in .NET - DotNETConf2024.pdf
Luigi Fugaro
 
MongoDB NoSQL database a deep dive -MyWhitePaper
Rajesh Kumar
 
Ad

More from Sease (20)

PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Sease
 
PPTX
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Sease
 
PPTX
From Natural Language to Structured Solr Queries using LLMs
Sease
 
PPTX
Hybrid Search With Apache Solr
Sease
 
PPTX
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
PDF
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
PPTX
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
PPTX
How does ChatGPT work: an Information Retrieval perspective
Sease
 
PDF
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
PPTX
Neural Search Comes to Apache Solr
Sease
 
PPTX
Large Scale Indexing
Sease
 
PDF
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
PDF
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
PDF
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
PPTX
How to cache your searches_ an open source implementation.pptx
Sease
 
PDF
Online Testing Learning to Rank with Solr Interleaving
Sease
 
PDF
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Sease
 
PDF
Apache Lucene/Solr Document Classification
Sease
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Sease
 
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Sease
 
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Hybrid Search With Apache Solr
Sease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
How does ChatGPT work: an Information Retrieval perspective
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Neural Search Comes to Apache Solr
Sease
 
Large Scale Indexing
Sease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
How to cache your searches_ an open source implementation.pptx
Sease
 
Online Testing Learning to Rank with Solr Interleaving
Sease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Sease
 
Apache Lucene/Solr Document Classification
Sease
 
Ad

Recently uploaded (20)

PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
The Future of Artificial Intelligence (AI)
Mukul
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 

Multi Valued Vectors Lucene

  • 1. Community Over Code 07/10/2023 Alessandro Benedetti, Director @ Sease Introducing Multi-valued Vector Fields in Apache Lucene 1
  • 2. ‣ Born in Tarquinia (ancient Etruscan city in Italy) ‣ R&D Software Engineer ‣ Director ‣ Master degree in Computer Science ‣ PC member for ECIR, SIGIR and Desires ‣ Apache Lucene/Solr PMC member/committer ‣ Elasticsearch/OpenSearch expert ‣ Semantic search, NLP, Machine Learning technologies passionate ‣ Beach Volleyball player and Snowboarder ALESSANDRO BENEDETTI WHO AM I ? 2
  • 3. ‣ Headquarter in London/distributed ‣ Open-source Enthusiasts ‣ Apache Lucene/Solr experts ‣ Elasticsearch/OpenSearch experts ‣ Community Contributors ‣ Active Researchers ‣ Hot Trends : Neural Search, Natural Language Processing Learning To Rank, Document Similarity, Search Quality Evaluation, Relevance Tuning SEArch SErvices www.sease.io 3
  • 4. AGENDA Why Multi-valued? HNSW and modifications Index time internals Challenges of a contribution Query time internals 4
  • 5. 3 2 1 WHAT What Can you do now? the text content of a field exceeds the maximum amount of characters accepted by your inference model (to encode vectors) Split the content in paragraphs across multiple documents. Your unit of information becomes the paragraph When returning the results you need to aggregate back to documents 5
  • 6. 3 2 1Split the content in paragraphs across multiple documents. Your unit of information becomes the paragraph When returning the results you need to aggregate back to documents ● Indexing Time: nested documents(slow/expe nsive) ● Indexing Time: flattened documents(redundant data) ● Query Time: parent-child join queries? (slow/expensive) ● Query Time: collapsing/grouping ● Aggregations: faceting becomes more complicated ● Stats: aggregating data and calculating stats is impacted HOW 6
  • 7. ● This applies for all fields and field types actually ● you may be ok applying those strategies … ● … but for some users may be quite annoying and expensive WHY MULTI-VALUED 7
  • 8. ● K Nearest Neighbour Algorithm? ● Indexing data structures and approach? ● Query time data structures and approach? What does it mean to bring multi-valued to vectors? 8
  • 9. ANN - Approximate Nearest Neighbor ● Exact Nearest Neighbor is expensive! (1vs1 vector distance) ● it’s fine to lose accuracy to get a massive performance gain ● pre-process the dataset to build index data structures ● Generally vectors are modelled in: ○ Trees ○ Hashes ○ Graphs - HNSW 9
  • 10. HNSW - Hierarchical Navigable Small World graphs Hierarchical Navigable Small World (HNSW) graphs are among the top-performing index- time data structures for approximate nearest neighbor search (ANN). References https://doi.org/10.1016/j.is.2013.10.006 https://arxiv.org/abs/1603.09320 10
  • 11. HNSW - How it works in a nutshell ● Proximity graph ● Vertices are vectors, closer vectors are linked ● Hierarchical Layers based on skip lists ○ longer edges in higher layers(fast retrieval) ○ shorter edges in lower layers(accuracy) ● Each layer is a Navigable Small World Graph ○ greedy search for the closest friend(local minimum) ○ higher the degree of vertices(number of connections) lower the probability of hitting local min (but more expensive ○ move down layer for refining the minimum(closest friend) 11
  • 12. HNSW - Skip Lists ● the higher the layer, the more sparse ● descending in layers while searching ● fast to search and insert 12
  • 13. HNSW - Small World rd Graphs ● start from entry point ● greedy search (each time distance is calculated across friends) ● starting from zoom out (low degree) to zoom in(high degree) ● when building the graph, higher average degree improve quality at a cost image from https://www.pinecone.io/learn/hnsw/ 13
  • 14. HNSW - Index time ● add a vector at the time ● probability to enter layer N ● when added, it goes to all other layers -> identify the layer(s) of insertion ● topk=1 closest neighbour is identified ● we descend and repeat until the layer of insertion ● topk=ef_construction to identify neighbours candidates ● M neighbours are linked (easiest is calculate the exact distance) image from https://www.pinecone.io/learn/hnsw/ Multi-Valued - each node is not a document - multiple vectors per document Id 14
  • 15. HNSW - Search time ● Start from layer N (top) ○ longer edges in higher layers(fast retrieval) ○ shorter edges in lower layers(accuracy) ● Each layer is a Navigable Small World Graph ○ greedy search for the closest friend(local minimum) ○ higher the degree of vertices(number of connections) lower the probability of hitting local min (but more expensive ○ move down layer for refining the minimum(closest friend) Multi-Valued you may add in the top-K results the same document Id multiple times 15
  • 16. HNSW - MAX/SUM approach MAX when adding a vector from the same document, you update the score with the max SUM when adding a vector from the same document, you update the score summing 16
  • 17. Nov 2020 - Apache Lucene 9.0 Dedicated File Format for Navigable Small World Graphs https://issues.apache.org/jira/browse/LUCENE-9004 Jan 2022 - Apache Lucene 9.0 Handle Document Deletions https://issues.apache.org/jira/browse/LUCENE-10040 Feb 2022 - Apache Lucene 9.1 Introduced Hierarchy in HNSW https://issues.apache.org/jira/browse/LUCENE-10054 Mar 2022 - Apache Lucene 9.1 Re-use data structures across HNSW Graph https://issues.apache.org/jira/browse/LUCENE-10391 Mar 2022 - Apache Lucene 9.1 Pre filters with KNN queries https://issues.apache.org/jira/browse/LUCENE-10382 Aug 2022 - Apache Lucene 9.4 8 bits vector quantization https://github.com/apache/lucene/issues/11613 JIRA ISSUES: https://issues.apache.org/jira/issues/?jql=project%20%3D%2 0LUCENE%20AND%20labels%20%3D%20vector-based- search GITHUB ISSUES: https://github.com/apache/lucene/labels/vector-based-search Apache Lucene 17
  • 18. 9.8 - Parent Join for Vectors 18 Released 28/09/23 Merged in 9.8 ● PROS: - a child vector is a document, so it can have metadata ● CONS: - performance? ● NodeIdCachingHeap ● Knn Collector (various implementations) ○ ToParentJoinKnnCollector ● ToParentBlockJoinKnnVectorQuery.java ● For more Info: Benjamin Trent
  • 20. INDEXING - Auxiliary Data Structures MAP VECTOR IDS TO DOCUMENT IDS in a multi-valued scenario multiple vectors may belong to the same document Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● leverage sparse support ● ordinal (vector Id) to document (document Id) map ● DocsWithVectorsSet to keep track of vectors per documents ● DirectMonotonicWriter to write the map Lucene95HnswVectorsWriter Write auxiliary data structures 20
  • 21. INDEXING - DocsWithVectorsSet DocsWithVectorsSet accumulator of documents that have vectors, used by the HnswVectorsWriter when writing data Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● compatible with single valued dense/sparse scenarios ● keep a stack of vectors per document ● able to return a count of vectors for each document DocsWithVectorsSet 21
  • 22. INDEXING - DirectMonotonicWriter DirectMonotonicWriter write a sequence of integers monotonically increasing (never decreasing), in blocks Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● each integer is a document Id ● the same document Id repeated for each vector in the document ● DirectMonotonicReader ordToDoc used then at reading time in the SparseOffHeapVectorValues ● public int ordToDoc(int ord) -> the ordinal (vector Id) is the index to use to access the block and the position within the block to finally get the document Id 22
  • 23. INDEXING - building HNSW Graph NODE ID is the VECTOR ID same as the sparse scenario, each node in the graph has an incremental ID aligned with the vector ID Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● the nodes count in the graph = vector count ● no code changes 23
  • 24. QUERY TIME - Exact Search Vector Scorer (naive solution) all vectors are iterated, only the ones corresponding to an accepted doc are scored Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● VectorScorer scores only BitSet acceptedDocs ● all vectors from ByteVectorValues/FloatVectorValues are iterated ● scores are updated MAX/SUM AbstractKnnVectorQuery 24
  • 25. QUERY TIME - Approximate Search HNSW SEARCH searching on vectors(graph nodes) and returning documents(max/sum score) Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● searching on level != 0 -> vectors are added as candidates/results ● searching on level = 0 -> document ID is added to the results ● int docId = vectors.ordToDoc(vectorId); ● results are added to NeighborQueue HnswGraphSearcher 25
  • 26. QUERY TIME - NeighborQueue TOP-K DOCUMENTS (NeighborQueue) data structure used to collect the top-k results as a long heap Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● MIN HEAP ● each element is a long [32 bits][32 bits] -> [score][~document Id] NeighborQueue 26
  • 27. QUERY TIME - NeighborQueue TOP-K DOCUMENTS (NeighborQueue) data structure used to collect the top-k results as a long heap Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? ● nodeIdToHeapIndex cache is used to keep track of nodes position ● score is updated for the node (MAX/SUM) ● DOWNHEAP (as the ranking may have improved) NeighborQueue 27
  • 28. ● to build the first prototype -> 1 year ● super active area -> merging ● Lucene codecs change names and old codec is moved back to backwards codecs ● 85 classes DIFF! ○ simplified temporarily removing MAX/SUM ○ simplified temporarily removing separate code branches for single/multivalued ○ down to 25 classes! ○ thanks to Benjamin Trent, Mayya Sharipova, Jim Ferenczi, Josh Devins for the first reviews Challenges - side project and merging 28
  • 29. WRAPPING UP Why Multi-valued? HNSW and modifications Index time internals Challenges of a contribution Query time internals 29
  • 30. DO YOU WANT TO MAKE IT HAPPEN? HELP WITH CODE Pull Request HELP WITH FUNDINGS [email protected] 30 After Lucene 9.8 significant changes happened