Multi Valued Vectors Lucene

Community Over Code
07/10/2023
Alessandro Benedetti, Director @ Sease
Introducing Multi-valued Vector
Fields in Apache Lucene
1

‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master degree in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch/OpenSearch expert
‣ Semantic search, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
ALESSANDRO BENEDETTI
WHO AM I ?
2

‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevance Tuning
SEArch SErvices
www.sease.io
3

AGENDA
Why Multi-valued?
HNSW and modifications
Index time internals
Challenges of a contribution
Query time internals
4

3
2
1
WHAT
What Can you do now?
the text content of a field exceeds the maximum amount of characters accepted by
your inference model (to encode vectors)
Split the content in
paragraphs across
multiple
documents.
Your unit of
information
becomes the
paragraph
When returning the
results you need to
aggregate back to
documents
5

3
2
1Split the content in
paragraphs across
multiple
documents.
Your unit of
information
becomes the
paragraph
When returning the
results you need to
aggregate back to
documents
● Indexing Time: nested
documents(slow/expe
nsive)
● Indexing Time:
flattened
documents(redundant
data)
● Query Time: parent-child
join queries?
(slow/expensive)
● Query Time:
collapsing/grouping
● Aggregations: faceting
becomes more
complicated
● Stats: aggregating data
and calculating stats is
impacted
HOW
6

● This applies for all fields and field types actually
● you may be ok applying those strategies …
● … but for some users may be quite annoying and
expensive
WHY MULTI-VALUED
7

● K Nearest Neighbour Algorithm?
● Indexing data structures and approach?
● Query time data structures and approach?
What does it mean to bring multi-valued to vectors?
8

ANN - Approximate Nearest Neighbor
● Exact Nearest Neighbor is expensive! (1vs1 vector
distance)
● it’s fine to lose accuracy to get a massive performance gain
● pre-process the dataset to build index data structures
● Generally vectors are modelled in:
○ Trees
○ Hashes
○ Graphs - HNSW
9

HNSW - Hierarchical Navigable Small World graphs
Hierarchical Navigable Small World (HNSW)
graphs are among the top-performing index-
time data structures for approximate nearest
neighbor search (ANN).
References
https://doi.org/10.1016/j.is.2013.10.006
https://arxiv.org/abs/1603.09320
10

HNSW - How it works in a nutshell
● Proximity graph
● Vertices are vectors, closer vectors are linked
● Hierarchical Layers based on skip lists
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
expensive
○ move down layer for refining the minimum(closest
friend)
11

HNSW - Skip Lists
● the higher the layer, the more sparse
● descending in layers while searching
● fast to search and insert
12

HNSW - Small World
rd Graphs
● start from entry point
● greedy search (each time distance is calculated across friends)
● starting from zoom out (low degree) to zoom in(high degree)
● when building the graph, higher average degree improve quality at a cost
image from https://www.pinecone.io/learn/hnsw/
13

HNSW - Index time
● add a vector at the time
● probability to enter layer N
● when added, it goes to all other layers
-> identify the layer(s) of insertion
● topk=1 closest neighbour is identified
● we descend and repeat until
the layer of insertion
● topk=ef_construction to identify neighbours
candidates
● M neighbours are linked (easiest is calculate
the exact distance)
image from https://www.pinecone.io/learn/hnsw/
Multi-Valued
- each node is not a document
- multiple vectors per document Id
14

HNSW - Search time
● Start from layer N (top)
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
expensive
○ move down layer for refining the minimum(closest
friend)
Multi-Valued
you may add in the top-K results the same
document Id multiple times
15

HNSW - MAX/SUM approach
MAX
when adding a vector from the
same document, you update the
score with the max
SUM
when adding a vector from the
same document, you update the
score summing
16

Nov 2020 - Apache Lucene 9.0
Dedicated File Format for Navigable Small World Graphs
https://issues.apache.org/jira/browse/LUCENE-9004
Jan 2022 - Apache Lucene 9.0
Handle Document Deletions
Feb 2022 - Apache Lucene 9.1
Introduced Hierarchy in HNSW
Mar 2022 - Apache Lucene 9.1
Re-use data structures across HNSW Graph
Mar 2022 - Apache Lucene 9.1
Pre filters with KNN queries
Aug 2022 - Apache Lucene 9.4
8 bits vector quantization
https://github.com/apache/lucene/issues/11613
JIRA ISSUES:
https://issues.apache.org/jira/issues/?jql=project%20%3D%2
0LUCENE%20AND%20labels%20%3D%20vector-based-
search
GITHUB ISSUES:
https://github.com/apache/lucene/labels/vector-based-search
Apache Lucene
17

9.8 - Parent Join for Vectors
18
Released 28/09/23
Merged in 9.8
● PROS:
- a child vector is a document, so it can
have metadata
● CONS:
- performance?
● NodeIdCachingHeap
● Knn Collector (various implementations)
○ ToParentJoinKnnCollector
● ToParentBlockJoinKnnVectorQuery.java
● For more Info: Benjamin Trent

Apache Lucene
19
GitHub
Pull Request

INDEXING - Auxiliary Data Structures
MAP VECTOR IDS TO DOCUMENT IDS
in a multi-valued scenario multiple vectors may belong to the
same document
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● leverage sparse support
● ordinal (vector Id) to document (document Id) map
● DocsWithVectorsSet to keep track of vectors per documents
● DirectMonotonicWriter to write the map
Lucene95HnswVectorsWriter
Write auxiliary data structures
20

INDEXING - DocsWithVectorsSet
DocsWithVectorsSet
accumulator of documents that have vectors, used by the HnswVectorsWriter when writing data
Operational
Search Session
Business Metrics.
LLMs?
What Data metrics?
Combine Metric for
Business?
negative trends.
● compatible with single valued dense/sparse scenarios
● keep a stack of vectors per document
● able to return a count of vectors for each document
DocsWithVectorsSet
21

INDEXING - DirectMonotonicWriter
DirectMonotonicWriter
write a sequence of integers monotonically increasing (never decreasing),
in blocks
Operational
Search Session
Business Metrics.
LLMs?
What Data metrics?
Combine Metric for
Business?
negative trends.
● each integer is a document Id
● the same document Id repeated for each vector in the document
● DirectMonotonicReader ordToDoc used then at reading time in the
SparseOffHeapVectorValues
● public int ordToDoc(int ord) -> the ordinal (vector Id) is the index
to use to access the block and the position within the block to
finally get the document Id
22

INDEXING - building HNSW Graph
NODE ID is the VECTOR ID
same as the sparse scenario, each node in the graph has an incremental ID
aligned with the vector ID
Operational
Search Session
Business Metrics.
LLMs?
What Data metrics?
Combine Metric for
Business?
negative trends.
● the nodes count in the graph = vector count
● no code changes
23

QUERY TIME - Exact Search
Vector Scorer
(naive solution) all vectors are iterated, only the ones corresponding
to an accepted doc are scored
Operational
Search Session
Business Metrics.
LLMs?
What Data metrics?
Combine Metric for
Business?
negative trends.
● VectorScorer scores only BitSet acceptedDocs
● all vectors from ByteVectorValues/FloatVectorValues are iterated
● scores are updated MAX/SUM
AbstractKnnVectorQuery
24

QUERY TIME - Approximate Search
HNSW SEARCH
searching on vectors(graph nodes) and returning
documents(max/sum score)
Operational
Search Session
Business Metrics.
LLMs?
What Data metrics?
Combine Metric for
Business?
negative trends.
● searching on level != 0 -> vectors are added as candidates/results
● searching on level = 0 -> document ID is added to the results
● int docId = vectors.ordToDoc(vectorId);
● results are added to NeighborQueue
HnswGraphSearcher
25

QUERY TIME - NeighborQueue
TOP-K DOCUMENTS (NeighborQueue)
data structure used to collect the top-k results
as a long heap
Operational
Search Session
Business Metrics.
LLMs?
What Data metrics?
Combine Metric for
Business?
negative trends.
● MIN HEAP
● each element is a long
[32 bits][32 bits] ->
[score][~document Id]
NeighborQueue
26

QUERY TIME - NeighborQueue
TOP-K DOCUMENTS (NeighborQueue)
data structure used to collect the top-k results
as a long heap
Operational
Search Session
Business Metrics.
LLMs?
What Data metrics?
Combine Metric for
Business?
● nodeIdToHeapIndex cache is used to
keep track of nodes position
● score is updated for the node
(MAX/SUM)
● DOWNHEAP (as the ranking may
have improved)
NeighborQueue
27

● to build the first prototype -> 1 year
● super active area -> merging
● Lucene codecs change names and old codec
is moved back to backwards codecs
● 85 classes DIFF!
○ simplified temporarily removing MAX/SUM
○ simplified temporarily removing separate
code branches for single/multivalued
○ down to 25 classes!
○ thanks to Benjamin Trent, Mayya Sharipova, Jim Ferenczi,
Josh Devins for the first reviews
Challenges - side project and merging
28

WRAPPING UP
Why Multi-valued?
HNSW and modifications
Index time internals
Challenges of a contribution
Query time internals
29

DO YOU WANT TO MAKE IT HAPPEN?
HELP WITH CODE
Pull Request
HELP WITH FUNDINGS
info@sease.io
30
After Lucene 9.8
significant changes happened

THANK YOU!
@seaseltd @sease-
ltd
@seaseltd @sease_ltd 31

Multi Valued Vectors Lucene

More Related Content

Similar to Multi Valued Vectors Lucene (20)

More from Sease (20)

Recently uploaded (20)

Multi Valued Vectors Lucene