Hybrid Search With Apache Solr

Hybrid Search with
Apache Solr
Speaker: Alessandro Benedetti, Director @ Sease
COMMUNITY OVER CODE EU 2024 - 03/06/2024

‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master degree in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch/OpenSearch expert
‣ Semantic search, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
ALESSANDRO BENEDETTI
WHO AM I ?
2

‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
Hot Trends :
● Large Language Models Applications
● Vector-based (Neural) Search
● Natural Language Processing
● Learning To Rank
● Document Similarity
● Search Quality Evaluation
● Relevance Tuning
SEArch SErvices
www.sease.io
3

Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview

Vector-based Search
Training Indexing Searching
Labeled Samples Text to Vectors Query to Vector
Lookup in Index

Neural Search Workflow
Similarity between a Query and a Document is translated to distance in a vector space

Distance Between Vectors -> Similarity
https://www.pinecone.io/learn/what-is-similarity-search/
https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-
e51395ffe60d
● Specify the relationship metric
between elements in the dataset
● use-case dependant
○ experiment which one
works better for you!
● In Information Retrieval Cosine
similarity proved to work quite well (it’s
a normalised inner product)
https://www.baeldung.com/cs/euclidean-distance-vs-cosine-similarity

Low Explainability
● High Dimensionality - vectors are long sequences of numerical values (768,
1536 …)
● Dimensions - each feature (element in the vector) has no clear semantic in
many cases (slightly different from sparse vectors or explicit feature vectors)
● Values - It’s not obvious to estimate how a single value impact relevance
(higher is better?)
● Similarity - To explain why a search result is retrieved in the top-K the vector
distance is the only info you have
Research is happening but it’s still an open problem

Lexical matches?
● Search users still have the expectation of lexical matches to happen
● You can’t enforce that with Vector-based search
“Why the document with the keyword in the title
is not coming up?” cit.

Low Diversity
● By nature vector-based search just returns the top-k
ordered by vector similarity
● Unless you add more logic on top, you would expect low
diversity by definition

Hybrid Search
● Mitigation of current vector-search problems - Is it here to stay?
● Combine traditional keyword-based (lexical) search with vector-based (neural)
search
● Retrieval of two sets of candidates:
○ one set of results coming from lexical matches with the query keywords
○ a set of results coming from the K-Nearest Neighbours search with the query
vector
● Ranking of the candidates

May 2022 - Apache Solr 9.0
Sease Introduced support to KNN search (HNSW)
https://issues.apache.org/jira/browse/SOLR-15880
JIRA ISSUES
https://issues.apache.org/jir
a/browse/SOLR-
15880?jql=labels%20%3D
%20vector-based-search
Apache Solr implementation

Retrieval Stage
The hybrid candidate result set is the union of the results coming from the two models:
● the top-K results coming from the K-Nearest Neighbours search and the <numFound>
results coming from the lexical (keyword-based) search.
● The cardinality of the combined result set is <= (K + NumFound).
● The result set doesn’t include any duplicates.

Retrieval Stage
The hybrid candidate result set is the intersection of the results coming from the two models:
● only the top-K results coming from the K-Nearest Neighbour that satisfy the lexical query
are returned.
● The cardinality of the combined result set is <= K .
● This is effectively post-filtering K-Nearest Neighbours results but affecting the score

Bonus Point: PRE-FILTERING VS POST-FILTERING
● < 9.1 -> FQ were post-filters
● > 9.1 -> FQ are pre-filters (to run a post-filter
https://solr.apache.org/guide/solr/latest/query-guide/common-query-parameters.html#cache-local-
parameter)
● > 9.6 -> More flexibility with additional !knn query parser params

9.6 PRE-FILTERING VS POST-FILTERING
More flexibility with additional !knn query parser params
○ preFilter -> Specifies an explicit list of Pre-Filter query strings to use.
N.B. if you specify this, FQs will all become post-filters
○ includeTags -> Indicates that only fq filters with the specified tag should be
considered for implicit Pre-Filtering. Must not be combined with preFilter.
○ excludeTags -> Indicates that fq filters with the specified tag should be
excluded from consideration for implicit Pre-Filtering. Must not be combined with
preFilter.
DEFAULT
Main Q -> same as >9.1, FQ that are not post-filters become pre-filters automatically
○ includeTags and excludeTags may be used to limit the set of fq filters used in the Pre-Filter.
As an fq param, or as a subquery clause in a larger query: No implicit Pre-Filter is used.
○ includeTags and excludeTags must not be used in these situations.
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-
search.html#explicit-knn-pre-filtering

9.6 PRE-FILTERING VS POST-FILTERING
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-
search.html#explicit-knn-pre-filtering
Some use cases where includeTags and/or excludeTags may be more useful then
an explicit preFilter parameters:
● You have some fq parameters that are re-used on many requests (even when
you don’t use the knn parser) that you wish to be used as KNN Pre-Filters when
you do use the knn query parser.
● You typically want all fq params to be used as KNN Pre-Filters, but when users
"drill down" on Facets, you want the fq parameters you add to be excluded from
the KNN Pre-Filtering so that the result set gets smaller; instead of just computing
a new topK set.

The Ranking Problem
● We need a score that reflects the best relevance ordering
● KNN candidates present a score [-1 … 1]
● Lexical candidates present an unbounded score (potentially on a complete
different scale)
N.B. combining Lexical scores with vector-based similarity (and potentially other
features) is not a solved problem
What options do we have right now in Apache Solr?

Ranking Stage
● The filter component ignores any scoring and just builds the hybrid result set.
● The must clause is responsible for assigning the score, using the appropriate function query.
● The lexical score is min-max normalised to be scaled between 0 and 1 and it is summed to the K-Nearest
Neighbours score.
This simple linear combination of the scores could be a good starting point.

Ranking Stage
● The filter component ignores any scoring and just builds the hybrid result set.
● The must clause is responsible for assigning the score, using the appropriate function query.
● The lexical score is min-max normalised to be scaled between 0.1 and 1 and it is multiplied by the K-Nearest
Neighbours score.
● Better? No evidence

Learning To Rank
● Multiple factors (features) affect ranking
○ no quick answer to a mathematical function to combine them -> what do you want
to optimise?
○ Sum? Normalised Sum? Multiplication? linear or non-linear function?
○ Rather than manual trial/error let’s use Machine Learning -> LTR
● Apache Solr supports Learning To Rank since 6.4
○ from 9.3 Sease contributed the first support for vector similarity as a feature
○ First step -> sponsor us or contribute improvements!
● Build a training set <query, document> -> rating
○ <query, document> is described by a vector of numerical feature, one of them can
be your vector similarity, others may be lexical scores or business rules
https://sease.io/category/learning-to-rank

Future Works
● Reciprocal Rank Fusion
(https://github.com/apache/solr/pull/2489 )
● Better scaling (min-max on theoretical max)
● And More!

The AI Side Of Apache Lucene/Solr - Training

Additional Resources
Additional Resources
● Blog: https://sease.io/2023/12/hybrid-search-with-apache-solr.html

THANK YOU!
@seaseltd @sease-
ltd
@seaseltd @sease_ltd

Hybrid Search With Apache Solr

More Related Content

What's hot (20)

Similar to Hybrid Search With Apache Solr (20)

More from Sease (20)

Recently uploaded (20)

Hybrid Search With Apache Solr