SlideShare a Scribd company logo
Hybrid Search with
Apache Solr
Speaker: Alessandro Benedetti, Director @ Sease
COMMUNITY OVER CODE EU 2024 - 03/06/2024
‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master degree in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch/OpenSearch expert
‣ Semantic search, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
ALESSANDRO BENEDETTI
WHO AM I ?
2
‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
Hot Trends :
● Large Language Models Applications
● Vector-based (Neural) Search
● Natural Language Processing
● Learning To Rank
● Document Similarity
● Search Quality Evaluation
● Relevance Tuning
SEArch SErvices
www.sease.io
3
Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
Vector-based Search
Training Indexing Searching
Labeled Samples Text to Vectors Query to Vector
Lookup in Index
Neural Search Workflow
Similarity between a Query and a Document is translated to distance in a vector space
Distance Between Vectors -> Similarity
https://www.pinecone.io/learn/what-is-similarity-search/
https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-
e51395ffe60d
● Specify the relationship metric
between elements in the dataset
● use-case dependant
○ experiment which one
works better for you!
● In Information Retrieval Cosine
similarity proved to work quite well (it’s
a normalised inner product)
https://www.baeldung.com/cs/euclidean-distance-vs-cosine-similarity
Low Explainability
● High Dimensionality - vectors are long sequences of numerical values (768,
1536 …)
● Dimensions - each feature (element in the vector) has no clear semantic in
many cases (slightly different from sparse vectors or explicit feature vectors)
● Values - It’s not obvious to estimate how a single value impact relevance
(higher is better?)
● Similarity - To explain why a search result is retrieved in the top-K the vector
distance is the only info you have
Research is happening but it’s still an open problem
Lexical matches?
● Search users still have the expectation of lexical matches to happen
● You can’t enforce that with Vector-based search
“Why the document with the keyword in the title
is not coming up?” cit.
Low Diversity
● By nature vector-based search just returns the top-k
ordered by vector similarity
● Unless you add more logic on top, you would expect low
diversity by definition
Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
Hybrid Search
● Mitigation of current vector-search problems - Is it here to stay?
● Combine traditional keyword-based (lexical) search with vector-based (neural)
search
● Retrieval of two sets of candidates:
○ one set of results coming from lexical matches with the query keywords
○ a set of results coming from the K-Nearest Neighbours search with the query
vector
● Ranking of the candidates
Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
May 2022 - Apache Solr 9.0
Sease Introduced support to KNN search (HNSW)
https://issues.apache.org/jira/browse/SOLR-15880
JIRA ISSUES
https://issues.apache.org/jir
a/browse/SOLR-
15880?jql=labels%20%3D
%20vector-based-search
Apache Solr implementation
Retrieval Stage
The hybrid candidate result set is the union of the results coming from the two models:
● the top-K results coming from the K-Nearest Neighbours search and the <numFound>
results coming from the lexical (keyword-based) search.
● The cardinality of the combined result set is <= (K + NumFound).
● The result set doesn’t include any duplicates.
Retrieval Stage
The hybrid candidate result set is the intersection of the results coming from the two models:
● only the top-K results coming from the K-Nearest Neighbour that satisfy the lexical query
are returned.
● The cardinality of the combined result set is <= K .
● This is effectively post-filtering K-Nearest Neighbours results but affecting the score
Bonus Point: PRE-FILTERING VS POST-FILTERING
● < 9.1 -> FQ were post-filters
● > 9.1 -> FQ are pre-filters (to run a post-filter
https://solr.apache.org/guide/solr/latest/query-guide/common-query-parameters.html#cache-local-
parameter)
● > 9.6 -> More flexibility with additional !knn query parser params
9.6 PRE-FILTERING VS POST-FILTERING
More flexibility with additional !knn query parser params
○ preFilter -> Specifies an explicit list of Pre-Filter query strings to use.
N.B. if you specify this, FQs will all become post-filters
○ includeTags -> Indicates that only fq filters with the specified tag should be
considered for implicit Pre-Filtering. Must not be combined with preFilter.
○ excludeTags -> Indicates that fq filters with the specified tag should be
excluded from consideration for implicit Pre-Filtering. Must not be combined with
preFilter.
DEFAULT
Main Q -> same as >9.1, FQ that are not post-filters become pre-filters automatically
○ includeTags and excludeTags may be used to limit the set of fq filters used in the Pre-Filter.
As an fq param, or as a subquery clause in a larger query: No implicit Pre-Filter is used.
○ includeTags and excludeTags must not be used in these situations.
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-
search.html#explicit-knn-pre-filtering
9.6 PRE-FILTERING VS POST-FILTERING
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-
search.html#explicit-knn-pre-filtering
Some use cases where includeTags and/or excludeTags may be more useful then
an explicit preFilter parameters:
● You have some fq parameters that are re-used on many requests (even when
you don’t use the knn parser) that you wish to be used as KNN Pre-Filters when
you do use the knn query parser.
● You typically want all fq params to be used as KNN Pre-Filters, but when users
"drill down" on Facets, you want the fq parameters you add to be excluded from
the KNN Pre-Filtering so that the result set gets smaller; instead of just computing
a new topK set.
Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
The Ranking Problem
● We need a score that reflects the best relevance ordering
● KNN candidates present a score [-1 … 1]
● Lexical candidates present an unbounded score (potentially on a complete
different scale)
N.B. combining Lexical scores with vector-based similarity (and potentially other
features) is not a solved problem
What options do we have right now in Apache Solr?
Ranking Stage
● The filter component ignores any scoring and just builds the hybrid result set.
● The must clause is responsible for assigning the score, using the appropriate function query.
● The lexical score is min-max normalised to be scaled between 0 and 1 and it is summed to the K-Nearest
Neighbours score.
This simple linear combination of the scores could be a good starting point.
Ranking Stage
● The filter component ignores any scoring and just builds the hybrid result set.
● The must clause is responsible for assigning the score, using the appropriate function query.
● The lexical score is min-max normalised to be scaled between 0.1 and 1 and it is multiplied by the K-Nearest
Neighbours score.
● Better? No evidence
Learning To Rank
● Multiple factors (features) affect ranking
○ no quick answer to a mathematical function to combine them -> what do you want
to optimise?
○ Sum? Normalised Sum? Multiplication? linear or non-linear function?
○ Rather than manual trial/error let’s use Machine Learning -> LTR
● Apache Solr supports Learning To Rank since 6.4
○ from 9.3 Sease contributed the first support for vector similarity as a feature
○ First step -> sponsor us or contribute improvements!
● Build a training set <query, document> -> rating
○ <query, document> is described by a vector of numerical feature, one of them can
be your vector similarity, others may be lexical scores or business rules
https://sease.io/category/learning-to-rank
Features.json
model.json
LTR Query
Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
Future Works
● Reciprocal Rank Fusion
(https://github.com/apache/solr/pull/2489 )
● Better scaling (min-max on theoretical max)
● And More!
The AI Side Of Apache Lucene/Solr - Training
Additional Resources
Additional Resources
● Blog: https://sease.io/2023/12/hybrid-search-with-apache-solr.html
THANK YOU!
@seaseltd @sease-
ltd
@seaseltd @sease_ltd

More Related Content

What's hot (20)

PPTX
Deep dive into LangChain integration with Neo4j.pptx
TomazBratanic1
 
PDF
Graph Neural Network for Phenotype Prediction
tuxette
 
PDF
CIFAR-10
satyam_madala
 
PPTX
Introduction to RAG (Retrieval Augmented Generation) and its application
Knoldus Inc.
 
PPTX
Deep learning with keras
MOHITKUMAR1379
 
PDF
Neo4j for Discovering Drugs and Biomarkers
Neo4j
 
PDF
Introduction to object detection
Brodmann17
 
PDF
Using AI to understand search intent
Aritra Mandal
 
PDF
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
 
PDF
[unofficial] Pyramid Scene Parsing Network (CVPR 2017)
Shunta Saito
 
PDF
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
PPTX
Presentation_Malware Analysis.pptx
nishanth kurush
 
PPTX
Interpretable Machine Learning
Sri Ambati
 
PDF
ResNet basics (Deep Residual Network for Image Recognition)
Sanjay Saha
 
PPTX
효율적인 SQL 작성방법 1주차
희동 강
 
PDF
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
 
PDF
Kafka clients and emitters
Edgar Domingues
 
PDF
Deep Learning for Graphs
DeepLearningBlr
 
PPTX
Squeezing Deep Learning Into Mobile Phones
Anirudh Koul
 
PDF
Skin Cancer Detection Using Deep Learning Techniques
IRJET Journal
 
Deep dive into LangChain integration with Neo4j.pptx
TomazBratanic1
 
Graph Neural Network for Phenotype Prediction
tuxette
 
CIFAR-10
satyam_madala
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Knoldus Inc.
 
Deep learning with keras
MOHITKUMAR1379
 
Neo4j for Discovering Drugs and Biomarkers
Neo4j
 
Introduction to object detection
Brodmann17
 
Using AI to understand search intent
Aritra Mandal
 
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
 
[unofficial] Pyramid Scene Parsing Network (CVPR 2017)
Shunta Saito
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
Presentation_Malware Analysis.pptx
nishanth kurush
 
Interpretable Machine Learning
Sri Ambati
 
ResNet basics (Deep Residual Network for Image Recognition)
Sanjay Saha
 
효율적인 SQL 작성방법 1주차
희동 강
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
 
Kafka clients and emitters
Edgar Domingues
 
Deep Learning for Graphs
DeepLearningBlr
 
Squeezing Deep Learning Into Mobile Phones
Anirudh Koul
 
Skin Cancer Detection Using Deep Learning Techniques
IRJET Journal
 

Similar to Hybrid Search With Apache Solr (20)

PPTX
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Sease
 
PDF
Retrieving Information From Solr
Ramzi Alqrainy
 
PPT
Improving VIVO search through semantic ranking.
Deepak K
 
PDF
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
PPTX
The Intent Algorithms of Search & Recommendation Engines
Trey Grainger
 
PDF
Search Engine-Building with Lucene and Solr
Kai Chan
 
PDF
Apace Solr Web Development.pdf
Abanti Aazmin
 
PPTX
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 
PDF
Enhancing relevancy through personalization & semantic search
Trey Grainger
 
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
PPTX
Apache solr
Péter Király
 
PPTX
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Simon Hughes
 
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
PDF
Find it, possibly also near you!
Paul Borgermans
 
PDF
Multi faceted responsive search, autocomplete, feeds engine & logging
lucenerevolution
 
PPT
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Ecommerce Solution Provider SysIQ
 
PDF
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Ramzi Alqrainy
 
PDF
Hacking Lucene and Solr for Fun and Profit
lucenerevolution
 
PPTX
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
 
PDF
Apace Solr Web Development.pdf
Ayesha Siddika
 
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Sease
 
Retrieving Information From Solr
Ramzi Alqrainy
 
Improving VIVO search through semantic ranking.
Deepak K
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
The Intent Algorithms of Search & Recommendation Engines
Trey Grainger
 
Search Engine-Building with Lucene and Solr
Kai Chan
 
Apace Solr Web Development.pdf
Abanti Aazmin
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 
Enhancing relevancy through personalization & semantic search
Trey Grainger
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
Apache solr
Péter Király
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Simon Hughes
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
Find it, possibly also near you!
Paul Borgermans
 
Multi faceted responsive search, autocomplete, feeds engine & logging
lucenerevolution
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Ecommerce Solution Provider SysIQ
 
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Ramzi Alqrainy
 
Hacking Lucene and Solr for Fun and Profit
lucenerevolution
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
 
Apace Solr Web Development.pdf
Ayesha Siddika
 
Ad

More from Sease (20)

PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Sease
 
PPTX
From Natural Language to Structured Solr Queries using LLMs
Sease
 
PPTX
Multi Valued Vectors Lucene
Sease
 
PPTX
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
PDF
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
PDF
Introducing Multi Valued Vectors Fields in Apache Lucene
Sease
 
PPTX
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
PPTX
How does ChatGPT work: an Information Retrieval perspective
Sease
 
PDF
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
PPTX
Neural Search Comes to Apache Solr
Sease
 
PPTX
Large Scale Indexing
Sease
 
PDF
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
PDF
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
PDF
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
PPTX
How to cache your searches_ an open source implementation.pptx
Sease
 
PDF
Online Testing Learning to Rank with Solr Interleaving
Sease
 
PDF
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Sease
 
PDF
Apache Lucene/Solr Document Classification
Sease
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Sease
 
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Multi Valued Vectors Lucene
Sease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Sease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
How does ChatGPT work: an Information Retrieval perspective
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Neural Search Comes to Apache Solr
Sease
 
Large Scale Indexing
Sease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
How to cache your searches_ an open source implementation.pptx
Sease
 
Online Testing Learning to Rank with Solr Interleaving
Sease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Sease
 
Apache Lucene/Solr Document Classification
Sease
 
Ad

Recently uploaded (20)

PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 

Hybrid Search With Apache Solr

  • 1. Hybrid Search with Apache Solr Speaker: Alessandro Benedetti, Director @ Sease COMMUNITY OVER CODE EU 2024 - 03/06/2024
  • 2. ‣ Born in Tarquinia (ancient Etruscan city in Italy) ‣ R&D Software Engineer ‣ Director ‣ Master degree in Computer Science ‣ PC member for ECIR, SIGIR and Desires ‣ Apache Lucene/Solr PMC member/committer ‣ Elasticsearch/OpenSearch expert ‣ Semantic search, NLP, Machine Learning technologies passionate ‣ Beach Volleyball player and Snowboarder ALESSANDRO BENEDETTI WHO AM I ? 2
  • 3. ‣ Headquarter in London/distributed ‣ Open-source Enthusiasts ‣ Apache Lucene/Solr experts ‣ Elasticsearch/OpenSearch experts ‣ Community Contributors ‣ Active Researchers Hot Trends : ● Large Language Models Applications ● Vector-based (Neural) Search ● Natural Language Processing ● Learning To Rank ● Document Similarity ● Search Quality Evaluation ● Relevance Tuning SEArch SErvices www.sease.io 3
  • 4. Limitations of Vector-Based Search Hybrid Search Hybrid Retrieval in Apache Solr Hybrid Ranking in Apache Solr Future Works Overview
  • 5. Limitations of Vector-Based Search Hybrid Search Hybrid Retrieval in Apache Solr Hybrid Ranking in Apache Solr Future Works Overview
  • 6. Vector-based Search Training Indexing Searching Labeled Samples Text to Vectors Query to Vector Lookup in Index
  • 7. Neural Search Workflow Similarity between a Query and a Document is translated to distance in a vector space
  • 8. Distance Between Vectors -> Similarity https://www.pinecone.io/learn/what-is-similarity-search/ https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling- e51395ffe60d ● Specify the relationship metric between elements in the dataset ● use-case dependant ○ experiment which one works better for you! ● In Information Retrieval Cosine similarity proved to work quite well (it’s a normalised inner product) https://www.baeldung.com/cs/euclidean-distance-vs-cosine-similarity
  • 9. Low Explainability ● High Dimensionality - vectors are long sequences of numerical values (768, 1536 …) ● Dimensions - each feature (element in the vector) has no clear semantic in many cases (slightly different from sparse vectors or explicit feature vectors) ● Values - It’s not obvious to estimate how a single value impact relevance (higher is better?) ● Similarity - To explain why a search result is retrieved in the top-K the vector distance is the only info you have Research is happening but it’s still an open problem
  • 10. Lexical matches? ● Search users still have the expectation of lexical matches to happen ● You can’t enforce that with Vector-based search “Why the document with the keyword in the title is not coming up?” cit.
  • 11. Low Diversity ● By nature vector-based search just returns the top-k ordered by vector similarity ● Unless you add more logic on top, you would expect low diversity by definition
  • 12. Limitations of Vector-Based Search Hybrid Search Hybrid Retrieval in Apache Solr Hybrid Ranking in Apache Solr Future Works Overview
  • 13. Hybrid Search ● Mitigation of current vector-search problems - Is it here to stay? ● Combine traditional keyword-based (lexical) search with vector-based (neural) search ● Retrieval of two sets of candidates: ○ one set of results coming from lexical matches with the query keywords ○ a set of results coming from the K-Nearest Neighbours search with the query vector ● Ranking of the candidates
  • 14. Limitations of Vector-Based Search Hybrid Search Hybrid Retrieval in Apache Solr Hybrid Ranking in Apache Solr Future Works Overview
  • 15. May 2022 - Apache Solr 9.0 Sease Introduced support to KNN search (HNSW) https://issues.apache.org/jira/browse/SOLR-15880 JIRA ISSUES https://issues.apache.org/jir a/browse/SOLR- 15880?jql=labels%20%3D %20vector-based-search Apache Solr implementation
  • 16. Retrieval Stage The hybrid candidate result set is the union of the results coming from the two models: ● the top-K results coming from the K-Nearest Neighbours search and the <numFound> results coming from the lexical (keyword-based) search. ● The cardinality of the combined result set is <= (K + NumFound). ● The result set doesn’t include any duplicates.
  • 17. Retrieval Stage The hybrid candidate result set is the intersection of the results coming from the two models: ● only the top-K results coming from the K-Nearest Neighbour that satisfy the lexical query are returned. ● The cardinality of the combined result set is <= K . ● This is effectively post-filtering K-Nearest Neighbours results but affecting the score
  • 18. Bonus Point: PRE-FILTERING VS POST-FILTERING ● < 9.1 -> FQ were post-filters ● > 9.1 -> FQ are pre-filters (to run a post-filter https://solr.apache.org/guide/solr/latest/query-guide/common-query-parameters.html#cache-local- parameter) ● > 9.6 -> More flexibility with additional !knn query parser params
  • 19. 9.6 PRE-FILTERING VS POST-FILTERING More flexibility with additional !knn query parser params ○ preFilter -> Specifies an explicit list of Pre-Filter query strings to use. N.B. if you specify this, FQs will all become post-filters ○ includeTags -> Indicates that only fq filters with the specified tag should be considered for implicit Pre-Filtering. Must not be combined with preFilter. ○ excludeTags -> Indicates that fq filters with the specified tag should be excluded from consideration for implicit Pre-Filtering. Must not be combined with preFilter. DEFAULT Main Q -> same as >9.1, FQ that are not post-filters become pre-filters automatically ○ includeTags and excludeTags may be used to limit the set of fq filters used in the Pre-Filter. As an fq param, or as a subquery clause in a larger query: No implicit Pre-Filter is used. ○ includeTags and excludeTags must not be used in these situations. https://solr.apache.org/guide/solr/latest/query-guide/dense-vector- search.html#explicit-knn-pre-filtering
  • 20. 9.6 PRE-FILTERING VS POST-FILTERING https://solr.apache.org/guide/solr/latest/query-guide/dense-vector- search.html#explicit-knn-pre-filtering Some use cases where includeTags and/or excludeTags may be more useful then an explicit preFilter parameters: ● You have some fq parameters that are re-used on many requests (even when you don’t use the knn parser) that you wish to be used as KNN Pre-Filters when you do use the knn query parser. ● You typically want all fq params to be used as KNN Pre-Filters, but when users "drill down" on Facets, you want the fq parameters you add to be excluded from the KNN Pre-Filtering so that the result set gets smaller; instead of just computing a new topK set.
  • 21. Limitations of Vector-Based Search Hybrid Search Hybrid Retrieval in Apache Solr Hybrid Ranking in Apache Solr Future Works Overview
  • 22. The Ranking Problem ● We need a score that reflects the best relevance ordering ● KNN candidates present a score [-1 … 1] ● Lexical candidates present an unbounded score (potentially on a complete different scale) N.B. combining Lexical scores with vector-based similarity (and potentially other features) is not a solved problem What options do we have right now in Apache Solr?
  • 23. Ranking Stage ● The filter component ignores any scoring and just builds the hybrid result set. ● The must clause is responsible for assigning the score, using the appropriate function query. ● The lexical score is min-max normalised to be scaled between 0 and 1 and it is summed to the K-Nearest Neighbours score. This simple linear combination of the scores could be a good starting point.
  • 24. Ranking Stage ● The filter component ignores any scoring and just builds the hybrid result set. ● The must clause is responsible for assigning the score, using the appropriate function query. ● The lexical score is min-max normalised to be scaled between 0.1 and 1 and it is multiplied by the K-Nearest Neighbours score. ● Better? No evidence
  • 25. Learning To Rank ● Multiple factors (features) affect ranking ○ no quick answer to a mathematical function to combine them -> what do you want to optimise? ○ Sum? Normalised Sum? Multiplication? linear or non-linear function? ○ Rather than manual trial/error let’s use Machine Learning -> LTR ● Apache Solr supports Learning To Rank since 6.4 ○ from 9.3 Sease contributed the first support for vector similarity as a feature ○ First step -> sponsor us or contribute improvements! ● Build a training set <query, document> -> rating ○ <query, document> is described by a vector of numerical feature, one of them can be your vector similarity, others may be lexical scores or business rules https://sease.io/category/learning-to-rank
  • 29. Limitations of Vector-Based Search Hybrid Search Hybrid Retrieval in Apache Solr Hybrid Ranking in Apache Solr Future Works Overview
  • 30. Future Works ● Reciprocal Rank Fusion (https://github.com/apache/solr/pull/2489 ) ● Better scaling (min-max on theoretical max) ● And More!
  • 31. The AI Side Of Apache Lucene/Solr - Training
  • 32. Additional Resources Additional Resources ● Blog: https://sease.io/2023/12/hybrid-search-with-apache-solr.html