What’s next for
deep learning
for Search?
Bhaskar Mitra
Principal Researcher
Microsoft Research
A (personal) reflection on the journey so far…
2014-16
2016
First wave of deep document ranking models
Trained on 200K English
queries from Bing.com
(proprietary dataset) Trained on 95K Chinese
queries from Sogou.com
(public dataset)
Trained using BM25-based
weak labels
2017
But are we making
real progress?
¯_(ツ)_/¯
Passage Ranking Leaderboard
MS MARCO passage ranking benchmark
launches with 0.5M+ English training queries
The myth of “no neural IR model worked before BERT”: first generation deep
ranking models, e.g., Duet and KNRM, and their variants, outperform most
traditional IR methods by reasonable margin on the MS MARCO benchmark
2018
Did neural IR really have a “weak baselines” problem?
I will argue NO: (i) pre-MS MARCO, most neural IR papers benchmarking on Robust04 were NOT trained on
large labeled datasets and represent a biased sample of neural IR papers, and (ii) even in those cases there is
little evidence that these papers employed any weaker baselines than non-neural IR papers
Why is this important?
1. Can’t expect every paper to beat SOTA. Focus is often on hypothesis testing. Check for appropriate
baselines, not SOTA baselines. Improvements over emerging methods (not yet SOTA) should be encouraged.
2. Early generation deep ranking models provided many useful
insights and created the demand for large training dataset. 👏🏽
But we had a BIGGER benchmarking problem
The lack of public IR benchmarks with large scale training
data led to:
Comparisons under low-data regime
 e.g., older TREC collections with few hundred queries
Comparisons on (semi-)synthetic benchmarks
 e.g., TREC CAR
Comparisons under weak supervision training
Comparisons on corpus of language different than what
the models were designed for
Performance of deep models typically
improve with more training data
(image source: The Duet paper) Non-standardized benchmarks also required reimplementation of baselines (specially,
neural baselines) which in turn meant that many of them were under-tuned, in turn,
contributing to the “weak baselines” problem!
The year of BERT
Three months after the BERT paper hits arXiv the first BERT-based reranking model achieves
0.359 MRR compared to previous state-of-the-art of 0.281 on MS MARCO
2019
TREC Deep Learning Track
2019
Document Ranking Leaderboard
+
TREC 2020 Deep Learning Track
Also, this guy
😒👇
2020
Next year (2023) we will be
hosting the 5th edition of the
TREC Deep Learning Track 🙌
🏽
Are we making
real progress?
Deep learning models have gone from
novelty to commodity in communities
like SIGIR—parallels to how learning-to-
rank models “took over” IR
Deep models have demonstrated large
gains over previous state-of-the-art,
and the gap continues to grow
But we must be careful of how we
interpret “progress”, and interrogate the
evidence when it is largely based on a
single benchmark
Internal validity. Would improvements on the
current dataset hold on a different sample from
the same dataset for the same task?
External validity. Would improvements on the
current dataset hold on a different dataset with
different distribution for the same task or on a
different (but closely related) task?
If MS MARCO’s training data is only useful for
achieving good results on MS MARCO’s test
set, then it’s less useful for the IR community
Important: transfer learning from MS MARCO
to other benchmarks
• TREC DL is transfer learning (MS MARCO
sparse binary labels  NIST’s 5-point
labels)
• Promising results: MS MARCO  Robust04,
TREC-COVID, TREC-CAsT
Under bootstrap analysis
we find the leaderboard
rankings are stable!
😊👍
BERT-scale deep ranking models in
production search systems
Industry impact
“Give a small boy a hammer, and he will find
that everything he encounters needs pounding.
- Abraham Kaplan
https://en.wikipedia.org/wiki/Law_of_the_instrument
What’s next?
Short term. In the short term, larger focus is likely to continue making these
models more practically useful in real production systems. The emphasis will be
on effectiveness, efficiency, and robustness.
Long term. In the longer term, our increased optimization capabilities should
support and encourage bolder visions for IR. The three directions I am personally
excited about are: (i) Interplay between IR data structures and deep ranking
models, (ii) stochastic ranking and direct optimization for target exposure, and
(iii) structured knowledge extraction / modeling / retrieval.
Case studies: Scaling BERT-based relevance models to
long documents and full retrieval settings
Nogueira and Cho (2019)
How do we scale to
longer documents?
How do we scale to
full retrieval from
large collections?
Bottleneck: Peak GPU
memory during training
Challenge: Collection size
vs. expected online query
response time
Challenges in scaling
BERT to longer inputs
At training time, the GPU memory requirement for
BERT’s self-attention layers grows quadratically w.r.t. to
input length
The quadratic complexity is a direct result of storing all
the 𝑛2
-dimensional attention matrices in GPU memory
during training for easier backpropagation
Potential workarounds:
• Trade-off GPU memory and training time by
proactively releasing GPU memory during forward
pass at the cost of redundant re-computations
during backward pass
• Find cheaper approximation to self-attention layers
• Reduce the input space by running BERT on select
passages in the document
Trade-off GPU memory and training time using
gradient checkpointing
At training time, during the forward-pass the model
caches all intermediate outputs in GPU memory so
that during backward-pass we can easily compute
the gradients of a layer’s outputs w.r.t. its inputs
Under gradient checkpointing (Chen et al., 2016), in
contrast, the model only saves intermediate outputs
at specific checkpoints; during the backward-pass,
missing intermediate outputs are recomputed based
on the closed preceding checkpoint(s)
For Transformers, this allows us to store only one
𝑛2
-dimensional attention matrix in GPU memory at
any given time!
Without gradient checkpointing:
With gradient checkpointing: checkpoint
Stored in GPU memory
https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9
Cheaper approximation: Transformer  Conformer
Conformer is an alternative to Transformer that employs a separable self-attention layer with linear GPU
memory complexity (as opposed to Transformer’s quadratic complexity) and is augmented with additional
convolutional layers to model short-distance attention
Mitra, Hofstätter, Zamani, and Craswell. Conformer-Kernel with Query Term Independence for Document Retrieval. ArXiv preprint. (2020)
Mitra, Hofstätter, Zamani, and Craswell. Conformer-Kernel with Query Term Independence at TREC 2020 Deep Learning Track. In Proc. TREC. (2020)
Mitra, Hofstätter, Zamani, and Craswell. Improving Transformer-Kernel Ranking Model Using Conformer and Query Term Independence. In Proc. SIGIR. (2021)
TREC 2020 Deep Learning Track
(Document Ranking Task)
Reduce the task: Passage-based document ranking
Hofstätter, Zamani, Mitra, Craswell, and Hanbury. Local Self-Attention over Long Text for Efficient Document Retrieval. In Proc. SIGIR. (2020)
Hofstätter, Mitra, Zamani, Craswell, and Hanbury. Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. In Proc. SIGIR. (2021)
Kazai, Mitra, Dong, Zamani, Craswell, and Yang. Less is Less: When Are Snippets Insufficient for Human vs Machine Relevance Estimation? In Proc. ECIR. (2022)
Strategy 1: Run BERT on first-k tokens from the document
Considering only the first-k tokens leads to underestimation of relevance and consequently
under-retrieval of longer documents (Hofstätter et al., 2020). Recent studies (Kazai et al., 2022)
have also analyzed when single snippets are insufficient for both human and machine learning
based relevance estimation.
Strategy 2: Run BERT on multiple windows of k-tokens each from the document
This is the approach proposed by Hofstätter et al. (2020). However, the number of windows can be large
corresponding to longer documents and running BERT too many times per query-document pair can also be
prohibitively costly.
Strategy 3: Run BERT on windows of text pre-selected using cheaper models
This is the approach proposed by Hofstätter et al. (2021). The approach
(IDCM) was motivated by cascaded ranking pipelines, but in this case the
cascades are employed within-document for passage selection.
Intra-Document
Cascaded Model (IDCM)
We employ a cascaded architecture: a cheaper model
ranks-and-prunes candidate passages and costlier BERT
model inspects only selected passages from the document
The cheaper model is trained via knowledge distillation
from the BERT model
Hofstätter, Mitra, Zamani, Craswell, and Hanbury. Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. In Proc. SIGIR. (2021)
Challenges in scaling
BERT to full retrieval
Broadly two sets of approaches have emerged:
Dense retrieval and Query Term Independent
(QTI) models; both precompute document
representations at indexing time and require
very little computations at query response time
Mitra, Rosset, Hawking, Craswell, Diaz, and Yilmaz. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking Using Deep Neural Networks. ArXiv preprint. (2019)
Mitra, Hofstätter, Zamani, and Craswell. Conformer-Kernel with Query Term Independence for Document Retrieval. ArXiv preprint. (2020)
Mitra, Hofstätter, Zamani, and Craswell. Conformer-Kernel with Query Term Independence at TREC 2020 Deep Learning Track. In Proc. TREC. (2020)
Mitra, Hofstätter, Zamani, and Craswell. Improving Transformer-Kernel Ranking Model Using Conformer and Query Term Independence. In Proc. SIGIR. (2021)
Nogueira and Cho (2019)
Dense retrieval: Xiong et al. (2021), Qu et al.
(2021), Hofstätter et al. (2021), and others
QTI: Mitra et al. (2019), Nogueira et al.
(2019), Dai and Callan (2020), and others
A note about distillation
A popular recipe involves pretraining/finetuning large models and
then knowledge distillation to smaller models that can be deployed in
real-world retrieval systems
Pretrained model
e.g., 24-layer BERT
Finetuned model
e.g., 24-layer BERT
Distilled
model
e.g., smaller
model, or dense
retriever, or QTI,
or early-stage
cascade model
Benchmarking by jointly considering effectiveness,
efficiency, and robustness
Hofstätter, Craswell, Mitra, Zamani, and Hanbury. Are We There Yet? A Decision Framework for Replacing Term-Based Retrieval with Dense Retrieval Systems. ArXiv preprint. (2022)
A preliminary framework:
• Identify key measures of effectiveness, efficiency, and robustness
• Measures can either be traded-off against each other, or act as
guardrails
• Apply value-laden and business-informed trade-offs between
cost and robustness measures to define aggregate measures, e.g.,
• Identify the set of acceptable solutions (again) based on value-
laden and business-informed trade-offs decision boundaries
Robustness and calibration
In practical scenarios, it is important that these of
deep models perform robustly across different
domains and distributional shifts over time
It is also important that we look beyond point
estimates of relevance and consider calibrated
uncertainties in model predictions
Cohen, Mitra, Rosset, Hofmann, and Croft. Cross Domain Regularization for Neural Ranking Models using Adversarial Learning. In Proc. SIGIR. (2018)
Cohen, Mitra, Lesota, Rekabsaz, and Eickhoff. Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models. In Proc. SIGIR. (2021)
Cohen, Du, Mitra, Mercurio, Rekabsaz, and Eickhoff. Inconsistent Ranking Assumptions in Medical Search and Their Downstream Consequences. In Proc. SIGIR. (2022)
Looking further down the road…
IR data structures and deep ranking models
Kraska et al. (2018) were one of the earliest to propose learned index structures where predictive
machine learning is employed to speed up search over classical data structures
I believe there’s a significant opportunity to design data-structure-aware deep ranking models
and to employ deep learning to directly optimize for efficiency in our search stacks
Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
Case study:
Optimizing first stage retrieval
using reinforcement learning
Large scale IR systems trade-off search result quality and query response time
In Bing, we have a candidate generation stage followed by multiple rank and prune stages
Typically, we apply machine learning in the re-ranking stages
In this work, we explore reinforcement learning for effective and efficient candidate generation
Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
In Bing, the index is distributed over multiple machines
For candidate generation, on each machine the documents are linearly scanned using a match plan
Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
When a query comes in, it is automatically
categorized, and a pre-defined match plan
is selected
A match rule defines the condition that a
document should satisfy to be selected as a
candidate
A match plan consists of a sequence of
match rules, and corresponding stopping
criteria
The stopping criteria decides when the
index scan using a particular match rule
should terminate—and if the matching
process should continue with the next match
rule, or conclude, or reset to the beginning
of the index
Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
Match plans influence the
trade-off between
effectiveness and efficiency
E.g., long queries with rare
intents may require expensive
match plans that consider
body text and search deeper
into the index
In contrast, for popular
navigational queries a shallow
scan against URL and title
metastreams may be sufficient
Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
E.g.,
Query: halloween costumes
Match rule: mrA → (halloween ∈ A|U|B|T ) ∧ (costumes ∈ A|U|B|T )
Query: facebook login
Match rule: mrB → (facebook ∈ U|T )
Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
During execution, two accumulators are tracked
u: the number of blocks accessed from disk
v: the cum. number of term matches in all inspected documents
A stopping criteria sets thresholds for each – when either thresholds are met, the scan using
that particular match rule terminates
Matching may then continue with a new match rule, or terminate, or re-start from beginning
Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
Optimizing query evaluations using reinforcement learning
Learn a policy πθ : S → A which maximizes the
cumulative discounted reward R, where γ is the
discount rate
We employ table-based Q learning
State space: index blocks accessed (ut) and term
matches (vt)
Action space:
Reward function:
g(di) is the relevance of the ith document estimated based on
the subsequent L1 ranker score—considering only top n
documents
index
match rule relevance discounted by
index blocks accessed
agent
accumulators
(u, v)
Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
Search as a mediator
of exposure
Traditional IR is concerned with ranking items
according to relevance
These information access systems deployed at
web-scale mediate what information / items
gets exposure
In many search scenarios it may be more
appropriate to optimize for exposure rather
than rank-based metrics; it may also allow us to
move towards richer presentation schemes
beyond ranked lists
Also, important in the context of fair ranking!
Stochastic ranking and expected exposure
In recommendation, Diaz et al. (2020) define a stochastic ranking policy 𝜋𝑢, conditioned on user 𝑢 ∈ U, as a
probability distribution over all permutations of items in the collection
The expected exposure of an item 𝑑 for user 𝑢 can then be computed as follows:
Here, 𝑝(𝜖|𝑑,𝜎) can be computed using a user browsing model like RBP as discussed previously
Diaz, Mitra, Ekstrand, Biega, and Carterette. Evaluating stochastic rankings with expected exposure. In Proc. CIKM. Best full paper honorable mention. (2020)
Wu*, Mitra*, Ma, Diaz, and Liu (*equal contributions). Joint Multisided Exposure Fairness for Recommendation. In Proc. SIGIR. (2022)
A stochastic ranking model samples a ranking from a probability distribution over all possible permutations
of items in the collection—i.e., for the same intent it returns a slightly different ranking on each impression
restaurants in montreal restaurants in montreal
restaurants in montreal
restaurants in montreal
Gradient-based optimization for target exposure
add independently
sampled Gumbel noise
neural scoring
function
compute smooth
rank value
compute exposure
using user model
compute loss with
target exposure
compute average
exposure
items target
exposure
Diaz, Mitra, Ekstrand, Biega, and Carterette. Evaluating stochastic rankings with expected exposure. In Proc. CIKM. Best full paper honorable mention. (2020)
Wu*, Mitra*, Ma, Diaz, and Liu (*equal contributions). Joint Multisided Exposure Fairness for Recommendation. In Proc. SIGIR. (2022)
What query exposes me (or my document)?
###### * * * * * * @@@@@@
######
* * * * * *
@@@@@@
* * * * * *
######
Document retrieval
Given a user-specified query, the document retrieval
system retrieves a list of documents from a collection
ranked by their estimated relevance to the query
Exposing query Identification (EQI)
Given a document and a specified document
retrieval system, the exposing query retrieval
system retrieves a list of queries from a log
ranked by how prominently the document is
exposed by the query when searched against
the document retrieval system
EQI for dense retrieval models
(a.k.a., “Reverse”-ing ANCE)
In our preliminary study, we try to learn new metric spaces (ANCE-append
and ANCE-residual) such that a nearest neighbor search in the new space
approximates reverse nearest neighbor search in the original dense retrieval
embedding space
We compare their performance with nearest neighbor search in the original
metric space (ANCE-reverse)
a query a document
Given a query , the document
retrieval system performs a nearest
neighbor search over documents
Given a document , the EQI system
performs a reverse nearest neighbor
search over queries
Li, Li, Mitra, Diaz, and Biega. Exposing Query Identification for Search Transparency. In Proc. TheWebConf, ACM. (2022)
Structured knowledge
An exciting challenge for deep learning in the context
of information access is how to handle multimodal and
structured information
Applications include automatic knowledge base
construction (e.g., Project Alexandria), structured item
retrieval (e.g., product search, document retrieval with
multiple fields), and KB-augmented machine learning
My personal research is moving towards this direction
and hopefully I will have more to report on this in the
coming year 🙂
Zamani, Mitra, Song, Craswell, and Tiwary. Neural Ranking Models with Multiple Document Fields. In Proc. WSDM. (2018)
Thank you!
@UnderdogGeek bmitra@microsoft.com
http://bit.ly/fntir-neural

More Related Content

PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PPTX
A Simple Introduction to Neural Information Retrieval
PPTX
WAND Top-k Retrieval
PDF
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
PPTX
Introduction to LLM Post-Training - MIT 6.S191 2025
PDF
Word2Vec: Vector presentation of words - Mohammad Mahdavi
PDF
Deep Learning for Recommender Systems RecSys2017 Tutorial
PPTX
A Simple Introduction to Word Embeddings
Neural Text Embeddings for Information Retrieval (WSDM 2017)
A Simple Introduction to Neural Information Retrieval
WAND Top-k Retrieval
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
Introduction to LLM Post-Training - MIT 6.S191 2025
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Deep Learning for Recommender Systems RecSys2017 Tutorial
A Simple Introduction to Word Embeddings

What's hot (20)

PPTX
Joint Multisided Exposure Fairness for Search and Recommendation
PDF
Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAI
PPTX
Learning to Rank with Neural Networks
PDF
Introduction to Generative Adversarial Networks (GANs)
PDF
Single Image Super Resolution Overview
PDF
Feature Engineering
PDF
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
PPTX
Meta-Learning Presentation
PDF
Introduction to Diffusion Models
PPTX
Graph Representation Learning
PPTX
Fine tune and deploy Hugging Face NLP models
PPTX
Deep Learning for Search
PDF
Deep Generative Models
PDF
An introduction to computer vision with Hugging Face
PPTX
PPTX
Semantic Segmentation on Satellite Imagery
PPTX
Neural Learning to Rank
PDF
Basic Generative Adversarial Networks
PDF
Generative adversarial networks
PDF
Self-supervised Learning Lecture Note
Joint Multisided Exposure Fairness for Search and Recommendation
Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAI
Learning to Rank with Neural Networks
Introduction to Generative Adversarial Networks (GANs)
Single Image Super Resolution Overview
Feature Engineering
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Meta-Learning Presentation
Introduction to Diffusion Models
Graph Representation Learning
Fine tune and deploy Hugging Face NLP models
Deep Learning for Search
Deep Generative Models
An introduction to computer vision with Hugging Face
Semantic Segmentation on Satellite Imagery
Neural Learning to Rank
Basic Generative Adversarial Networks
Generative adversarial networks
Self-supervised Learning Lecture Note
Ad

Similar to What’s next for deep learning for Search? (20)

PPTX
Neural Information Retrieval: In search of meaningful progress
PPTX
Efficient Machine Learning and Machine Learning for Efficiency in Information...
PPTX
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
PPTX
Benchmarking search relevance in industry vs academia
PPTX
Overview of the TREC 2019 Deep Learning Track
PDF
Improving search with neural ranking methods
PDF
Information Retrieval with Deep Learning
PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PPTX
Deep Dive into DeepSeek _ Nevil Vekariya
PPTX
Declarative Experimentation in Information Retrieval using PyTerrier
PPTX
Deep Neural Methods for Retrieval
PPTX
Neural Models for Information Retrieval
PDF
深度学习639页PPT/////////////////////////////
PDF
Deep learning state_of_the_art- Autonomous Driving
PDF
Deep Learning State of the Art (2019) - MIT by Lex Fridman
PPTX
MICRádasdasdasdasdasdasdasdasdasdOS 2021.pptx
PPTX
Latest trends in AI and information Retrieval
PPTX
Introduction to Neural Information Retrieval and Large Language Models
PDF
Deep Learning in NLP (BERT, ERNIE and REFORMER)
PDF
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Neural Information Retrieval: In search of meaningful progress
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking search relevance in industry vs academia
Overview of the TREC 2019 Deep Learning Track
Improving search with neural ranking methods
Information Retrieval with Deep Learning
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Dive into DeepSeek _ Nevil Vekariya
Declarative Experimentation in Information Retrieval using PyTerrier
Deep Neural Methods for Retrieval
Neural Models for Information Retrieval
深度学习639页PPT/////////////////////////////
Deep learning state_of_the_art- Autonomous Driving
Deep Learning State of the Art (2019) - MIT by Lex Fridman
MICRádasdasdasdasdasdasdasdasdasdOS 2021.pptx
Latest trends in AI and information Retrieval
Introduction to Neural Information Retrieval and Large Language Models
Deep Learning in NLP (BERT, ERNIE and REFORMER)
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Ad

More from Bhaskar Mitra (20)

PPTX
Emancipatory Information Retrieval (Invited Talk at UCC)
PPTX
Emancipatory Information Retrieval (SWIRL 2025)
PPTX
Sociotechnical Implications of Generative AI for Information Access
PDF
Bias and Beyond: On Generative AI and the Future of Search and Society
PPTX
Search and Society: Reimagining Information Access for Radical Futures
PDF
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
PPTX
Multisided Exposure Fairness for Search and Recommendation
PPTX
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
PPTX
Neural Learning to Rank
PPTX
Duet @ TREC 2019 Deep Learning Track
PPTX
Neural Learning to Rank
PPTX
Deep Learning for Search
PPTX
Neural Learning to Rank
PPTX
Deep Learning for Search
PPTX
Dual Embedding Space Model (DESM)
PPTX
Adversarial and reinforcement learning-based approaches to information retrieval
PPTX
5 Lessons Learned from Designing Neural Models for Information Retrieval
PPTX
Neural Models for Information Retrieval
PPTX
Neural Models for Document Ranking
PPTX
Neu-IR 2017: welcome
Emancipatory Information Retrieval (Invited Talk at UCC)
Emancipatory Information Retrieval (SWIRL 2025)
Sociotechnical Implications of Generative AI for Information Access
Bias and Beyond: On Generative AI and the Future of Search and Society
Search and Society: Reimagining Information Access for Radical Futures
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
Multisided Exposure Fairness for Search and Recommendation
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Neural Learning to Rank
Duet @ TREC 2019 Deep Learning Track
Neural Learning to Rank
Deep Learning for Search
Neural Learning to Rank
Deep Learning for Search
Dual Embedding Space Model (DESM)
Adversarial and reinforcement learning-based approaches to information retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
Neural Models for Information Retrieval
Neural Models for Document Ranking
Neu-IR 2017: welcome

Recently uploaded (20)

PDF
Statistics on Ai - sourced from AIPRM.pdf
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PPTX
Module 1 Introduction to Web Programming .pptx
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
Advancing precision in air quality forecasting through machine learning integ...
PPTX
Build Your First AI Agent with UiPath.pptx
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PPTX
Microsoft User Copilot Training Slide Deck
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Statistics on Ai - sourced from AIPRM.pdf
sbt 2.0: go big (Scala Days 2025 edition)
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Taming the Chaos: How to Turn Unstructured Data into Decisions
Data Virtualization in Action: Scaling APIs and Apps with FME
Module 1 Introduction to Web Programming .pptx
Custom Battery Pack Design Considerations for Performance and Safety
Enhancing plagiarism detection using data pre-processing and machine learning...
The influence of sentiment analysis in enhancing early warning system model f...
Comparative analysis of machine learning models for fake news detection in so...
Consumable AI The What, Why & How for Small Teams.pdf
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Flame analysis and combustion estimation using large language and vision assi...
Advancing precision in air quality forecasting through machine learning integ...
Build Your First AI Agent with UiPath.pptx
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Microsoft User Copilot Training Slide Deck
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf

What’s next for deep learning for Search?

  • 1. What’s next for deep learning for Search? Bhaskar Mitra Principal Researcher Microsoft Research
  • 2. A (personal) reflection on the journey so far…
  • 5. First wave of deep document ranking models Trained on 200K English queries from Bing.com (proprietary dataset) Trained on 95K Chinese queries from Sogou.com (public dataset) Trained using BM25-based weak labels 2017
  • 6. But are we making real progress? ¯_(ツ)_/¯ Passage Ranking Leaderboard MS MARCO passage ranking benchmark launches with 0.5M+ English training queries The myth of “no neural IR model worked before BERT”: first generation deep ranking models, e.g., Duet and KNRM, and their variants, outperform most traditional IR methods by reasonable margin on the MS MARCO benchmark 2018
  • 7. Did neural IR really have a “weak baselines” problem? I will argue NO: (i) pre-MS MARCO, most neural IR papers benchmarking on Robust04 were NOT trained on large labeled datasets and represent a biased sample of neural IR papers, and (ii) even in those cases there is little evidence that these papers employed any weaker baselines than non-neural IR papers Why is this important? 1. Can’t expect every paper to beat SOTA. Focus is often on hypothesis testing. Check for appropriate baselines, not SOTA baselines. Improvements over emerging methods (not yet SOTA) should be encouraged. 2. Early generation deep ranking models provided many useful insights and created the demand for large training dataset. 👏🏽
  • 8. But we had a BIGGER benchmarking problem The lack of public IR benchmarks with large scale training data led to: Comparisons under low-data regime  e.g., older TREC collections with few hundred queries Comparisons on (semi-)synthetic benchmarks  e.g., TREC CAR Comparisons under weak supervision training Comparisons on corpus of language different than what the models were designed for Performance of deep models typically improve with more training data (image source: The Duet paper) Non-standardized benchmarks also required reimplementation of baselines (specially, neural baselines) which in turn meant that many of them were under-tuned, in turn, contributing to the “weak baselines” problem!
  • 9. The year of BERT Three months after the BERT paper hits arXiv the first BERT-based reranking model achieves 0.359 MRR compared to previous state-of-the-art of 0.281 on MS MARCO 2019
  • 10. TREC Deep Learning Track 2019
  • 11. Document Ranking Leaderboard + TREC 2020 Deep Learning Track Also, this guy 😒👇 2020 Next year (2023) we will be hosting the 5th edition of the TREC Deep Learning Track 🙌 🏽
  • 12. Are we making real progress? Deep learning models have gone from novelty to commodity in communities like SIGIR—parallels to how learning-to- rank models “took over” IR Deep models have demonstrated large gains over previous state-of-the-art, and the gap continues to grow But we must be careful of how we interpret “progress”, and interrogate the evidence when it is largely based on a single benchmark
  • 13. Internal validity. Would improvements on the current dataset hold on a different sample from the same dataset for the same task? External validity. Would improvements on the current dataset hold on a different dataset with different distribution for the same task or on a different (but closely related) task? If MS MARCO’s training data is only useful for achieving good results on MS MARCO’s test set, then it’s less useful for the IR community Important: transfer learning from MS MARCO to other benchmarks • TREC DL is transfer learning (MS MARCO sparse binary labels  NIST’s 5-point labels) • Promising results: MS MARCO  Robust04, TREC-COVID, TREC-CAsT Under bootstrap analysis we find the leaderboard rankings are stable! 😊👍
  • 14. BERT-scale deep ranking models in production search systems Industry impact
  • 15. “Give a small boy a hammer, and he will find that everything he encounters needs pounding. - Abraham Kaplan https://en.wikipedia.org/wiki/Law_of_the_instrument
  • 16. What’s next? Short term. In the short term, larger focus is likely to continue making these models more practically useful in real production systems. The emphasis will be on effectiveness, efficiency, and robustness. Long term. In the longer term, our increased optimization capabilities should support and encourage bolder visions for IR. The three directions I am personally excited about are: (i) Interplay between IR data structures and deep ranking models, (ii) stochastic ranking and direct optimization for target exposure, and (iii) structured knowledge extraction / modeling / retrieval.
  • 17. Case studies: Scaling BERT-based relevance models to long documents and full retrieval settings Nogueira and Cho (2019) How do we scale to longer documents? How do we scale to full retrieval from large collections? Bottleneck: Peak GPU memory during training Challenge: Collection size vs. expected online query response time
  • 18. Challenges in scaling BERT to longer inputs At training time, the GPU memory requirement for BERT’s self-attention layers grows quadratically w.r.t. to input length The quadratic complexity is a direct result of storing all the 𝑛2 -dimensional attention matrices in GPU memory during training for easier backpropagation Potential workarounds: • Trade-off GPU memory and training time by proactively releasing GPU memory during forward pass at the cost of redundant re-computations during backward pass • Find cheaper approximation to self-attention layers • Reduce the input space by running BERT on select passages in the document
  • 19. Trade-off GPU memory and training time using gradient checkpointing At training time, during the forward-pass the model caches all intermediate outputs in GPU memory so that during backward-pass we can easily compute the gradients of a layer’s outputs w.r.t. its inputs Under gradient checkpointing (Chen et al., 2016), in contrast, the model only saves intermediate outputs at specific checkpoints; during the backward-pass, missing intermediate outputs are recomputed based on the closed preceding checkpoint(s) For Transformers, this allows us to store only one 𝑛2 -dimensional attention matrix in GPU memory at any given time! Without gradient checkpointing: With gradient checkpointing: checkpoint Stored in GPU memory https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9
  • 20. Cheaper approximation: Transformer  Conformer Conformer is an alternative to Transformer that employs a separable self-attention layer with linear GPU memory complexity (as opposed to Transformer’s quadratic complexity) and is augmented with additional convolutional layers to model short-distance attention Mitra, Hofstätter, Zamani, and Craswell. Conformer-Kernel with Query Term Independence for Document Retrieval. ArXiv preprint. (2020) Mitra, Hofstätter, Zamani, and Craswell. Conformer-Kernel with Query Term Independence at TREC 2020 Deep Learning Track. In Proc. TREC. (2020) Mitra, Hofstätter, Zamani, and Craswell. Improving Transformer-Kernel Ranking Model Using Conformer and Query Term Independence. In Proc. SIGIR. (2021) TREC 2020 Deep Learning Track (Document Ranking Task)
  • 21. Reduce the task: Passage-based document ranking Hofstätter, Zamani, Mitra, Craswell, and Hanbury. Local Self-Attention over Long Text for Efficient Document Retrieval. In Proc. SIGIR. (2020) Hofstätter, Mitra, Zamani, Craswell, and Hanbury. Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. In Proc. SIGIR. (2021) Kazai, Mitra, Dong, Zamani, Craswell, and Yang. Less is Less: When Are Snippets Insufficient for Human vs Machine Relevance Estimation? In Proc. ECIR. (2022) Strategy 1: Run BERT on first-k tokens from the document Considering only the first-k tokens leads to underestimation of relevance and consequently under-retrieval of longer documents (Hofstätter et al., 2020). Recent studies (Kazai et al., 2022) have also analyzed when single snippets are insufficient for both human and machine learning based relevance estimation. Strategy 2: Run BERT on multiple windows of k-tokens each from the document This is the approach proposed by Hofstätter et al. (2020). However, the number of windows can be large corresponding to longer documents and running BERT too many times per query-document pair can also be prohibitively costly. Strategy 3: Run BERT on windows of text pre-selected using cheaper models This is the approach proposed by Hofstätter et al. (2021). The approach (IDCM) was motivated by cascaded ranking pipelines, but in this case the cascades are employed within-document for passage selection.
  • 22. Intra-Document Cascaded Model (IDCM) We employ a cascaded architecture: a cheaper model ranks-and-prunes candidate passages and costlier BERT model inspects only selected passages from the document The cheaper model is trained via knowledge distillation from the BERT model Hofstätter, Mitra, Zamani, Craswell, and Hanbury. Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. In Proc. SIGIR. (2021)
  • 23. Challenges in scaling BERT to full retrieval Broadly two sets of approaches have emerged: Dense retrieval and Query Term Independent (QTI) models; both precompute document representations at indexing time and require very little computations at query response time Mitra, Rosset, Hawking, Craswell, Diaz, and Yilmaz. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking Using Deep Neural Networks. ArXiv preprint. (2019) Mitra, Hofstätter, Zamani, and Craswell. Conformer-Kernel with Query Term Independence for Document Retrieval. ArXiv preprint. (2020) Mitra, Hofstätter, Zamani, and Craswell. Conformer-Kernel with Query Term Independence at TREC 2020 Deep Learning Track. In Proc. TREC. (2020) Mitra, Hofstätter, Zamani, and Craswell. Improving Transformer-Kernel Ranking Model Using Conformer and Query Term Independence. In Proc. SIGIR. (2021) Nogueira and Cho (2019) Dense retrieval: Xiong et al. (2021), Qu et al. (2021), Hofstätter et al. (2021), and others QTI: Mitra et al. (2019), Nogueira et al. (2019), Dai and Callan (2020), and others
  • 24. A note about distillation A popular recipe involves pretraining/finetuning large models and then knowledge distillation to smaller models that can be deployed in real-world retrieval systems Pretrained model e.g., 24-layer BERT Finetuned model e.g., 24-layer BERT Distilled model e.g., smaller model, or dense retriever, or QTI, or early-stage cascade model
  • 25. Benchmarking by jointly considering effectiveness, efficiency, and robustness Hofstätter, Craswell, Mitra, Zamani, and Hanbury. Are We There Yet? A Decision Framework for Replacing Term-Based Retrieval with Dense Retrieval Systems. ArXiv preprint. (2022) A preliminary framework: • Identify key measures of effectiveness, efficiency, and robustness • Measures can either be traded-off against each other, or act as guardrails • Apply value-laden and business-informed trade-offs between cost and robustness measures to define aggregate measures, e.g., • Identify the set of acceptable solutions (again) based on value- laden and business-informed trade-offs decision boundaries
  • 26. Robustness and calibration In practical scenarios, it is important that these of deep models perform robustly across different domains and distributional shifts over time It is also important that we look beyond point estimates of relevance and consider calibrated uncertainties in model predictions Cohen, Mitra, Rosset, Hofmann, and Croft. Cross Domain Regularization for Neural Ranking Models using Adversarial Learning. In Proc. SIGIR. (2018) Cohen, Mitra, Lesota, Rekabsaz, and Eickhoff. Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models. In Proc. SIGIR. (2021) Cohen, Du, Mitra, Mercurio, Rekabsaz, and Eickhoff. Inconsistent Ranking Assumptions in Medical Search and Their Downstream Consequences. In Proc. SIGIR. (2022)
  • 27. Looking further down the road…
  • 28. IR data structures and deep ranking models Kraska et al. (2018) were one of the earliest to propose learned index structures where predictive machine learning is employed to speed up search over classical data structures I believe there’s a significant opportunity to design data-structure-aware deep ranking models and to employ deep learning to directly optimize for efficiency in our search stacks
  • 29. Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018) Case study: Optimizing first stage retrieval using reinforcement learning
  • 30. Large scale IR systems trade-off search result quality and query response time In Bing, we have a candidate generation stage followed by multiple rank and prune stages Typically, we apply machine learning in the re-ranking stages In this work, we explore reinforcement learning for effective and efficient candidate generation Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
  • 31. In Bing, the index is distributed over multiple machines For candidate generation, on each machine the documents are linearly scanned using a match plan Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
  • 32. When a query comes in, it is automatically categorized, and a pre-defined match plan is selected A match rule defines the condition that a document should satisfy to be selected as a candidate A match plan consists of a sequence of match rules, and corresponding stopping criteria The stopping criteria decides when the index scan using a particular match rule should terminate—and if the matching process should continue with the next match rule, or conclude, or reset to the beginning of the index Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
  • 33. Match plans influence the trade-off between effectiveness and efficiency E.g., long queries with rare intents may require expensive match plans that consider body text and search deeper into the index In contrast, for popular navigational queries a shallow scan against URL and title metastreams may be sufficient Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
  • 34. E.g., Query: halloween costumes Match rule: mrA → (halloween ∈ A|U|B|T ) ∧ (costumes ∈ A|U|B|T ) Query: facebook login Match rule: mrB → (facebook ∈ U|T ) Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
  • 35. During execution, two accumulators are tracked u: the number of blocks accessed from disk v: the cum. number of term matches in all inspected documents A stopping criteria sets thresholds for each – when either thresholds are met, the scan using that particular match rule terminates Matching may then continue with a new match rule, or terminate, or re-start from beginning Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
  • 36. Optimizing query evaluations using reinforcement learning Learn a policy πθ : S → A which maximizes the cumulative discounted reward R, where γ is the discount rate We employ table-based Q learning State space: index blocks accessed (ut) and term matches (vt) Action space: Reward function: g(di) is the relevance of the ith document estimated based on the subsequent L1 ranker score—considering only top n documents index match rule relevance discounted by index blocks accessed agent accumulators (u, v) Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
  • 37. Search as a mediator of exposure Traditional IR is concerned with ranking items according to relevance These information access systems deployed at web-scale mediate what information / items gets exposure In many search scenarios it may be more appropriate to optimize for exposure rather than rank-based metrics; it may also allow us to move towards richer presentation schemes beyond ranked lists Also, important in the context of fair ranking!
  • 38. Stochastic ranking and expected exposure In recommendation, Diaz et al. (2020) define a stochastic ranking policy 𝜋𝑢, conditioned on user 𝑢 ∈ U, as a probability distribution over all permutations of items in the collection The expected exposure of an item 𝑑 for user 𝑢 can then be computed as follows: Here, 𝑝(𝜖|𝑑,𝜎) can be computed using a user browsing model like RBP as discussed previously Diaz, Mitra, Ekstrand, Biega, and Carterette. Evaluating stochastic rankings with expected exposure. In Proc. CIKM. Best full paper honorable mention. (2020) Wu*, Mitra*, Ma, Diaz, and Liu (*equal contributions). Joint Multisided Exposure Fairness for Recommendation. In Proc. SIGIR. (2022) A stochastic ranking model samples a ranking from a probability distribution over all possible permutations of items in the collection—i.e., for the same intent it returns a slightly different ranking on each impression restaurants in montreal restaurants in montreal restaurants in montreal restaurants in montreal
  • 39. Gradient-based optimization for target exposure add independently sampled Gumbel noise neural scoring function compute smooth rank value compute exposure using user model compute loss with target exposure compute average exposure items target exposure Diaz, Mitra, Ekstrand, Biega, and Carterette. Evaluating stochastic rankings with expected exposure. In Proc. CIKM. Best full paper honorable mention. (2020) Wu*, Mitra*, Ma, Diaz, and Liu (*equal contributions). Joint Multisided Exposure Fairness for Recommendation. In Proc. SIGIR. (2022)
  • 40. What query exposes me (or my document)? ###### * * * * * * @@@@@@ ###### * * * * * * @@@@@@ * * * * * * ###### Document retrieval Given a user-specified query, the document retrieval system retrieves a list of documents from a collection ranked by their estimated relevance to the query Exposing query Identification (EQI) Given a document and a specified document retrieval system, the exposing query retrieval system retrieves a list of queries from a log ranked by how prominently the document is exposed by the query when searched against the document retrieval system
  • 41. EQI for dense retrieval models (a.k.a., “Reverse”-ing ANCE) In our preliminary study, we try to learn new metric spaces (ANCE-append and ANCE-residual) such that a nearest neighbor search in the new space approximates reverse nearest neighbor search in the original dense retrieval embedding space We compare their performance with nearest neighbor search in the original metric space (ANCE-reverse) a query a document Given a query , the document retrieval system performs a nearest neighbor search over documents Given a document , the EQI system performs a reverse nearest neighbor search over queries Li, Li, Mitra, Diaz, and Biega. Exposing Query Identification for Search Transparency. In Proc. TheWebConf, ACM. (2022)
  • 42. Structured knowledge An exciting challenge for deep learning in the context of information access is how to handle multimodal and structured information Applications include automatic knowledge base construction (e.g., Project Alexandria), structured item retrieval (e.g., product search, document retrieval with multiple fields), and KB-augmented machine learning My personal research is moving towards this direction and hopefully I will have more to report on this in the coming year 🙂 Zamani, Mitra, Song, Craswell, and Tiwary. Neural Ranking Models with Multiple Document Fields. In Proc. WSDM. (2018)

Editor's Notes

  • #3: We are all familiar with the saying “if all you have is a hammer then everything looks like a nail”.
  • #16: We are all familiar with the saying “if all you have is a hammer then everything looks like a nail”.
  • #28: We are all familiar with the saying “if all you have is a hammer then everything looks like a nail”.