What’s next for deep learning for Search?

What’s next for
deep learning
for Search?
Bhaskar Mitra
Principal Researcher
Microsoft Research

A (personal) reflection on the journey so far…

First wave of deep document ranking models
Trained on 200K English
queries from Bing.com
(proprietary dataset) Trained on 95K Chinese
queries from Sogou.com
(public dataset)
Trained using BM25-based
weak labels
2017

But are we making
real progress?
¯_(ツ)_/¯
Passage Ranking Leaderboard
MS MARCO passage ranking benchmark
launches with 0.5M+ English training queries
The myth of “no neural IR model worked before BERT”: first generation deep
ranking models, e.g., Duet and KNRM, and their variants, outperform most
traditional IR methods by reasonable margin on the MS MARCO benchmark
2018

Did neural IR really have a “weak baselines” problem?
I will argue NO: (i) pre-MS MARCO, most neural IR papers benchmarking on Robust04 were NOT trained on
large labeled datasets and represent a biased sample of neural IR papers, and (ii) even in those cases there is
little evidence that these papers employed any weaker baselines than non-neural IR papers
Why is this important?
1. Can’t expect every paper to beat SOTA. Focus is often on hypothesis testing. Check for appropriate
baselines, not SOTA baselines. Improvements over emerging methods (not yet SOTA) should be encouraged.
2. Early generation deep ranking models provided many useful
insights and created the demand for large training dataset. 👏🏽

But we had a BIGGER benchmarking problem
The lack of public IR benchmarks with large scale training
data led to:
Comparisons under low-data regime
 e.g., older TREC collections with few hundred queries
Comparisons on (semi-)synthetic benchmarks
 e.g., TREC CAR
Comparisons under weak supervision training
Comparisons on corpus of language different than what
the models were designed for
Performance of deep models typically
improve with more training data
(image source: The Duet paper) Non-standardized benchmarks also required reimplementation of baselines (specially,
neural baselines) which in turn meant that many of them were under-tuned, in turn,
contributing to the “weak baselines” problem!

The year of BERT
Three months after the BERT paper hits arXiv the first BERT-based reranking model achieves
0.359 MRR compared to previous state-of-the-art of 0.281 on MS MARCO
2019

Document Ranking Leaderboard
+
TREC 2020 Deep Learning Track
Also, this guy
😒👇
2020
Next year (2023) we will be
hosting the 5th edition of the
TREC Deep Learning Track 🙌
🏽

Are we making
real progress?
Deep learning models have gone from
novelty to commodity in communities
like SIGIR—parallels to how learning-to-
rank models “took over” IR
Deep models have demonstrated large
gains over previous state-of-the-art,
and the gap continues to grow
But we must be careful of how we
interpret “progress”, and interrogate the
evidence when it is largely based on a
single benchmark

Internal validity. Would improvements on the
current dataset hold on a different sample from
the same dataset for the same task?
External validity. Would improvements on the
current dataset hold on a different dataset with
different distribution for the same task or on a
different (but closely related) task?
If MS MARCO’s training data is only useful for
achieving good results on MS MARCO’s test
set, then it’s less useful for the IR community
Important: transfer learning from MS MARCO
to other benchmarks
• TREC DL is transfer learning (MS MARCO
sparse binary labels  NIST’s 5-point
labels)
• Promising results: MS MARCO  Robust04,
TREC-COVID, TREC-CAsT
Under bootstrap analysis
we find the leaderboard
rankings are stable!
😊👍

BERT-scale deep ranking models in
production search systems
Industry impact

“Give a small boy a hammer, and he will find
that everything he encounters needs pounding.
- Abraham Kaplan
https://en.wikipedia.org/wiki/Law_of_the_instrument

What’s next?
Short term. In the short term, larger focus is likely to continue making these
models more practically useful in real production systems. The emphasis will be
on effectiveness, efficiency, and robustness.
Long term. In the longer term, our increased optimization capabilities should
support and encourage bolder visions for IR. The three directions I am personally
excited about are: (i) Interplay between IR data structures and deep ranking
models, (ii) stochastic ranking and direct optimization for target exposure, and
(iii) structured knowledge extraction / modeling / retrieval.

Case studies: Scaling BERT-based relevance models to
long documents and full retrieval settings
Nogueira and Cho (2019)
How do we scale to
longer documents?
How do we scale to
full retrieval from
large collections?
Bottleneck: Peak GPU
memory during training
Challenge: Collection size
vs. expected online query
response time

Challenges in scaling
BERT to longer inputs
At training time, the GPU memory requirement for
BERT’s self-attention layers grows quadratically w.r.t. to
input length
The quadratic complexity is a direct result of storing all
the 𝑛2
-dimensional attention matrices in GPU memory
during training for easier backpropagation
Potential workarounds:
• Trade-off GPU memory and training time by
proactively releasing GPU memory during forward
pass at the cost of redundant re-computations
during backward pass
• Find cheaper approximation to self-attention layers
• Reduce the input space by running BERT on select
passages in the document

Trade-off GPU memory and training time using
gradient checkpointing
At training time, during the forward-pass the model
caches all intermediate outputs in GPU memory so
that during backward-pass we can easily compute
the gradients of a layer’s outputs w.r.t. its inputs
Under gradient checkpointing (Chen et al., 2016), in
contrast, the model only saves intermediate outputs
at specific checkpoints; during the backward-pass,
missing intermediate outputs are recomputed based
on the closed preceding checkpoint(s)
For Transformers, this allows us to store only one
𝑛2
-dimensional attention matrix in GPU memory at
any given time!
Without gradient checkpointing:
With gradient checkpointing: checkpoint
Stored in GPU memory
https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9

Cheaper approximation: Transformer  Conformer
Conformer is an alternative to Transformer that employs a separable self-attention layer with linear GPU
memory complexity (as opposed to Transformer’s quadratic complexity) and is augmented with additional
convolutional layers to model short-distance attention
Mitra, Hofstätter, Zamani, and Craswell. Conformer-Kernel with Query Term Independence for Document Retrieval. ArXiv preprint. (2020)
Mitra, Hofstätter, Zamani, and Craswell. Conformer-Kernel with Query Term Independence at TREC 2020 Deep Learning Track. In Proc. TREC. (2020)
Mitra, Hofstätter, Zamani, and Craswell. Improving Transformer-Kernel Ranking Model Using Conformer and Query Term Independence. In Proc. SIGIR. (2021)
TREC 2020 Deep Learning Track
(Document Ranking Task)

Reduce the task: Passage-based document ranking
Hofstätter, Zamani, Mitra, Craswell, and Hanbury. Local Self-Attention over Long Text for Efficient Document Retrieval. In Proc. SIGIR. (2020)
Hofstätter, Mitra, Zamani, Craswell, and Hanbury. Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. In Proc. SIGIR. (2021)
Kazai, Mitra, Dong, Zamani, Craswell, and Yang. Less is Less: When Are Snippets Insufficient for Human vs Machine Relevance Estimation? In Proc. ECIR. (2022)
Strategy 1: Run BERT on first-k tokens from the document
Considering only the first-k tokens leads to underestimation of relevance and consequently
under-retrieval of longer documents (Hofstätter et al., 2020). Recent studies (Kazai et al., 2022)
have also analyzed when single snippets are insufficient for both human and machine learning
based relevance estimation.
Strategy 2: Run BERT on multiple windows of k-tokens each from the document
This is the approach proposed by Hofstätter et al. (2020). However, the number of windows can be large
corresponding to longer documents and running BERT too many times per query-document pair can also be
prohibitively costly.
Strategy 3: Run BERT on windows of text pre-selected using cheaper models
This is the approach proposed by Hofstätter et al. (2021). The approach
(IDCM) was motivated by cascaded ranking pipelines, but in this case the
cascades are employed within-document for passage selection.

Intra-Document
Cascaded Model (IDCM)
We employ a cascaded architecture: a cheaper model
ranks-and-prunes candidate passages and costlier BERT
model inspects only selected passages from the document
The cheaper model is trained via knowledge distillation
from the BERT model
Hofstätter, Mitra, Zamani, Craswell, and Hanbury. Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. In Proc. SIGIR. (2021)

Challenges in scaling
BERT to full retrieval
Broadly two sets of approaches have emerged:
Dense retrieval and Query Term Independent
(QTI) models; both precompute document
representations at indexing time and require
very little computations at query response time
Mitra, Rosset, Hawking, Craswell, Diaz, and Yilmaz. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking Using Deep Neural Networks. ArXiv preprint. (2019)
Mitra, Hofstätter, Zamani, and Craswell. Conformer-Kernel with Query Term Independence for Document Retrieval. ArXiv preprint. (2020)
Mitra, Hofstätter, Zamani, and Craswell. Conformer-Kernel with Query Term Independence at TREC 2020 Deep Learning Track. In Proc. TREC. (2020)
Mitra, Hofstätter, Zamani, and Craswell. Improving Transformer-Kernel Ranking Model Using Conformer and Query Term Independence. In Proc. SIGIR. (2021)
Nogueira and Cho (2019)
Dense retrieval: Xiong et al. (2021), Qu et al.
(2021), Hofstätter et al. (2021), and others
QTI: Mitra et al. (2019), Nogueira et al.
(2019), Dai and Callan (2020), and others

A note about distillation
A popular recipe involves pretraining/finetuning large models and
then knowledge distillation to smaller models that can be deployed in
real-world retrieval systems
Pretrained model
e.g., 24-layer BERT
Finetuned model
e.g., 24-layer BERT
Distilled
model
e.g., smaller
model, or dense
retriever, or QTI,
or early-stage
cascade model

Benchmarking by jointly considering effectiveness,
efficiency, and robustness
Hofstätter, Craswell, Mitra, Zamani, and Hanbury. Are We There Yet? A Decision Framework for Replacing Term-Based Retrieval with Dense Retrieval Systems. ArXiv preprint. (2022)
A preliminary framework:
• Identify key measures of effectiveness, efficiency, and robustness
• Measures can either be traded-off against each other, or act as
guardrails
• Apply value-laden and business-informed trade-offs between
cost and robustness measures to define aggregate measures, e.g.,
• Identify the set of acceptable solutions (again) based on value-
laden and business-informed trade-offs decision boundaries

Robustness and calibration
In practical scenarios, it is important that these of
deep models perform robustly across different
domains and distributional shifts over time
It is also important that we look beyond point
estimates of relevance and consider calibrated
uncertainties in model predictions
Cohen, Mitra, Rosset, Hofmann, and Croft. Cross Domain Regularization for Neural Ranking Models using Adversarial Learning. In Proc. SIGIR. (2018)
Cohen, Mitra, Lesota, Rekabsaz, and Eickhoff. Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models. In Proc. SIGIR. (2021)
Cohen, Du, Mitra, Mercurio, Rekabsaz, and Eickhoff. Inconsistent Ranking Assumptions in Medical Search and Their Downstream Consequences. In Proc. SIGIR. (2022)

Looking further down the road…

IR data structures and deep ranking models
Kraska et al. (2018) were one of the earliest to propose learned index structures where predictive
machine learning is employed to speed up search over classical data structures
I believe there’s a significant opportunity to design data-structure-aware deep ranking models
and to employ deep learning to directly optimize for efficiency in our search stacks

Rosset, Jose, Ghosh, Mitra, and Tiwary. Optimizing Query Evaluations Using Reinforcement Learning for Web Search. In Proc. SIGIR. (2018)
Case study:
Optimizing first stage retrieval
using reinforcement learning

Large scale IR systems trade-off search result quality and query response time
In Bing, we have a candidate generation stage followed by multiple rank and prune stages
Typically, we apply machine learning in the re-ranking stages
In this work, we explore reinforcement learning for effective and efficient candidate generation

In Bing, the index is distributed over multiple machines
For candidate generation, on each machine the documents are linearly scanned using a match plan

When a query comes in, it is automatically
categorized, and a pre-defined match plan
is selected
A match rule defines the condition that a
document should satisfy to be selected as a
candidate
A match plan consists of a sequence of
match rules, and corresponding stopping
criteria
The stopping criteria decides when the
index scan using a particular match rule
should terminate—and if the matching
process should continue with the next match
rule, or conclude, or reset to the beginning
of the index

Match plans influence the
trade-off between
effectiveness and efficiency
E.g., long queries with rare
intents may require expensive
match plans that consider
body text and search deeper
into the index
In contrast, for popular
navigational queries a shallow
scan against URL and title
metastreams may be sufficient

E.g.,
Query: halloween costumes
Match rule: mrA → (halloween ∈ A|U|B|T ) ∧ (costumes ∈ A|U|B|T )
Query: facebook login
Match rule: mrB → (facebook ∈ U|T )

During execution, two accumulators are tracked
u: the number of blocks accessed from disk
v: the cum. number of term matches in all inspected documents
A stopping criteria sets thresholds for each – when either thresholds are met, the scan using
that particular match rule terminates
Matching may then continue with a new match rule, or terminate, or re-start from beginning

Optimizing query evaluations using reinforcement learning
Learn a policy πθ : S → A which maximizes the
cumulative discounted reward R, where γ is the
discount rate
We employ table-based Q learning
State space: index blocks accessed (ut) and term
matches (vt)
Action space:
Reward function:
g(di) is the relevance of the ith document estimated based on
the subsequent L1 ranker score—considering only top n
documents
index
match rule relevance discounted by
index blocks accessed
agent
accumulators
(u, v)

Search as a mediator
of exposure
Traditional IR is concerned with ranking items
according to relevance
These information access systems deployed at
web-scale mediate what information / items
gets exposure
In many search scenarios it may be more
appropriate to optimize for exposure rather
than rank-based metrics; it may also allow us to
move towards richer presentation schemes
beyond ranked lists
Also, important in the context of fair ranking!

Stochastic ranking and expected exposure
In recommendation, Diaz et al. (2020) define a stochastic ranking policy 𝜋𝑢, conditioned on user 𝑢 ∈ U, as a
probability distribution over all permutations of items in the collection
The expected exposure of an item 𝑑 for user 𝑢 can then be computed as follows:
Here, 𝑝(𝜖|𝑑,𝜎) can be computed using a user browsing model like RBP as discussed previously
Diaz, Mitra, Ekstrand, Biega, and Carterette. Evaluating stochastic rankings with expected exposure. In Proc. CIKM. Best full paper honorable mention. (2020)
Wu*, Mitra*, Ma, Diaz, and Liu (*equal contributions). Joint Multisided Exposure Fairness for Recommendation. In Proc. SIGIR. (2022)
A stochastic ranking model samples a ranking from a probability distribution over all possible permutations
of items in the collection—i.e., for the same intent it returns a slightly different ranking on each impression
restaurants in montreal restaurants in montreal
restaurants in montreal
restaurants in montreal

Gradient-based optimization for target exposure
add independently
sampled Gumbel noise
neural scoring
function
compute smooth
rank value
compute exposure
using user model
compute loss with
target exposure
compute average
exposure
items target
exposure
Diaz, Mitra, Ekstrand, Biega, and Carterette. Evaluating stochastic rankings with expected exposure. In Proc. CIKM. Best full paper honorable mention. (2020)
Wu*, Mitra*, Ma, Diaz, and Liu (*equal contributions). Joint Multisided Exposure Fairness for Recommendation. In Proc. SIGIR. (2022)

What query exposes me (or my document)?
###### * * * * * * @@@@@@
######
* * * * * *
@@@@@@
* * * * * *
######
Document retrieval
Given a user-specified query, the document retrieval
system retrieves a list of documents from a collection
ranked by their estimated relevance to the query
Exposing query Identification (EQI)
Given a document and a specified document
retrieval system, the exposing query retrieval
system retrieves a list of queries from a log
ranked by how prominently the document is
exposed by the query when searched against
the document retrieval system

EQI for dense retrieval models
(a.k.a., “Reverse”-ing ANCE)
In our preliminary study, we try to learn new metric spaces (ANCE-append
and ANCE-residual) such that a nearest neighbor search in the new space
approximates reverse nearest neighbor search in the original dense retrieval
embedding space
We compare their performance with nearest neighbor search in the original
metric space (ANCE-reverse)
a query a document
Given a query , the document
retrieval system performs a nearest
neighbor search over documents
Given a document , the EQI system
performs a reverse nearest neighbor
search over queries
Li, Li, Mitra, Diaz, and Biega. Exposing Query Identification for Search Transparency. In Proc. TheWebConf, ACM. (2022)

Structured knowledge
An exciting challenge for deep learning in the context
of information access is how to handle multimodal and
structured information
Applications include automatic knowledge base
construction (e.g., Project Alexandria), structured item
retrieval (e.g., product search, document retrieval with
multiple fields), and KB-augmented machine learning
My personal research is moving towards this direction
and hopefully I will have more to report on this in the
coming year 🙂
Zamani, Mitra, Song, Craswell, and Tiwary. Neural Ranking Models with Multiple Document Fields. In Proc. WSDM. (2018)

Thank you!
@UnderdogGeek bmitra@microsoft.com
http://bit.ly/fntir-neural

What’s next for deep learning for Search?

More Related Content

What's hot (20)

Similar to What’s next for deep learning for Search? (20)

More from Bhaskar Mitra (20)

Recently uploaded (20)

What’s next for deep learning for Search?

Editor's Notes