Made to Measure: Ranking Evaluation using Elasticsearch

Daniel Schneiter
Elastic{Meetup} #41, Zürich, April 9, 2019
Original author: Christoph Büscher
Made to Measure: 
Ranking Evaluation
using Elasticsearch

!2
If you can not
measure it, 
you cannot
improve it!
AlmostAnActualQuoteTM by Lord Kelvin
https://commons.wikimedia.org/wiki/File:Portrait_of_William_Thomson,_Baron_Kelvin.jpg

?!3
How
good
is
your
search
Image by Kecko
https://www.flickr.com/photos/kecko/18146364972 (CC BY 2.0)

!4
Image by Muff Wiggler
https://www.flickr.com/photos/muffwiggler/5605240619 (CC BY 2.0)

!5
Ranking Evaluation 
 
A repeatable way
to quickly measure the quality
of search results 
over a wide range of user needs

!6
• Automate - don’t make people
look at screens
• no gut-feeling / “management-
driven” ad-hoc search ranking
REPEATABILITY

!7
• fast iterations instead of long
waits (e.g. in A/B testing)
SPEED

!8
• numeric output
• support of different metrics
• define “quality“ in your domain
QUALITY 
MEASURE

!9
• optimize across wider range of
use case (aka “information
needs”)
• think about what the majority
of your users want
• collect data to discover what is
important for your use case
USER 
NEEDS

!10
Prerequisites for Ranking Evaluation
1. Define a set of typical information needs
2. For each search case, rate your documents for those information needs 
(either binary relevant/non-relevant or on some graded scale)
3. If full labelling is not feasible, choose a small subset instead 
(often the case because document set is too large)
4. Choose a metric to calculate. 
Some good metrics already defined in Information Retrieval research:
• Precision@K, (N)DCG, ERR, Reciprocal Rank etc…
Source: Gray Arial 10pt

!11
Search Evaluation Continuum
speed
preparation time
people looking  
at screens
Some sort of 
unit test
QA assisted by
scripts
user studies
A/B testing
Ranking Evaluation
slow
fast
little lots

!12
Where Ranking Evaluation can help
Development Production Communication 
Tool
• guiding design decisions
• enabling quick iteration
• helps defining “search quality”
clearer
• forces stakeholders to “get
real” about their expectations
• monitor changes
• spot degradations

!13
Elasticsearch  
‘rank_eval’ API

!14
Ranking Evaluation API
GET /my_index/_rank_eval
{
"metric": {
"mean_reciprocal_rank": {
[...]
}
},
"templates": [{
[...]
}],
"requests": [{ 
"template_id": “my_query_template”,
"ratings": [...],
"params": {
"query_string": “hotel amsterdam",
"field": "text"
} 
[...]
}]
}
• introduced in 6.2 (still experimental API)
• joint work between
• Christoph Büscher (@dalatangi)
• Isabel Drost-Fromm (@MaineC)
• Inputs:
• a set of search requests (“information needs”)
• document ratings for each request
• a metrics definition; currently available
• Precision@K
• Discounted Cumulative Gain / (N)DCG
• Expected Reciprocal Rank / ERR
• MRR, …

!15
Ranking Evaluation API Details
"metric": {
"precision": {
"relevant_rating_threshold": "2",
"k": 5
}
}
metric
"requests": [{
"id": "JFK_query",
"request": {
“query”: { […] }
},
"ratings": […]
},
… other use cases …]
requests
"ratings": [ {
"_id": "3054546",
"rating": 3
}, {
"_id": "5119376",
"rating": 1
}, […]
]
ratings

{
"rank_eval": {
"metric_score": 0.431,
"details": {
"my_query_id1": {
"metric_score": 0.6,
"unrated_docs": [
{
"_index": "idx",
"_id": "1960795"
}, [...]
],
"hits": [...],
"metric_details": {
“precision" : {
“relevant_docs_retrieved": 6, 
"docs_retrieved": 10
}
}
},
"my_query_id2" : { [...] }
}
}
}
!16
_rank_eval response
overall score
details per query
maybe rate those?
details about metric

!17
How to get document ratings?
1. Define a set of typical information needs of user 
(e.g. analyze logs, ask product management / customer etc…)
2. For each case, get small set of candidate documents 
(e.g. by very broad query)
3. Rate those documents with respect to the underlying information need
• can initially be done by you or other stakeholders; 
later maybe outsource e.g. via Mechanical Turk
4. Iterate!

!18
Metrics currently available
Metric Description Ratings
Precision At K Set-based metric; ratio of relevant doc in top K results binary
Reciprocal Rank (RR) Positional metric; inverse of the first relevant document binary
Discounted Cumulative
Gain (DCG)
takes order into account; highly relevant docs score more 
if they appear earlier in result list
graded
Expected Reciprocal
Rank (ERR)
motivated by “cascade model” of search; models
dependency of results with respect to their predecessors
graded

!19
Precision At K
• In short: “How many good results appear in the first K results” 
(e.g. first few pages in UI)
• supports only boolean relevance judgements
• PROS: easy to understand & communicate
• CONS: least stable across different user needs, e.g. total number of
relevant documents for a query influences precision at k
prec@k =
# relevant docs{ }
# all results at k{ }

!20
Reciprocal Rank
• supports only boolean relevance judgements
• PROS: easy to understand & communicate
• CONS: limited to cases where amount of good results doesn’t matter
• If averaged over a sample of queries Q often called MRR 
(mean reciprocal rank):
RR =
1
position of first relevant document
MRR =
1
Q
1
rankii
Q
∑

!21
Discounted Cumulative Gain (DCG)
• Predecessor: Cumulative Gain (CG)
• sums relevance judgement over top k results
CG = relk
i=1
k
∑
DCG =
reli
log2
(i +1)i=1
k
∑
• DCG takes position into account
• divides by log2 at each position
• NDCG (Normalized DCG)
• divides by “ideal” DCG for a query (IDCG) NDCG =
DCG
IDCG

!22
Expected Reciprocal Rank (ERR)
• cascade based metric
• supports graded relevance judgements
• model assumes user goes through 
result list in order and is satisfied with 
the first relevant document
• R_i probability that user stops at position i
• ERR is high 
when relevant document appear early
ERR =
1
r
(1− Ri
)Rr
i=1
r−1
∏r=1
k
∑
Ri
=
2
reli
−1
2
relmax
reli
! relevance at pos. i
relmax
! maximal relevance grade

!24
Demo project and Data
• Demo uses aprox. 1800 documents from the english Wikipedia
• Wikipedias Discovery department collects and publishes relevance
judgements with their Discernatron project
• Bulk data and all query examples available at 
https://github.com/cbuescher/rankEvalDemo

!26
Some questions I have for you…
• How do you measure search relevance currently?
• Did you find anything useful about the ranking evaluation approach?
• Feedback about usability of the API 
(ping be on Github or our Discuss Forum @cbuescher)

!27
Further reading
• Manning, Raghavan & Schütze: Introduction to Information
Retrieval, Cambridge University Press. 2008.
• Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected
reciprocal rank for graded relevance. Proceeding of the 18th
ACM Conference on Information and Knowledge
Management - CIKM ’09, 621.
• Blog: https://www.elastic.co/blog/made-to-measure-how-to-
use-the-ranking-evaluation-api-in-elasticsearch
• Docs: https://www.elastic.co/guide/en/elasticsearch/reference/
current/search-rank-eval.html
• Discuss: https://discuss.elastic.co/c/elasticsearch (cbuescher)
• Github: :Search/Ranking Label (cbuescher)

Made to Measure: Ranking Evaluation using Elasticsearch

More Related Content

What's hot (20)

Similar to Made to Measure: Ranking Evaluation using Elasticsearch (20)

Recently uploaded (20)

Made to Measure: Ranking Evaluation using Elasticsearch