Daniel Schneiter
Elastic{Meetup} #41, Zürich, April 9, 2019
Original author: Christoph Büscher
Made to Measure:

Ranking Evaluation
using Elasticsearch
!2
If you can not
measure it,

you cannot
improve it!
AlmostAnActualQuoteTM by Lord Kelvin
https://commons.wikimedia.org/wiki/File:Portrait_of_William_Thomson,_Baron_Kelvin.jpg
?!3
How
good
is
your
search
Image by Kecko
https://www.flickr.com/photos/kecko/18146364972 (CC BY 2.0)
!4
Image by Muff Wiggler
https://www.flickr.com/photos/muffwiggler/5605240619 (CC BY 2.0)
!5
Ranking Evaluation



A repeatable way
to quickly measure the quality
of search results

over a wide range of user needs
!6
• Automate - don’t make people
look at screens
• no gut-feeling / “management-
driven” ad-hoc search ranking
REPEATABILITY
!7
• fast iterations instead of long
waits (e.g. in A/B testing)
SPEED
!8
• numeric output
• support of different metrics
• define “quality“ in your domain
QUALITY

MEASURE
!9
• optimize across wider range of
use case (aka “information
needs”)
• think about what the majority
of your users want
• collect data to discover what is
important for your use case
USER

NEEDS
!10
Prerequisites for Ranking Evaluation
1. Define a set of typical information needs
2. For each search case, rate your documents for those information needs

(either binary relevant/non-relevant or on some graded scale)
3. If full labelling is not feasible, choose a small subset instead

(often the case because document set is too large)
4. Choose a metric to calculate.

Some good metrics already defined in Information Retrieval research:
• Precision@K, (N)DCG, ERR, Reciprocal Rank etc…
Source: Gray Arial 10pt
!11
Search Evaluation Continuum
speed
preparation time
people looking 

at screens
Some sort of

unit test
QA assisted by
scripts
user studies
A/B testing
Ranking Evaluation
slow
fast
little lots
!12
Where Ranking Evaluation can help
Development Production Communication

Tool
• guiding design decisions
• enabling quick iteration
• helps defining “search quality”
clearer
• forces stakeholders to “get
real” about their expectations
• monitor changes
• spot degradations
!13
Elasticsearch 

‘rank_eval’ API
!14
Ranking Evaluation API
GET /my_index/_rank_eval
{
"metric": {
"mean_reciprocal_rank": {
[...]
}
},
"templates": [{
[...]
}],
"requests": [{

"template_id": “my_query_template”,
"ratings": [...],
"params": {
"query_string": “hotel amsterdam",
"field": "text"
}

[...]
}]
}
• introduced in 6.2 (still experimental API)
• joint work between
• Christoph Büscher (@dalatangi)
• Isabel Drost-Fromm (@MaineC)
• Inputs:
• a set of search requests (“information needs”)
• document ratings for each request
• a metrics definition; currently available
• Precision@K
• Discounted Cumulative Gain / (N)DCG
• Expected Reciprocal Rank / ERR
• MRR, …

!15
Ranking Evaluation API Details
"metric": {
"precision": {
"relevant_rating_threshold": "2",
"k": 5
}
}
metric
"requests": [{
"id": "JFK_query",
"request": {
“query”: { […] }
},
"ratings": […]
},
… other use cases …]
requests
"ratings": [ {
"_id": "3054546",
"rating": 3
}, {
"_id": "5119376",
"rating": 1
}, […]
]
ratings
{
"rank_eval": {
"metric_score": 0.431,
"details": {
"my_query_id1": {
"metric_score": 0.6,
"unrated_docs": [
{
"_index": "idx",
"_id": "1960795"
}, [...]
],
"hits": [...],
"metric_details": {
“precision" : {
“relevant_docs_retrieved": 6,

"docs_retrieved": 10
}
}
},
"my_query_id2" : { [...] }
}
}
}
!16
_rank_eval response
overall score
details per query
maybe rate those?
details about metric
!17
How to get document ratings?
1. Define a set of typical information needs of user

(e.g. analyze logs, ask product management / customer etc…)
2. For each case, get small set of candidate documents

(e.g. by very broad query)
3. Rate those documents with respect to the underlying information need
• can initially be done by you or other stakeholders;

later maybe outsource e.g. via Mechanical Turk
4. Iterate!
Source: Gray Arial 10pt
!18
Metrics currently available
Metric Description Ratings
Precision At K Set-based metric; ratio of relevant doc in top K results binary
Reciprocal Rank (RR) Positional metric; inverse of the first relevant document binary
Discounted Cumulative
Gain (DCG)
takes order into account; highly relevant docs score more

if they appear earlier in result list
graded
Expected Reciprocal
Rank (ERR)
motivated by “cascade model” of search; models
dependency of results with respect to their predecessors
graded
!19
Precision At K
• In short: “How many good results appear in the first K results”

(e.g. first few pages in UI)
• supports only boolean relevance judgements
• PROS: easy to understand & communicate
• CONS: least stable across different user needs, e.g. total number of
relevant documents for a query influences precision at k
Source: Gray Arial 10pt
prec@k =
# relevant docs{ }
# all results at k{ }
!20
Reciprocal Rank
• supports only boolean relevance judgements
• PROS: easy to understand & communicate
• CONS: limited to cases where amount of good results doesn’t matter
• If averaged over a sample of queries Q often called MRR

(mean reciprocal rank):
Source: Gray Arial 10pt
RR =
1
position of first relevant document
MRR =
1
Q
1
rankii
Q
∑
!21
Discounted Cumulative Gain (DCG)
• Predecessor: Cumulative Gain (CG)
• sums relevance judgement over top k results
Source: Gray Arial 10pt
CG = relk
i=1
k
∑
DCG =
reli
log2
(i +1)i=1
k
∑
• DCG takes position into account
• divides by log2 at each position
• NDCG (Normalized DCG)
• divides by “ideal” DCG for a query (IDCG) NDCG =
DCG
IDCG
!22
Expected Reciprocal Rank (ERR)
• cascade based metric
• supports graded relevance judgements
• model assumes user goes through

result list in order and is satisfied with

the first relevant document
• R_i probability that user stops at position i
• ERR is high

when relevant document appear early
Source: Gray Arial 10pt
ERR =
1
r
(1− Ri
)Rr
i=1
r−1
∏r=1
k
∑
Ri
=
2
reli
−1
2
relmax
reli
! relevance at pos. i
relmax
! maximal relevance grade
!23
DEMO TIME
!24
Demo project and Data
• Demo uses aprox. 1800 documents from the english Wikipedia
• Wikipedias Discovery department collects and publishes relevance
judgements with their Discernatron project
• Bulk data and all query examples available at

https://github.com/cbuescher/rankEvalDemo
Source: Gray Arial 10pt
!25
Q&A
!26
Some questions I have for you…
• How do you measure search relevance currently?
• Did you find anything useful about the ranking evaluation approach?
• Feedback about usability of the API

(ping be on Github or our Discuss Forum @cbuescher)
Source: Gray Arial 10pt
!27
Further reading
• Manning, Raghavan & Schütze: Introduction to Information
Retrieval, Cambridge University Press. 2008.
• Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected
reciprocal rank for graded relevance. Proceeding of the 18th
ACM Conference on Information and Knowledge
Management - CIKM ’09, 621.
• Blog: https://www.elastic.co/blog/made-to-measure-how-to-
use-the-ranking-evaluation-api-in-elasticsearch
• Docs: https://www.elastic.co/guide/en/elasticsearch/reference/
current/search-rank-eval.html
• Discuss: https://discuss.elastic.co/c/elasticsearch (cbuescher)
• Github: :Search/Ranking Label (cbuescher)
Source: Gray Arial 10pt

More Related Content

PDF
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
PDF
確実な再起動からはじめる クラウドネイティブオペレーション
PDF
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
PDF
区間分割の仕方を最適化する動的計画法 (JOI 2021 夏季セミナー)
PPT
Banco de Dados - NoSQL
PDF
MariaDB 10.5 binary install (바이너리 설치)
PPTX
Migrating from RDBMS to MongoDB
PDF
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
確実な再起動からはじめる クラウドネイティブオペレーション
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
区間分割の仕方を最適化する動的計画法 (JOI 2021 夏季セミナー)
Banco de Dados - NoSQL
MariaDB 10.5 binary install (바이너리 설치)
Migrating from RDBMS to MongoDB
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018

What's hot (20)

PDF
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2021年7月版]
PDF
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
PDF
Deep Learning 勉強会 (Chapter 7-12)
PDF
2025年現在のNewSQL (最強DB講義 #36 発表資料)
PDF
【2018年3月時点】Oracle BI ベストプラクティス
PPTX
Qlik ReplicateでApache Kafkaをターゲットとして使用する
PDF
Deep Learning for Recommender Systems
PDF
Oracle GoldenGateでの資料採取(トラブル時に採取すべき資料)
PPTX
Cql – cassandra query language
PDF
BigQueryを始めてみよう - Google Analytics データを活用する
PDF
Deep Learning for Recommender Systems
PPTX
RL4J で始める深層強化学習
PPTX
20210921 Qlik Tips SCDとIntervalmatch
PDF
Oracle GoldenGate 19c を使用した 簡単データベース移行ガイド_v1.0
PDF
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
PDF
まだ統計固定で消耗してるの? - Bind Peek をもっと使おうぜ! 2015 Edition -
PPTX
Google BigQueryのターゲットエンドポイントとしての利用
PDF
TECHTALK_20220802 Direct Query.pdf
PDF
Oracle GoldenGate入門
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2021年7月版]
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
Deep Learning 勉強会 (Chapter 7-12)
2025年現在のNewSQL (最強DB講義 #36 発表資料)
【2018年3月時点】Oracle BI ベストプラクティス
Qlik ReplicateでApache Kafkaをターゲットとして使用する
Deep Learning for Recommender Systems
Oracle GoldenGateでの資料採取(トラブル時に採取すべき資料)
Cql – cassandra query language
BigQueryを始めてみよう - Google Analytics データを活用する
Deep Learning for Recommender Systems
RL4J で始める深層強化学習
20210921 Qlik Tips SCDとIntervalmatch
Oracle GoldenGate 19c を使用した 簡単データベース移行ガイド_v1.0
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
まだ統計固定で消耗してるの? - Bind Peek をもっと使おうぜ! 2015 Edition -
Google BigQueryのターゲットエンドポイントとしての利用
TECHTALK_20220802 Direct Query.pdf
Oracle GoldenGate入門
Ad

Similar to Made to Measure: Ranking Evaluation using Elasticsearch (20)

PDF
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
PDF
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
PDF
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
PDF
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
PDF
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
PDF
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
PDF
Search Quality Evaluation: Tools and Techniques
PDF
Haystack London - Search Quality Evaluation, Tools and Techniques
PDF
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
PDF
Web search-metrics-tutorial-www2010-section-2of7-relevance
PDF
Search quality in practice
PPT
information technology materrailas paper
PDF
Evaluating Search Performance
PPTX
Information Retrieval Evaluation
PPT
Performance evaluation of IR models
PDF
An introduction to Elasticsearch's advanced relevance ranking toolbox
PDF
Evaluating Search Performance
PDF
assia2015sakai
PDF
Learning to rank search results
PDF
Click Model-Based Information Retrieval Metrics
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Search Quality Evaluation: Tools and Techniques
Haystack London - Search Quality Evaluation, Tools and Techniques
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Web search-metrics-tutorial-www2010-section-2of7-relevance
Search quality in practice
information technology materrailas paper
Evaluating Search Performance
Information Retrieval Evaluation
Performance evaluation of IR models
An introduction to Elasticsearch's advanced relevance ranking toolbox
Evaluating Search Performance
assia2015sakai
Learning to rank search results
Click Model-Based Information Retrieval Metrics
Ad

Recently uploaded (20)

PPT
Comprehensive Java Training Deck - Advanced topics
PDF
IAE-V2500 Engine for Airbus Family 319/320
PPTX
chapter 1.pptx dotnet technology introduction
PDF
electrical machines course file-anna university
PDF
Mechanics of materials week 2 rajeshwari
PPTX
SE unit 1.pptx aaahshdhajdviwhsiehebeiwheiebeiev
DOCX
ENVIRONMENTAL PROTECTION AND MANAGEMENT (18CVL756)
PPTX
SC Robotics Team Safety Training Presentation
PPTX
AI-Reporting for Emerging Technologies(BS Computer Engineering)
PDF
Artificial Intelligence_ Basics .Artificial Intelligence_ Basics .
PPTX
Design ,Art Across Digital Realities and eXtended Reality
PDF
B461227.pdf American Journal of Multidisciplinary Research and Review
PPTX
ARCHITECTURE AND PROGRAMMING OF EMBEDDED SYSTEMS
PPTX
Soft Skills Unit 2 Listening Speaking Reading Writing.pptx
PPT
Module_1_Lecture_1_Introduction_To_Automation_In_Production_Systems2023.ppt
PDF
MACCAFERRY GUIA GAVIONES TERRAPLENES EN ESPAÑOL
PPTX
Unit IImachinemachinetoolopeartions.pptx
PPTX
Real Estate Management PART 1.pptxFFFFFFFFFFFFF
PDF
ST MNCWANGO P2 WIL (MEPR302) FINAL REPORT.pdf
PPT
Unit - I.lathemachnespct=ificationsand ppt
Comprehensive Java Training Deck - Advanced topics
IAE-V2500 Engine for Airbus Family 319/320
chapter 1.pptx dotnet technology introduction
electrical machines course file-anna university
Mechanics of materials week 2 rajeshwari
SE unit 1.pptx aaahshdhajdviwhsiehebeiwheiebeiev
ENVIRONMENTAL PROTECTION AND MANAGEMENT (18CVL756)
SC Robotics Team Safety Training Presentation
AI-Reporting for Emerging Technologies(BS Computer Engineering)
Artificial Intelligence_ Basics .Artificial Intelligence_ Basics .
Design ,Art Across Digital Realities and eXtended Reality
B461227.pdf American Journal of Multidisciplinary Research and Review
ARCHITECTURE AND PROGRAMMING OF EMBEDDED SYSTEMS
Soft Skills Unit 2 Listening Speaking Reading Writing.pptx
Module_1_Lecture_1_Introduction_To_Automation_In_Production_Systems2023.ppt
MACCAFERRY GUIA GAVIONES TERRAPLENES EN ESPAÑOL
Unit IImachinemachinetoolopeartions.pptx
Real Estate Management PART 1.pptxFFFFFFFFFFFFF
ST MNCWANGO P2 WIL (MEPR302) FINAL REPORT.pdf
Unit - I.lathemachnespct=ificationsand ppt

Made to Measure: Ranking Evaluation using Elasticsearch

  • 1. Daniel Schneiter Elastic{Meetup} #41, Zürich, April 9, 2019 Original author: Christoph Büscher Made to Measure:
 Ranking Evaluation using Elasticsearch
  • 2. !2 If you can not measure it,
 you cannot improve it! AlmostAnActualQuoteTM by Lord Kelvin https://commons.wikimedia.org/wiki/File:Portrait_of_William_Thomson,_Baron_Kelvin.jpg
  • 4. !4 Image by Muff Wiggler https://www.flickr.com/photos/muffwiggler/5605240619 (CC BY 2.0)
  • 5. !5 Ranking Evaluation
 
 A repeatable way to quickly measure the quality of search results
 over a wide range of user needs
  • 6. !6 • Automate - don’t make people look at screens • no gut-feeling / “management- driven” ad-hoc search ranking REPEATABILITY
  • 7. !7 • fast iterations instead of long waits (e.g. in A/B testing) SPEED
  • 8. !8 • numeric output • support of different metrics • define “quality“ in your domain QUALITY
 MEASURE
  • 9. !9 • optimize across wider range of use case (aka “information needs”) • think about what the majority of your users want • collect data to discover what is important for your use case USER
 NEEDS
  • 10. !10 Prerequisites for Ranking Evaluation 1. Define a set of typical information needs 2. For each search case, rate your documents for those information needs
 (either binary relevant/non-relevant or on some graded scale) 3. If full labelling is not feasible, choose a small subset instead
 (often the case because document set is too large) 4. Choose a metric to calculate.
 Some good metrics already defined in Information Retrieval research: • Precision@K, (N)DCG, ERR, Reciprocal Rank etc… Source: Gray Arial 10pt
  • 11. !11 Search Evaluation Continuum speed preparation time people looking 
 at screens Some sort of
 unit test QA assisted by scripts user studies A/B testing Ranking Evaluation slow fast little lots
  • 12. !12 Where Ranking Evaluation can help Development Production Communication
 Tool • guiding design decisions • enabling quick iteration • helps defining “search quality” clearer • forces stakeholders to “get real” about their expectations • monitor changes • spot degradations
  • 14. !14 Ranking Evaluation API GET /my_index/_rank_eval { "metric": { "mean_reciprocal_rank": { [...] } }, "templates": [{ [...] }], "requests": [{
 "template_id": “my_query_template”, "ratings": [...], "params": { "query_string": “hotel amsterdam", "field": "text" }
 [...] }] } • introduced in 6.2 (still experimental API) • joint work between • Christoph Büscher (@dalatangi) • Isabel Drost-Fromm (@MaineC) • Inputs: • a set of search requests (“information needs”) • document ratings for each request • a metrics definition; currently available • Precision@K • Discounted Cumulative Gain / (N)DCG • Expected Reciprocal Rank / ERR • MRR, …

  • 15. !15 Ranking Evaluation API Details "metric": { "precision": { "relevant_rating_threshold": "2", "k": 5 } } metric "requests": [{ "id": "JFK_query", "request": { “query”: { […] } }, "ratings": […] }, … other use cases …] requests "ratings": [ { "_id": "3054546", "rating": 3 }, { "_id": "5119376", "rating": 1 }, […] ] ratings
  • 16. { "rank_eval": { "metric_score": 0.431, "details": { "my_query_id1": { "metric_score": 0.6, "unrated_docs": [ { "_index": "idx", "_id": "1960795" }, [...] ], "hits": [...], "metric_details": { “precision" : { “relevant_docs_retrieved": 6,
 "docs_retrieved": 10 } } }, "my_query_id2" : { [...] } } } } !16 _rank_eval response overall score details per query maybe rate those? details about metric
  • 17. !17 How to get document ratings? 1. Define a set of typical information needs of user
 (e.g. analyze logs, ask product management / customer etc…) 2. For each case, get small set of candidate documents
 (e.g. by very broad query) 3. Rate those documents with respect to the underlying information need • can initially be done by you or other stakeholders;
 later maybe outsource e.g. via Mechanical Turk 4. Iterate! Source: Gray Arial 10pt
  • 18. !18 Metrics currently available Metric Description Ratings Precision At K Set-based metric; ratio of relevant doc in top K results binary Reciprocal Rank (RR) Positional metric; inverse of the first relevant document binary Discounted Cumulative Gain (DCG) takes order into account; highly relevant docs score more
 if they appear earlier in result list graded Expected Reciprocal Rank (ERR) motivated by “cascade model” of search; models dependency of results with respect to their predecessors graded
  • 19. !19 Precision At K • In short: “How many good results appear in the first K results”
 (e.g. first few pages in UI) • supports only boolean relevance judgements • PROS: easy to understand & communicate • CONS: least stable across different user needs, e.g. total number of relevant documents for a query influences precision at k Source: Gray Arial 10pt prec@k = # relevant docs{ } # all results at k{ }
  • 20. !20 Reciprocal Rank • supports only boolean relevance judgements • PROS: easy to understand & communicate • CONS: limited to cases where amount of good results doesn’t matter • If averaged over a sample of queries Q often called MRR
 (mean reciprocal rank): Source: Gray Arial 10pt RR = 1 position of first relevant document MRR = 1 Q 1 rankii Q ∑
  • 21. !21 Discounted Cumulative Gain (DCG) • Predecessor: Cumulative Gain (CG) • sums relevance judgement over top k results Source: Gray Arial 10pt CG = relk i=1 k ∑ DCG = reli log2 (i +1)i=1 k ∑ • DCG takes position into account • divides by log2 at each position • NDCG (Normalized DCG) • divides by “ideal” DCG for a query (IDCG) NDCG = DCG IDCG
  • 22. !22 Expected Reciprocal Rank (ERR) • cascade based metric • supports graded relevance judgements • model assumes user goes through
 result list in order and is satisfied with
 the first relevant document • R_i probability that user stops at position i • ERR is high
 when relevant document appear early Source: Gray Arial 10pt ERR = 1 r (1− Ri )Rr i=1 r−1 ∏r=1 k ∑ Ri = 2 reli −1 2 relmax reli ! relevance at pos. i relmax ! maximal relevance grade
  • 24. !24 Demo project and Data • Demo uses aprox. 1800 documents from the english Wikipedia • Wikipedias Discovery department collects and publishes relevance judgements with their Discernatron project • Bulk data and all query examples available at
 https://github.com/cbuescher/rankEvalDemo Source: Gray Arial 10pt
  • 26. !26 Some questions I have for you… • How do you measure search relevance currently? • Did you find anything useful about the ranking evaluation approach? • Feedback about usability of the API
 (ping be on Github or our Discuss Forum @cbuescher) Source: Gray Arial 10pt
  • 27. !27 Further reading • Manning, Raghavan & Schütze: Introduction to Information Retrieval, Cambridge University Press. 2008. • Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. Proceeding of the 18th ACM Conference on Information and Knowledge Management - CIKM ’09, 621. • Blog: https://www.elastic.co/blog/made-to-measure-how-to- use-the-ranking-evaluation-api-in-elasticsearch • Docs: https://www.elastic.co/guide/en/elasticsearch/reference/ current/search-rank-eval.html • Discuss: https://discuss.elastic.co/c/elasticsearch (cbuescher) • Github: :Search/Ranking Label (cbuescher) Source: Gray Arial 10pt