Learning to Rank Datasets for Search with Oscar Castaneda

Oscar Castañeda, Xoom a PayPal Service
Learning to Rank Datasets
for Search
#SAISDS8

#SAISDS8
About
• Data Scientist at Xoom a PayPal service.
• Interests:
• Data Management,
• Dataset Search,
• Learning to Rank.
2

Spark cluster with Elasticsearch
http://bit.ly/2em6RUKhttp://bit.ly/2ebM9HO

And Indexed RDatasets
Spark cluster with Elasticsearch Inside

5
#SAISDS8

6
for Search!
#SAISDS8

7
Agenda
• Problem Statement and Motivation
• Elasticsearch Learning to Rank
• Data Pipeline: metadata extraction, judgement list extraction
• Demo: Beginnings of a Dataset Search Engine with Machine-
learned relevance ranking.
• Q&A
#SAISDS8

8
Problem Statement (1)
• Despite datasets being a key corporate
asset they are generally not given the
importance they deserve and as a result
they are hard to find.
#SAISDS8

9
Problem Statement (2)
• Specifically, teams within organizations
have a hard time finding datasets
relevant to their function.
#SAISDS8

10
Topics
• Indexing
#SAISDS8

11
Topics
• Indexing (Spark Summit East 2017).
#SAISDS8

12
Topics
• Ranking
#SAISDS8

13
Topics
• Ranking => today’s topic!
#SAISDS8

14
Questions
• How are datasets ranked?
• Can judgement lists (useful for ranking)
be generated at dataset production
time?
#SAISDS8

15
Overview
Rdatasets
Take ES
snapshot
Restore ES snapshot
http://bit.ly/2e5H1jL
#SAISDS8

16
Overview
Rdatasets
Data Pipelines:
•Extract Filename, Format, Field description.
•Extract Type information.
•Index CSV files.
Extract Filename
Extract Type information.
Index CSV files.
Extract Format
Extract Field description
#SAISDS8

17
Overview
Data Lake
#SAISDS8

18
Overview
Data Lake
#SAISDS8

19
Motivation (1)
• Organizing, indexing and ranking Datasets:
#SAISDS8

20
Motivation (1)
• Organizing, indexing and ranking Datasets:
• Produced by individual data pipelines
• On Data Lake(s)
#SAISDS8

21
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
#SAISDS8

22
Motivation (2)
• Extract “relevance judgements” and use them to bootstrap a
dataset rank model (a posteriori vs. post hoc (Halevy et al.,
2016)).
#SAISDS8

23
Motivation (2)
• Extract “relevance judgements” and use them to bootstrap a
dataset rank model (a posteriori vs. post hoc (Halevy et al.,
2016)).
• In a feedback loop leveraging click-through data on dataset
profile pages.
#SAISDS8

24
Organizing, Indexing and Ranking Datasets
Index dataset
features
Input data
Data Pipeline
Create
dataset
profile
pages.
Using Tableau Javascript API
ES Cluster
Fetch dataset from Datalake and run Data Pipeline on demand.
Search
Fetch dataset
from Datalake.
Dataset profile Team Dashboard
Feedback
logging
Relevance
judgements
#SAISDS8

25
Input data
Data Pipeline
#SAISDS8

Search
26
Input data
Data Pipeline
ES Cluster
Index dataset
features
#SAISDS8

Index dataset
features
Search
27
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8

Index dataset
features
Search
28
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
New

Index dataset
features
Search
29
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
New
New

Search
30
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile
#SAISDS8

Search
31
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
#SAISDS8

Search
32
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Fetch dataset
from Datalake.
#SAISDS8

Search
33
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Fetch dataset
from Datalake.
#SAISDS8

34
How do you rank datasets?
#SAISDS8

Ranking datasets
35
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
#SAISDS8

Ranking datasets
36
• Used to produce a ranking function for datasets.
#SAISDS8

Ranking datasets
37
• Leveraged for training
#SAISDS8

Ranking datasets
38
• And to bootstrap a dataset rank model
#SAISDS8

Ranking datasets
39
• And to bootstrap a dataset rank model
• (a posteriori vs. post hoc (Halevy et al., 2016).)
#SAISDS8

Ranking datasets
40
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
#SAISDS8

Ranking datasets
41
• Leveraging click-through data on dataset profile pages.
#SAISDS8

Ranking datasets
42
• Leveraging click-through data on dataset profile pages.
#SAISDS8
Create
dataset
profile
pages.
Dataset profile

43
• Alon et al (2016) advocate finding data in a post-
hoc manner by collecting and aggregating
metadata after datasets are created or updated.
• We propose a so-called “a posteriori” approach
where metadata is generated as part of running
pipelines using Spark.
A posteriori vs. Post-hoc
#SAISDS8

44
• Alon et al (2016) advocate indexing
• We prefer indexing
A posteriori vs. Post-hoc
#SAISDS8
after the fact
immediately after the fact

45
Pros
• “Relevance judgements” can be extracted and
leveraged to bootstrap a ranked dataset index.
• In a feedback loop leveraging click-through data on
Dataset profile pages.
• More granular metrics available to evaluate
metadata regeneration.
#SAISDS8

46
Cons
• Offline model development is disconnected and only
indirectly part of feedback using click-through data.
• Looking at trees instead of the forest.
• Need to replay indexing pipeline when things change
(per data pipeline).
#SAISDS8

48#SAISDS8
Demo Scenario
• Movies represent datasets
• TMDB movie pages represent dataset profile
pages.
• Marketing team (also called WAR team) interested
in War movies (datasets).

49#SAISDS8
Movie 1
https://www.themoviedb.org/movie/7555-rambo

50#SAISDS8
Movie 2
https://www.themoviedb.org/movie/1370-rambo-iii

51#SAISDS8
Movie 3
https://www.themoviedb.org/movie/1369-rambo-first-blood-part-ii

52#SAISDS8
judgement file
Rambo
Rambo III
Rambo: First Blood Part II

53
What have we seen?
• How to rank datasets on Elasticsearch using LTR.
• Extract relevance judgements immediately after
datasets are generated in Spark.
• Demo: Dataset Search with Spark and
Elasticsearch LTR.
#SAISDS8

54
Next Steps (1)
• Describe Datasets in a structured schema.org way
using Data Catalog Vocabulary [2].
• Build a knowledge graph and use GraphX to extract insights.
(Useful e.g. for column concept determination (Deng et al. 2013)).
• Build topic models based on structured Datasets using
Glint to perform scalable topic model extraction in Spark
(Jagerman and Eickhoff, 2016) [1].
[1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/
#SAISDS8

55
References
• Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang.
Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016
International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01,
2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730.
• Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute,
University of Amsterdam, May 2013.
• Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/
1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422.
• Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept determination for web tables using large
knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf.
• Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to
rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016.
• Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen,
Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on
Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015.
#SAISDS8

Thank You.
Email: ocastaneda@paypal.com
Twitter: @oscar_castaneda
#SAISDS8

Learning to Rank Datasets for Search with Oscar Castaneda

More Related Content

Similar to Learning to Rank Datasets for Search with Oscar Castaneda (20)

More from Databricks (20)

Recently uploaded (20)

Learning to Rank Datasets for Search with Oscar Castaneda