SlideShare a Scribd company logo
Oscar Castañeda, Xoom a PayPal Service
Learning to Rank Datasets
for Search
#SAISDS8
#SAISDS8
About
• Data Scientist at Xoom a PayPal service.
• Interests:
• Data Management,
• Dataset Search,
• Learning to Rank.
2
Spark cluster with Elasticsearch
http://bit.ly/2em6RUKhttp://bit.ly/2ebM9HO
And Indexed RDatasets
Spark cluster with Elasticsearch Inside
5
Learning to Rank Datasets
#SAISDS8
6
Learning to Rank Datasets
for Search!
#SAISDS8
7
Agenda
• Problem Statement and Motivation
• Elasticsearch Learning to Rank
• Data Pipeline: metadata extraction, judgement list extraction
• Demo: Beginnings of a Dataset Search Engine with Machine-
learned relevance ranking.
• Q&A
#SAISDS8
8
Problem Statement (1)
• Despite datasets being a key corporate
asset they are generally not given the
importance they deserve and as a result
they are hard to find.
#SAISDS8
9
Problem Statement (2)
• Specifically, teams within organizations
have a hard time finding datasets
relevant to their function.
#SAISDS8
10
Topics
• Indexing
#SAISDS8
11
Topics
• Indexing (Spark Summit East 2017).
#SAISDS8
12
Topics
• Indexing (Spark Summit East 2017).
• Ranking
#SAISDS8
13
Topics
• Indexing (Spark Summit East 2017).
• Ranking => today’s topic!
#SAISDS8
14
Questions
• How are datasets ranked?
• Can judgement lists (useful for ranking)
be generated at dataset production
time?
#SAISDS8
15
Overview
Rdatasets
Take ES
snapshot
Restore ES snapshot
http://bit.ly/2e5H1jL
#SAISDS8
16
Overview
Rdatasets
Data Pipelines:
•Extract Filename, Format, Field description.
•Extract Type information.
•Index CSV files.
Extract Filename
Extract Type information.
Index CSV files.
Extract Format
Extract Field description
#SAISDS8
17
Overview
Data Lake
#SAISDS8
18
Overview
Data Lake
#SAISDS8
19
Motivation (1)
• Organizing, indexing and ranking Datasets:
#SAISDS8
20
Motivation (1)
• Organizing, indexing and ranking Datasets:
• Produced by individual data pipelines
• On Data Lake(s)
#SAISDS8
21
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
#SAISDS8
22
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
• Extract “relevance judgements” and use them to bootstrap a
dataset rank model (a posteriori vs. post hoc (Halevy et al.,
2016)).
#SAISDS8
23
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
• Extract “relevance judgements” and use them to bootstrap a
dataset rank model (a posteriori vs. post hoc (Halevy et al.,
2016)).
• In a feedback loop leveraging click-through data on dataset
profile pages.
#SAISDS8
24
Organizing, Indexing and Ranking Datasets
Index dataset
features
Input data
Data Pipeline
Create
dataset
profile
pages.
Using Tableau Javascript API
ES Cluster
Fetch dataset from Datalake and run Data Pipeline on demand.
Search
Fetch dataset
from Datalake.
Dataset profile Team Dashboard
Feedback
logging
Relevance
judgements
#SAISDS8
25
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
#SAISDS8
Search
26
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
#SAISDS8
Index dataset
features
Search
27
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
Index dataset
features
Search
28
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
New
Index dataset
features
Search
29
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
New
New
Search
30
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile
#SAISDS8
Search
31
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile Team Dashboard
#SAISDS8
Search
32
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile Team Dashboard
Fetch dataset from Datalake and run Data Pipeline on demand.
Fetch dataset
from Datalake.
#SAISDS8
Search
33
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile Team Dashboard
Fetch dataset from Datalake and run Data Pipeline on demand.
Fetch dataset
from Datalake.
#SAISDS8
34
How do you rank datasets?
#SAISDS8
Ranking datasets
35
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
#SAISDS8
Ranking datasets
36
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
#SAISDS8
Ranking datasets
37
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
• Leveraged for training
#SAISDS8
Ranking datasets
38
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
• Leveraged for training
• And to bootstrap a dataset rank model
#SAISDS8
Ranking datasets
39
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
• Leveraged for training
• And to bootstrap a dataset rank model
• (a posteriori vs. post hoc (Halevy et al., 2016).)
#SAISDS8
Ranking datasets
40
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
#SAISDS8
Ranking datasets
41
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
• Leveraging click-through data on dataset profile pages.
#SAISDS8
Ranking datasets
42
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
• Leveraging click-through data on dataset profile pages.
#SAISDS8
Create
dataset
profile
pages.
Dataset profile
43
• Alon et al (2016) advocate finding data in a post-
hoc manner by collecting and aggregating
metadata after datasets are created or updated.
• We propose a so-called “a posteriori” approach
where metadata is generated as part of running
pipelines using Spark.
A posteriori vs. Post-hoc
#SAISDS8
44
• Alon et al (2016) advocate indexing
• We prefer indexing
A posteriori vs. Post-hoc
#SAISDS8
after the fact
immediately after the fact
45
Pros
• “Relevance judgements” can be extracted and
leveraged to bootstrap a ranked dataset index.
• In a feedback loop leveraging click-through data on
Dataset profile pages.
• More granular metrics available to evaluate
metadata regeneration.
#SAISDS8
immediately after the fact
46
Cons
• Offline model development is disconnected and only
indirectly part of feedback using click-through data.
• Looking at trees instead of the forest.
• Need to replay indexing pipeline when things change
(per data pipeline).
#SAISDS8
immediately after the fact
47#SAISDS8
Demo!
48#SAISDS8
Demo Scenario
• Movies represent datasets
• TMDB movie pages represent dataset profile
pages.
• Marketing team (also called WAR team) interested
in War movies (datasets).
49#SAISDS8
Movie 1
https://www.themoviedb.org/movie/7555-rambo
50#SAISDS8
Movie 2
https://www.themoviedb.org/movie/1370-rambo-iii
51#SAISDS8
Movie 3
https://www.themoviedb.org/movie/1369-rambo-first-blood-part-ii
52#SAISDS8
judgement file
Rambo
Rambo III
Rambo: First Blood Part II
53
What have we seen?
• How to rank datasets on Elasticsearch using LTR.
• Extract relevance judgements immediately after
datasets are generated in Spark.
• Demo: Dataset Search with Spark and
Elasticsearch LTR.
#SAISDS8
54
Next Steps (1)
• Describe Datasets in a structured schema.org way
using Data Catalog Vocabulary [2].
• Build a knowledge graph and use GraphX to extract insights.
(Useful e.g. for column concept determination (Deng et al. 2013)).
• Build topic models based on structured Datasets using
Glint to perform scalable topic model extraction in Spark
(Jagerman and Eickhoff, 2016) [1].
[1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/
#SAISDS8
55
References
• Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang.
Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016
International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01,
2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730.
• Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute,
University of Amsterdam, May 2013.
• Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/
1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422.
• Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept de- termination for web tables using large
knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf.
• Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to
rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016.
• Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen,
Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on
Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015.
#SAISDS8
56
Q&A
#SAISDS8
Thank You.
Email: ocastaneda@paypal.com
Twitter: @oscar_castaneda
#SAISDS8

More Related Content

PPTX
Tutorial on query auto-completion
Yichen Feng
 
PDF
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Databricks
 
PDF
Step-by-step approach to question answering
NAVER Engineering
 
PDF
Recommending and Searching (Research @ Spotify)
Mounia Lalmas-Roelleke
 
PPTX
Blockchain, Bitcoin, Mining - My Product School Presentation
Aarthi Srinivasan
 
PDF
Parts 1 & 2: WWW 2018 Tutorial: Understanding User Needs & Tasks
Rishabh Mehrotra
 
PPTX
Natural language processing PPT presentation
Sai Mohith
 
PDF
2017 Tutorial - Deep Learning for Dialogue Systems
MLReview
 
Tutorial on query auto-completion
Yichen Feng
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Databricks
 
Step-by-step approach to question answering
NAVER Engineering
 
Recommending and Searching (Research @ Spotify)
Mounia Lalmas-Roelleke
 
Blockchain, Bitcoin, Mining - My Product School Presentation
Aarthi Srinivasan
 
Parts 1 & 2: WWW 2018 Tutorial: Understanding User Needs & Tasks
Rishabh Mehrotra
 
Natural language processing PPT presentation
Sai Mohith
 
2017 Tutorial - Deep Learning for Dialogue Systems
MLReview
 

Similar to Learning to Rank Datasets for Search with Oscar Castaneda (20)

PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
PDF
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Databricks
 
PDF
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
PDF
Learning to Rank with Apache Spark: A Case Study in Production Machine Learni...
Databricks
 
PPTX
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Gianluca Tarasconi
 
PDF
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Databricks
 
PPTX
Learning to Rank with Apache Spark
Anna Bladzich
 
DOCX
Slide notes for "The Rise of Self-service Business Intelligence"
skewdlogix
 
PDF
5 big data at work linking discovery and bi to improve business outcomes from...
Dr. Wilfred Lin (Ph.D.)
 
PPTX
MUDROD - Ranking
Yongyao Jiang
 
PDF
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Yunyao Li
 
PDF
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
StampedeCon
 
PDF
Image Based Information Retrieval Using Deep Learning and Clustering Techniques
IRJET Journal
 
PDF
Image Based Information Retrieval Using Deep Learning and Clustering Techniques
IRJET Journal
 
PDF
Data mining software comparison
Esteban Alcaide
 
PDF
Knowage roadmap-2022 (1)
KNOWAGE
 
PDF
Big Data Ecosystem @ LinkedIn
Minh-Hoang Nguyen
 
PPTX
How Humans & Machines Can Improve Site Search Results - Search Y: Paris
JP Sherman
 
PPTX
Introduction to data science
Mahir Haque
 
PDF
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Mail.ru Group
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Databricks
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
Learning to Rank with Apache Spark: A Case Study in Production Machine Learni...
Databricks
 
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Gianluca Tarasconi
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Databricks
 
Learning to Rank with Apache Spark
Anna Bladzich
 
Slide notes for "The Rise of Self-service Business Intelligence"
skewdlogix
 
5 big data at work linking discovery and bi to improve business outcomes from...
Dr. Wilfred Lin (Ph.D.)
 
MUDROD - Ranking
Yongyao Jiang
 
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Yunyao Li
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
StampedeCon
 
Image Based Information Retrieval Using Deep Learning and Clustering Techniques
IRJET Journal
 
Image Based Information Retrieval Using Deep Learning and Clustering Techniques
IRJET Journal
 
Data mining software comparison
Esteban Alcaide
 
Knowage roadmap-2022 (1)
KNOWAGE
 
Big Data Ecosystem @ LinkedIn
Minh-Hoang Nguyen
 
How Humans & Machines Can Improve Site Search Results - Search Y: Paris
JP Sherman
 
Introduction to data science
Mahir Haque
 
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Mail.ru Group
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 

Learning to Rank Datasets for Search with Oscar Castaneda

  • 1. Oscar Castañeda, Xoom a PayPal Service Learning to Rank Datasets for Search #SAISDS8
  • 2. #SAISDS8 About • Data Scientist at Xoom a PayPal service. • Interests: • Data Management, • Dataset Search, • Learning to Rank. 2
  • 3. Spark cluster with Elasticsearch http://bit.ly/2em6RUKhttp://bit.ly/2ebM9HO
  • 4. And Indexed RDatasets Spark cluster with Elasticsearch Inside
  • 5. 5 Learning to Rank Datasets #SAISDS8
  • 6. 6 Learning to Rank Datasets for Search! #SAISDS8
  • 7. 7 Agenda • Problem Statement and Motivation • Elasticsearch Learning to Rank • Data Pipeline: metadata extraction, judgement list extraction • Demo: Beginnings of a Dataset Search Engine with Machine- learned relevance ranking. • Q&A #SAISDS8
  • 8. 8 Problem Statement (1) • Despite datasets being a key corporate asset they are generally not given the importance they deserve and as a result they are hard to find. #SAISDS8
  • 9. 9 Problem Statement (2) • Specifically, teams within organizations have a hard time finding datasets relevant to their function. #SAISDS8
  • 11. 11 Topics • Indexing (Spark Summit East 2017). #SAISDS8
  • 12. 12 Topics • Indexing (Spark Summit East 2017). • Ranking #SAISDS8
  • 13. 13 Topics • Indexing (Spark Summit East 2017). • Ranking => today’s topic! #SAISDS8
  • 14. 14 Questions • How are datasets ranked? • Can judgement lists (useful for ranking) be generated at dataset production time? #SAISDS8
  • 15. 15 Overview Rdatasets Take ES snapshot Restore ES snapshot http://bit.ly/2e5H1jL #SAISDS8
  • 16. 16 Overview Rdatasets Data Pipelines: •Extract Filename, Format, Field description. •Extract Type information. •Index CSV files. Extract Filename Extract Type information. Index CSV files. Extract Format Extract Field description #SAISDS8
  • 19. 19 Motivation (1) • Organizing, indexing and ranking Datasets: #SAISDS8
  • 20. 20 Motivation (1) • Organizing, indexing and ranking Datasets: • Produced by individual data pipelines • On Data Lake(s) #SAISDS8
  • 21. 21 Motivation (2) • Produce a ranking function for datasets that are generated as part of running data pipelines. #SAISDS8
  • 22. 22 Motivation (2) • Produce a ranking function for datasets that are generated as part of running data pipelines. • Extract “relevance judgements” and use them to bootstrap a dataset rank model (a posteriori vs. post hoc (Halevy et al., 2016)). #SAISDS8
  • 23. 23 Motivation (2) • Produce a ranking function for datasets that are generated as part of running data pipelines. • Extract “relevance judgements” and use them to bootstrap a dataset rank model (a posteriori vs. post hoc (Halevy et al., 2016)). • In a feedback loop leveraging click-through data on dataset profile pages. #SAISDS8
  • 24. 24 Organizing, Indexing and Ranking Datasets Index dataset features Input data Data Pipeline Create dataset profile pages. Using Tableau Javascript API ES Cluster Fetch dataset from Datalake and run Data Pipeline on demand. Search Fetch dataset from Datalake. Dataset profile Team Dashboard Feedback logging Relevance judgements #SAISDS8
  • 25. 25 Organizing, Indexing and Ranking Datasets Input data Data Pipeline #SAISDS8
  • 26. Search 26 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features #SAISDS8
  • 27. Index dataset features Search 27 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Feedback logging Relevance judgements #SAISDS8
  • 28. Index dataset features Search 28 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Feedback logging Relevance judgements #SAISDS8 New
  • 29. Index dataset features Search 29 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Feedback logging Relevance judgements #SAISDS8 New New
  • 30. Search 30 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile #SAISDS8
  • 31. Search 31 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile Team Dashboard #SAISDS8
  • 32. Search 32 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile Team Dashboard Fetch dataset from Datalake and run Data Pipeline on demand. Fetch dataset from Datalake. #SAISDS8
  • 33. Search 33 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile Team Dashboard Fetch dataset from Datalake and run Data Pipeline on demand. Fetch dataset from Datalake. #SAISDS8
  • 34. 34 How do you rank datasets? #SAISDS8
  • 35. Ranking datasets 35 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. #SAISDS8
  • 36. Ranking datasets 36 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. #SAISDS8
  • 37. Ranking datasets 37 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. • Leveraged for training #SAISDS8
  • 38. Ranking datasets 38 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. • Leveraged for training • And to bootstrap a dataset rank model #SAISDS8
  • 39. Ranking datasets 39 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. • Leveraged for training • And to bootstrap a dataset rank model • (a posteriori vs. post hoc (Halevy et al., 2016).) #SAISDS8
  • 40. Ranking datasets 40 • Click-through data provides implicit feedback useful to adjust initial relevance judgements. #SAISDS8
  • 41. Ranking datasets 41 • Click-through data provides implicit feedback useful to adjust initial relevance judgements. • Leveraging click-through data on dataset profile pages. #SAISDS8
  • 42. Ranking datasets 42 • Click-through data provides implicit feedback useful to adjust initial relevance judgements. • Leveraging click-through data on dataset profile pages. #SAISDS8 Create dataset profile pages. Dataset profile
  • 43. 43 • Alon et al (2016) advocate finding data in a post- hoc manner by collecting and aggregating metadata after datasets are created or updated. • We propose a so-called “a posteriori” approach where metadata is generated as part of running pipelines using Spark. A posteriori vs. Post-hoc #SAISDS8
  • 44. 44 • Alon et al (2016) advocate indexing • We prefer indexing A posteriori vs. Post-hoc #SAISDS8 after the fact immediately after the fact
  • 45. 45 Pros • “Relevance judgements” can be extracted and leveraged to bootstrap a ranked dataset index. • In a feedback loop leveraging click-through data on Dataset profile pages. • More granular metrics available to evaluate metadata regeneration. #SAISDS8 immediately after the fact
  • 46. 46 Cons • Offline model development is disconnected and only indirectly part of feedback using click-through data. • Looking at trees instead of the forest. • Need to replay indexing pipeline when things change (per data pipeline). #SAISDS8 immediately after the fact
  • 48. 48#SAISDS8 Demo Scenario • Movies represent datasets • TMDB movie pages represent dataset profile pages. • Marketing team (also called WAR team) interested in War movies (datasets).
  • 53. 53 What have we seen? • How to rank datasets on Elasticsearch using LTR. • Extract relevance judgements immediately after datasets are generated in Spark. • Demo: Dataset Search with Spark and Elasticsearch LTR. #SAISDS8
  • 54. 54 Next Steps (1) • Describe Datasets in a structured schema.org way using Data Catalog Vocabulary [2]. • Build a knowledge graph and use GraphX to extract insights. (Useful e.g. for column concept determination (Deng et al. 2013)). • Build topic models based on structured Datasets using Glint to perform scalable topic model extraction in Spark (Jagerman and Eickhoff, 2016) [1]. [1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/ #SAISDS8
  • 55. 55 References • Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730. • Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute, University of Amsterdam, May 2013. • Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/ 1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422. • Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept de- termination for web tables using large knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf. • Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016. • Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015. #SAISDS8