SlideShare a Scribd company logo
… and associated Idiosyncratic Operating Principles.
Industrializing
DataScience Workflows
Sean Downes
Sr DataScientist @ Expedia, Inc.
The Problem
So you’ve been asked to bring the infrastructure into the cloud.
So your Data Lake is actually a Data Swamp.
Rental Cars and Industrialized Learning to Rank with Sean Downes
Login
Impressions
Clicks
Purchases
The Problem
Every Line of Business has its own Structure
Every MicroService has a log
And you want to A|B Test
Can you please turn…
The Problem
into
using
Preview
Context / Disclaimer
Lightning Review of Data Platforms
(idiosyncratic) Organizing Principles
Context / Disclaimer
Academic
Theoretical Physicist
We’ve got some work to do.
So…
I’m implicitly assuming this talk will be one in an ensemble of opinions
supercomputers…
Lightning Review of Data Platforms
… the Commerical Data Center…
Lightning Review of Data Platforms
… Virtualized Everything
Lightning Review of Data Platforms
1. Assign Tasks their own virtual hardware
2. Expend / Contract Resources by demand
3. Real-time HotSwapping
4. Software Updates Built In
5. Etc Etc Etc.
idiosyncratic Organizing Principles
iOP1) Clarity
iOP2) Engineers are not Data Scientists
iOP3) PMs are not Data Scientists
iOP4) Data Scientists are not Engineers
iOP5) Close the Data Loop
iOP1: Data Clarity
PUBLISH THIS INTERNALLY!
“big data, big noise”
Where is what data?
Who owns what field?
What is this this field?
Where did this field go?
Why is this field NULL?
iOP1: Data Clarity
Minibatch streaming into nested JSON?
O(10kB)?
GZip?
O(50-500 MB)
Parquet.
Snappy.
“Expect Data Science”
Spark
Big Thanks to Jason Pohl @ DB!
And Charles Pritchard!
iOP2: Engineers are not Data Scientists
“why would you need to do that?”
Scratch Space
Cluster Bootstrap Permissions
Access S3 Buckets
Sandbox Clusters
Share Notebooks Across Accounts
We DO NOT SPEAK IAM Role/Anything
iOP2: Engineers are not Data Scientists
“why would you need to do that?”
if possible:
Write your own Pipelines.
else:
Explain Data Science.
iOP3: PMs are not Data Scientists
“you don’t need that!”
Once upon a time in the Flight DataLake…
Only 10% of a Search Impression was recorded
Worse: It was only the Cheapest10%
Many of the bookings where not included in this list!
iOP4: Data Scientists are Not Engineers
“we need to support models in @#&%? format”
Pick a Robust Standard and Stick to It.
If you’re big enough to worry about this, you can commit code
jPMML
Everybody Use Git. Now. Yes You.
Production Code Matters. Format. Document.
Pipelines Count as Production Code.
iOP5: Close that Data Loop
What is your data doing?
New Data? Consider Bandits!
Big Data? Set up a learning problem.
Empower by Design.
Empower by Design.
Contact information or call to action goes here.
Thank You.

More Related Content

PDF
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Databricks
 
PDF
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
PDF
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
PDF
Spark and S3 with Ryan Blue
Databricks
 
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Databricks
 
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
Spark and S3 with Ryan Blue
Databricks
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 

What's hot (20)

PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
PDF
SSR: Structured Streaming for R and Machine Learning
felixcss
 
PDF
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Spark Summit
 
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 
PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
PDF
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Databricks
 
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
PDF
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Databricks
 
PDF
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
PDF
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
PDF
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Databricks
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
PPTX
Introduction to Streaming Distributed Processing with Storm
Brandon O'Brien
 
PDF
Big Telco - Yousun Jeong
Spark Summit
 
PDF
Building Data Pipelines in Python
C4Media
 
PDF
Spark Summit EU talk by Michael Nitschinger
Spark Summit
 
PDF
Developing high frequency indicators using real time tick data on apache supe...
Zekeriya Besiroglu
 
PPTX
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
PPTX
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
SSR: Structured Streaming for R and Machine Learning
felixcss
 
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Spark Summit
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Databricks
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Databricks
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Databricks
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
Introduction to Streaming Distributed Processing with Storm
Brandon O'Brien
 
Big Telco - Yousun Jeong
Spark Summit
 
Building Data Pipelines in Python
C4Media
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit
 
Developing high frequency indicators using real time tick data on apache supe...
Zekeriya Besiroglu
 
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Ad

Similar to Rental Cars and Industrialized Learning to Rank with Sean Downes (20)

PDF
Data Science in Future Tense
Paco Nathan
 
PDF
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
PDF
From a student to an apache committer practice of apache io tdb
jixuan1989
 
PDF
Containers & AI - Beauty and the Beast!?!
Tobias Schneck
 
PDF
Containers & AI - Beauty and the Beast !?! @MLCon - 27.6.2024
Tobias Schneck
 
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
PDF
Kubernetes and AI - Beauty and the Beast - Tobias Schneck - DOAG 24 NUE - 20....
Tobias Schneck
 
PDF
Searching Chinese Patents Presentation at Enterprise Data World
OpenSource Connections
 
PPTX
Essential Data Engineering for Data Scientist
SoftServe
 
PDF
Paytm labs soyouwanttodatascience
Adam Muise
 
PDF
Data Science with Spark
Krishna Sankar
 
PDF
Measure All the Things! - Austin Data Day 2014
gdusbabek
 
PDF
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Paco Nathan
 
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
PPTX
(Big) Data (Science) Skills
Oscar Corcho
 
PDF
From Lab to Factory: Or how to turn data into value
Peadar Coyle
 
PDF
Making the Most of In-Memory: More than Speed
Inside Analysis
 
PDF
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
 
PPTX
Microsoft Dryad
Colin Clark
 
PPT
Big Graph Analytics on Neo4j with Apache Spark
Kenny Bastani
 
Data Science in Future Tense
Paco Nathan
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
From a student to an apache committer practice of apache io tdb
jixuan1989
 
Containers & AI - Beauty and the Beast!?!
Tobias Schneck
 
Containers & AI - Beauty and the Beast !?! @MLCon - 27.6.2024
Tobias Schneck
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
Kubernetes and AI - Beauty and the Beast - Tobias Schneck - DOAG 24 NUE - 20....
Tobias Schneck
 
Searching Chinese Patents Presentation at Enterprise Data World
OpenSource Connections
 
Essential Data Engineering for Data Scientist
SoftServe
 
Paytm labs soyouwanttodatascience
Adam Muise
 
Data Science with Spark
Krishna Sankar
 
Measure All the Things! - Austin Data Day 2014
gdusbabek
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Paco Nathan
 
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
(Big) Data (Science) Skills
Oscar Corcho
 
From Lab to Factory: Or how to turn data into value
Peadar Coyle
 
Making the Most of In-Memory: More than Speed
Inside Analysis
 
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
 
Microsoft Dryad
Colin Clark
 
Big Graph Analytics on Neo4j with Apache Spark
Kenny Bastani
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPT
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
PPTX
International-health-agency and it's work.pptx
shreehareeshgs
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
short term internship project on Data visualization
JMJCollegeComputerde
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Chad Readey - An Independent Thinker
Chad Readey
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
International-health-agency and it's work.pptx
shreehareeshgs
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Probability systematic sampling methods.pptx
PrakashRajput19
 

Rental Cars and Industrialized Learning to Rank with Sean Downes

  • 1. … and associated Idiosyncratic Operating Principles. Industrializing DataScience Workflows Sean Downes Sr DataScientist @ Expedia, Inc.
  • 2. The Problem So you’ve been asked to bring the infrastructure into the cloud. So your Data Lake is actually a Data Swamp.
  • 4. Login Impressions Clicks Purchases The Problem Every Line of Business has its own Structure Every MicroService has a log And you want to A|B Test
  • 5. Can you please turn… The Problem into using
  • 6. Preview Context / Disclaimer Lightning Review of Data Platforms (idiosyncratic) Organizing Principles
  • 7. Context / Disclaimer Academic Theoretical Physicist We’ve got some work to do. So… I’m implicitly assuming this talk will be one in an ensemble of opinions
  • 9. … the Commerical Data Center… Lightning Review of Data Platforms
  • 10. … Virtualized Everything Lightning Review of Data Platforms 1. Assign Tasks their own virtual hardware 2. Expend / Contract Resources by demand 3. Real-time HotSwapping 4. Software Updates Built In 5. Etc Etc Etc.
  • 11. idiosyncratic Organizing Principles iOP1) Clarity iOP2) Engineers are not Data Scientists iOP3) PMs are not Data Scientists iOP4) Data Scientists are not Engineers iOP5) Close the Data Loop
  • 12. iOP1: Data Clarity PUBLISH THIS INTERNALLY! “big data, big noise” Where is what data? Who owns what field? What is this this field? Where did this field go? Why is this field NULL?
  • 13. iOP1: Data Clarity Minibatch streaming into nested JSON? O(10kB)? GZip? O(50-500 MB) Parquet. Snappy. “Expect Data Science” Spark Big Thanks to Jason Pohl @ DB! And Charles Pritchard!
  • 14. iOP2: Engineers are not Data Scientists “why would you need to do that?” Scratch Space Cluster Bootstrap Permissions Access S3 Buckets Sandbox Clusters Share Notebooks Across Accounts We DO NOT SPEAK IAM Role/Anything
  • 15. iOP2: Engineers are not Data Scientists “why would you need to do that?” if possible: Write your own Pipelines. else: Explain Data Science.
  • 16. iOP3: PMs are not Data Scientists “you don’t need that!” Once upon a time in the Flight DataLake… Only 10% of a Search Impression was recorded Worse: It was only the Cheapest10% Many of the bookings where not included in this list!
  • 17. iOP4: Data Scientists are Not Engineers “we need to support models in @#&%? format” Pick a Robust Standard and Stick to It. If you’re big enough to worry about this, you can commit code jPMML Everybody Use Git. Now. Yes You. Production Code Matters. Format. Document. Pipelines Count as Production Code.
  • 18. iOP5: Close that Data Loop What is your data doing? New Data? Consider Bandits! Big Data? Set up a learning problem. Empower by Design.
  • 20. Contact information or call to action goes here. Thank You.