SlideShare a Scribd company logo
Data Agility - A Journey to
Advanced Analytics and
Machine Learning at Scale
Spark Summit, April 2019
Hari Subramanian, Engineering Manager
About me
Engineering Manager (also Engineer, Product Manager, Entrepreneur)
Previously: Amazon Web Services, VMware, Startups
Until Recently: Led Big Data Analytics & Data Science Workbench at Uber
Currently: Customer Obsession Engineering at Uber
00 The Uber Scale
01 Uber’s Data Platform
02 Data Science Workbench
03 DS & ML - Uber Toolset
04 Customer Obsession - a case study
05 Lessons learned
06 Wrap-up
Agenda
NYC
Uber’s mission is to
ignite opportunity by
setting the world in
motion.
Impacts
Millions of
Riders
Global
footprint
Livelihood
for Millions
of drivers
Data informs every decision in the company
Daily Uber trips
powered by ML
Millions
Messages
processed by Kafka
2T
Queries across
Hive, Vertica and
Presto
1M Data ingested
into HDFS
150TB
How Big is our Big Data?
Overview of Uber’s Data Platform
DATA SOURCES
RAW DATA
MODELED TABLES
MINING BUSINESS
INSIGHTS
CONSUMING BUSINESS INSIGHTS
EXPERIMENTATION
DATA SCIENCE
MACHINE
LEARNING
CUSTOM DATA SETS
Dashboarding
Alerting
Monitoring
Data Exploration
Query Engines
Knowledge Bases
ETL Frameworks
Data Integrity
Storage
Infrastructure
What is DSW?
Rapid growth growing pains
Accessing data
& services was
complicated
Getting started
was hard
Collaboration was
difficult
Many stakeholders, many needs
Cost and compliance
requirements
Varied
users
Different
infrastructure needs
Single window
access
Democratize data science by enabling access
to reliable infrastructure and advanced tooling
in a community-driven learning environment
Data Science Workbench
Fully hosted 1-click Jupyter Notebook & RStudio IDEGetting Started
Data Access
Shared Standards
Collaboration
Scalability
Available Features
All internal data sources / Multi-DC / Secure / Log/Audit capabilities
Pre-baked Environments
Sharing options on notebooks; 1-click Shiny dashboard publication
Various session sizes, types (CPU, GPU)/access to compute
engines
Documentation Support
Our world today
Key features
● Data exploration
● Data preparation
● Ad-hoc analyses
● Model exploration
Interactive
workspaces
● Visualizing rich insights
derived from complex
analytics
● Displaying business metrics
Advanced
dashboards
● Automating complex
processes
● Small model training
● Scheduling data pulls
Business process
automations
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
What problem does
DSW solve?
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Advanced data
science &
complex analytics
Data Scientists Ops Analysts Contractors
Business process
automation
S&P AnalystsOps Managers Contractors
Exploratory ML,
model-training, &
production
Data Scientists ML Researchers
Support
NLP model for support
tickets
Safety
Trip classification
Uber Eats
Restaurant
recommendations
Risk
Driver account check
Referral risk scoring
Operations
Lifetime value (LTV)
modelEngineers
HDFS (Hadoop Data Lake)
YARN
Mesos
Peloton
Piper | Metron | WatchTower |
Marmaray | Kirby | Databook
Security Summary | Query | Dash |
Map | Chart Builder
Hive as a Service
Spark as a Service
DSW
Query Gateway Services
Observability
AllActiveandHiveSync
EfficiencyandCapacity
Presto
XP | Mentana Michelangelo
Experimentation BI Tools DS Platform ML Platform Data processing Platform
Unique fit in a mature Data Platform
DS & ML - Uber Toolkit
Ingestion & Dispersal (Hoover, Marmaray - uses Spark, Hive)
Data preparation (Databook, QB/QR - uses Spark, Presto, Hive)
Data Analytics (BI tools, DSW - numPy, scikit-learn, pandas)
ML and DL (Spark MLLib, xgboost, TF, keras, pytorch, Horovod)
Model serving (PyML, Michelangelo, Peloton)
Workflows, Exploration (AirFlow/Piper, Data Science Workbench)
Case study
COTA - Customer Obsession Ticketing Assistant
A Deep Learning Model developed and deployed using
Uber’s Data Platform
What is the challenge?
As Uber grows, so does our volume of support tickets
Millions of tickets
from riders / drivers /
eaters per week
Thousands of
different types of
issues users may
encounter
This slide was adapted from a talk by Huaixiu Zheng, Uber
User
CSRContact
Ticket
Response
Select Flow Node
Write Message
Select
Contact Type
Lookup info &
Policies
Select Action
Write response using
a Reply Template
Bliss - Uber’s Customer Support Platform
This slide was adapted from a talk by Huaixiu Zheng, Uber
The Problem
Resolving a ticket is not easy (or cheap)
1000+ types
in a hierarchy
depth: 3~6
10+ actions (adjust fare, add appeasement, …)
1000+ reply templates
This slide was adapted from a talk by Huaixiu Zheng, Uber
Portuguese
Spanish
English
ML Layer
User Info
Trip Info
Ticket Text
COTA: The Solution
A collaborative effort from Uber Risk, CO Eng, and Data Platform teams
TYPE
REPLY
Ticket Metadata
COTA v2.1
(wordCNN)
ACTION
Recommend
+
Default
+
Auto-resolution
ROUTING
Risk Features
Fraud DS
embedment
CO Eng routing
engagement
2. prototype
3. productionize
1. define
4. measure
Launch and Iterate
Typical Machine Learning Workflow
2. prototype
GET DATA
DATA PREPARATION
TRAIN MODELS
EVALUATE MODELS
Validation
Computational cost
Interpretability
SQL, Spark
Data cleansing and
pre-processing,
R / Python
CPU or GPU
Exploration and prototyping
3. productionize
1. define
4. measure
● Iterate on model quickly with tweaks to parameters and configuration
● Flexible development - custom code + leverage existing modules for
data prep, ETLs, train, predict, and visualize
● Jupyter notebook running on a GPU or CPU session
● Pre-packaged Spark, tensorflow, keras, pandas, numpy, scipy etc.
● Interactive Spark exec through Uber’s Spark as a Service - Drogon
● API integrations to production ML platform - Michelangelo
● API integrations to data workflow management - Piper
● Develop and test locally, deploy in the cluster when ready
Vision: Build in DSW, run in prod platforms
Easy ML experimentation, quick production
Deep Learning
Spark Pipeline
Architecture
This slide was adapted from a talk by Huaixiu Zheng, Uber
Training: Deep Learning Spark Pipeline
Spark
+
Tensorflow
This slide was adapted from a talk by Huaixiu Zheng, Uber
Serving: Spark for batch & real-time predictions
Java Virtual Machine Hosted by a
Docker Container
Model Lifecycle
Managed in DSW
Model Lifecycle
This slide was adapted from a talk by Huaixiu Zheng, Uber
Step 1: Data ETL
● Ingredients
○ Query to do ETL
○ Scheduled notebook as a Piper job to retrain data ETL daily
This slide was adapted from a talk by Huaixiu Zheng, Uber
Step 2: Spark Transformations
● Ingredients
○ Setup spark job in Drogon via Michelangelo
○ Scheduled job to trigger the job at a particular retraining
frequency
This slide was adapted from a talk by Huaixiu Zheng, Uber
Step 3: Data Transfer
● Ingredients
○ Upstream dependency on Spark job
○ Scheduled job to trigger data copy to a GPU only cluster using
a cross datacenter replication service
This slide was adapted from a talk by Huaixiu Zheng, Uber
Step 4: Deep Learning Training
● Ingredients
○ Upstream dependency on data transfer
○ Prepare a docker image containing the training code
○ A scheduled job to trigger the DL training in a GPU cluster
This slide was adapted from a talk by Huaixiu Zheng, Uber
Step 5: Model Merging
● Ingredients
○ This happens within the DL training job.
○ Right after DL training is done, Spark and DL models are
merged and uploaded to a model store.
This slide was adapted from a talk by Huaixiu Zheng, Uber
Step 6: Model Deployment
● Ingredients
○ Upstream dependency on DL training and Model Merging
○ A scheduled job from notebook triggerring Model Deployment
This slide was adapted from a talk by Huaixiu Zheng, Uber
Leverage Monitoring and Ops tools built for
production scenarios
This slide was adapted from a talk by Huaixiu Zheng, Uber
Lessons learned
Build for the experts, design for the less
technical people
Create communities with both data
scientists and non data scientists
Don’t stop at building what’s known,
empower people to look for the unknown
Thank you!

More Related Content

What's hot (20)

PDF
Extending Machine Learning Algorithms with PySpark
Databricks
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
PDF
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
PDF
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Databricks
 
PDF
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
PDF
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Databricks
 
PPTX
Catalyst optimizer
Ayub Mohammad
 
PDF
Splice Machine's use of Apache Spark and MLflow
Databricks
 
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
PDF
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Databricks
 
PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
PDF
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Databricks
 
PDF
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
PDF
Koalas: How Well Does Koalas Work?
Databricks
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
 
PDF
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Databricks
 
PDF
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
PDF
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
Extending Machine Learning Algorithms with PySpark
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Databricks
 
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Databricks
 
Catalyst optimizer
Ayub Mohammad
 
Splice Machine's use of Apache Spark and MLflow
Databricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Databricks
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Databricks
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
Koalas: How Well Does Koalas Work?
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Databricks
 
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 

Similar to Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale (20)

PDF
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Databricks
 
PDF
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Karthik Murugesan
 
PDF
ML and Data Science at Uber - GITPro talk 2017
Sudhir Tonse
 
PPTX
Big Data Pipelines and Machine Learning at Uber
Sudhir Tonse
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
PPTX
Building intelligent applications, experimental ML with Uber’s Data Science W...
DataWorks Summit
 
PDF
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Costanoa Ventures
 
PDF
Michelangelo - Machine Learning Platform - 2018
Karthik Murugesan
 
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
PDF
AI meets Big Data
Jan Wiegelmann
 
PDF
DevOps Days Rockies MLOps
Matthew Reynolds
 
PPTX
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Codemotion
 
PDF
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
Databricks
 
PDF
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Carol McDonald
 
PPTX
DevOps for Machine Learning overview en-us
eltonrodriguez11
 
PPTX
DevOps for AI Apps
Richin Jain
 
PPTX
Production ML Systems and Computer Vision with Google Cloud
gdgsurrey
 
PPTX
MOPs & ML Pipelines on GCP - Session 6, RGDC
gdgsurrey
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Databricks
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Databricks
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Karthik Murugesan
 
ML and Data Science at Uber - GITPro talk 2017
Sudhir Tonse
 
Big Data Pipelines and Machine Learning at Uber
Sudhir Tonse
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
Building intelligent applications, experimental ML with Uber’s Data Science W...
DataWorks Summit
 
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Costanoa Ventures
 
Michelangelo - Machine Learning Platform - 2018
Karthik Murugesan
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
AI meets Big Data
Jan Wiegelmann
 
DevOps Days Rockies MLOps
Matthew Reynolds
 
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Codemotion
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
Databricks
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Carol McDonald
 
DevOps for Machine Learning overview en-us
eltonrodriguez11
 
DevOps for AI Apps
Richin Jain
 
Production ML Systems and Computer Vision with Google Cloud
gdgsurrey
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
gdgsurrey
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
deep dive data management sharepoint apps.ppt
novaprofk
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 

Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale

  • 1. Data Agility - A Journey to Advanced Analytics and Machine Learning at Scale Spark Summit, April 2019 Hari Subramanian, Engineering Manager
  • 2. About me Engineering Manager (also Engineer, Product Manager, Entrepreneur) Previously: Amazon Web Services, VMware, Startups Until Recently: Led Big Data Analytics & Data Science Workbench at Uber Currently: Customer Obsession Engineering at Uber
  • 3. 00 The Uber Scale 01 Uber’s Data Platform 02 Data Science Workbench 03 DS & ML - Uber Toolset 04 Customer Obsession - a case study 05 Lessons learned 06 Wrap-up Agenda
  • 4. NYC Uber’s mission is to ignite opportunity by setting the world in motion. Impacts Millions of Riders Global footprint Livelihood for Millions of drivers
  • 5. Data informs every decision in the company
  • 6. Daily Uber trips powered by ML Millions Messages processed by Kafka 2T Queries across Hive, Vertica and Presto 1M Data ingested into HDFS 150TB How Big is our Big Data?
  • 7. Overview of Uber’s Data Platform DATA SOURCES RAW DATA MODELED TABLES MINING BUSINESS INSIGHTS CONSUMING BUSINESS INSIGHTS EXPERIMENTATION DATA SCIENCE MACHINE LEARNING CUSTOM DATA SETS Dashboarding Alerting Monitoring Data Exploration Query Engines Knowledge Bases ETL Frameworks Data Integrity Storage Infrastructure
  • 9. Rapid growth growing pains Accessing data & services was complicated Getting started was hard Collaboration was difficult
  • 10. Many stakeholders, many needs Cost and compliance requirements Varied users Different infrastructure needs Single window access
  • 11. Democratize data science by enabling access to reliable infrastructure and advanced tooling in a community-driven learning environment Data Science Workbench
  • 12. Fully hosted 1-click Jupyter Notebook & RStudio IDEGetting Started Data Access Shared Standards Collaboration Scalability Available Features All internal data sources / Multi-DC / Secure / Log/Audit capabilities Pre-baked Environments Sharing options on notebooks; 1-click Shiny dashboard publication Various session sizes, types (CPU, GPU)/access to compute engines Documentation Support Our world today
  • 13. Key features ● Data exploration ● Data preparation ● Ad-hoc analyses ● Model exploration Interactive workspaces ● Visualizing rich insights derived from complex analytics ● Displaying business metrics Advanced dashboards ● Automating complex processes ● Small model training ● Scheduling data pulls Business process automations
  • 17. Advanced data science & complex analytics Data Scientists Ops Analysts Contractors
  • 19. Exploratory ML, model-training, & production Data Scientists ML Researchers Support NLP model for support tickets Safety Trip classification Uber Eats Restaurant recommendations Risk Driver account check Referral risk scoring Operations Lifetime value (LTV) modelEngineers
  • 20. HDFS (Hadoop Data Lake) YARN Mesos Peloton Piper | Metron | WatchTower | Marmaray | Kirby | Databook Security Summary | Query | Dash | Map | Chart Builder Hive as a Service Spark as a Service DSW Query Gateway Services Observability AllActiveandHiveSync EfficiencyandCapacity Presto XP | Mentana Michelangelo Experimentation BI Tools DS Platform ML Platform Data processing Platform Unique fit in a mature Data Platform
  • 21. DS & ML - Uber Toolkit Ingestion & Dispersal (Hoover, Marmaray - uses Spark, Hive) Data preparation (Databook, QB/QR - uses Spark, Presto, Hive) Data Analytics (BI tools, DSW - numPy, scikit-learn, pandas) ML and DL (Spark MLLib, xgboost, TF, keras, pytorch, Horovod) Model serving (PyML, Michelangelo, Peloton) Workflows, Exploration (AirFlow/Piper, Data Science Workbench)
  • 22. Case study COTA - Customer Obsession Ticketing Assistant A Deep Learning Model developed and deployed using Uber’s Data Platform
  • 23. What is the challenge? As Uber grows, so does our volume of support tickets Millions of tickets from riders / drivers / eaters per week Thousands of different types of issues users may encounter This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 24. User CSRContact Ticket Response Select Flow Node Write Message Select Contact Type Lookup info & Policies Select Action Write response using a Reply Template Bliss - Uber’s Customer Support Platform This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 25. The Problem Resolving a ticket is not easy (or cheap) 1000+ types in a hierarchy depth: 3~6 10+ actions (adjust fare, add appeasement, …) 1000+ reply templates This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 26. Portuguese Spanish English ML Layer User Info Trip Info Ticket Text COTA: The Solution A collaborative effort from Uber Risk, CO Eng, and Data Platform teams TYPE REPLY Ticket Metadata COTA v2.1 (wordCNN) ACTION Recommend + Default + Auto-resolution ROUTING Risk Features Fraud DS embedment CO Eng routing engagement
  • 27. 2. prototype 3. productionize 1. define 4. measure Launch and Iterate Typical Machine Learning Workflow
  • 28. 2. prototype GET DATA DATA PREPARATION TRAIN MODELS EVALUATE MODELS Validation Computational cost Interpretability SQL, Spark Data cleansing and pre-processing, R / Python CPU or GPU Exploration and prototyping 3. productionize 1. define 4. measure
  • 29. ● Iterate on model quickly with tweaks to parameters and configuration ● Flexible development - custom code + leverage existing modules for data prep, ETLs, train, predict, and visualize ● Jupyter notebook running on a GPU or CPU session ● Pre-packaged Spark, tensorflow, keras, pandas, numpy, scipy etc. ● Interactive Spark exec through Uber’s Spark as a Service - Drogon ● API integrations to production ML platform - Michelangelo ● API integrations to data workflow management - Piper ● Develop and test locally, deploy in the cluster when ready Vision: Build in DSW, run in prod platforms Easy ML experimentation, quick production
  • 31. Architecture This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 32. Training: Deep Learning Spark Pipeline Spark + Tensorflow This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 33. Serving: Spark for batch & real-time predictions Java Virtual Machine Hosted by a Docker Container
  • 35. Model Lifecycle This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 36. Step 1: Data ETL ● Ingredients ○ Query to do ETL ○ Scheduled notebook as a Piper job to retrain data ETL daily This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 37. Step 2: Spark Transformations ● Ingredients ○ Setup spark job in Drogon via Michelangelo ○ Scheduled job to trigger the job at a particular retraining frequency This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 38. Step 3: Data Transfer ● Ingredients ○ Upstream dependency on Spark job ○ Scheduled job to trigger data copy to a GPU only cluster using a cross datacenter replication service This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 39. Step 4: Deep Learning Training ● Ingredients ○ Upstream dependency on data transfer ○ Prepare a docker image containing the training code ○ A scheduled job to trigger the DL training in a GPU cluster This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 40. Step 5: Model Merging ● Ingredients ○ This happens within the DL training job. ○ Right after DL training is done, Spark and DL models are merged and uploaded to a model store. This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 41. Step 6: Model Deployment ● Ingredients ○ Upstream dependency on DL training and Model Merging ○ A scheduled job from notebook triggerring Model Deployment This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 42. Leverage Monitoring and Ops tools built for production scenarios This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 43. Lessons learned Build for the experts, design for the less technical people Create communities with both data scientists and non data scientists Don’t stop at building what’s known, empower people to look for the unknown