Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale

Data Agility - A Journey to
Advanced Analytics and
Machine Learning at Scale
Spark Summit, April 2019
Hari Subramanian, Engineering Manager

About me
Engineering Manager (also Engineer, Product Manager, Entrepreneur)
Previously: Amazon Web Services, VMware, Startups
Until Recently: Led Big Data Analytics & Data Science Workbench at Uber
Currently: Customer Obsession Engineering at Uber

00 The Uber Scale
01 Uber’s Data Platform
02 Data Science Workbench
03 DS & ML - Uber Toolset
04 Customer Obsession - a case study
05 Lessons learned
06 Wrap-up
Agenda

NYC
Uber’s mission is to
ignite opportunity by
setting the world in
motion.
Impacts
Millions of
Riders
Global
footprint
Livelihood
for Millions
of drivers

Data informs every decision in the company

Daily Uber trips
powered by ML
Millions
Messages
processed by Kafka
2T
Queries across
Hive, Vertica and
Presto
1M Data ingested
into HDFS
150TB
How Big is our Big Data?

Overview of Uber’s Data Platform
DATA SOURCES
RAW DATA
MODELED TABLES
MINING BUSINESS
INSIGHTS
CONSUMING BUSINESS INSIGHTS
EXPERIMENTATION
DATA SCIENCE
MACHINE
LEARNING
CUSTOM DATA SETS
Dashboarding
Alerting
Monitoring
Data Exploration
Query Engines
Knowledge Bases
ETL Frameworks
Data Integrity
Storage
Infrastructure

Rapid growth growing pains
Accessing data
& services was
complicated
Getting started
was hard
Collaboration was
diﬃcult

Many stakeholders, many needs
Cost and compliance
requirements
Varied
users
Diﬀerent
infrastructure needs
Single window
access

Democratize data science by enabling access
to reliable infrastructure and advanced tooling
in a community-driven learning environment
Data Science Workbench

Fully hosted 1-click Jupyter Notebook & RStudio IDEGetting Started
Data Access
Shared Standards
Collaboration
Scalability
Available Features
All internal data sources / Multi-DC / Secure / Log/Audit capabilities
Pre-baked Environments
Sharing options on notebooks; 1-click Shiny dashboard publication
Various session sizes, types (CPU, GPU)/access to compute
engines
Documentation Support
Our world today

Key features
● Data exploration
● Data preparation
● Ad-hoc analyses
● Model exploration
Interactive
workspaces
● Visualizing rich insights
derived from complex
analytics
● Displaying business metrics
Advanced
dashboards
● Automating complex
processes
● Small model training
● Scheduling data pulls
Business process
automations

Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale

Advanced data
science &
complex analytics
Data Scientists Ops Analysts Contractors

Business process
automation
S&P AnalystsOps Managers Contractors

Exploratory ML,
model-training, &
production
Data Scientists ML Researchers
Support
NLP model for support
tickets
Safety
Trip classiﬁcation
Uber Eats
Restaurant
recommendations
Risk
Driver account check
Referral risk scoring
Operations
Lifetime value (LTV)
modelEngineers

DS & ML - Uber Toolkit
Ingestion & Dispersal (Hoover, Marmaray - uses Spark, Hive)
Data preparation (Databook, QB/QR - uses Spark, Presto, Hive)
Data Analytics (BI tools, DSW - numPy, scikit-learn, pandas)
ML and DL (Spark MLLib, xgboost, TF, keras, pytorch, Horovod)
Model serving (PyML, Michelangelo, Peloton)
Workﬂows, Exploration (AirFlow/Piper, Data Science Workbench)

Case study
COTA - Customer Obsession Ticketing Assistant
A Deep Learning Model developed and deployed using
Uber’s Data Platform

What is the challenge?
As Uber grows, so does our volume of support tickets
Millions of tickets
from riders / drivers /
eaters per week
Thousands of
diﬀerent types of
issues users may
encounter
This slide was adapted from a talk by Huaixiu Zheng, Uber

User
CSRContact
Ticket
Response
Select Flow Node
Write Message
Select
Contact Type
Lookup info &
Policies
Select Action
Write response using
a Reply Template
Bliss - Uber’s Customer Support Platform

The Problem
Resolving a ticket is not easy (or cheap)
1000+ types
in a hierarchy
depth: 3~6
10+ actions (adjust fare, add appeasement, …)
1000+ reply templates

Portuguese
Spanish
English
ML Layer
User Info
Trip Info
Ticket Text
COTA: The Solution
A collaborative eﬀort from Uber Risk, CO Eng, and Data Platform teams
TYPE
REPLY
Ticket Metadata
COTA v2.1
(wordCNN)
ACTION
Recommend
+
Default
+
Auto-resolution
ROUTING
Risk Features
Fraud DS
embedment
CO Eng routing
engagement

2. prototype
3. productionize
1. define
4. measure
Launch and Iterate
Typical Machine Learning Workflow

2. prototype
GET DATA
DATA PREPARATION
TRAIN MODELS
EVALUATE MODELS
Validation
Computational cost
Interpretability
SQL, Spark
Data cleansing and
pre-processing,
R / Python
CPU or GPU
Exploration and prototyping
3. productionize
1. define
4. measure

● Iterate on model quickly with tweaks to parameters and configuration
● Flexible development - custom code + leverage existing modules for
data prep, ETLs, train, predict, and visualize
● Jupyter notebook running on a GPU or CPU session
● Pre-packaged Spark, tensorflow, keras, pandas, numpy, scipy etc.
● Interactive Spark exec through Uber’s Spark as a Service - Drogon
● API integrations to production ML platform - Michelangelo
● API integrations to data workflow management - Piper
● Develop and test locally, deploy in the cluster when ready
Vision: Build in DSW, run in prod platforms
Easy ML experimentation, quick production

Architecture

Training: Deep Learning Spark Pipeline
Spark
+
Tensorﬂow

Serving: Spark for batch & real-time predictions
Java Virtual Machine Hosted by a
Docker Container

Model Lifecycle
Managed in DSW

Model Lifecycle

Step 1: Data ETL
● Ingredients
○ Query to do ETL
○ Scheduled notebook as a Piper job to retrain data ETL daily

Step 2: Spark Transformations
● Ingredients
○ Setup spark job in Drogon via Michelangelo
○ Scheduled job to trigger the job at a particular retraining
frequency

Step 3: Data Transfer
● Ingredients
○ Upstream dependency on Spark job
○ Scheduled job to trigger data copy to a GPU only cluster using
a cross datacenter replication service

Step 4: Deep Learning Training
● Ingredients
○ Upstream dependency on data transfer
○ Prepare a docker image containing the training code
○ A scheduled job to trigger the DL training in a GPU cluster

Step 5: Model Merging
● Ingredients
○ This happens within the DL training job.
○ Right after DL training is done, Spark and DL models are
merged and uploaded to a model store.

Step 6: Model Deployment
● Ingredients
○ Upstream dependency on DL training and Model Merging
○ A scheduled job from notebook triggerring Model Deployment

Leverage Monitoring and Ops tools built for
production scenarios

Lessons learned
Build for the experts, design for the less
technical people
Create communities with both data
scientists and non data scientists
Don’t stop at building what’s known,
empower people to look for the unknown

Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale

More Related Content

What's hot (20)

Similar to Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale (20)

More from Databricks (20)

Recently uploaded (20)

Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale