Scaling Ride-Hailing with Machine Learning on MLflow

Scaling ride
hailing with
Md Jawad
Data Scientist
GOJEK

Our Scale
Operating in 4 countries
and more than 70 cities
80mapp downloads
+250kmerchants
4countries
1m+drivers
100m+monthly bookings
Indonesia
Singapore
Thailand
Vietnam

Mobility Data Science Team
■ Matchmaking
■ Surge pricing

1. Matchmaking model
a. Background
b. Challenges
c. Desired state
2. MLflow
3. Solution
Agenda

High rating
Heading to
home area
Lowest ETA
Customer
Selected
driver
Choosing best driver for the job

Matchmaking: First Cut
Raw Data
Prod
ServingHow can we get models into production asap?

Raw Data
Process
Data
Airflow
Airflow DAG

Prod
Serving
Deploy
Gitlab for CI/CD

Raw Data
Prod
Serving
How are we going
to train models?
Deploy
Process
Data
Airflow

Raw Data
Prod
Serving
Build, Test, Deploy
Application
Process Data, Train Model
Airflow
Trigger: API CallTrigger: Daily Schedule Helm deploy to Kubernetes

Matchmaking: The Monolith
Airflow
Raw Data
Prod
Serving
Process data + Train models + Deploy

Challenges with this approach
● Inefficient
○ Need to wait hours for pipeline to run before deploying models
○ Can’t deploy serving without trigger from Airflow

● Inefficient
● Hard to experiment
○ Do we fork the codebase for each small change?
○ Do we fan-in and fan-out a single pipeline?
○ Tracking model performance over time

● Inefficient
● Versioning is broken
Model tracking
by timestamp?

● Inefficient
● Low reproducibility
○ Pipelines have non-deterministic side inputs (API calls,
fetching data, reading configuration)
○ No standardized way to track artifacts or processes

● Inefficient
● No visibility
Features? Models? Parameters? Metrics?

● Inefficient
● Low visibility
● Hard to scale
How do we scale to 1000s
models and new markets?
Airflow trains model,
triggers new deploy
through GitLab
Hardcoded
deployments
targets

● Inefficient
● Low visibility
● Hard to scale
● No separation of roles
Raw Data
Prod
Serving
Responsibility of
Data Engineers,
Software Engineers,
Data Scientists

Desired state
● Easy to experiment
● Easy to reproduce results
● Easy to deploy models
● Easy to evaluate performance of features and models
● Capable of scaling to 1000s of models in many regions

Model
Exchange
Data Prep
Training
Deploy
Raw Data
Governance
Scale
Scale
Scale
Scale
μ
λ θ Tuning
μ
λ θ Tuning
An open source platform for the
machine learning lifecycle
Delta

Tracking
Record and query
experiments: code,
data, config, results
Projects
Packaging format
for reproducible runs
on any platform
Models
General model format
that supports diverse
deployment tools
MLflow Components

• Parameters: key-value
inputs to your code
• Metrics: numeric values
(can update over time)
• Artifacts: arbitrary files,
including models
• Source: which version
of code ran?
Key Concepts in Tracking

Legacy ML workflow
Airflow
Raw Data
Prod
Serving

Approach
1. Decouple based on concerns
Raw Data
Prod
Serving
Deploy
Airflow
Process
Data
???
Train
Models
???

2. Implement ML pipeline solution
Raw Data
Prod
Serving
Deploy
Airflow
Process
Data
???
Train
Models
???
Approach

2. Implement ML pipeline solution and Continuous Delivery solution
Raw Data
Prod
Serving
Deploy
Airflow
Process
Data
???
Train
Models
???
Approach

3. Add an artifact store between stages for features (Feast)
Feature
Store
Raw Data
Prod
Serving
Deploy
Airflow
Process
Data
*GOJEK/
Feast
Train
Models
???
*http://github.com/gojek/feast
Approach

3. Add an artifact store between stages for features (Feast) and models (MLflow)
Model
Store
Feature
Store
Raw Data
Prod
Serving
Airflow
Process
Data
GOJEK/
Feast
Train
Models
Deploy
Approach

Advantages: Asynchronous Experimentation
Raw Data
Process
Data
Prod
Serving
Feature
Store
Train
Models
Deploy
Time
based
Instance
based
Artifact
based1 2 3
with mlflow.start_run():
# train model...
mlflow.log_param("alpha", alpha)
mlflow.log_param("l1_ratio", l1_ratio)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.sklearn.log_model(lr, "model")

Advantages: Reproducible & Traceable
Raw Data
Process
Data
Prod
Serving
Feature
Store
Train
Models
Deploy
Track artifacts used to train
models
● features
● pipeline version (git+SHA)
● and other pipeline variables
Track artifacts used to
deploy ML systems
● docker image
● configuration
● model version
● feature data
Track artifacts used to
produce features
● data sources
● jobs
● parameters

Advantages: Governance & Evaluation
Prod
Serving
Feature
Store
Train
Models
Deploy
training run
parameters
deployment
configuration
model
performance
feature
performance
1 2
34

Advantages: Role Separation
Raw Data
Process
Data
Prod
Serving
Feature
Store
Train
Models
Deploy
Data Scientist Software EngineerData Engineer

Advantages: Scalability
Driver Allocation System: (3 environments) x (4 markets) x (5 model types) x (10+ live
experiments)
= 600+ simultaneous deployments
gke-PROD-SG-T1-EXP2323
CD Pipeline
(pull based)
Configuration
Helm Charts
Docker Images
gke-PROD-TH-T2-EXP1006
gke-PROD-ID-T3-EXP3423
gke-PROD-VN-T4-EXP1800
1
New model is
published
2
Monitors all artifacts
for new versions
3
Test and deploy changes to
relevant clusters

Scaling Ride-Hailing with Machine Learning on MLflow

More Related Content

What's hot (20)

Similar to Scaling Ride-Hailing with Machine Learning on MLflow (20)

More from Databricks (20)

Recently uploaded (20)

Scaling Ride-Hailing with Machine Learning on MLflow