SlideShare a Scribd company logo
Scaling ride
hailing with
Md Jawad
Data Scientist
GOJEK
Scaling Ride-Hailing with Machine Learning on MLflow
Our Scale
Operating in 4 countries
and more than 70 cities
80mapp downloads
+250kmerchants
4countries
1m+drivers
100m+monthly bookings
Indonesia
Singapore
Thailand
Vietnam
#JUSTGOJEKIT
Mobility Data Science Team
Mobility Data Science Team
■ Matchmaking
■ Surge pricing
Industry challenge
1. Matchmaking model
a. Background
b. Challenges
c. Desired state
2. MLflow
3. Solution
Agenda
High rating
Heading to
home area
Lowest ETA
Customer
Selected
driver
Choosing best driver for the job
Matchmaking: First Cut
Raw Data
Prod
ServingHow can we get models into production asap?
Matchmaking: First Cut
Raw Data
Process
Data
Airflow
Airflow DAG
Matchmaking: First Cut
Prod
Serving
Deploy
Gitlab for CI/CD
Matchmaking: First Cut
Raw Data
Prod
Serving
How are we going
to train models?
Deploy
Process
Data
Airflow
Matchmaking: First Cut
Raw Data
Prod
Serving
Build, Test, Deploy
Application
Process Data, Train Model
Airflow
Trigger: API CallTrigger: Daily Schedule Helm deploy to Kubernetes
Matchmaking: The Monolith
Airflow
Raw Data
Prod
Serving
Process data + Train models + Deploy
Challenges with this approach
● Inefficient
○ Need to wait hours for pipeline to run before deploying models
○ Can’t deploy serving without trigger from Airflow
Challenges with this approach
● Inefficient
● Hard to experiment
○ Do we fork the codebase for each small change?
○ Do we fan-in and fan-out a single pipeline?
○ Tracking model performance over time
Challenges with this approach
● Inefficient
● Hard to experiment
● Versioning is broken
Model tracking
by timestamp?
Challenges with this approach
● Inefficient
● Hard to experiment
● Versioning is broken
● Low reproducibility
○ Pipelines have non-deterministic side inputs (API calls,
fetching data, reading configuration)
○ No standardized way to track artifacts or processes
Challenges with this approach
● Inefficient
● Hard to experiment
● Versioning is broken
● Low reproducibility
● No visibility
Features? Models? Parameters? Metrics?
Challenges with this approach
● Inefficient
● Hard to experiment
● Versioning is broken
● Low reproducibility
● Low visibility
● Hard to scale
How do we scale to 1000s
models and new markets?
Airflow trains model,
triggers new deploy
through GitLab
Hardcoded
deployments
targets
Challenges with this approach
● Inefficient
● Hard to experiment
● Versioning is broken
● Low reproducibility
● Low visibility
● Hard to scale
● No separation of roles
Raw Data
Prod
Serving
Process data + Train models + Deploy
Responsibility of
Data Engineers,
Software Engineers,
Data Scientists
Desired state
● Easy to experiment
● Easy to reproduce results
● Easy to deploy models
● Easy to evaluate performance of features and models
● Capable of scaling to 1000s of models in many regions
Model
Exchange
Data Prep
Training
Deploy
Raw Data
Governance
Scale
Scale
Scale
Scale
μ
λ θ Tuning
μ
λ θ Tuning
An open source platform for the
machine learning lifecycle
Delta
Tracking
Record and query
experiments: code,
data, config, results
Projects
Packaging format
for reproducible runs
on any platform
Models
General model format
that supports diverse
deployment tools
MLflow Components
• Parameters: key-value
inputs to your code
• Metrics: numeric values
(can update over time)
• Artifacts: arbitrary files,
including models
• Source: which version
of code ran?
Key Concepts in Tracking
Legacy ML workflow
Airflow
Raw Data
Prod
Serving
Process data + Train models + Deploy
Approach
1. Decouple based on concerns
Raw Data
Prod
Serving
Deploy
Airflow
Process
Data
???
Train
Models
???
1. Decouple based on concerns
2. Implement ML pipeline solution
Raw Data
Prod
Serving
Deploy
Airflow
Process
Data
???
Train
Models
???
Approach
1. Decouple based on concerns
2. Implement ML pipeline solution and Continuous Delivery solution
Raw Data
Prod
Serving
Deploy
Airflow
Process
Data
???
Train
Models
???
Approach
1. Decouple based on concerns
2. Implement ML pipeline solution and Continuous Delivery solution
3. Add an artifact store between stages for features (Feast)
Feature
Store
Raw Data
Prod
Serving
Deploy
Airflow
Process
Data
*GOJEK/
Feast
Train
Models
???
*http://github.com/gojek/feast
Approach
1. Decouple based on concerns
2. Implement ML pipeline solution and Continuous Delivery solution
3. Add an artifact store between stages for features (Feast) and models (MLflow)
Model
Store
Feature
Store
Raw Data
Prod
Serving
Airflow
Process
Data
GOJEK/
Feast
Train
Models
Deploy
Approach
Advantages: Asynchronous Experimentation
Raw Data
Process
Data
Prod
Serving
Feature
Store
Train
Models
Deploy
Time
based
Instance
based
Artifact
based1 2 3
with mlflow.start_run():
# train model...
mlflow.log_param("alpha", alpha)
mlflow.log_param("l1_ratio", l1_ratio)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.sklearn.log_model(lr, "model")
Advantages: Reproducible & Traceable
Raw Data
Process
Data
Prod
Serving
Feature
Store
Train
Models
Deploy
Track artifacts used to train
models
● features
● pipeline version (git+SHA)
● and other pipeline variables
Track artifacts used to
deploy ML systems
● docker image
● configuration
● model version
● feature data
Track artifacts used to
produce features
● data sources
● jobs
● parameters
Advantages: Governance & Evaluation
Prod
Serving
Feature
Store
Train
Models
Deploy
training run
parameters
deployment
configuration
model
performance
feature
performance
1 2
34
Advantages: Role Separation
Raw Data
Process
Data
Prod
Serving
Feature
Store
Train
Models
Deploy
Data Scientist Software EngineerData Engineer
Advantages: Scalability
Driver Allocation System: (3 environments) x (4 markets) x (5 model types) x (10+ live
experiments)
= 600+ simultaneous deployments
gke-PROD-SG-T1-EXP2323
CD Pipeline
(pull based)
Configuration
Helm Charts
Docker Images
gke-PROD-TH-T2-EXP1006
gke-PROD-ID-T3-EXP3423
gke-PROD-VN-T4-EXP1800
1
New model is
published
2
Monitors all artifacts
for new versions
3
Test and deploy changes to
relevant clusters
Thank you

More Related Content

What's hot (20)

PPTX
From Data Science to MLOps
Carl W. Handlin
 
PPTX
Azure purview
Shafqat Turza
 
PDF
Databricks Overview for MLOps
Databricks
 
PDF
Vertex AI: Pipelines for your MLOps workflows
Márton Kodok
 
PDF
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
PDF
Ml ops intro session
Avinash Patil
 
PPTX
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j
 
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
PDF
What is MLOps
Henrik Skogström
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PDF
MLOps with Kubeflow
Saurabh Kaushik
 
PDF
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
GoDataDriven
 
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus
 
PPTX
MLOps in action
Pieter de Bruin
 
PDF
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
Animesh Singh
 
PDF
Metaflow: The ML Infrastructure at Netflix
Bill Liu
 
PDF
MLOps Bridging the gap between Data Scientists and Ops.
Knoldus Inc.
 
PDF
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Databricks
 
PDF
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
PPTX
Azure Synapse Analytics Overview (r2)
James Serra
 
From Data Science to MLOps
Carl W. Handlin
 
Azure purview
Shafqat Turza
 
Databricks Overview for MLOps
Databricks
 
Vertex AI: Pipelines for your MLOps workflows
Márton Kodok
 
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
Ml ops intro session
Avinash Patil
 
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
What is MLOps
Henrik Skogström
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
MLOps with Kubeflow
Saurabh Kaushik
 
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
GoDataDriven
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus
 
MLOps in action
Pieter de Bruin
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
Animesh Singh
 
Metaflow: The ML Infrastructure at Netflix
Bill Liu
 
MLOps Bridging the gap between Data Scientists and Ops.
Knoldus Inc.
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Databricks
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
Azure Synapse Analytics Overview (r2)
James Serra
 

Similar to Scaling Ride-Hailing with Machine Learning on MLflow (20)

PDF
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Databricks
 
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Databricks
 
PDF
Scaling up Machine Learning Development
Matei Zaharia
 
PDF
mlflow: Accelerating the End-to-End ML lifecycle
Databricks
 
PDF
MLflow with Databricks
Liangjun Jiang
 
PDF
Mlflow with databricks
Liangjun Jiang
 
PPTX
Improving How We Deliver Machine Learning Models (XCONF 2019)
David Tan
 
PDF
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Bill Liu
 
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
PDF
From Prototyping to Deployment at Scale with R and sparklyr with Kevin Kuo
Databricks
 
PPTX
Why is dev ops for machine learning so different - dataxdays
Ryan Dawson
 
PDF
Training and deploying ML models with Google Cloud Platform
Sotrender
 
PPTX
Why is dev ops for machine learning so different
Ryan Dawson
 
PDF
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Sotrender
 
PDF
Automated Production Ready ML at Scale
Databricks
 
PPTX
03_aiops-1.pptx
FarazulHoda2
 
PDF
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward
 
PDF
Applied Machine learning for business analytics
meghu123
 
PDF
Deploying Large Spark Models to production and model scoring in near real time
subhojit banerjee
 
PDF
DutchMLSchool 2022 - Automation
BigML, Inc
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Databricks
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Databricks
 
Scaling up Machine Learning Development
Matei Zaharia
 
mlflow: Accelerating the End-to-End ML lifecycle
Databricks
 
MLflow with Databricks
Liangjun Jiang
 
Mlflow with databricks
Liangjun Jiang
 
Improving How We Deliver Machine Learning Models (XCONF 2019)
David Tan
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Bill Liu
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
From Prototyping to Deployment at Scale with R and sparklyr with Kevin Kuo
Databricks
 
Why is dev ops for machine learning so different - dataxdays
Ryan Dawson
 
Training and deploying ML models with Google Cloud Platform
Sotrender
 
Why is dev ops for machine learning so different
Ryan Dawson
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Sotrender
 
Automated Production Ready ML at Scale
Databricks
 
03_aiops-1.pptx
FarazulHoda2
 
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward
 
Applied Machine learning for business analytics
meghu123
 
Deploying Large Spark Models to production and model scoring in near real time
subhojit banerjee
 
DutchMLSchool 2022 - Automation
BigML, Inc
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 

Scaling Ride-Hailing with Machine Learning on MLflow

  • 1. Scaling ride hailing with Md Jawad Data Scientist GOJEK
  • 3. Our Scale Operating in 4 countries and more than 70 cities 80mapp downloads +250kmerchants 4countries 1m+drivers 100m+monthly bookings Indonesia Singapore Thailand Vietnam
  • 6. Mobility Data Science Team ■ Matchmaking ■ Surge pricing
  • 8. 1. Matchmaking model a. Background b. Challenges c. Desired state 2. MLflow 3. Solution Agenda
  • 9. High rating Heading to home area Lowest ETA Customer Selected driver Choosing best driver for the job
  • 10. Matchmaking: First Cut Raw Data Prod ServingHow can we get models into production asap?
  • 11. Matchmaking: First Cut Raw Data Process Data Airflow Airflow DAG
  • 13. Matchmaking: First Cut Raw Data Prod Serving How are we going to train models? Deploy Process Data Airflow
  • 14. Matchmaking: First Cut Raw Data Prod Serving Build, Test, Deploy Application Process Data, Train Model Airflow Trigger: API CallTrigger: Daily Schedule Helm deploy to Kubernetes
  • 15. Matchmaking: The Monolith Airflow Raw Data Prod Serving Process data + Train models + Deploy
  • 16. Challenges with this approach ● Inefficient ○ Need to wait hours for pipeline to run before deploying models ○ Can’t deploy serving without trigger from Airflow
  • 17. Challenges with this approach ● Inefficient ● Hard to experiment ○ Do we fork the codebase for each small change? ○ Do we fan-in and fan-out a single pipeline? ○ Tracking model performance over time
  • 18. Challenges with this approach ● Inefficient ● Hard to experiment ● Versioning is broken Model tracking by timestamp?
  • 19. Challenges with this approach ● Inefficient ● Hard to experiment ● Versioning is broken ● Low reproducibility ○ Pipelines have non-deterministic side inputs (API calls, fetching data, reading configuration) ○ No standardized way to track artifacts or processes
  • 20. Challenges with this approach ● Inefficient ● Hard to experiment ● Versioning is broken ● Low reproducibility ● No visibility Features? Models? Parameters? Metrics?
  • 21. Challenges with this approach ● Inefficient ● Hard to experiment ● Versioning is broken ● Low reproducibility ● Low visibility ● Hard to scale How do we scale to 1000s models and new markets? Airflow trains model, triggers new deploy through GitLab Hardcoded deployments targets
  • 22. Challenges with this approach ● Inefficient ● Hard to experiment ● Versioning is broken ● Low reproducibility ● Low visibility ● Hard to scale ● No separation of roles Raw Data Prod Serving Process data + Train models + Deploy Responsibility of Data Engineers, Software Engineers, Data Scientists
  • 23. Desired state ● Easy to experiment ● Easy to reproduce results ● Easy to deploy models ● Easy to evaluate performance of features and models ● Capable of scaling to 1000s of models in many regions
  • 24. Model Exchange Data Prep Training Deploy Raw Data Governance Scale Scale Scale Scale μ λ θ Tuning μ λ θ Tuning An open source platform for the machine learning lifecycle Delta
  • 25. Tracking Record and query experiments: code, data, config, results Projects Packaging format for reproducible runs on any platform Models General model format that supports diverse deployment tools MLflow Components
  • 26. • Parameters: key-value inputs to your code • Metrics: numeric values (can update over time) • Artifacts: arbitrary files, including models • Source: which version of code ran? Key Concepts in Tracking
  • 27. Legacy ML workflow Airflow Raw Data Prod Serving Process data + Train models + Deploy
  • 28. Approach 1. Decouple based on concerns Raw Data Prod Serving Deploy Airflow Process Data ??? Train Models ???
  • 29. 1. Decouple based on concerns 2. Implement ML pipeline solution Raw Data Prod Serving Deploy Airflow Process Data ??? Train Models ??? Approach
  • 30. 1. Decouple based on concerns 2. Implement ML pipeline solution and Continuous Delivery solution Raw Data Prod Serving Deploy Airflow Process Data ??? Train Models ??? Approach
  • 31. 1. Decouple based on concerns 2. Implement ML pipeline solution and Continuous Delivery solution 3. Add an artifact store between stages for features (Feast) Feature Store Raw Data Prod Serving Deploy Airflow Process Data *GOJEK/ Feast Train Models ??? *http://github.com/gojek/feast Approach
  • 32. 1. Decouple based on concerns 2. Implement ML pipeline solution and Continuous Delivery solution 3. Add an artifact store between stages for features (Feast) and models (MLflow) Model Store Feature Store Raw Data Prod Serving Airflow Process Data GOJEK/ Feast Train Models Deploy Approach
  • 33. Advantages: Asynchronous Experimentation Raw Data Process Data Prod Serving Feature Store Train Models Deploy Time based Instance based Artifact based1 2 3 with mlflow.start_run(): # train model... mlflow.log_param("alpha", alpha) mlflow.log_param("l1_ratio", l1_ratio) mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) mlflow.sklearn.log_model(lr, "model")
  • 34. Advantages: Reproducible & Traceable Raw Data Process Data Prod Serving Feature Store Train Models Deploy Track artifacts used to train models ● features ● pipeline version (git+SHA) ● and other pipeline variables Track artifacts used to deploy ML systems ● docker image ● configuration ● model version ● feature data Track artifacts used to produce features ● data sources ● jobs ● parameters
  • 35. Advantages: Governance & Evaluation Prod Serving Feature Store Train Models Deploy training run parameters deployment configuration model performance feature performance 1 2 34
  • 36. Advantages: Role Separation Raw Data Process Data Prod Serving Feature Store Train Models Deploy Data Scientist Software EngineerData Engineer
  • 37. Advantages: Scalability Driver Allocation System: (3 environments) x (4 markets) x (5 model types) x (10+ live experiments) = 600+ simultaneous deployments gke-PROD-SG-T1-EXP2323 CD Pipeline (pull based) Configuration Helm Charts Docker Images gke-PROD-TH-T2-EXP1006 gke-PROD-ID-T3-EXP3423 gke-PROD-VN-T4-EXP1800 1 New model is published 2 Monitors all artifacts for new versions 3 Test and deploy changes to relevant clusters