Start with version control and experiments management in machine learning

1 like102 views

The document outlines a workflow for managing version control and reproducible experiments in machine learning projects, emphasizing the importance of automated pipelines, artifact versioning, and environment control. It includes practical steps and a checklist for achieving reproducibility, alongside a use case for a dog and cat classifier project. The document also highlights the benefits of discipline and organization within teams for enhancing project performance.

Data & Analytics

Start with version control and experiments
management in ML:
reproducible experiments
Data Fest3
Minsk, 2019
1
Mikhail Rozhkov

2
Workﬂow of ML project and artefacts
Problem
Statement
MVP
design
Get data
Prepare data
Train model
Evaluate
modelTest &
Integrate
Serve /
Predict
Monitor
1. Analyze &
Plan
2. Prototype
4. Monitor &
Maintain
3. Productionize
Inspired by Uber’s workflow of a machine learning project diagram. Scaling Machine Learning at Uber with Michelangelo https://eng.uber.com/scaling-michelangelo/
Solution
development

Experiment: pipelines, conﬁgs and artifacts
Algorithm
Data
Hyperpara
meters
Evaluation
Measure
Model
ETL
tasks
test
dataset
train
dataset
evaluate
train
Experiment
config - artifacts
- pipelines
- code
- conﬁgs
3

ML reproducibility is a dimension of quality
4
What is Reproducibility?
using the original methods applied to
the original data to produce the
original results [Gardner]
Why should you care?
● Trust
● Consistent Results
● Versioned History
● Team Performance
● Pain Less Production
Josh Gardner, Yuming Yang, Ryan S. Baker, Christopher Brooks. Enabling End-To-End Machine
Learning Replicability: A Case Study in Educational Data Mining

ML Reproducibility
1. Automated pipelines
2. Control run params
3. Control execution DAG
4. Code version control
5. Artifacts version control (models, datasets, etc.)
6. Use shared/cloud storage for artifacts
7. Environment dependencies control
6

How to start?
step 1 step 2 step 3 step 4
Manual work
Automated
work
Time on DS
task
100%
0%
10%
90%
10%
90% 90%
7

Start with artifacts versioning!
Algorithm
Data
Hyperpara
meters
Evaluation
Measure
Model
ETL
tasks
test
dataset
train
dataset
evaluate
train
Experiment
config
8

Use Case: dogs and cats classiﬁer
● Project
○ Classify dogs and cats by photo
○ Datat
■ object: cats, dogs
■ dogs: 12500
■ cats: 12500
○ Metrics: accuracy, ROC-AUC
9
● Team
○ > 2 members
○ diﬀerent machines/servers
○ diﬀerent OS
○ git-flow dev process
○ run on one machine

Step 1:
Jupyter Notebook
● code in Jupyter Notebook
● everything in Docker
10

ML Reproducibility checklist
11
1. Automated pipelines
2. Control run params
3. Control execution DAG
4. Code version control
5. Artifacts version control (models, datasets, etc.)
6. Use shared/cloud storage for artifacts
7. Environment dependencies control
8. Experiments results tracking

Step 2:
build pipelines
● move common code into .py modules
● build pipelines
● everything in Docker
● run experiments in terminal or Jupyter Notebook
12

Model
train
train report
index
Data
config
evaluate test reportindex
Data
config
split index
Data
config
Setup pipelines
13

ML Reproducibility checklist
14
1. Automated pipelines
2. Control run params
3. Control execution DAG
4. Code version control
5. Artifacts version control (models, datasets, etc.)
6. Use shared/cloud storage for artifacts
7. Environment dependencies control
8. Experiments results tracking

Step 3:
add version control
for artifacts
15
● add models/data/congis under DVC control
● same code in .py modules
● same pipelines
● everything in Docker
● run experiments in terminal or Jupyter Notebook

ML Reproducibility checklist
16
1. Automated pipelines
2. Control run params
3. Control execution DAG
4. Code version control
5. Artifacts version control (models, datasets, etc.)
6. Use shared/cloud storage for artifacts
7. Environment dependencies control
8. Experiments results tracking

Step 4:
add execution
DAG control
● add pipelines dependencies under DVC control
● models/data/congis under DVC control
● same code in .py modules
● same pipelines
● everything in Docker
● run experiments in terminal or Jupyter Notebook
17

Experiment
config
train config
eval config
split config
prepare
config
Model
train
train report
index
Data
config
evaluate test reportindex
Data
config
split index
Data
config
Setup pipelines
18

ML Reproducibility checklist
19
1. Automated pipelines
2. Control run params
3. Control execution DAG
4. Code version control
5. Artifacts version control (models, datasets, etc.)
6. Use shared/cloud storage for artifacts
7. Environment dependencies control
8. Experiments results tracking

Step 5:
add experiments
control
● add experiments benchmark (DVC, mlflow)
● pipelines dependencies under DVC control
● models/data/congis under DVC control
● same code in .py modules
● same pipelines
● everything in Docker
● run experiments in terminal or Jupyter Notebook
20

Metrics tracking in mlﬂow UI
21
from mlflow import log_metric, log_param,
log_artifact
log_artifact(args.config)
log_param('batch_size', config['batch_size'])
log_metric('f1', f1)
log_metric('roc_auc', roc_auc)

Experiments benchmarking
22
runs
params metrics

ML Reproducibility checklist
23
1. Automated pipelines
2. Control run params
3. Control execution DAG
4. Code version control
5. Artifacts version control (models, datasets, etc.)
6. Use shared/cloud storage for artifacts
7. Environment dependencies control
8. Experiments results tracking

Conclusions
1. pipelines - not diﬀicult
2. start where you detect a “copy-paste” pattern
3. artifacts version control - MUST
4. discipline in a team is important
5. more benefits for high complexity and large team projects
24

Contact me
25
Mikhail Rozhkov
mail: mnrozhkov@gmail.com
ods: @Mikhail Rozhkov

More Related Content

Similar to Start with version control and experiments management in machine learning (20)

PDF

StarWest 2019 - End to end testing: Stupid or Legit?mabl

PDF

Reproducibility and experiments management in Machine Learning Mikhail Rozhkov

PDF

End to-end test automation at scalemabl

PDF

DevOps for TYPO3 Teams and ProjectsFedir RYKHTIK

PDF

Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019VMware Tanzu

PDF

Software Delivery in 2016 - A Continuous Delivery ApproachGiovanni Toraldo

PDF

Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender

PDF

Scaling Ride-Hailing with Machine Learning on MLflowDatabricks

PDF

AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Updatejamieayre

PDF

Building successful and secure products with AI and MLSimon Lia-Jonassen

PDF

Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Fwdays

PDF

Hydrosphere.io for ODSC: Webinar on KubeflowRustem Zakiev

PDF

Tools for Test-Driven Product ModelingTim Geisler

ODP

Moodle Development Best PracitcesJustin Filip

PDF

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman

PDF

QA Meetup at Signavio (Berlin, 06.06.19)Anesthezia

PPTX

Measurement .Net Performance with BenchmarkDotNetVasyl Senko

PDF

Un puente enre MLops y Devops con Openshift AIJuan Vicente Herrera Ruiz de Alejo

PDF

Presentation Verification & ValidationElmar Selbach

PDF

Monitoring AI with AIStepan Pushkarev

StarWest 2019 - End to end testing: Stupid or Legit?mabl

Reproducibility and experiments management in Machine Learning Mikhail Rozhkov

End to-end test automation at scalemabl

DevOps for TYPO3 Teams and ProjectsFedir RYKHTIK

Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019VMware Tanzu

Software Delivery in 2016 - A Continuous Delivery ApproachGiovanni Toraldo

Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender

Scaling Ride-Hailing with Machine Learning on MLflowDatabricks

AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Updatejamieayre

Building successful and secure products with AI and MLSimon Lia-Jonassen

Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Fwdays

Hydrosphere.io for ODSC: Webinar on KubeflowRustem Zakiev

Tools for Test-Driven Product ModelingTim Geisler

Moodle Development Best PracitcesJustin Filip

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman

QA Meetup at Signavio (Berlin, 06.06.19)Anesthezia

Measurement .Net Performance with BenchmarkDotNetVasyl Senko

Un puente enre MLops y Devops con Openshift AIJuan Vicente Herrera Ruiz de Alejo

Presentation Verification & ValidationElmar Selbach

Monitoring AI with AIStepan Pushkarev

More from Mikhail Rozhkov (15)

PDF

Школа Tech-In.RU: Cеминар 1. Основы работы с Ардуино (Аrduino) и Обзор hardwa...Mikhail Rozhkov

PPTX

How to improve performance of team members? Consider competencies and context! Mikhail Rozhkov

PPTX

Применение Arduino (Ардуино) в школе. Сообщество Tech-In.ruMikhail Rozhkov

PDF

Tech in.ru Опыт проведения семинаров по ардуино, электронике и робототехнике ...Mikhail Rozhkov

PDF

Slides_Workplace context and its effect on individual competencies and perfor...Mikhail Rozhkov

PDF

Study summary_Workplace context and its effect on individual competencies and...Mikhail Rozhkov

PPSX

An initial framework of competency-based knowledge managementMikhail Rozhkov

PDF

Отчет о конференции "Управление знаниями: практика" 2011Mikhail Rozhkov

PDF

Роль знаний в организацииMikhail Rozhkov

PDF

Влияние управления знаниями на конкурентоспособность организацийMikhail Rozhkov

PPSX

концепция поликтики уз в современном вузеMikhail Rozhkov

PPSX

организационно-управленческие семинары как инструмент управления знаниямиMikhail Rozhkov

PPSX

Implementation of work-based learning approach in partnership of universities...Mikhail Rozhkov

PPTX

Управление знаниями в университетеMikhail Rozhkov

PDF

Интернет в образовании: путеводительMikhail Rozhkov