SlideShare a Scribd company logo
Introduction to Polyaxon
Yu Ishikawa
Agenda
- Why do we need Polyaxon?
- What is Polyaxon?
- How does Polyaxon work?
- Demo
- Summary
Objectives to introduce Polyaxon
- Make the lead time of experiments as short as possible.
- Make the financial cost to train models as cheap as possible.
- Make the experiments reproducible.
Experiment Phase Operating Phase
Problem
Setting
Collecting
Data
Experiment
s
Off-line
Evaluation
Serving
Models
On-line
Evaluation
Retrain
Model
Off-line
Evaluation
Productionize
ML system
Productionize PhaseML
Workflow
Polyaxon’s role
Why do we need Polyaxon?
- We are not able to manage experiments as team today.
- The cost of experiments is expensive in terms of the financial cost and time. There is room to
improve the efficiency and productivity.
- Setting experiment environments can be tough for ML engineers. Moreover, the environments tend
not to be reproducible. Taking over other member’s tasks can be expensive. As well as, we can not
manage the training process as team.
- It takes a long time for python ML libraries like sklearn to do hyperparameter search, since python is
basically not good at scalability.
Agenda
- Why do we need Polyaxon?
- What is Poyaxon?
- How does Polyaxon work?
- Demo
- Summary
What is Polyaxon?
- An open source platform for reproducible machine learning at scale.
- https://polyaxon.com/
- Features
- Notebook
- Hyperparameter search
- Powerful workspace
- User management
- Dynamic resources allocation
- Dashboard
- Versioning
Notebook environment with Jupyter
---
version: 1
kind: notebook
build:
image: tensorflow/tensorflow:1.4.1-py3
build_steps:
- pip3 install jupyter
$ Polyaxon notebook start -f polyaxon_notebook.yml
New version of CLI (0.3.5) is now available. To upgrade run:
pip install -U polyaxon-cli
Notebook is being deployed for project `quick-start`
It may take some time before you can access the notebook.
Your notebook will be available on:
http://35.184.217.84:80/notebook/root/quick-start/
- Polyaxon enables us to launch a jupyter environment with one command. As
well as we can define the environment with docker and some commands in a
YAML file.
- We can reproduce the notebook experiments easily.
Hyperparameter tuning with Polyaxon
- Polyaxon supports some hyperparameter tuning methods:
- Grid search
- Random search
- Bayesian optimization
- Early stopping
- Hyperband
- We can control the concurrency of hyperparameter tuning with YAML file.
- We can reproduce hyperparameter tuning jobs as well.
Hyperparameter tuning with high concurrency
---
version: 1
kind: group
hptuning:
concurrency: 5
matrix:
learning_rate:
linspace: 0.001:0.1:5
dropout:
values: [0.25, 0.3]
activation:
values: [relu, sigmoid]
declarations:
batch_size: 128
num_steps: 500
num_epochs: 1
build:
image: tensorflow/tensorflow:1.4.1-py3
build_steps:
- pip3 install --no-cache-dir -U polyaxon-helper
run:
cmd: python3 model.py --batch_size={{ batch_size }} 
--num_steps={{ num_steps }} 
--learning_rate={{ learning_rate }} 
--dropout={{ dropout }} 
--num_epochs={{ num_epochs }} 
--activation={{ activation }}
$ Polyaxon notebook start -f
polyaxon_notebook.yml
New version of CLI (0.3.5) is now available. To
upgrade run:
pip install -U polyaxon-cli
Creating an experiment group with the following
definition:
---------------- -----------------
Search algorithm grid
Concurrency 5 concurrent runs
Early stopping deactivated
---------------- -----------------
Experiment group 1 was created
polyaxon_gridsearch.yml
Dashboard ~ Experiments
Dashboard ~ Metrics Visualization
Agenda
- Why do we need Polyaxon?
- What is Polyaxon?
- How does Polyaxon work?
- Demo
- Summary
Hyperparameter tuning with a single machine
Many CPU cores machine
Memory
data
CPU
Core: Train with parameter set A
Core: Train with parameter set B
Core: Train with parameter set C
Core: Train with parameter set X
...
For instance, scikit-learn’s GridSearchCV enables us to run experiments in parallel. However, the number
of process is based on the number of CPU cores. For instance, a 64 CPU cores machine can have up to
64 concurrencies.
Hyperparameter tuning of Polyaxon
Polyaxon on k8s
Polyaxon core
Node A
Node B
Node C
Training Code
Upload & run
Build
Pod: parameter set A
Pod: parameter set B
Pod: parameter set C
Pod: parameter set D
Pod: parameter set E
Pod: parameter set F
Schedule
The more the number of nodes in k8s cluster, the more the number of process to
train is. There is no constraint of parallelism.
Even one experiments can be shorter
Experiments Evaluation
Experiments Evaluation
By leveraging the multiple nodes on k8s, we can shorten the experiment time of 1
experiment with high concurrency.
Single machine
Polyaxon
t
Reduce training time
Auto-scalable & preemptible node pool with polyaxon
Polyaxon on k8s
Node pool for
polyaxon core
Node Node Node
Node pool for
experiments
Auto-scalable & preemptible node pool with polyaxon
Polyaxon on k8s
Node pool for
polyaxon core
Node Node Node
Node pool for
experiments
Training Code
of experiment X
Upload & run
Concurrency:
- 100
Requests:
- CPU: 1
- Memory: 2GB
Auto-scalable & preemptible node pool with polyaxon
Polyaxon on k8s
Node pool for
polyaxon core
Node Node Node
Node pool for
experiments
Preemptible node Preemptible node Preemptible node
Training Code
of experiment X
Upload & run
Concurrency:
- 100
Requests:
- CPU: 1
- Memory: 2GB
Automatically launch new preemptible instances
Auto-scalable & preemptible node pool with polyaxon
Polyaxon on k8s
Node pool for
polyaxon core
Node Node Node
Node pool for
experiments
Training Code
of experiment Y
Upload & run
Concurrency:
- 50
Requests:
- CPU: 1
- Memory: 1GB
Preemptible node Preemptible node Preemptible node
Auto-scalable & preemptible node pool with polyaxon
Polyaxon on k8s
Node pool for
polyaxon core
Node Node Node
Node pool for
experiments
Preemptible node Preemptible node
Preemptible node Preemptible node
Preemptible node
Training Code
of experiment Y
Upload & run
Concurrency:
- 50
Requests:
- CPU: 1
- Memory: 1GB
Preemptible instance/GPU/TPU pricing
Preemptible instance Preemptible GPU
Preemptible TPU
Regular instance cost vs Preemptible instance cost
Running cost
t
Regular instance
Preemptible instance
- We can reduce the cost of training models by
leveraging preemptible instances with
polyaxon.
- Polyaxon enables us to use preemptible node
pool for experiments.
- Since polyaxon automatically scale the node
pool with GKE, we don’t need to hold static
instances for experiments.
Reduced
cost
It takes a longer time to do experiments with a single machine sequentially,
because python ML library like sklearn is not basically scalable.
Multiple Experiments with a single machine
Experiments Evaluation
Experiments Evaluation
Experiments Evaluation
t
Polyaxon enables us to easily run multiple experiments in parallel on k8s. We
don’t need to wait for each experiments to move on to the next one.
Multiple Experiments with polyaxon
Experiments Evaluation
Experiments Evaluation
Experiments Evaluation
We can shorten the total experiments time by
the parallelism of Polyaxon.
t
Sequential experiments cost vs parallel experiments cost
Running cost
t
- Essentially speaking, the costs of instances
should be the same, since the cost of CPU
usage is linear with running time.
- However, we should not overlook labor costs
while experiments. Waiting for experiments is
time and money wasting. Time is money!!
- We can reduce the total cost by shortening the
total experiments time.
Sequential experiments
Experiments in parallel
Labor cost
t
Reduced cost
Power of multiple preemptible nodes
- The cost of preemptible n1-standard-64 x 10 nodes x 2 hours with 640
concurrencies. It should be cheap!
- $12.8 = $0.64 * 10 * 2
- Even If it takes about 20 minutes to run 1 parameter set of a training job, we
can run about 3840 parameter sets of a training job for just 2 hours with such
a cheap cost.
We can achieve the objectives:
- Make the lead time of experiments as short as possible.
- Make the financial cost to train models as cheap as possible.
Agenda
- Why do we need Polyaxon?
- What is Polyaxon?
- How does Polyaxon work?
- Demo
- Summary
Demo
- Notebook
- Job
- Experiment
- Hyperparameter tuning at scale
Summary
- We can definitely achieve the objectives with polyaxon on GKE.
- Make the lead time of experiments as short as possible.
- Make the financial cost to train models as cheap as possible.
- Make the experiments reproducible.
- All we ML engineers have to do is:
- Making the training code in python as usual, and
- Defining the YAML files to do experiments.
- What’s next?
- Supporting preemptible GPUs / TPUs.
Appendix A: Links
- Polyaxon
- https://polyaxon.com/
- Documentation
- https://docs.polyaxon.com/
- Examples
- https://github.com/polyaxon/polyaxon-quick-start
- https://github.com/polyaxon/deep-learning-with-python-notebooks-on-polyaxon
- https://github.com/polyaxon/polyaxon-examples
Appendix B: Architecture of Polyaxon

More Related Content

What's hot (20)

PDF
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
Chris Fregly
 
PDF
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
PDF
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly
 
PDF
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly
 
PDF
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
Chris Fregly
 
PDF
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Chris Fregly
 
PDF
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
Chris Fregly
 
PDF
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Chris Fregly
 
PDF
Atlanta Spark User Meetup 09 22 2016
Chris Fregly
 
PDF
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
Chris Fregly
 
PDF
On heap cache vs off-heap cache
rgrebski
 
PDF
Advanced Spark and TensorFlow Meetup May 26, 2016
Chris Fregly
 
PPTX
Profiling & Testing with Spark
Roger Rafanell Mas
 
PDF
PostgreSQL with OpenCL
Muhaza Liebenlito
 
PDF
Solving channel coding simulation and optimization problems using GPU
Usatyuk Vasiliy
 
PDF
.NET Fest 2019. Николай Балакин. Микрооптимизации в мире .NET
NETFest
 
PPTX
Available HPC resources at CSUC
CSUC - Consorci de Serveis Universitaris de Catalunya
 
PDF
Some experiences for porting application to Intel Xeon Phi
Maho Nakata
 
PDF
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Intel® Software
 
PPTX
PyPy - is it ready for production
Mark Rees
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
Chris Fregly
 
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Chris Fregly
 
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
Chris Fregly
 
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Chris Fregly
 
Atlanta Spark User Meetup 09 22 2016
Chris Fregly
 
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
Chris Fregly
 
On heap cache vs off-heap cache
rgrebski
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Chris Fregly
 
Profiling & Testing with Spark
Roger Rafanell Mas
 
PostgreSQL with OpenCL
Muhaza Liebenlito
 
Solving channel coding simulation and optimization problems using GPU
Usatyuk Vasiliy
 
.NET Fest 2019. Николай Балакин. Микрооптимизации в мире .NET
NETFest
 
Some experiences for porting application to Intel Xeon Phi
Maho Nakata
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Intel® Software
 
PyPy - is it ready for production
Mark Rees
 

Similar to Introduction to Polyaxon (20)

PPTX
Build, train, and deploy Machine Learning models at scale (May 2018)
Julien SIMON
 
PPTX
Build, train, and deploy Machine Learning models at scale (May 2018)
Julien SIMON
 
PDF
DevoxxUK: Optimizating Application Performance on Kubernetes
Dinakar Guniguntala
 
PPTX
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
PDF
SigOpt at GTC - Tuning the Untunable
SigOpt
 
PPTX
JVM and OS Tuning for accelerating Spark application
Tatsuhiro Chiba
 
PDF
Spark Summit EU talk by Josef Habdank
Spark Summit
 
PDF
HiPEAC 2019 Tutorial - Maestro RTOS
Tulipp. Eu
 
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
PDF
Apache Submarine: Unified Machine Learning Platform
Wangda Tan
 
PPTX
StackNet Meta-Modelling framework
Sri Ambati
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
PDF
Toronto meetup 20190917
Bill Liu
 
PDF
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
Stijn Decubber
 
PDF
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Chris Fregly
 
PDF
How to Puppetize Google Cloud Platform - PuppetConf 2014
Puppet
 
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
PDF
OpenNebulaConf 2016 - Measuring and tuning VM performance by Boyan Krosnov, S...
OpenNebula Project
 
PDF
Intro - End to end ML with Kubeflow @ SignalConf 2018
Holden Karau
 
PPTX
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Community
 
Build, train, and deploy Machine Learning models at scale (May 2018)
Julien SIMON
 
Build, train, and deploy Machine Learning models at scale (May 2018)
Julien SIMON
 
DevoxxUK: Optimizating Application Performance on Kubernetes
Dinakar Guniguntala
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
SigOpt at GTC - Tuning the Untunable
SigOpt
 
JVM and OS Tuning for accelerating Spark application
Tatsuhiro Chiba
 
Spark Summit EU talk by Josef Habdank
Spark Summit
 
HiPEAC 2019 Tutorial - Maestro RTOS
Tulipp. Eu
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
Apache Submarine: Unified Machine Learning Platform
Wangda Tan
 
StackNet Meta-Modelling framework
Sri Ambati
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
Toronto meetup 20190917
Bill Liu
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
Stijn Decubber
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Chris Fregly
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
Puppet
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
OpenNebulaConf 2016 - Measuring and tuning VM performance by Boyan Krosnov, S...
OpenNebula Project
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Holden Karau
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Community
 
Ad

More from Yu Ishikawa (10)

PDF
2017 09-27 democratize data products with SQL
Yu Ishikawa
 
PDF
2016-06-15 Sparkの機械学習の開発と活用の動向
Yu Ishikawa
 
PDF
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
Yu Ishikawa
 
PDF
2015-11-17 きちんと知りたいApache Spark ~機械学習とさまざまな機能群
Yu Ishikawa
 
PPTX
2015 03-12 道玄坂LT祭り第2回 Spark DataFrame Introduction
Yu Ishikawa
 
PPTX
2014 09-12 lambda-architecture-at-indix
Yu Ishikawa
 
PPTX
「チーム開発実践入門」勉強会
Yu Ishikawa
 
PDF
BdasとSpark概要
Yu Ishikawa
 
PPTX
Hadoop conference 2013winter_for_slideshare
Yu Ishikawa
 
PPTX
2012 02-02 mixi engineer's seminor #3
Yu Ishikawa
 
2017 09-27 democratize data products with SQL
Yu Ishikawa
 
2016-06-15 Sparkの機械学習の開発と活用の動向
Yu Ishikawa
 
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
Yu Ishikawa
 
2015-11-17 きちんと知りたいApache Spark ~機械学習とさまざまな機能群
Yu Ishikawa
 
2015 03-12 道玄坂LT祭り第2回 Spark DataFrame Introduction
Yu Ishikawa
 
2014 09-12 lambda-architecture-at-indix
Yu Ishikawa
 
「チーム開発実践入門」勉強会
Yu Ishikawa
 
BdasとSpark概要
Yu Ishikawa
 
Hadoop conference 2013winter_for_slideshare
Yu Ishikawa
 
2012 02-02 mixi engineer's seminor #3
Yu Ishikawa
 
Ad

Recently uploaded (20)

PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 

Introduction to Polyaxon

  • 2. Agenda - Why do we need Polyaxon? - What is Polyaxon? - How does Polyaxon work? - Demo - Summary
  • 3. Objectives to introduce Polyaxon - Make the lead time of experiments as short as possible. - Make the financial cost to train models as cheap as possible. - Make the experiments reproducible. Experiment Phase Operating Phase Problem Setting Collecting Data Experiment s Off-line Evaluation Serving Models On-line Evaluation Retrain Model Off-line Evaluation Productionize ML system Productionize PhaseML Workflow Polyaxon’s role
  • 4. Why do we need Polyaxon? - We are not able to manage experiments as team today. - The cost of experiments is expensive in terms of the financial cost and time. There is room to improve the efficiency and productivity. - Setting experiment environments can be tough for ML engineers. Moreover, the environments tend not to be reproducible. Taking over other member’s tasks can be expensive. As well as, we can not manage the training process as team. - It takes a long time for python ML libraries like sklearn to do hyperparameter search, since python is basically not good at scalability.
  • 5. Agenda - Why do we need Polyaxon? - What is Poyaxon? - How does Polyaxon work? - Demo - Summary
  • 6. What is Polyaxon? - An open source platform for reproducible machine learning at scale. - https://polyaxon.com/ - Features - Notebook - Hyperparameter search - Powerful workspace - User management - Dynamic resources allocation - Dashboard - Versioning
  • 7. Notebook environment with Jupyter --- version: 1 kind: notebook build: image: tensorflow/tensorflow:1.4.1-py3 build_steps: - pip3 install jupyter $ Polyaxon notebook start -f polyaxon_notebook.yml New version of CLI (0.3.5) is now available. To upgrade run: pip install -U polyaxon-cli Notebook is being deployed for project `quick-start` It may take some time before you can access the notebook. Your notebook will be available on: http://35.184.217.84:80/notebook/root/quick-start/ - Polyaxon enables us to launch a jupyter environment with one command. As well as we can define the environment with docker and some commands in a YAML file. - We can reproduce the notebook experiments easily.
  • 8. Hyperparameter tuning with Polyaxon - Polyaxon supports some hyperparameter tuning methods: - Grid search - Random search - Bayesian optimization - Early stopping - Hyperband - We can control the concurrency of hyperparameter tuning with YAML file. - We can reproduce hyperparameter tuning jobs as well.
  • 9. Hyperparameter tuning with high concurrency --- version: 1 kind: group hptuning: concurrency: 5 matrix: learning_rate: linspace: 0.001:0.1:5 dropout: values: [0.25, 0.3] activation: values: [relu, sigmoid] declarations: batch_size: 128 num_steps: 500 num_epochs: 1 build: image: tensorflow/tensorflow:1.4.1-py3 build_steps: - pip3 install --no-cache-dir -U polyaxon-helper run: cmd: python3 model.py --batch_size={{ batch_size }} --num_steps={{ num_steps }} --learning_rate={{ learning_rate }} --dropout={{ dropout }} --num_epochs={{ num_epochs }} --activation={{ activation }} $ Polyaxon notebook start -f polyaxon_notebook.yml New version of CLI (0.3.5) is now available. To upgrade run: pip install -U polyaxon-cli Creating an experiment group with the following definition: ---------------- ----------------- Search algorithm grid Concurrency 5 concurrent runs Early stopping deactivated ---------------- ----------------- Experiment group 1 was created polyaxon_gridsearch.yml
  • 11. Dashboard ~ Metrics Visualization
  • 12. Agenda - Why do we need Polyaxon? - What is Polyaxon? - How does Polyaxon work? - Demo - Summary
  • 13. Hyperparameter tuning with a single machine Many CPU cores machine Memory data CPU Core: Train with parameter set A Core: Train with parameter set B Core: Train with parameter set C Core: Train with parameter set X ... For instance, scikit-learn’s GridSearchCV enables us to run experiments in parallel. However, the number of process is based on the number of CPU cores. For instance, a 64 CPU cores machine can have up to 64 concurrencies.
  • 14. Hyperparameter tuning of Polyaxon Polyaxon on k8s Polyaxon core Node A Node B Node C Training Code Upload & run Build Pod: parameter set A Pod: parameter set B Pod: parameter set C Pod: parameter set D Pod: parameter set E Pod: parameter set F Schedule The more the number of nodes in k8s cluster, the more the number of process to train is. There is no constraint of parallelism.
  • 15. Even one experiments can be shorter Experiments Evaluation Experiments Evaluation By leveraging the multiple nodes on k8s, we can shorten the experiment time of 1 experiment with high concurrency. Single machine Polyaxon t Reduce training time
  • 16. Auto-scalable & preemptible node pool with polyaxon Polyaxon on k8s Node pool for polyaxon core Node Node Node Node pool for experiments
  • 17. Auto-scalable & preemptible node pool with polyaxon Polyaxon on k8s Node pool for polyaxon core Node Node Node Node pool for experiments Training Code of experiment X Upload & run Concurrency: - 100 Requests: - CPU: 1 - Memory: 2GB
  • 18. Auto-scalable & preemptible node pool with polyaxon Polyaxon on k8s Node pool for polyaxon core Node Node Node Node pool for experiments Preemptible node Preemptible node Preemptible node Training Code of experiment X Upload & run Concurrency: - 100 Requests: - CPU: 1 - Memory: 2GB Automatically launch new preemptible instances
  • 19. Auto-scalable & preemptible node pool with polyaxon Polyaxon on k8s Node pool for polyaxon core Node Node Node Node pool for experiments Training Code of experiment Y Upload & run Concurrency: - 50 Requests: - CPU: 1 - Memory: 1GB Preemptible node Preemptible node Preemptible node
  • 20. Auto-scalable & preemptible node pool with polyaxon Polyaxon on k8s Node pool for polyaxon core Node Node Node Node pool for experiments Preemptible node Preemptible node Preemptible node Preemptible node Preemptible node Training Code of experiment Y Upload & run Concurrency: - 50 Requests: - CPU: 1 - Memory: 1GB
  • 21. Preemptible instance/GPU/TPU pricing Preemptible instance Preemptible GPU Preemptible TPU
  • 22. Regular instance cost vs Preemptible instance cost Running cost t Regular instance Preemptible instance - We can reduce the cost of training models by leveraging preemptible instances with polyaxon. - Polyaxon enables us to use preemptible node pool for experiments. - Since polyaxon automatically scale the node pool with GKE, we don’t need to hold static instances for experiments. Reduced cost
  • 23. It takes a longer time to do experiments with a single machine sequentially, because python ML library like sklearn is not basically scalable. Multiple Experiments with a single machine Experiments Evaluation Experiments Evaluation Experiments Evaluation t
  • 24. Polyaxon enables us to easily run multiple experiments in parallel on k8s. We don’t need to wait for each experiments to move on to the next one. Multiple Experiments with polyaxon Experiments Evaluation Experiments Evaluation Experiments Evaluation We can shorten the total experiments time by the parallelism of Polyaxon. t
  • 25. Sequential experiments cost vs parallel experiments cost Running cost t - Essentially speaking, the costs of instances should be the same, since the cost of CPU usage is linear with running time. - However, we should not overlook labor costs while experiments. Waiting for experiments is time and money wasting. Time is money!! - We can reduce the total cost by shortening the total experiments time. Sequential experiments Experiments in parallel Labor cost t Reduced cost
  • 26. Power of multiple preemptible nodes - The cost of preemptible n1-standard-64 x 10 nodes x 2 hours with 640 concurrencies. It should be cheap! - $12.8 = $0.64 * 10 * 2 - Even If it takes about 20 minutes to run 1 parameter set of a training job, we can run about 3840 parameter sets of a training job for just 2 hours with such a cheap cost. We can achieve the objectives: - Make the lead time of experiments as short as possible. - Make the financial cost to train models as cheap as possible.
  • 27. Agenda - Why do we need Polyaxon? - What is Polyaxon? - How does Polyaxon work? - Demo - Summary
  • 28. Demo - Notebook - Job - Experiment - Hyperparameter tuning at scale
  • 29. Summary - We can definitely achieve the objectives with polyaxon on GKE. - Make the lead time of experiments as short as possible. - Make the financial cost to train models as cheap as possible. - Make the experiments reproducible. - All we ML engineers have to do is: - Making the training code in python as usual, and - Defining the YAML files to do experiments. - What’s next? - Supporting preemptible GPUs / TPUs.
  • 30. Appendix A: Links - Polyaxon - https://polyaxon.com/ - Documentation - https://docs.polyaxon.com/ - Examples - https://github.com/polyaxon/polyaxon-quick-start - https://github.com/polyaxon/deep-learning-with-python-notebooks-on-polyaxon - https://github.com/polyaxon/polyaxon-examples