SlideShare a Scribd company logo
v
Democratizing Machine
Learning on Kubernetes
Principal Solution Architect
Cloud and AI, Microsoft
@joyqiao2016
Joy Qiao
Principal Program Manager
Azure Containers, Microsoft
@LachlanEvenson
Lachlan Evenson
• The Data Scientist
• Building and training models
• Experience in Machine Learning frameworks/libraries
• Basic understanding of computer hardware
• Lucky to have Kubernetes experience
Who are we?
Data Scientist
• The Infra Engineer/SRE
• Build and maintain baremetal/cloud infra
• Kubernetes experience
• Little to no Machine Learning library experience
Who are we? (continued)
Infra Engineer
SRE
ML on Kubernetes
Infra Engineer
SRE
Data Scientist
We have the RIGHT tools and libraries to build
and train models
We have the RIGHT platform in Kubernetes to
run and train these models
Why this matters
• Two discrete worlds are coming together
• The knowledge is not widely accessible to the right
audience
• Nomenclature
• Documentation and use-cases are lacking
• APIs are evolving very fast, sample code gets out
of date quickly
What we’ve experienced
Let’s get started
Distributed Deep Learning Architectures
Data Parallelism
1. Parallel training on different machines
2. Update the parameters
synchronously/asynchronously
3. Refresh the local model with new parameters, go to
1 and repeat
Distributed Training Architecture
Credits: Taifeng Wang, DMTK team
Model Parallelism
The global model is partitioned into K sub-models.
The sub-models are distributed over K local
workers and serve as their local models.
In each mini-batch, the local workers compute the
gradients of the local weights by back propagation.
Distributed Training Architecture
Credits: Taifeng Wang, DMTK team
For Variable Distribution & Gradient
Aggregation
• Parameter_server
Distributed TensorFlow Architecture
Source: https://www.tensorflow.org/performance/performance_models
Distributed Deep Learning Tooling
• Kubeflow - https://github.com/kubeflow/kubeflow
• JupyterHub, Tensorflow training/serving
• Seldon, Pachyderm, Pytorch-job
• Hands on Labs – https://aka.ms/kubeflow-labs
• Deep Learning Workspace
• https://github.com/microsoft/DLWorkspace/
• https://microsoft.github.io/DLWorkspace/
ML on Kubernetes open source tooling
Distributed Tensorflow using Kubeflow
Distributed Tensorflow using Kubeflow (cont)
• MetaML - https://aka.ms/kubeflow-labs
ML on Kubernetes open source tooling
In just 3 simple steps
Running Distributed TensorFlow on Kubernetes with
Kubeflow
1. Create a Kubernetes cluster with GPU enabled nodes
(AKS, GKE, EKS) -
https://docs.microsoft.com/azure/aks/gpu-cluster
2. Install Kubeflow
3. Run Distributed TensorFlow training job
Detailed instructions at https://aka.ms/kubeflow-labs
Running Distributed TensorFlow on Kubernetes
(continued)
Demo
Distributed Training Performance
on Kubernetes
Training Environment on Azure
Node VMs
oNC24r for workers
§ 4x NVIDIA® Tesla® K80 GPU
§ 24 CPU cores, 224 GB RAM
oD14_v2 for parameter server
§ 16 CPU cores, 112 GB RAM
Kubernetes: AKS (Azure Kubernetes Service) v1.9.6
GPU: NVIDIA® Tesla® K80
Benchmarks scripts: https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
• OS: Ubuntu 16.04 LTS
• TensorFlow: 1.8
• CUDA / cuDNN: 9.0 / 7.0
• Disk: Local SSD
• DataSet: ImageNet (real data, not synthetic)
•Linear scalability
•GPUs are fully saturated
Training on Single Pod, Multi-GPU
Settings:
•Topology: 1 ps and 2 workers
•Async variables update
•Using cpu as the local_parameter_device
•Each ps/worker pod has its own dedicated host
•variable_update mode: parameter_server
•Network protocol: gRPC
Distributed Training
Single Pod Training with 4 GPUs
vs Distributed Training
with 2 workers with 8 GPUs
Distributed Training (continued)
Distributed Training (continued)
Training Speedup on 2 nodes vs single-node
The model with a
higher ratio
scales better.
Distributed training scalability depends on the compute/communication ratio of the model
Observations during test:
• Linear scalability largely depends on the model and network bandwidth.
• GPUs not fully saturated on the worker nodes, likely due to network bottleneck.
• VGG16 does not scale across multiple nodes. GPUs “starved” most of the time.
• Having parameter servers running on the same pods as the workers seem to have
worse performance
• Tricky to decide the right ratio of workers to parameter servers
• Sync vs Async variable updates
Distributed Training (continued)
Distributed Training (continued)
How can we do better?
Benchmark on 32 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s
network
(source: https://github.com/uber/horovod)
Horovod: Uber’s Open Source Distributed Deep Learning
Framework for TensorFlow, Keras, and PyTorch
• A stand-alone python
package
• Seamless install on top of
TensorFlow
• Uses NCCL for ring-
allreduce across servers
instead of parameter server
• Uses MPI for worker
discovery and reduction
coordination
• Tensor Fusion
Benchmark on AKS – Traditional Distributed TensorFlow vs Horovod
Horovod: Uber’s Open Source Distributed Deep Learning
Framework
Training Speedup on 2 nodes vs single-node
Ecosystem open source Tooling for Distributed
training on Kubernetes
“FreeFlow” CNI plugin from Microsoft Research (Credits: Yibo Zhu from MS Research)
https://github.com/Microsoft/Freeflow
• Speeds up container overlay network communication to the same as host machine
• Supports both RDMA and TCP
• Higher throughput, lower latency, and less CPU overhead
• Pure user space implementation
• Not relying on anything specific to Kubernetes.
• Detailed steps on how to setup FreeFlow in K8s at
https://github.com/joyq-github/TensorFlowonK8s/blob/master/FreeFlow.md
Ecosystem open source Tooling for Distributed
training on Kubernetes (continued)
Custom CRI & Scheduler: GPU-related resource scheduling on K8s
(Credits: Sanjeev Mehrotra from MS Research)
• Pods with no. of GPUs with how much memory
• Pods with no. of GPUs interconnected via NVLink, etc.
• https://github.com/Microsoft/KubeGPU
Fast.ai: Training Imagenet in 3 hours for $25
and CIFAR10 in 3 minutes for $0.26
http://www.fast.ai/2018/04/30/dawnbench-fastai/
Speed to accuracy is very important, in addition to images/sec
Algorithmic creativity is more important than bare-metal performance
• Super convergence - slowly increase the learning rate whilst decreasing momentum during training
https://arxiv.org/abs/1708.07120
• Progressive resizing - train on smaller images at the start of training, and gradually increase image
size as you train further
o Progressive Growing of GANs https://arxiv.org/abs/1710.10196
o Enhanced Deep Residual Networks https://arxiv.org/abs/1707.02921
• Half precision arithmetic
• Wide Residual Networks https://arxiv.org/abs/1605.07146
• Distributed Training Architectures
• Open Source tooling on Kubernetes
• Performance benchmarks
• Ecosystem libraries and tooling
In Summary
Kubeflow – https://github.com/kubeflow/kubeflow
Kubeflow Labs – https://aka.ms/kubeflow-labs
Deep Learning Workspace – https://github.com/microsoft/DLWorkspace/
AKS GPU Cluster create - https://docs.microsoft.com/azure/aks/gpu-cluster
Resources
Horovod - https://github.com/uber/horovod
FreeFlow - https://github.com/Microsoft/Freeflow
FreeFlow setup docs - https://github.com/joyq-github/TensorFlowonK8s/blob/master/FreeFlow.md
Kubernetes GPU Scheduler - https://github.com/Microsoft/KubeGPU
Fast.ai - http://www.fast.ai/2018/04/30/dawnbench-fastai/
Resources (continued)
You can reach us on Twitter via:
@joyqiao2016
@LachlanEvenson
Thank you

More Related Content

What's hot (20)

PDF
Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradar
sangam biradar
 
PDF
DCSF 19 Microservices API: Routing Across Any Infrastructure
Docker, Inc.
 
PDF
DockerCon 18 Cool Hacks: solo.io
Docker, Inc.
 
PDF
DCSF19 Kubernetes Security with OPA
Docker, Inc.
 
PPTX
Tectonic Summit 2016: Multi-Cluster Kubernetes: Planning for Unknowns
CoreOS
 
PDF
Why I wish I'd Heard of Docker when I was 12 - Finnian Anderson
Docker, Inc.
 
PDF
Troubleshooting tips from docker support engineers
Docker, Inc.
 
PDF
Dev opsec dockerimage_patch_n_lifecyclemanagement_2019
kanedafromparis
 
PDF
DCEU 18: App-in-a-Box with Docker Application Packages
Docker, Inc.
 
PDF
On Prem Container Cloud - Lessons Learned
CodeOps Technologies LLP
 
PDF
Okteto For Kubernetes Developer :- Container Camp 2020
sangam biradar
 
PDF
GKE Tip Series how do i choose between gke standard, autopilot and cloud run
Sreenivas Makam
 
PDF
DCEU 18: From Monolith to Microservices
Docker, Inc.
 
PPTX
DevOps with Kubernetes and Helm - OSCON 2018
Jessica Deen
 
PDF
Building a Secure Supply Chain with Docker
Docker, Inc.
 
PDF
Modernizing Traditional Applications
Docker, Inc.
 
PDF
Serverless stream processing of Debezium data change events with Knative | De...
Red Hat Developers
 
PPTX
Zero-downtime deployment with Kubernetes [Meetup #21 - 01]
Vietnam Open Infrastructure User Group
 
PPTX
Kubernetes Helm: Why It Matters
Platform9
 
PDF
Kubernetes Logging
Denys Havrysh
 
Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradar
sangam biradar
 
DCSF 19 Microservices API: Routing Across Any Infrastructure
Docker, Inc.
 
DockerCon 18 Cool Hacks: solo.io
Docker, Inc.
 
DCSF19 Kubernetes Security with OPA
Docker, Inc.
 
Tectonic Summit 2016: Multi-Cluster Kubernetes: Planning for Unknowns
CoreOS
 
Why I wish I'd Heard of Docker when I was 12 - Finnian Anderson
Docker, Inc.
 
Troubleshooting tips from docker support engineers
Docker, Inc.
 
Dev opsec dockerimage_patch_n_lifecyclemanagement_2019
kanedafromparis
 
DCEU 18: App-in-a-Box with Docker Application Packages
Docker, Inc.
 
On Prem Container Cloud - Lessons Learned
CodeOps Technologies LLP
 
Okteto For Kubernetes Developer :- Container Camp 2020
sangam biradar
 
GKE Tip Series how do i choose between gke standard, autopilot and cloud run
Sreenivas Makam
 
DCEU 18: From Monolith to Microservices
Docker, Inc.
 
DevOps with Kubernetes and Helm - OSCON 2018
Jessica Deen
 
Building a Secure Supply Chain with Docker
Docker, Inc.
 
Modernizing Traditional Applications
Docker, Inc.
 
Serverless stream processing of Debezium data change events with Knative | De...
Red Hat Developers
 
Zero-downtime deployment with Kubernetes [Meetup #21 - 01]
Vietnam Open Infrastructure User Group
 
Kubernetes Helm: Why It Matters
Platform9
 
Kubernetes Logging
Denys Havrysh
 

Similar to Democratizing machine learning on kubernetes (20)

PDF
Using Deep Learning Toolkits with Kubernetes clusters
Joy Qiao
 
PDF
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Jakob Karalus
 
PDF
Introduction to DL platform
xiaogaozi
 
PDF
Distributed DNN training: Infrastructure, challenges, and lessons learned
Wee Hyong Tok
 
PPTX
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Tyrone Systems
 
PDF
How To Build Efficient ML Pipelines From The Startup Perspective (GTC Silicon...
Jaeman An
 
PPTX
Machine Learning using Kubernetes - AI Conclave 2019
Arun Gupta
 
PPTX
Distributed tensorflow on kubernetes
inwin stack
 
PDF
Intro - End to end ML with Kubeflow @ SignalConf 2018
Holden Karau
 
PDF
Containerized architectures for deep learning
Antje Barth
 
PDF
Machine learning using Kubernetes
Arun Gupta
 
PDF
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
PDF
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Chris Fregly
 
PPTX
Distributed tensorflow on kubernetes
inwin stack
 
PDF
Scaling Deep Learning Algorithms on Extreme Scale Architectures
inside-BigData.com
 
PPTX
Distributed Deep learning Training.
Umang Sharma
 
PPTX
Leonid Kuligin "Training ML models with Cloud"
Lviv Startup Club
 
PDF
Uber's Journey in Distributed Deep Learning
inside-BigData.com
 
PDF
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly
 
PDF
Tensorflow 2.0 and Coral Edge TPU
Andrés Leonardo Martinez Ortiz
 
Using Deep Learning Toolkits with Kubernetes clusters
Joy Qiao
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Jakob Karalus
 
Introduction to DL platform
xiaogaozi
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Wee Hyong Tok
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Tyrone Systems
 
How To Build Efficient ML Pipelines From The Startup Perspective (GTC Silicon...
Jaeman An
 
Machine Learning using Kubernetes - AI Conclave 2019
Arun Gupta
 
Distributed tensorflow on kubernetes
inwin stack
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Holden Karau
 
Containerized architectures for deep learning
Antje Barth
 
Machine learning using Kubernetes
Arun Gupta
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Chris Fregly
 
Distributed tensorflow on kubernetes
inwin stack
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
inside-BigData.com
 
Distributed Deep learning Training.
Umang Sharma
 
Leonid Kuligin "Training ML models with Cloud"
Lviv Startup Club
 
Uber's Journey in Distributed Deep Learning
inside-BigData.com
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly
 
Tensorflow 2.0 and Coral Edge TPU
Andrés Leonardo Martinez Ortiz
 
Ad

More from Docker, Inc. (20)

PDF
Containerize Your Game Server for the Best Multiplayer Experience
Docker, Inc.
 
PDF
How to Improve Your Image Builds Using Advance Docker Build
Docker, Inc.
 
PDF
Build & Deploy Multi-Container Applications to AWS
Docker, Inc.
 
PDF
Securing Your Containerized Applications with NGINX
Docker, Inc.
 
PDF
How To Build and Run Node Apps with Docker and Compose
Docker, Inc.
 
PDF
Hands-on Helm
Docker, Inc.
 
PDF
Distributed Deep Learning with Docker at Salesforce
Docker, Inc.
 
PDF
The First 10M Pulls: Building The Official Curl Image for Docker Hub
Docker, Inc.
 
PDF
Monitoring in a Microservices World
Docker, Inc.
 
PDF
COVID-19 in Italy: How Docker is Helping the Biggest Italian IT Company Conti...
Docker, Inc.
 
PDF
Predicting Space Weather with Docker
Docker, Inc.
 
PDF
Become a Docker Power User With Microsoft Visual Studio Code
Docker, Inc.
 
PDF
How to Use Mirroring and Caching to Optimize your Container Registry
Docker, Inc.
 
PDF
Monolithic to Microservices + Docker = SDLC on Steroids!
Docker, Inc.
 
PDF
Kubernetes at Datadog Scale
Docker, Inc.
 
PDF
Labels, Labels, Labels
Docker, Inc.
 
PDF
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
Docker, Inc.
 
PDF
Build & Deploy Multi-Container Applications to AWS
Docker, Inc.
 
PDF
From Fortran on the Desktop to Kubernetes in the Cloud: A Windows Migration S...
Docker, Inc.
 
PDF
Developing with Docker for the Arm Architecture
Docker, Inc.
 
Containerize Your Game Server for the Best Multiplayer Experience
Docker, Inc.
 
How to Improve Your Image Builds Using Advance Docker Build
Docker, Inc.
 
Build & Deploy Multi-Container Applications to AWS
Docker, Inc.
 
Securing Your Containerized Applications with NGINX
Docker, Inc.
 
How To Build and Run Node Apps with Docker and Compose
Docker, Inc.
 
Hands-on Helm
Docker, Inc.
 
Distributed Deep Learning with Docker at Salesforce
Docker, Inc.
 
The First 10M Pulls: Building The Official Curl Image for Docker Hub
Docker, Inc.
 
Monitoring in a Microservices World
Docker, Inc.
 
COVID-19 in Italy: How Docker is Helping the Biggest Italian IT Company Conti...
Docker, Inc.
 
Predicting Space Weather with Docker
Docker, Inc.
 
Become a Docker Power User With Microsoft Visual Studio Code
Docker, Inc.
 
How to Use Mirroring and Caching to Optimize your Container Registry
Docker, Inc.
 
Monolithic to Microservices + Docker = SDLC on Steroids!
Docker, Inc.
 
Kubernetes at Datadog Scale
Docker, Inc.
 
Labels, Labels, Labels
Docker, Inc.
 
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
Docker, Inc.
 
Build & Deploy Multi-Container Applications to AWS
Docker, Inc.
 
From Fortran on the Desktop to Kubernetes in the Cloud: A Windows Migration S...
Docker, Inc.
 
Developing with Docker for the Arm Architecture
Docker, Inc.
 
Ad

Recently uploaded (20)

PDF
What should be in a Leadership and Motivation Plan?
Writegenic AI
 
PDF
FINAL ZAKROS - UNESCO SITE CANDICACY - PRESENTATION - September 2024
StavrosKefalas1
 
PPTX
presentation on legal and regulatory action
raoharsh4122001
 
PPTX
BARRIERS TO EFFECTIVE COMMUNICATION.pptx
shraddham25
 
PPTX
Blended Family Future, the Mayflower and You
UCG NWA
 
PPTX
2025-07-13 Abraham 07 (shared slides).pptx
Dale Wells
 
PPTX
Presentationexpressions You are student leader and have just come from a stud...
BENSTARBEATZ
 
PPTX
g1-oral-comm-1.pptx dkekekwkwoowowwkkrkrrkfkfkfm
hnanie845
 
PDF
The Family Secret (essence of loveliness)
Favour Biodun
 
PPTX
Inspired by VeinSense: Supercharge Your Hackathon with Agentic AI
ShubhamSharma2528
 
PDF
Cloud Computing Service Availability.pdf
chakrirocky1
 
PPTX
AI presentation for everyone in every fields
dodinhkhai1
 
PPTX
STURGEON BAY WI AG PPT JULY 6 2025.pptx
FamilyWorshipCenterD
 
PPTX
some leadership theories MBA management.pptx
rkseo19
 
PPTX
Bob Stewart Humble Obedience 07-13-2025.pptx
FamilyWorshipCenterD
 
PPTX
Pastor Bob Stewart Acts 21 07 09 2025.pptx
FamilyWorshipCenterD
 
PPTX
A brief History of counseling in Social Work.pptx
Josaya Injesi
 
PDF
CHALLENGIES FACING THEOLOGICAL EDUCATION IN NIGERIA: STRATEGIES FOR IMPROVEMENT
PREVAILERS THEOLOGICAL SCHOOL FCT ABUJA
 
PPTX
Food_and_Drink_Bahasa_Inggris_Kelas_5.pptx
debbystevani36
 
PPTX
677697609-States-Research-Questions-Final.pptx
francistiin8
 
What should be in a Leadership and Motivation Plan?
Writegenic AI
 
FINAL ZAKROS - UNESCO SITE CANDICACY - PRESENTATION - September 2024
StavrosKefalas1
 
presentation on legal and regulatory action
raoharsh4122001
 
BARRIERS TO EFFECTIVE COMMUNICATION.pptx
shraddham25
 
Blended Family Future, the Mayflower and You
UCG NWA
 
2025-07-13 Abraham 07 (shared slides).pptx
Dale Wells
 
Presentationexpressions You are student leader and have just come from a stud...
BENSTARBEATZ
 
g1-oral-comm-1.pptx dkekekwkwoowowwkkrkrrkfkfkfm
hnanie845
 
The Family Secret (essence of loveliness)
Favour Biodun
 
Inspired by VeinSense: Supercharge Your Hackathon with Agentic AI
ShubhamSharma2528
 
Cloud Computing Service Availability.pdf
chakrirocky1
 
AI presentation for everyone in every fields
dodinhkhai1
 
STURGEON BAY WI AG PPT JULY 6 2025.pptx
FamilyWorshipCenterD
 
some leadership theories MBA management.pptx
rkseo19
 
Bob Stewart Humble Obedience 07-13-2025.pptx
FamilyWorshipCenterD
 
Pastor Bob Stewart Acts 21 07 09 2025.pptx
FamilyWorshipCenterD
 
A brief History of counseling in Social Work.pptx
Josaya Injesi
 
CHALLENGIES FACING THEOLOGICAL EDUCATION IN NIGERIA: STRATEGIES FOR IMPROVEMENT
PREVAILERS THEOLOGICAL SCHOOL FCT ABUJA
 
Food_and_Drink_Bahasa_Inggris_Kelas_5.pptx
debbystevani36
 
677697609-States-Research-Questions-Final.pptx
francistiin8
 

Democratizing machine learning on kubernetes

  • 2. Principal Solution Architect Cloud and AI, Microsoft @joyqiao2016 Joy Qiao Principal Program Manager Azure Containers, Microsoft @LachlanEvenson Lachlan Evenson
  • 3. • The Data Scientist • Building and training models • Experience in Machine Learning frameworks/libraries • Basic understanding of computer hardware • Lucky to have Kubernetes experience Who are we? Data Scientist
  • 4. • The Infra Engineer/SRE • Build and maintain baremetal/cloud infra • Kubernetes experience • Little to no Machine Learning library experience Who are we? (continued) Infra Engineer SRE
  • 5. ML on Kubernetes Infra Engineer SRE Data Scientist
  • 6. We have the RIGHT tools and libraries to build and train models We have the RIGHT platform in Kubernetes to run and train these models Why this matters
  • 7. • Two discrete worlds are coming together • The knowledge is not widely accessible to the right audience • Nomenclature • Documentation and use-cases are lacking • APIs are evolving very fast, sample code gets out of date quickly What we’ve experienced
  • 10. Data Parallelism 1. Parallel training on different machines 2. Update the parameters synchronously/asynchronously 3. Refresh the local model with new parameters, go to 1 and repeat Distributed Training Architecture Credits: Taifeng Wang, DMTK team
  • 11. Model Parallelism The global model is partitioned into K sub-models. The sub-models are distributed over K local workers and serve as their local models. In each mini-batch, the local workers compute the gradients of the local weights by back propagation. Distributed Training Architecture Credits: Taifeng Wang, DMTK team
  • 12. For Variable Distribution & Gradient Aggregation • Parameter_server Distributed TensorFlow Architecture Source: https://www.tensorflow.org/performance/performance_models
  • 14. • Kubeflow - https://github.com/kubeflow/kubeflow • JupyterHub, Tensorflow training/serving • Seldon, Pachyderm, Pytorch-job • Hands on Labs – https://aka.ms/kubeflow-labs • Deep Learning Workspace • https://github.com/microsoft/DLWorkspace/ • https://microsoft.github.io/DLWorkspace/ ML on Kubernetes open source tooling
  • 16. Distributed Tensorflow using Kubeflow (cont)
  • 17. • MetaML - https://aka.ms/kubeflow-labs ML on Kubernetes open source tooling
  • 18. In just 3 simple steps Running Distributed TensorFlow on Kubernetes with Kubeflow
  • 19. 1. Create a Kubernetes cluster with GPU enabled nodes (AKS, GKE, EKS) - https://docs.microsoft.com/azure/aks/gpu-cluster 2. Install Kubeflow 3. Run Distributed TensorFlow training job Detailed instructions at https://aka.ms/kubeflow-labs Running Distributed TensorFlow on Kubernetes (continued)
  • 20. Demo
  • 22. Training Environment on Azure Node VMs oNC24r for workers § 4x NVIDIA® Tesla® K80 GPU § 24 CPU cores, 224 GB RAM oD14_v2 for parameter server § 16 CPU cores, 112 GB RAM Kubernetes: AKS (Azure Kubernetes Service) v1.9.6 GPU: NVIDIA® Tesla® K80 Benchmarks scripts: https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks • OS: Ubuntu 16.04 LTS • TensorFlow: 1.8 • CUDA / cuDNN: 9.0 / 7.0 • Disk: Local SSD • DataSet: ImageNet (real data, not synthetic)
  • 23. •Linear scalability •GPUs are fully saturated Training on Single Pod, Multi-GPU
  • 24. Settings: •Topology: 1 ps and 2 workers •Async variables update •Using cpu as the local_parameter_device •Each ps/worker pod has its own dedicated host •variable_update mode: parameter_server •Network protocol: gRPC Distributed Training
  • 25. Single Pod Training with 4 GPUs vs Distributed Training with 2 workers with 8 GPUs Distributed Training (continued)
  • 26. Distributed Training (continued) Training Speedup on 2 nodes vs single-node The model with a higher ratio scales better. Distributed training scalability depends on the compute/communication ratio of the model
  • 27. Observations during test: • Linear scalability largely depends on the model and network bandwidth. • GPUs not fully saturated on the worker nodes, likely due to network bottleneck. • VGG16 does not scale across multiple nodes. GPUs “starved” most of the time. • Having parameter servers running on the same pods as the workers seem to have worse performance • Tricky to decide the right ratio of workers to parameter servers • Sync vs Async variable updates Distributed Training (continued)
  • 29. Benchmark on 32 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network (source: https://github.com/uber/horovod) Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow, Keras, and PyTorch • A stand-alone python package • Seamless install on top of TensorFlow • Uses NCCL for ring- allreduce across servers instead of parameter server • Uses MPI for worker discovery and reduction coordination • Tensor Fusion
  • 30. Benchmark on AKS – Traditional Distributed TensorFlow vs Horovod Horovod: Uber’s Open Source Distributed Deep Learning Framework Training Speedup on 2 nodes vs single-node
  • 31. Ecosystem open source Tooling for Distributed training on Kubernetes “FreeFlow” CNI plugin from Microsoft Research (Credits: Yibo Zhu from MS Research) https://github.com/Microsoft/Freeflow • Speeds up container overlay network communication to the same as host machine • Supports both RDMA and TCP • Higher throughput, lower latency, and less CPU overhead • Pure user space implementation • Not relying on anything specific to Kubernetes. • Detailed steps on how to setup FreeFlow in K8s at https://github.com/joyq-github/TensorFlowonK8s/blob/master/FreeFlow.md
  • 32. Ecosystem open source Tooling for Distributed training on Kubernetes (continued) Custom CRI & Scheduler: GPU-related resource scheduling on K8s (Credits: Sanjeev Mehrotra from MS Research) • Pods with no. of GPUs with how much memory • Pods with no. of GPUs interconnected via NVLink, etc. • https://github.com/Microsoft/KubeGPU
  • 33. Fast.ai: Training Imagenet in 3 hours for $25 and CIFAR10 in 3 minutes for $0.26 http://www.fast.ai/2018/04/30/dawnbench-fastai/ Speed to accuracy is very important, in addition to images/sec Algorithmic creativity is more important than bare-metal performance • Super convergence - slowly increase the learning rate whilst decreasing momentum during training https://arxiv.org/abs/1708.07120 • Progressive resizing - train on smaller images at the start of training, and gradually increase image size as you train further o Progressive Growing of GANs https://arxiv.org/abs/1710.10196 o Enhanced Deep Residual Networks https://arxiv.org/abs/1707.02921 • Half precision arithmetic • Wide Residual Networks https://arxiv.org/abs/1605.07146
  • 34. • Distributed Training Architectures • Open Source tooling on Kubernetes • Performance benchmarks • Ecosystem libraries and tooling In Summary
  • 35. Kubeflow – https://github.com/kubeflow/kubeflow Kubeflow Labs – https://aka.ms/kubeflow-labs Deep Learning Workspace – https://github.com/microsoft/DLWorkspace/ AKS GPU Cluster create - https://docs.microsoft.com/azure/aks/gpu-cluster Resources
  • 36. Horovod - https://github.com/uber/horovod FreeFlow - https://github.com/Microsoft/Freeflow FreeFlow setup docs - https://github.com/joyq-github/TensorFlowonK8s/blob/master/FreeFlow.md Kubernetes GPU Scheduler - https://github.com/Microsoft/KubeGPU Fast.ai - http://www.fast.ai/2018/04/30/dawnbench-fastai/ Resources (continued)
  • 37. You can reach us on Twitter via: @joyqiao2016 @LachlanEvenson Thank you