SlideShare a Scribd company logo
© 2017 Mesosphere, Inc. All Rights Reserved. 1
Running Distributed
TensorFlow on DC/OS
Kevin Klues
klueska@mesosphere.com
© 2017 Mesosphere, Inc. All Rights Reserved. 2
Kevin Klues is an Engineering Manager at Mesosphere where he leads the DC/OS Cluster Operations team.
Since joining Mesosphere, Kevin has been involved in the design and implementation of a number of Mesos’s
core subsystems, including GPU isolation, Pods, and Attach/Exec support. Prior to joining Mesosphere, Kevin
worked at Google on an experimental operating system for data centers called Akaros. He and a few others
founded the Akaros project while working on their Ph.Ds at UC Berkeley. In a past life Kevin was a lead
developer of the TinyOS project, working at Stanford, the Technical University of Berlin, and the CSIRO in
Australia. When not working, you can usually find Kevin on a snowboard or up in the mountains in some
capacity or another.
© 2017 Mesosphere, Inc. All Rights Reserved.
● DC/OS (Data Center Operating System) is an open-
source, distributed operating system
● It takes Mesos and builds upon it with additional
services and functionality
3
What is DC/OS?
© 2017 Mesosphere, Inc. All Rights Reserved.
● DC/OS (Data Center Operating System) is an
open-source, distributed operating system
● It takes Mesos and builds upon it with additional
services and functionality
○ Built-in support for service discovery, load balancing, security, and
ease of installation
○ Extra tooling (e.g. comprehensive CLI and a GUI)
○ Built-in frameworks for launching long running services (Marathon)
and batch jobs (Metronome)
○ A repository (app-store) for installing other common packages and
frameworks (e.g. Spark, Kafka, Cassandra, TensorFlow)
4
What is DC/OS?
© 2017 Mesosphere, Inc. All Rights Reserved. 5
What is DC/OS?
METRONOME
(Batch)
© 2017 Mesosphere, Inc. All Rights Reserved. 6
What is DC/OS?
METRONOME
(Batch)
© 2017 Mesosphere, Inc. All Rights Reserved. 7
What is DC/OS?
METRONOME
(Batch)
© 2017 Mesosphere, Inc. All Rights Reserved. 8
What is DC/OS?
METRONOME
(Batch)
© 2017 Mesosphere, Inc. All Rights Reserved. 9
Overview of Talk
● Demo Setup (Preview)
● Typical developer workflow for TensorFlow
● Challenges running distributed TensorFlow
● Running distributed TensorFlow on DC/OS
● Demo
● Next Steps
© 2017 Mesosphere, Inc. All Rights Reserved.
Demo Setup - Train an Image Classifier
10
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3% Panda
© 2017 Mesosphere, Inc. All Rights Reserved.
● Train the Inception-V3 Image Classification model on the CIFAR-10 dataset.
○ Inception-V3: an open-source image recognition model
○ CIFAR-10: a well-known dataset with 60,000 low-resolution images of 10
classes of objects (trucks, planes, ships, birds, cats, etc.)
11
Demo Setup - Model and Training Data
Trained
Model
© 2017 Mesosphere, Inc. All Rights Reserved.
● Run two separate TensorFlow Jobs
○ A non-distributed job with a single worker
○ A distributed job with several workers
12
Demo Setup - Training Deployment Strategy
Trained
Model
Input
Data Set
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
● Spin up a DC/OS Cluster on GCE to run the jobs
○ 1 master, 8 agents
○ Each agent has:
■ 4 Tesla K80 GPUs
■ 8 CPUs
■ 32GB of Memory
○ HDFS pre-installed for
serving training data 13
Demo Setup
© 2017 Mesosphere, Inc. All Rights Reserved.
● Log data from both jobs into HDFS
○ Use TensorBoard to monitor and compare their progress
14
Demo Setup
© 2017 Mesosphere, Inc. All Rights Reserved.
● Log data from both jobs into HDFS
○ Use TensorBoard to monitor and compare their progress
15
Demo Setup
Note: This is a serious model that would take over a week to fully train, even on a
cluster of expensive machines. Our goal here is simply to demonstrate how easy
it is to deploy and monitor large TensorFlow jobs on DC/OS.
© 2017 Mesosphere, Inc. All Rights Reserved.
Trained
Model
Typical Developer Workflow for TensorFlow
(Single-Node)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for single-node performance
● Train your data on a single-node → Output Trained Model
16
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow
(Distributed)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for distributed computation
● …
17
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow
(Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP address
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
18
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow
(Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
19
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
20
● Hard-coding a “ClusterSpec” is incredibly tedious
○ Users need to rewrite code for every job they want to run in a distributed setting
○ True even for code they “inherit” from standard models
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
21
● Dealing with failures is not graceful
○ Users need to stop training, change their hard-coded ClusterSpec, and
manually restart their jobs
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
● Manually configuring each node in a cluster takes a long time and is error-prone
○ Setting up access to a shared file system (for checkpoint and summary files)
requires authenticating on each node
○ Tweaking hyper-parameters requires re-uploading code to every node
22
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
● We use the dcos-commons SDK to dynamically create the ClusterSpec
23
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
24
● Wrapper script to abstract away distributed TensorFlow configuration
○ Separates “deployer” responsibilities from “developer” responsibilities
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
User
Code
Wrapper
Script
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
25
● The dcos-commons SDK cleanly restarts failed tasks and reconnects
them to the cluster
© 2017 Mesosphere, Inc. All Rights Reserved.
● We use DC/OS Secrets (or alternatively environment variables) to pass
credentials to every node in the cluster
Running distributed TensorFlow on DC/OS
26
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
27
● We use a runtime configuration dictionary to quickly tweak hyper-
parameters between different runs of the same model.
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "{...}"
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
$ dcos beta-tensorflow update start
--name=/cifar-multiple --
options=cifar-multiple.json
© 2017 Mesosphere, Inc. All Rights Reserved.
● Train the Inception-V3 Image Classification model on the CIFAR-10 dataset.
● A non-distributed job with a single worker
● A distributed job with several workers
28
Demo Setup Recap
Trained Model
Trained
Model
Input
Data Set
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
DEMO
29
© 2017 Mesosphere, Inc. All Rights Reserved.
Next Steps
● What we have today
○ Single Framework
○ Installed via standard DC/OS
package management tools
○ Need to manually start/stop
and remove framework from
cluster when completed
30
Chief
Worker
Worker Worker
PS PS
HDFS/GCS/etc
TensorBoard
© 2017 Mesosphere, Inc. All Rights Reserved.
Next Steps
● Where we are going
○ Meta Framework
○ Able to install / run instances
of the original single-
framework
○ Launch and monitor via
`tensorflow` CLI extensions
○ Automatically start/stop and
31
Chief
Worker
Worker Worker
P
S
P
S HDFS/G
CS/etc
TensorB
oard
Chief
Worker
Worker Worker
P
S
P
S HDFS/G
CS/etc
TensorB
oard
Chief
Worker
Worker Worker
P
S
P
S HDFS/G
CS/etc
TensorB
oard
Meta
Framework
CLI
© 2017 Mesosphere, Inc. All Rights Reserved.
Next Steps
32
$ dcos tensorflow run train.py 
> --workers=3 
> --ps=2
Running “train.py” on DC/OS with 3
workers and 2 parameter servers.
© 2017 Mesosphere, Inc. All Rights Reserved.
Special Thanks to All Collaborators
33
Sam Pringle
springle@mesosphere.com
Primary Developer of the DC/OS TensorFlow Package
Jörg Schad Ben
Wood
Evan Lezar Art
Rand
Gabriel Hartmann Chris
Lambert
© 2017 Mesosphere, Inc. All Rights Reserved.
● DC/OS TensorFlow Package (currently closed source)
○ https://github.com/mesosphere/dcos-tensorflow
● DC/OS TensorFlow Tools
○ https://github.com/dcos-labs/dcos-tensorflow-tools/
● Tutorial for deploying TensorFlow on DC/OS
○ https://github.com/dcos/examples/tree/master/tensorflow
● Contact:
○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dcos
○ Slack: chat.dcos.io #tensorflow
Questions and Links
34

More Related Content

What's hot (20)

PDF
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
Joseph Kuo
 
PDF
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
DataStax
 
PDF
MOUG17: DB Security; Secure your Data
Monica Li
 
PDF
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
DataStax Academy
 
PDF
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
PDF
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
PDF
Red Hat Ceph Storage: Past, Present and Future
Red_Hat_Storage
 
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
PPTX
Bootcamp 2017 - SQL Server on Linux
Maximiliano Accotto
 
PDF
The hidden engineering behind machine learning products at Helixa
Alluxio, Inc.
 
PDF
Extending DevOps to Big Data Applications with Kubernetes
Nicola Ferraro
 
PDF
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
PDF
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Alluxio, Inc.
 
PDF
Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...
OpenStack
 
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
PPTX
Real time analytics
Leandro Totino Pereira
 
PDF
Montreal Linux MeetUp - OpenStack Overview (2017.10.03)
Stacy Véronneau
 
PDF
Introducing the Hub for Data Orchestration
Alluxio, Inc.
 
PDF
Fluid: When Alluxio Meets Kubernetes
Alluxio, Inc.
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
Joseph Kuo
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
DataStax
 
MOUG17: DB Security; Secure your Data
Monica Li
 
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
DataStax Academy
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Red Hat Ceph Storage: Past, Present and Future
Red_Hat_Storage
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Bootcamp 2017 - SQL Server on Linux
Maximiliano Accotto
 
The hidden engineering behind machine learning products at Helixa
Alluxio, Inc.
 
Extending DevOps to Big Data Applications with Kubernetes
Nicola Ferraro
 
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Alluxio, Inc.
 
Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...
OpenStack
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Real time analytics
Leandro Totino Pereira
 
Montreal Linux MeetUp - OpenStack Overview (2017.10.03)
Stacy Véronneau
 
Introducing the Hub for Data Orchestration
Alluxio, Inc.
 
Fluid: When Alluxio Meets Kubernetes
Alluxio, Inc.
 

Similar to Running Distributed TensorFlow with GPUs on Mesos with DC/OS (20)

PDF
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Codemotion
 
PDF
Using DC/OS for Continuous Delivery - DevPulseCon 2017
pleia2
 
PDF
TensorFlow 16: Building a Data Science Platform
Seldon
 
PDF
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Codemotion
 
PDF
OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...
NETWAYS
 
PPTX
Webinar: End-to-End CI/CD with GitLab and DC/OS
Mesosphere Inc.
 
PPTX
Dealing with kubesprawl tetris style !
Taco Scargo
 
PPTX
Distributed tensorflow on kubernetes
inwin stack
 
PPTX
Distributed tensorflow on kubernetes
inwin stack
 
PDF
CISCO - Presentation at Hortonworks Booth - Strata 2014
Hortonworks
 
PPTX
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
MayaData Inc
 
PPTX
Operating Flink on Mesos at Scale
Biswajit Das
 
PPT
Mesosphere and the Enterprise: Run Your Applications on Apache Mesos - Steve ...
{code} by Dell EMC
 
PPTX
Doing Dropbox the Native Cloud Native Way
Minio
 
PDF
Flink Forward San Francisco 2018: Jörg Schad and Biswajit Das - "Operating Fl...
Flink Forward
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
PDF
Webinar: Nightmares of a Container Orchestration System - Jorg Schad
Codemotion
 
PDF
Webinar - Nightmares of a Container Orchestration System - Jorg Schad
Codemotion
 
PDF
PuppetConf 2016: Using Puppet with Kubernetes and OpenShift – Diane Mueller, ...
Puppet
 
PDF
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
Edge AI and Vision Alliance
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Codemotion
 
Using DC/OS for Continuous Delivery - DevPulseCon 2017
pleia2
 
TensorFlow 16: Building a Data Science Platform
Seldon
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Codemotion
 
OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...
NETWAYS
 
Webinar: End-to-End CI/CD with GitLab and DC/OS
Mesosphere Inc.
 
Dealing with kubesprawl tetris style !
Taco Scargo
 
Distributed tensorflow on kubernetes
inwin stack
 
Distributed tensorflow on kubernetes
inwin stack
 
CISCO - Presentation at Hortonworks Booth - Strata 2014
Hortonworks
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
MayaData Inc
 
Operating Flink on Mesos at Scale
Biswajit Das
 
Mesosphere and the Enterprise: Run Your Applications on Apache Mesos - Steve ...
{code} by Dell EMC
 
Doing Dropbox the Native Cloud Native Way
Minio
 
Flink Forward San Francisco 2018: Jörg Schad and Biswajit Das - "Operating Fl...
Flink Forward
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Webinar: Nightmares of a Container Orchestration System - Jorg Schad
Codemotion
 
Webinar - Nightmares of a Container Orchestration System - Jorg Schad
Codemotion
 
PuppetConf 2016: Using Puppet with Kubernetes and OpenShift – Diane Mueller, ...
Puppet
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
Edge AI and Vision Alliance
 
Ad

More from Mesosphere Inc. (20)

PPTX
DevOps in Age of Kubernetes
Mesosphere Inc.
 
PPTX
Java EE Modernization with Mesosphere DCOS
Mesosphere Inc.
 
PPTX
Operating Kubernetes at Scale (Australia Presentation)
Mesosphere Inc.
 
PPTX
Episode 4: Operating Kubernetes at Scale with DC/OS
Mesosphere Inc.
 
PPTX
Episode 3: Kubernetes and Big Data Services
Mesosphere Inc.
 
PPTX
Episode 2: Deploying Kubernetes at Scale
Mesosphere Inc.
 
PPTX
Episode 1: Building Kubernetes-as-a-Service
Mesosphere Inc.
 
PDF
Best Practices for Managing Kubernetes and Stateful Services: Mesosphere & Sy...
Mesosphere Inc.
 
PPTX
Webinar: What's New in DC/OS 1.11
Mesosphere Inc.
 
PPTX
Webinar: Operating Kubernetes at Scale
Mesosphere Inc.
 
PPTX
Webinar: Déployez facilement Kubernetes & vos containers
Mesosphere Inc.
 
PPTX
Webinar: Deep Learning Pipelines Beyond the Learning
Mesosphere Inc.
 
PPTX
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
Mesosphere Inc.
 
PDF
Jolt: Distributed, fault-tolerant test running at scale using Mesos
Mesosphere Inc.
 
PDF
Deploying Kong with Mesosphere DC/OS
Mesosphere Inc.
 
PPTX
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10
Mesosphere Inc.
 
PDF
Mesosphere & Magnetic: Take the pain out of running complex and critical serv...
Mesosphere Inc.
 
PDF
Easy Docker Deployments with Mesosphere DCOS on Azure
Mesosphere Inc.
 
PPTX
Mesos framework API v1
Mesosphere Inc.
 
PPTX
Scaling Like Twitter with Apache Mesos
Mesosphere Inc.
 
DevOps in Age of Kubernetes
Mesosphere Inc.
 
Java EE Modernization with Mesosphere DCOS
Mesosphere Inc.
 
Operating Kubernetes at Scale (Australia Presentation)
Mesosphere Inc.
 
Episode 4: Operating Kubernetes at Scale with DC/OS
Mesosphere Inc.
 
Episode 3: Kubernetes and Big Data Services
Mesosphere Inc.
 
Episode 2: Deploying Kubernetes at Scale
Mesosphere Inc.
 
Episode 1: Building Kubernetes-as-a-Service
Mesosphere Inc.
 
Best Practices for Managing Kubernetes and Stateful Services: Mesosphere & Sy...
Mesosphere Inc.
 
Webinar: What's New in DC/OS 1.11
Mesosphere Inc.
 
Webinar: Operating Kubernetes at Scale
Mesosphere Inc.
 
Webinar: Déployez facilement Kubernetes & vos containers
Mesosphere Inc.
 
Webinar: Deep Learning Pipelines Beyond the Learning
Mesosphere Inc.
 
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
Mesosphere Inc.
 
Jolt: Distributed, fault-tolerant test running at scale using Mesos
Mesosphere Inc.
 
Deploying Kong with Mesosphere DC/OS
Mesosphere Inc.
 
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10
Mesosphere Inc.
 
Mesosphere & Magnetic: Take the pain out of running complex and critical serv...
Mesosphere Inc.
 
Easy Docker Deployments with Mesosphere DCOS on Azure
Mesosphere Inc.
 
Mesos framework API v1
Mesosphere Inc.
 
Scaling Like Twitter with Apache Mesos
Mesosphere Inc.
 
Ad

Recently uploaded (20)

PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
Complete Network Protection with Real-Time Security
L4RGINDIA
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
Complete Network Protection with Real-Time Security
L4RGINDIA
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 

Running Distributed TensorFlow with GPUs on Mesos with DC/OS

  • 1. © 2017 Mesosphere, Inc. All Rights Reserved. 1 Running Distributed TensorFlow on DC/OS Kevin Klues [email protected]
  • 2. © 2017 Mesosphere, Inc. All Rights Reserved. 2 Kevin Klues is an Engineering Manager at Mesosphere where he leads the DC/OS Cluster Operations team. Since joining Mesosphere, Kevin has been involved in the design and implementation of a number of Mesos’s core subsystems, including GPU isolation, Pods, and Attach/Exec support. Prior to joining Mesosphere, Kevin worked at Google on an experimental operating system for data centers called Akaros. He and a few others founded the Akaros project while working on their Ph.Ds at UC Berkeley. In a past life Kevin was a lead developer of the TinyOS project, working at Stanford, the Technical University of Berlin, and the CSIRO in Australia. When not working, you can usually find Kevin on a snowboard or up in the mountains in some capacity or another.
  • 3. © 2017 Mesosphere, Inc. All Rights Reserved. ● DC/OS (Data Center Operating System) is an open- source, distributed operating system ● It takes Mesos and builds upon it with additional services and functionality 3 What is DC/OS?
  • 4. © 2017 Mesosphere, Inc. All Rights Reserved. ● DC/OS (Data Center Operating System) is an open-source, distributed operating system ● It takes Mesos and builds upon it with additional services and functionality ○ Built-in support for service discovery, load balancing, security, and ease of installation ○ Extra tooling (e.g. comprehensive CLI and a GUI) ○ Built-in frameworks for launching long running services (Marathon) and batch jobs (Metronome) ○ A repository (app-store) for installing other common packages and frameworks (e.g. Spark, Kafka, Cassandra, TensorFlow) 4 What is DC/OS?
  • 5. © 2017 Mesosphere, Inc. All Rights Reserved. 5 What is DC/OS? METRONOME (Batch)
  • 6. © 2017 Mesosphere, Inc. All Rights Reserved. 6 What is DC/OS? METRONOME (Batch)
  • 7. © 2017 Mesosphere, Inc. All Rights Reserved. 7 What is DC/OS? METRONOME (Batch)
  • 8. © 2017 Mesosphere, Inc. All Rights Reserved. 8 What is DC/OS? METRONOME (Batch)
  • 9. © 2017 Mesosphere, Inc. All Rights Reserved. 9 Overview of Talk ● Demo Setup (Preview) ● Typical developer workflow for TensorFlow ● Challenges running distributed TensorFlow ● Running distributed TensorFlow on DC/OS ● Demo ● Next Steps
  • 10. © 2017 Mesosphere, Inc. All Rights Reserved. Demo Setup - Train an Image Classifier 10 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 11. © 2017 Mesosphere, Inc. All Rights Reserved. ● Train the Inception-V3 Image Classification model on the CIFAR-10 dataset. ○ Inception-V3: an open-source image recognition model ○ CIFAR-10: a well-known dataset with 60,000 low-resolution images of 10 classes of objects (trucks, planes, ships, birds, cats, etc.) 11 Demo Setup - Model and Training Data Trained Model
  • 12. © 2017 Mesosphere, Inc. All Rights Reserved. ● Run two separate TensorFlow Jobs ○ A non-distributed job with a single worker ○ A distributed job with several workers 12 Demo Setup - Training Deployment Strategy Trained Model Input Data Set Trained Model Input Data Set
  • 13. © 2017 Mesosphere, Inc. All Rights Reserved. ● Spin up a DC/OS Cluster on GCE to run the jobs ○ 1 master, 8 agents ○ Each agent has: ■ 4 Tesla K80 GPUs ■ 8 CPUs ■ 32GB of Memory ○ HDFS pre-installed for serving training data 13 Demo Setup
  • 14. © 2017 Mesosphere, Inc. All Rights Reserved. ● Log data from both jobs into HDFS ○ Use TensorBoard to monitor and compare their progress 14 Demo Setup
  • 15. © 2017 Mesosphere, Inc. All Rights Reserved. ● Log data from both jobs into HDFS ○ Use TensorBoard to monitor and compare their progress 15 Demo Setup Note: This is a serious model that would take over a week to fully train, even on a cluster of expensive machines. Our goal here is simply to demonstrate how easy it is to deploy and monitor large TensorFlow jobs on DC/OS.
  • 16. © 2017 Mesosphere, Inc. All Rights Reserved. Trained Model Typical Developer Workflow for TensorFlow (Single-Node) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for single-node performance ● Train your data on a single-node → Output Trained Model 16 Input Data Set
  • 17. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for distributed computation ● … 17
  • 18. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP address of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 18 Trained Model Input Data Set
  • 19. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 19 Trained Model Input Data Set
  • 20. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow 20 ● Hard-coding a “ClusterSpec” is incredibly tedious ○ Users need to rewrite code for every job they want to run in a distributed setting ○ True even for code they “inherit” from standard models tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222
  • 21. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow 21 ● Dealing with failures is not graceful ○ Users need to stop training, change their hard-coded ClusterSpec, and manually restart their jobs
  • 22. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow ● Manually configuring each node in a cluster takes a long time and is error-prone ○ Setting up access to a shared file system (for checkpoint and summary files) requires authenticating on each node ○ Tweaking hyper-parameters requires re-uploading code to every node 22
  • 23. © 2017 Mesosphere, Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS ● We use the dcos-commons SDK to dynamically create the ClusterSpec 23 { "service": { "name": "mnist", "job_url": "...", "job_context": "..." }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222
  • 24. © 2017 Mesosphere, Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 24 ● Wrapper script to abstract away distributed TensorFlow configuration ○ Separates “deployer” responsibilities from “developer” responsibilities { "service": { "name": "mnist", "job_url": "...", "job_context": "..." }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } User Code Wrapper Script
  • 25. © 2017 Mesosphere, Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 25 ● The dcos-commons SDK cleanly restarts failed tasks and reconnects them to the cluster
  • 26. © 2017 Mesosphere, Inc. All Rights Reserved. ● We use DC/OS Secrets (or alternatively environment variables) to pass credentials to every node in the cluster Running distributed TensorFlow on DC/OS 26
  • 27. © 2017 Mesosphere, Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 27 ● We use a runtime configuration dictionary to quickly tweak hyper- parameters between different runs of the same model. { "service": { "name": "mnist", "job_url": "...", "job_context": "{...}" }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } $ dcos beta-tensorflow update start --name=/cifar-multiple -- options=cifar-multiple.json
  • 28. © 2017 Mesosphere, Inc. All Rights Reserved. ● Train the Inception-V3 Image Classification model on the CIFAR-10 dataset. ● A non-distributed job with a single worker ● A distributed job with several workers 28 Demo Setup Recap Trained Model Trained Model Input Data Set Trained Model Input Data Set
  • 29. © 2017 Mesosphere, Inc. All Rights Reserved. DEMO 29
  • 30. © 2017 Mesosphere, Inc. All Rights Reserved. Next Steps ● What we have today ○ Single Framework ○ Installed via standard DC/OS package management tools ○ Need to manually start/stop and remove framework from cluster when completed 30 Chief Worker Worker Worker PS PS HDFS/GCS/etc TensorBoard
  • 31. © 2017 Mesosphere, Inc. All Rights Reserved. Next Steps ● Where we are going ○ Meta Framework ○ Able to install / run instances of the original single- framework ○ Launch and monitor via `tensorflow` CLI extensions ○ Automatically start/stop and 31 Chief Worker Worker Worker P S P S HDFS/G CS/etc TensorB oard Chief Worker Worker Worker P S P S HDFS/G CS/etc TensorB oard Chief Worker Worker Worker P S P S HDFS/G CS/etc TensorB oard Meta Framework CLI
  • 32. © 2017 Mesosphere, Inc. All Rights Reserved. Next Steps 32 $ dcos tensorflow run train.py > --workers=3 > --ps=2 Running “train.py” on DC/OS with 3 workers and 2 parameter servers.
  • 33. © 2017 Mesosphere, Inc. All Rights Reserved. Special Thanks to All Collaborators 33 Sam Pringle [email protected] Primary Developer of the DC/OS TensorFlow Package Jörg Schad Ben Wood Evan Lezar Art Rand Gabriel Hartmann Chris Lambert
  • 34. © 2017 Mesosphere, Inc. All Rights Reserved. ● DC/OS TensorFlow Package (currently closed source) ○ https://github.com/mesosphere/dcos-tensorflow ● DC/OS TensorFlow Tools ○ https://github.com/dcos-labs/dcos-tensorflow-tools/ ● Tutorial for deploying TensorFlow on DC/OS ○ https://github.com/dcos/examples/tree/master/tensorflow ● Contact: ○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dcos ○ Slack: chat.dcos.io #tensorflow Questions and Links 34

Editor's Notes

  • #29: What tabs will I click on in TensorBoard Scalars Mark arbitrary variables in your code to visualize how they change over time in the Scalars tab in TensorBoard Images Each step processes a batch of images More images are processed per-step in the distributed model Take away: over time the bounding box for the subject in the image will tighter and tighter around the subject alone Graphs Histograms Both jobs are represented because I pointed them at the same bucket