SlideShare a Scribd company logo
Deep learning beyond the learning
@joerg_schad @dcos
Jörg Schad
Technical Community
Lead / Developer
Deep Learning
● Core Mesos
developer at
Mesosphere
● Twitter:
@joerg_schad
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Promise
3
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Process
4
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: Some insight
5
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
6
1. Explore data using
Jupyter notebook
2. Train the model
using TensorFlow
3. Monitor training progress
using TensorBoard 4. Debug Model using tfdbg 5. Serve Model using TensorFlow
Serving
Cloud Pipeline
2. Explore data using
Jupyter notebook
3. Train the model
using TensorFlow
4. Monitor training progress
using TensorBoard 5. Debug Model using tfdbg 6. Serve Model using TensorFlow
Serving
1. Data Preparation using
Spark
7.Streaming of requests
...
Open Source Pipeline
2. Explore data using
Jupyter notebook
3. Train the model
using TensorFlow
4. Monitor training progress
using TensorBoard 5. Debug Model using tfdbg 6. Serve Model using TensorFlow
Serving
1. Data Preparation using
Spark
7. Kafka stream of
requests
Kubeflow
Deep Learning Pipeline
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2017 Mesosphere, Inc. All Rights Reserved.
Training Challenges
11
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
● Compute Intensive
○ (Hopefully) Large Datasets
■ Train
■ Dev
■ Test
○ Hyperparameter
■ #Layer
■ #Units per Layer
■ Learning Rate
■ ….
Data Management
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 13
Challenges
●
● Training/Dev/Test + New Data
● Large amounts
● Quality
● Availability (for cluster)
● Velocity
● Streaming
Solutions
GFS
Input Data Management
Input:
Lots of Labeled
Data
Apache Kafka
Apache Cassandra
© 2018 Mesosphere, Inc. All Rights Reserved. 14
Challenges
● Data is typically not ready to be
consumed by ML job*
● Data Cleaning
● Missing/incorrect labels
● Data Preparation
● Same Format
● Same Distribution
Solutions
Data Preparation
* Demo datasets are a fortunate exception :)
Users
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 16
Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 17
Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users
Frameworks
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
19
© 2018 Mesosphere, Inc. All Rights Reserved.
● Machine Intelligence is the broad term used to describe
techniques allowing computers to “learn” by analyzing very
large data sets using artificial neural networks
20
What is Tensorflow?
“An open-source software library for Machine Intelligence” - tensorflow.org
© 2018 Mesosphere, Inc. All Rights Reserved. 21
What is Tensorflow?
“An open-source software library for Machine Intelligence” - tensorflow.org
● Tensorflow is a software library that makes it easy for
developers to construct artificial neural networks to analyze
their data of interest
TensorFlow
Library
Python
Dataflow
Executor,
Compute Kernel
Implementations,
Networking, etc.
GPUs
CPUs
© 2017 Mesosphere, Inc. All Rights Reserved. 22
© 2018 Mesosphere, Inc. All Rights Reserved. 23
Alternatives
© 2018 Mesosphere, Inc. All Rights Reserved. 24
Alternatives
tf.enable_eager_execution()
https://www.tensorflow.org/get_started/eager
© 2018 Mesosphere, Inc. All Rights Reserved. 25
Data Analytics Ecosystem
© 2018 Mesosphere, Inc. All Rights Reserved.
APIs
26
© 2018 Mesosphere, Inc. All Rights Reserved. 27
Challenges
● Different Frameworks
● No one rules them all
Solutions
● Pick the right tool
● PMML if needed
Deep Learning Frameworks
Cluster
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2017 Mesosphere, Inc. All Rights Reserved.
Trained
Model
Typical Developer Workflow for TensorFlow (Single-Node)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for single-node performance
● Train your data on a single-node → Output Trained Model
29
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow (Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP address
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
30
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow (Distributed)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for distributed computation
● …
31
© 2018 Mesosphere, Inc. All Rights Reserved.
Resource Isolation and Allocation
32
© 2018 Mesosphere, Inc. All Rights Reserved.
TPU
33
© 2018 Mesosphere, Inc. All Rights Reserved.
TPUs
34
© 2017 Mesosphere, Inc. All Rights Reserved. 35
Datacenter
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Mesos/ DC/OS
automated schedulers, workload multiplexing onto the
same machines
Tensorflow
Jenkins
Kafka
Spark
Tensorflow
© 2018 Mesosphere, Inc. All Rights Reserved.
PHYSICAL
INFRASTRUCTURE
MICROSERVICES, CONTAINERS, & DEV TOOLS
VIRTUAL MACHINES PUBLIC CLOUDS
DATA SERVICES, MACHINE LEARNING, & AI
Security &
Compliance
Application-Aware
Automation Multitenancy
Hybrid Cloud
Management
100+
MORE
DatacenterEdge
Datacenter and Cloud as a Single Computing Resource
Powered by Apache Mesos
20+
MORE
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow*
37
● Dealing with failures is not graceful
○ Users need to stop training, change their hard-coded ClusterSpec, and
manually restart their jobs
* Any Distributed System
Deploy
Scale
Configure
Recover
3 AM
...
Typical Datacenter
siloed, over-provisioned servers,
low utilization
HDFS
Kafka
Kubernetes
Flink
TensorFlow
© 2018 Mesosphere, Inc. All Rights Reserved.
Two-level Scheduling
1. Agents advertise resources to Master
2. Master offers resources to Framework
3. Framework rejects / uses resources
4. Agent reports task status to Master
39
MESOS ARCHITECTURE
Mesos
Master
Mesos
Master
Mesos
Master
Mesos AgentMesos Agent Service
Cassandra
Executor
Cassandra
Task
Flink
Scheduler
Spark
Executor
Spark
Task
Mesos AgentMesos Agent Service
Docker
Executor
Docker
Task
CDB
Executor
Spark
Task
Spark
Scheduler
Kafka
Scheduler
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
40
● Hard-coding a “ClusterSpec” is incredibly tedious
○ Users need to rewrite code for every job they want to run in a distributed setting
○ True even for code they “inherit” from standard models
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
● Manually configuring each node in a cluster takes a long time and is error-prone
○ Setting up access to a shared file system (for checkpoint and summary files)
requires authenticating on each node
○ Tweaking hyper-parameters requires re-uploading code to every node
41
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow (Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
42
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
● We use the dcos-commons SDK to dynamically create the ClusterSpec
43
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
44
● Wrapper script to abstract away distributed TensorFlow configuration
○ Separates “deployer” responsibilities from “developer” responsibilities
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
User
Code
Wrapper
Script
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
45
● The dcos-commons SDK cleanly restarts failed tasks and reconnects
them to the cluster
Model Management
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved.
Recall
47
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda
© 2017 Mesosphere, Inc. All Rights Reserved.
Many Models
48
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
© 2018 Mesosphere, Inc. All Rights Reserved. 49
Challenges
● Many Models
● Different Hyperparameter
● Different Models
● New Training Data
● ...
Solutions
● Persistent Storage + Metadata
Model Management
GFS
© 2017 Mesosphere, Inc. All Rights Reserved.
TensorFlow Hub
50
https://www.tensorflow.org/hub/
Serving
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 52
Challenges
● How to Deploy Models?
● Zero Downtime
● Canary
Solutions
● TensorFlow Serving
Model Serving
© 2018 Mesosphere, Inc. All Rights Reserved.
TensorFlow Lite
53
https://www.tensorflow.org/mobile/tflite/
Challenges
● Small/Fast model without losing too
much performance
● 500 KB models….
© 2018 Mesosphere, Inc. All Rights Reserved.
Rendezvous Architecture
54
https://mapr.com/ebooks/machine-learning-logistics/
Monitoring
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 56
Challenges
● Understand {...}
● Debug
● Model Quality
● Accuracy
● Training Time
● …
● Overall Architecture
● Availability
● Latencies
● ...
Solutions
● TensorBoard
● Traditional Cluster Monitoring Tool
Monitoring
© 2018 Mesosphere, Inc. All Rights Reserved.
Debugging
57
tfdbg
https://www.tensorflow.org/programmers_guide/debugger
© 2018 Mesosphere, Inc. All Rights Reserved.
Debugging
58
Tfdbg
- GUI currently alpha
https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md
© 2018 Mesosphere, Inc. All Rights Reserved.
Profiling
59
Performance optimization for different
devices
- Keep device occupied
Profiling!
+
Experience!
https://www.tensorflow.org/performance/performance_guide
© 2018 Mesosphere, Inc. All Rights Reserved.
Platforms
60
● AWS Sagemaker
+ Spark, MXNet, TF
+ Serving/AB
- Cloud Only
● Google Datalab/ML-Engine
+ TF, Keras, Scikit, XGBoost
+ Serving/AB
- Cloud Only
- No control of docker images
● KubeFlow
+ TF Everywhere
- TF only
● DC/OS
+ Flexibility (all of the above)
+ GPU support
- More Manual setup
© 2018 Mesosphere, Inc. All Rights Reserved. 61
Demo
1. Explore data using
Jupyter notebook
2. Train the
model using
TensorFlow
3. Monitor training progress
using TensorBoard 4. Debug Model using tfdbg 5. Serve Model using TensorFlow
Serving
© 2018 Mesosphere, Inc. All Rights Reserved.
Related Work
62
● DC/OS TensorFlow
https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/
● DC/OS PyTorch
https://mesosphere.com/blog/deep-learning-pytorch-gpus/
● Ted Dunning’s Machine Learning Logistics
https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/
● KubeFlow
https://github.com/kubeflow/kubeflow
● Tensorflow (+ TensorBoard and Serving)
https://www.tensorflow.org/
© 2018 Mesosphere, Inc. All Rights Reserved.
Special Thanks to All Collaborators
63
Ben Wood Robin Oh
Evan Lezar Art Rand
Gabriel Hartmann Chris Lambert
Bo Hu
Sam Pringle Kevin Klues
© 2018 Mesosphere, Inc. All Rights Reserved.
● DC/OS TensorFlow Package (currently closed source)
○ https://github.com/mesosphere/dcos-tensorflow
● DC/OS TensorFlow Tools
○ https://github.com/dcos-labs/dcos-tensorflow-tools/
● Tutorial for deploying TensorFlow on DC/OS
○ https://github.com/dcos/examples/tree/master/tensorflow
● Contact:
○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dco
s
○ Slack: chat.dcos.io #tensorflow
Questions and Links
64

More Related Content

PDF
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Codemotion
 
PPTX
Webinar: Deep Learning Pipelines Beyond the Learning
Mesosphere Inc.
 
PDF
Simple, Modular and Extensible Big Data Platform Concept
Satish Mohan
 
PPT
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
DataWorks Summit
 
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
PDF
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
PDF
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
PPTX
Keep your Hadoop Cluster at its Best
DataWorks Summit/Hadoop Summit
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Codemotion
 
Webinar: Deep Learning Pipelines Beyond the Learning
Mesosphere Inc.
 
Simple, Modular and Extensible Big Data Platform Concept
Satish Mohan
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
DataWorks Summit
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
Keep your Hadoop Cluster at its Best
DataWorks Summit/Hadoop Summit
 

What's hot (20)

PDF
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Databricks
 
PPTX
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
PDF
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 
PPTX
Webinar: Déployez facilement Kubernetes & vos containers
Mesosphere Inc.
 
PPTX
Big Data Benchmarking
Venkata Naga Ravi
 
PPTX
High Performance Data Analytics with Java on Large Multicore HPC Clusters
Saliya Ekanayake
 
PDF
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Masayuki Matsushita
 
PDF
Distributed deep learning
Mehdi Shibahara
 
PDF
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Maurice Nsabimana
 
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
PDF
Hadoop Fundamentals I
Romeo Kienzler
 
PDF
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
PDF
Operationalizing Machine Learning at Scale with Sameer Nori
Databricks
 
PDF
Jfokus 2019-dowling-logical-clocks
Jim Dowling
 
PDF
Lessons Learned on Benchmarking Big Data Platforms
t_ivanov
 
PDF
Benchmarking Hadoop and Big Data
Nicolas Poggi
 
PPT
Hadoop tutorial
Aamir Ameen
 
PPTX
Apache Hadoop
Ajit Koti
 
PPTX
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Stefan Kupstaitis-Dunkler
 
PPTX
Lessons learned from running Spark on Docker
DataWorks Summit
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Databricks
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 
Webinar: Déployez facilement Kubernetes & vos containers
Mesosphere Inc.
 
Big Data Benchmarking
Venkata Naga Ravi
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
Saliya Ekanayake
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Masayuki Matsushita
 
Distributed deep learning
Mehdi Shibahara
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Maurice Nsabimana
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Hadoop Fundamentals I
Romeo Kienzler
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
Operationalizing Machine Learning at Scale with Sameer Nori
Databricks
 
Jfokus 2019-dowling-logical-clocks
Jim Dowling
 
Lessons Learned on Benchmarking Big Data Platforms
t_ivanov
 
Benchmarking Hadoop and Big Data
Nicolas Poggi
 
Hadoop tutorial
Aamir Ameen
 
Apache Hadoop
Ajit Koti
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Stefan Kupstaitis-Dunkler
 
Lessons learned from running Spark on Docker
DataWorks Summit
 
Ad

Similar to Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018 (20)

PDF
TensorFlow 16: Building a Data Science Platform
Seldon
 
PDF
Building ML Pipelines with DCOS
QAware GmbH
 
PPTX
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Mesosphere Inc.
 
PDF
Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad
Data Con LA
 
PDF
Distributed Deep Learning with Hadoop and TensorFlow
Jan Wiegelmann
 
PPTX
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
PDF
Scaling Deep Learning Algorithms on Extreme Scale Architectures
inside-BigData.com
 
PDF
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
Evans Ye
 
PDF
Tensorflow 2.0 and Coral Edge TPU
Andrés Leonardo Martinez Ortiz
 
PDF
Austin,TX Meetup presentation tensorflow final oct 26 2017
Clarisse Hedglin
 
PPTX
Demystifying-AI-Frameworks-TensorFlow-PyTorch-JAX-and-More (1).pptx
Anant Garg
 
PPTX
Build a Neural Network for ITSM with TensorFlow
Entrepreneur / Startup
 
PPTX
Machines Can Learn - a Practical Take on Machine Intelligence Using Spring Cl...
Christian Tzolov
 
PDF
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
PDF
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
PDF
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent
 
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
Databricks
 
PDF
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA
 
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
PDF
Machine Learning on the Cloud with Apache MXNet
delagoya
 
TensorFlow 16: Building a Data Science Platform
Seldon
 
Building ML Pipelines with DCOS
QAware GmbH
 
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Mesosphere Inc.
 
Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad
Data Con LA
 
Distributed Deep Learning with Hadoop and TensorFlow
Jan Wiegelmann
 
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
inside-BigData.com
 
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
Evans Ye
 
Tensorflow 2.0 and Coral Edge TPU
Andrés Leonardo Martinez Ortiz
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Clarisse Hedglin
 
Demystifying-AI-Frameworks-TensorFlow-PyTorch-JAX-and-More (1).pptx
Anant Garg
 
Build a Neural Network for ITSM with TensorFlow
Entrepreneur / Startup
 
Machines Can Learn - a Practical Take on Machine Intelligence Using Spring Cl...
Christian Tzolov
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
Databricks
 
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
Machine Learning on the Cloud with Apache MXNet
delagoya
 
Ad

More from Codemotion (20)

PDF
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Codemotion
 
PDF
Pompili - From hero to_zero: The FatalNoise neverending story
Codemotion
 
PPTX
Pastore - Commodore 65 - La storia
Codemotion
 
PPTX
Pennisi - Essere Richard Altwasser
Codemotion
 
PPTX
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Codemotion
 
PPTX
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Codemotion
 
PPTX
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Codemotion
 
PPTX
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Codemotion
 
PDF
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Codemotion
 
PDF
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Codemotion
 
PDF
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Codemotion
 
PDF
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Codemotion
 
PDF
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Codemotion
 
PDF
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Codemotion
 
PPTX
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Codemotion
 
PPTX
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
Codemotion
 
PDF
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Codemotion
 
PDF
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Codemotion
 
PDF
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Codemotion
 
PDF
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Codemotion
 
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Codemotion
 
Pompili - From hero to_zero: The FatalNoise neverending story
Codemotion
 
Pastore - Commodore 65 - La storia
Codemotion
 
Pennisi - Essere Richard Altwasser
Codemotion
 
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Codemotion
 
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Codemotion
 
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Codemotion
 
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Codemotion
 
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Codemotion
 
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Codemotion
 
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Codemotion
 
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Codemotion
 
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Codemotion
 
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Codemotion
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Codemotion
 
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
Codemotion
 
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Codemotion
 
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Codemotion
 
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Codemotion
 
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Codemotion
 

Recently uploaded (20)

PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
The Future of Artificial Intelligence (AI)
Mukul
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 

Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018

  • 1. Deep learning beyond the learning @joerg_schad @dcos
  • 2. Jörg Schad Technical Community Lead / Developer Deep Learning ● Core Mesos developer at Mesosphere ● Twitter: @joerg_schad
  • 3. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Promise 3
  • 4. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Process 4 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 5. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: Some insight 5
  • 6. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Challenges 6
  • 7. 1. Explore data using Jupyter notebook 2. Train the model using TensorFlow 3. Monitor training progress using TensorBoard 4. Debug Model using tfdbg 5. Serve Model using TensorFlow Serving
  • 8. Cloud Pipeline 2. Explore data using Jupyter notebook 3. Train the model using TensorFlow 4. Monitor training progress using TensorBoard 5. Debug Model using tfdbg 6. Serve Model using TensorFlow Serving 1. Data Preparation using Spark 7.Streaming of requests ...
  • 9. Open Source Pipeline 2. Explore data using Jupyter notebook 3. Train the model using TensorFlow 4. Monitor training progress using TensorBoard 5. Debug Model using tfdbg 6. Serve Model using TensorFlow Serving 1. Data Preparation using Spark 7. Kafka stream of requests Kubeflow
  • 10. Deep Learning Pipeline Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 11. © 2017 Mesosphere, Inc. All Rights Reserved. Training Challenges 11 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model ● Compute Intensive ○ (Hopefully) Large Datasets ■ Train ■ Dev ■ Test ○ Hyperparameter ■ #Layer ■ #Units per Layer ■ Learning Rate ■ ….
  • 12. Data Management Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 13. © 2018 Mesosphere, Inc. All Rights Reserved. 13 Challenges ● ● Training/Dev/Test + New Data ● Large amounts ● Quality ● Availability (for cluster) ● Velocity ● Streaming Solutions GFS Input Data Management Input: Lots of Labeled Data Apache Kafka Apache Cassandra
  • 14. © 2018 Mesosphere, Inc. All Rights Reserved. 14 Challenges ● Data is typically not ready to be consumed by ML job* ● Data Cleaning ● Missing/incorrect labels ● Data Preparation ● Same Format ● Same Distribution Solutions Data Preparation * Demo datasets are a fortunate exception :)
  • 15. Users Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 16. © 2018 Mesosphere, Inc. All Rights Reserved. 16 Challenges ● Different Users/Use cases ● Data Analyst/Exploring ● Production Workloads ● Highly Optimized ● How to spawn Environments? Solutions Users
  • 17. © 2018 Mesosphere, Inc. All Rights Reserved. 17 Challenges ● Different Users/Use cases ● Data Analyst/Exploring ● Production Workloads ● Highly Optimized ● How to spawn Environments? Solutions Users
  • 18. Frameworks Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 19. 19
  • 20. © 2018 Mesosphere, Inc. All Rights Reserved. ● Machine Intelligence is the broad term used to describe techniques allowing computers to “learn” by analyzing very large data sets using artificial neural networks 20 What is Tensorflow? “An open-source software library for Machine Intelligence” - tensorflow.org
  • 21. © 2018 Mesosphere, Inc. All Rights Reserved. 21 What is Tensorflow? “An open-source software library for Machine Intelligence” - tensorflow.org ● Tensorflow is a software library that makes it easy for developers to construct artificial neural networks to analyze their data of interest TensorFlow Library Python Dataflow Executor, Compute Kernel Implementations, Networking, etc. GPUs CPUs
  • 22. © 2017 Mesosphere, Inc. All Rights Reserved. 22
  • 23. © 2018 Mesosphere, Inc. All Rights Reserved. 23 Alternatives
  • 24. © 2018 Mesosphere, Inc. All Rights Reserved. 24 Alternatives tf.enable_eager_execution() https://www.tensorflow.org/get_started/eager
  • 25. © 2018 Mesosphere, Inc. All Rights Reserved. 25 Data Analytics Ecosystem
  • 26. © 2018 Mesosphere, Inc. All Rights Reserved. APIs 26
  • 27. © 2018 Mesosphere, Inc. All Rights Reserved. 27 Challenges ● Different Frameworks ● No one rules them all Solutions ● Pick the right tool ● PMML if needed Deep Learning Frameworks
  • 28. Cluster Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 29. © 2017 Mesosphere, Inc. All Rights Reserved. Trained Model Typical Developer Workflow for TensorFlow (Single-Node) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for single-node performance ● Train your data on a single-node → Output Trained Model 29 Input Data Set
  • 30. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP address of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 30 Trained Model Input Data Set
  • 31. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for distributed computation ● … 31
  • 32. © 2018 Mesosphere, Inc. All Rights Reserved. Resource Isolation and Allocation 32
  • 33. © 2018 Mesosphere, Inc. All Rights Reserved. TPU 33
  • 34. © 2018 Mesosphere, Inc. All Rights Reserved. TPUs 34
  • 35. © 2017 Mesosphere, Inc. All Rights Reserved. 35 Datacenter Typical Datacenter siloed, over-provisioned servers, low utilization Mesos/ DC/OS automated schedulers, workload multiplexing onto the same machines Tensorflow Jenkins Kafka Spark Tensorflow
  • 36. © 2018 Mesosphere, Inc. All Rights Reserved. PHYSICAL INFRASTRUCTURE MICROSERVICES, CONTAINERS, & DEV TOOLS VIRTUAL MACHINES PUBLIC CLOUDS DATA SERVICES, MACHINE LEARNING, & AI Security & Compliance Application-Aware Automation Multitenancy Hybrid Cloud Management 100+ MORE DatacenterEdge Datacenter and Cloud as a Single Computing Resource Powered by Apache Mesos 20+ MORE
  • 37. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow* 37 ● Dealing with failures is not graceful ○ Users need to stop training, change their hard-coded ClusterSpec, and manually restart their jobs * Any Distributed System
  • 38. Deploy Scale Configure Recover 3 AM ... Typical Datacenter siloed, over-provisioned servers, low utilization HDFS Kafka Kubernetes Flink TensorFlow
  • 39. © 2018 Mesosphere, Inc. All Rights Reserved. Two-level Scheduling 1. Agents advertise resources to Master 2. Master offers resources to Framework 3. Framework rejects / uses resources 4. Agent reports task status to Master 39 MESOS ARCHITECTURE Mesos Master Mesos Master Mesos Master Mesos AgentMesos Agent Service Cassandra Executor Cassandra Task Flink Scheduler Spark Executor Spark Task Mesos AgentMesos Agent Service Docker Executor Docker Task CDB Executor Spark Task Spark Scheduler Kafka Scheduler
  • 40. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow 40 ● Hard-coding a “ClusterSpec” is incredibly tedious ○ Users need to rewrite code for every job they want to run in a distributed setting ○ True even for code they “inherit” from standard models tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222
  • 41. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow ● Manually configuring each node in a cluster takes a long time and is error-prone ○ Setting up access to a shared file system (for checkpoint and summary files) requires authenticating on each node ○ Tweaking hyper-parameters requires re-uploading code to every node 41
  • 42. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 42 Trained Model Input Data Set
  • 43. © 2017 Mesosphere, Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS ● We use the dcos-commons SDK to dynamically create the ClusterSpec 43 { "service": { "name": "mnist", "job_url": "...", "job_context": "..." }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222
  • 44. © 2017 Mesosphere, Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 44 ● Wrapper script to abstract away distributed TensorFlow configuration ○ Separates “deployer” responsibilities from “developer” responsibilities { "service": { "name": "mnist", "job_url": "...", "job_context": "..." }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } User Code Wrapper Script
  • 45. © 2017 Mesosphere, Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 45 ● The dcos-commons SDK cleanly restarts failed tasks and reconnects them to the cluster
  • 46. Model Management Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 47. © 2018 Mesosphere, Inc. All Rights Reserved. Recall 47 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 48. © 2017 Mesosphere, Inc. All Rights Reserved. Many Models 48 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model
  • 49. © 2018 Mesosphere, Inc. All Rights Reserved. 49 Challenges ● Many Models ● Different Hyperparameter ● Different Models ● New Training Data ● ... Solutions ● Persistent Storage + Metadata Model Management GFS
  • 50. © 2017 Mesosphere, Inc. All Rights Reserved. TensorFlow Hub 50 https://www.tensorflow.org/hub/
  • 51. Serving Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 52. © 2018 Mesosphere, Inc. All Rights Reserved. 52 Challenges ● How to Deploy Models? ● Zero Downtime ● Canary Solutions ● TensorFlow Serving Model Serving
  • 53. © 2018 Mesosphere, Inc. All Rights Reserved. TensorFlow Lite 53 https://www.tensorflow.org/mobile/tflite/ Challenges ● Small/Fast model without losing too much performance ● 500 KB models….
  • 54. © 2018 Mesosphere, Inc. All Rights Reserved. Rendezvous Architecture 54 https://mapr.com/ebooks/machine-learning-logistics/
  • 55. Monitoring Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 56. © 2018 Mesosphere, Inc. All Rights Reserved. 56 Challenges ● Understand {...} ● Debug ● Model Quality ● Accuracy ● Training Time ● … ● Overall Architecture ● Availability ● Latencies ● ... Solutions ● TensorBoard ● Traditional Cluster Monitoring Tool Monitoring
  • 57. © 2018 Mesosphere, Inc. All Rights Reserved. Debugging 57 tfdbg https://www.tensorflow.org/programmers_guide/debugger
  • 58. © 2018 Mesosphere, Inc. All Rights Reserved. Debugging 58 Tfdbg - GUI currently alpha https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md
  • 59. © 2018 Mesosphere, Inc. All Rights Reserved. Profiling 59 Performance optimization for different devices - Keep device occupied Profiling! + Experience! https://www.tensorflow.org/performance/performance_guide
  • 60. © 2018 Mesosphere, Inc. All Rights Reserved. Platforms 60 ● AWS Sagemaker + Spark, MXNet, TF + Serving/AB - Cloud Only ● Google Datalab/ML-Engine + TF, Keras, Scikit, XGBoost + Serving/AB - Cloud Only - No control of docker images ● KubeFlow + TF Everywhere - TF only ● DC/OS + Flexibility (all of the above) + GPU support - More Manual setup
  • 61. © 2018 Mesosphere, Inc. All Rights Reserved. 61 Demo 1. Explore data using Jupyter notebook 2. Train the model using TensorFlow 3. Monitor training progress using TensorBoard 4. Debug Model using tfdbg 5. Serve Model using TensorFlow Serving
  • 62. © 2018 Mesosphere, Inc. All Rights Reserved. Related Work 62 ● DC/OS TensorFlow https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/ ● DC/OS PyTorch https://mesosphere.com/blog/deep-learning-pytorch-gpus/ ● Ted Dunning’s Machine Learning Logistics https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/ ● KubeFlow https://github.com/kubeflow/kubeflow ● Tensorflow (+ TensorBoard and Serving) https://www.tensorflow.org/
  • 63. © 2018 Mesosphere, Inc. All Rights Reserved. Special Thanks to All Collaborators 63 Ben Wood Robin Oh Evan Lezar Art Rand Gabriel Hartmann Chris Lambert Bo Hu Sam Pringle Kevin Klues
  • 64. © 2018 Mesosphere, Inc. All Rights Reserved. ● DC/OS TensorFlow Package (currently closed source) ○ https://github.com/mesosphere/dcos-tensorflow ● DC/OS TensorFlow Tools ○ https://github.com/dcos-labs/dcos-tensorflow-tools/ ● Tutorial for deploying TensorFlow on DC/OS ○ https://github.com/dcos/examples/tree/master/tensorflow ● Contact: ○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dco s ○ Slack: chat.dcos.io #tensorflow Questions and Links 64