SlideShare a Scribd company logo
Deep Learning in the Cloud at Scale:
A Data Orchestration Story
Chao Wang, Mickey Zhang, Qianjun Xu
Overview Machine Learning Requirements
End-to-End lifecycle and processes
Data Scientist Workflow
Deep Learning on Azure Machine Learning
Deep Learning: Additional Requirements
Distributed Training with Azure ML Compute
Kubernetes + Alluxio
Requirements of an
advanced ML Platform
Machine Learning
Typical E2E Process
Prepare
Data
Register and
Manage Model
Train &
Test Model
Build
Image
…
Build Model
(your favorite IDE)
Deploy Service
Monitor Model
Prepare Experiment Deploy
Orchestrate
DevOps loop for data science
Prepare
Data
Prepare
Register and
Manage Model
Build
Image
…
Build Model
(your favorite IDE)
Deploy Service
Monitor Model
Train &
Test Model
Deep Learning on
Azure Machine Learning
Characteristics of Deep Learning
Massive amounts of training data
Excels with raw, unstructured data
Automatic feature extraction
Computationally expensive
Distributed training mode: Data parallelism
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Job manager
Worker 2
Dataset
CNN model
Distributed training mode: Model parallelism
Dataset
CNN model
CNN model CNN modelSubset 1 Subset 2
Worker 1 Worker 2
Job manager
Challenges of distributed training
Dependencies and Containers
Provision clusters of VMs
Schedule jobs
Distribute data
Gather results
Handling failures
Scale resources
Secure Access
Kubernetes and Alluxio
Deep Learning Scenarios
1. Standard ImageNet
2. BERT-Large
3. Checkpoint Save/Load
Typical Data Consumption Model
Storage/NFS
Deep Learning Training Platform
RAM
SSD
GPU
CPU
Azure Kubernetes Service
RAM
SSD
GPU
CPU
RAM
SSD
GPU
CPU
Why Alluxio?
Scalable
Performance is scalable based on the
cluster size
Lower Cost
Leverage idle resources in the cluster
Performance
Improve data access throughput by
distributing across nodes
Flexibility
Manage multiple data sources in a
unified namespace
Side-Car Model With Alluxio
Storage
Deep Learning Training Platform
RAM
SSD
GPU
CPU
Azure Kubernetes Service
RAM
SSD
GPU
CPU
RAM
SSD
GPU
CPU
Data Preloaded in Cluster
ImageNet
PyTorch resnet50
1.3M images, 50~200 KB each
Bert-Large Training Job
100 partitions, 1.8 GB each
Checkpoint Save/Load
Single file, 4.7 GB
Load: 24 processes
Save: rank0 process
Looking Forward
• Build pilot experience for customers
• Continuing performance improvement
© Copyright Microsoft Corporation. All rights reserved.
Thank you!

More Related Content

What's hot (20)

PDF
The hidden engineering behind machine learning products at Helixa
Alluxio, Inc.
 
PDF
Build Real-Time Applications with Databricks Streaming
Databricks
 
PPTX
BTUG - Dec 2014 - Hybrid Connectivity Options
Michael Stephenson
 
PPTX
Databricks for Dummies
Rodney Joyce
 
PDF
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
Alluxio, Inc.
 
PDF
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
Spark Summit
 
PDF
Data Science Across Data Sources with Apache Arrow
Databricks
 
PDF
Alluxio + Spark: Accelerating Auto Data Tagging in WeRide
Alluxio, Inc.
 
PPTX
SharePoint User Group - Leeds - 2015-09-02
Michael Stephenson
 
PDF
The Pandemic Changes Everything, the Need for Speed and Resiliency
Alluxio, Inc.
 
PPTX
Super charged prototyping
Michael Stephenson
 
PPTX
Octo and the DevSecOps Evolution at Oracle by Ian Van Hoven
InfluxData
 
PDF
How to Build a new under filesystem in Alluxio: Apache Ozone as an example
Alluxio, Inc.
 
PDF
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Spark Summit
 
PPTX
Curriculum Associates Strata NYC 2017
Kristi Lewandowski
 
PPTX
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 
PDF
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Andrei Savu
 
PPTX
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
Spark Summit
 
PDF
How Adobe uses Structured Streaming at Scale
Databricks
 
PPTX
Never late again! Job-Level deadline SLOs in YARN
DataWorks Summit
 
The hidden engineering behind machine learning products at Helixa
Alluxio, Inc.
 
Build Real-Time Applications with Databricks Streaming
Databricks
 
BTUG - Dec 2014 - Hybrid Connectivity Options
Michael Stephenson
 
Databricks for Dummies
Rodney Joyce
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
Alluxio, Inc.
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
Spark Summit
 
Data Science Across Data Sources with Apache Arrow
Databricks
 
Alluxio + Spark: Accelerating Auto Data Tagging in WeRide
Alluxio, Inc.
 
SharePoint User Group - Leeds - 2015-09-02
Michael Stephenson
 
The Pandemic Changes Everything, the Need for Speed and Resiliency
Alluxio, Inc.
 
Super charged prototyping
Michael Stephenson
 
Octo and the DevSecOps Evolution at Oracle by Ian Van Hoven
InfluxData
 
How to Build a new under filesystem in Alluxio: Apache Ozone as an example
Alluxio, Inc.
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Spark Summit
 
Curriculum Associates Strata NYC 2017
Kristi Lewandowski
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Andrei Savu
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
Spark Summit
 
How Adobe uses Structured Streaming at Scale
Databricks
 
Never late again! Job-Level deadline SLOs in YARN
DataWorks Summit
 

Similar to Deep Learning in the Cloud at Scale: A Data Orchestration Story (20)

PDF
DEVOPS AND MACHINE LEARNING
CodeOps Technologies LLP
 
PPTX
MCT Summit Azure automated Machine Learning
Usama Wahab Khan Cloud, Data and AI
 
PDF
201908 Overview of Automated ML
Mark Tabladillo
 
PPTX
Machine Learning for .NET Developers - ADC21
Gülden Bilgütay
 
PPTX
AML_service.pptx
Abhishek878239
 
PPTX
PL SQLDay Machine Learning- Hands on ML.NET.pptx
Luis Beltran
 
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
PPTX
Dataminds - ML in Production
Nathan Bijnens
 
PPTX
Deeplearning and dev ops azure
Vishwas N
 
PPTX
2020 10 22 AI Fundamentals - Azure Machine Learning
Bruno Capuano
 
PPTX
Production ML Systems and Computer Vision with Google Cloud
gdgsurrey
 
PDF
201906 02 Introduction to AutoML with ML.NET 1.0
Mark Tabladillo
 
PDF
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
DataScienceConferenc1
 
PPTX
DotNet Conf Madrid 2019 - Whats New in ML.NET
Alberto Diaz Martin
 
PDF
C19013010 the tutorial to build shared ai services session 1
Bill Liu
 
PDF
Microsoft DevOps for AI with GoDataDriven
GoDataDriven
 
PPTX
What startups need to know about NLP, AI, & ML on the cloud.
Aaron (Ari) Bornstein
 
PPTX
Machine Learning and AI
James Serra
 
PPT
SQL Server 2008 Data Mining
llangit
 
PPT
SQL Server 2008 Data Mining
llangit
 
DEVOPS AND MACHINE LEARNING
CodeOps Technologies LLP
 
MCT Summit Azure automated Machine Learning
Usama Wahab Khan Cloud, Data and AI
 
201908 Overview of Automated ML
Mark Tabladillo
 
Machine Learning for .NET Developers - ADC21
Gülden Bilgütay
 
AML_service.pptx
Abhishek878239
 
PL SQLDay Machine Learning- Hands on ML.NET.pptx
Luis Beltran
 
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
Dataminds - ML in Production
Nathan Bijnens
 
Deeplearning and dev ops azure
Vishwas N
 
2020 10 22 AI Fundamentals - Azure Machine Learning
Bruno Capuano
 
Production ML Systems and Computer Vision with Google Cloud
gdgsurrey
 
201906 02 Introduction to AutoML with ML.NET 1.0
Mark Tabladillo
 
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
DataScienceConferenc1
 
DotNet Conf Madrid 2019 - Whats New in ML.NET
Alberto Diaz Martin
 
C19013010 the tutorial to build shared ai services session 1
Bill Liu
 
Microsoft DevOps for AI with GoDataDriven
GoDataDriven
 
What startups need to know about NLP, AI, & ML on the cloud.
Aaron (Ari) Bornstein
 
Machine Learning and AI
James Serra
 
SQL Server 2008 Data Mining
llangit
 
SQL Server 2008 Data Mining
llangit
 
Ad

More from Alluxio, Inc. (20)

PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
PDF
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
Ad

Recently uploaded (20)

PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 

Deep Learning in the Cloud at Scale: A Data Orchestration Story