SlideShare a Scribd company logo
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Kaarthik Sivashanmugam, Wee Hyong Tok
Microsoft
Infrastructure for Deep Learning
in Apache Spark
#UnifiedAnalytics #SparkAISummit
Agenda
• Evolution of data infrastructure
• ML workflow: Data prep & DNN training
• Intro to deep learning and computing needs
• Distributed deep learning and challenges
• Unified platform using Spark
– Infra considerations, challenges
• ML Pipelines
3#UnifiedAnalytics #SparkAISummit
Video
Feeds
Call Logs
Data
Web logs
Products
Images
……
Organization’s Data
Database /
Data
Warehouse
Organization’s data
Machine Learning
Typical E2E Process
…
Prepare Experiment Deploy
Orchestrate
+ Machine Learning and
Deep Learning workloads
6#UnifiedAnalytics #SparkAISummit
How long does it take to train Resnet-50 on ImageNet?
7#UnifiedAnalytics #SparkAISummit
14 daysBefore
2017
NVIDIA M40 GPU
Training Resnet-50 on Imagenet
8#UnifiedAnalytics #SparkAISummit
1 hour 31 mins 15 mins
Apr Sept Nov
Tesla P100 x 256 1,600 CPUs Tesla P100 x 1,024
Facebook
Caffe2
UC Berkeley,
TACC, UC Davis
Tensorflow
Preferred Network
ChainerMN
2017
6.6 mins
Tesla P40 x 2,048
Tencent
TensorFlow
July Nov
2.0 mins
Sony
Neural Network
Library (NNL)
Tesla V100 x 3,456
2018 2019
Fujitsu
MXNet
1.2 mins
Tesla V100 x 2,048
Apr
Considerations for Deep Learning @ Scale
• CPU vs. GPU
• Single vs. multi-GPU
• MPI vs. non-MPI
• Infiniband vs. Ethernet
9#UnifiedAnalytics #SparkAISummit
Credits: Mathew Salvaris
https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-deployment-of-deep-learning-models/
“Things” you need to deal with when training
machine learning/deep learning models
Gather results
Secure Access
Scale resources
Schedule jobs
Dependencies and Containers
Provision VM clusters
Distribute data
Handling failures
Machine Learning
Typical E2E Process
…
Prepare Experiment Deploy
Orchestrate
Machine Learning and Deep Learning
12#UnifiedAnalytics #SparkAISummit
Top figure source;
Bottom figure from NVIDIA
ML
DL
Lots of ML
Frameworks ….
13#UnifiedAnalytics #SparkAISummit
TensorFlow PyTorch
Scikit-Learn
MXNet Chainer
Keras
Design Choices for Big Data and Machine Learning/Deep Learning
14#UnifiedAnalytics #SparkAISummit
Laptop Spark +
Separate infrastructure for
ML/DL training/inference
Cloud
Spark
Execution Models for Spark and Deep Learning
15#UnifiedAnalytics #SparkAISummit
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
Task
2
Task
3
Spark
Data Parallelism Model Parallelism
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
Distributed Learning
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
Execution Models for Spark and Deep Learning
16#UnifiedAnalytics #SparkAISummit
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
Task
2
Task
3
Spark
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
Distributed Learning
Task
3
Task
2
Task
1
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
Execution Models for Spark and Deep Learning
17#UnifiedAnalytics #SparkAISummit
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
• Re-run crashed task
Task
2
Task
3
Spark
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
• Re-run all tasks
Distributed Learning
Task
3
Task
2
Task
1
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
Spark + ML/DL
18#UnifiedAnalytics #SparkAISummit
www.aka.ms/spark Sparkflow
TensorFlowOnSpark
Project Hydrogen
HorovodRunner
19#UnifiedAnalytics #SparkAISummit
Microsoft Machine Learning for
Apache Spark v0.16
Microsoft’s Open Source
Contributions to Apache Spark
www.aka.ms/spark Azure/mmlspark
Cognitive
Services
Spark
Serving
Model
Interpretability
LightGBM
Gradient Boosting
Deep Networks
with CNTK
HTTP on
Spark
Demo - Azure Databricks
and Deep Learning
20#UnifiedAnalytics #SparkAISummit
Demo – Distributed Deep
Learning using Tensorflow
with HorovodRunner
21#UnifiedAnalytics #SparkAISummit
What do you
need for
training /
distributed
training?
CPU
GPU
Network
Storage
Deep Learning
Framework
Memory
Physics of Machine Learning and Deep Learning
GPU Device Interconnect
• NVLink
• GPUDirect P2P
• GPUDirect RDMA
Interconnect topology sample
Credits:CUDA-MPI Blog (https://bit.ly/2KnmN58)
From CUDA to NCCL1 to NCCL2
Multi-Core
CPU
GPU Multi-GPU Multi-GPU
Multi-Node
NCCL 2NCCL 1CUDA
Multi-GPU
Communication
Library
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
NCCL 2.x (multi-node)
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
NCCL 2.x
(multi-
node)
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
Spark & GPU
• Using GPU with Spark options:
1. Native support (cluster manager, GPU tasks): SPARK-
24615
2. Use cores/memory as proxy for GPU resources and
allow GPU-enabled code execution
3. Code implementation/generation for GPU offload
• Considerations
– Flexibility
– Data management
– Multi-GPU execution
27#UnifiedAnalytics #SparkAISummit
Infrastructure Considerations
• Data format, storage and reuse
– Co-locate Data Engineering storage infrastructure (cluster-local)
– DL Framework support for HDFS (reading from HDFS does not mean data-locality-aware computation)
– Sharing data between Spark and Deep Learning (HDFS, Spark-TF connector, Parquet/Petastorm)
• Job execution
– Gang scheduling – Refer to SPARK-24374
– Support for GPU (and other accelerators) – Refer to SPARK-24615
– Cluster sharing with other types of jobs (CPU-only cluster vs. CPU+GPU cluster)
– Quota management
– Support for Docker containers
– MPI vs. non-MPI
– Difference GPU generations
• Node, GPU connectivity
– Infiniband, RDMA
– GPU Interconnect options
– Interconnect-aware scheduling, minimize distribution, repacking
ML Pipelines
• Using machine learning pipelines, data scientists, data engineers,
and IT professionals can collaborate on different steps/phases
• Enable use of best tech for different phases in ML/DL workflow
29#UnifiedAnalytics #SparkAISummit
Demo – Azure ML
Pipelines & Databricks
30#UnifiedAnalytics #SparkAISummit
What do you
need for training /
distributed
training?
CPU
GPU
Network
Storage
Deep Learning
Framework
Memory
Physics of Machine Learning and Deep Learning
Kaarthik Sivashanmugam, Wee Hyong Tok
Microsoft
Infrastructure for Deep Learning
in Apache Spark
#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

PDF
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
PDF
Introduction to Apache Spark
Juan Pedro Moreno
 
PDF
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
PDF
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
PDF
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
PDF
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
PDF
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit
 
PDF
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Databricks
 
PDF
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Databricks
 
PDF
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
PDF
Apache Spark Briefing
Thomas W. Dinsmore
 
PDF
Harnessing Spark Catalyst for Custom Data Payloads
Simeon Fitch
 
PDF
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
PDF
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
PPTX
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
PDF
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PDF
The Revolution Will be Streamed
Databricks
 
PDF
Flink in Zalando's world of Microservices
ZalandoHayley
 
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
Introduction to Apache Spark
Juan Pedro Moreno
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Databricks
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Databricks
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
Apache Spark Briefing
Thomas W. Dinsmore
 
Harnessing Spark Catalyst for Custom Data Payloads
Simeon Fitch
 
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
The Revolution Will be Streamed
Databricks
 
Flink in Zalando's world of Microservices
ZalandoHayley
 

Similar to Spark summit 2019 infrastructure for deep learning in apache spark 0425 (20)

PDF
Infrastructure for Deep Learning in Apache Spark
Databricks
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
PDF
Deep learning and Apache Spark
QuantUniversity
 
PPTX
Big Data Processing with Apache Spark 2014
mahchiev
 
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
PDF
High Performance Deep learning with Apache Spark
Rui Liu
 
PDF
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
PPTX
An Introduction to Apache Spark
Dona Mary Philip
 
PPTX
Lessons learned from running Spark on Docker
DataWorks Summit
 
PPTX
Apache Spark Core
Girish Khanzode
 
PDF
CaffeOnSpark: Deep Learning On Spark Cluster
Jen Aman
 
PDF
夏俊鸾:Spark——基于内存的下一代大数据分析框架
hdhappy001
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
PPTX
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
PPTX
AI and Spark - IBM Community AI Day
Nick Pentreath
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Infrastructure for Deep Learning in Apache Spark
Databricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Deep learning and Apache Spark
QuantUniversity
 
Big Data Processing with Apache Spark 2014
mahchiev
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
High Performance Deep learning with Apache Spark
Rui Liu
 
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
An Introduction to Apache Spark
Dona Mary Philip
 
Lessons learned from running Spark on Docker
DataWorks Summit
 
Apache Spark Core
Girish Khanzode
 
CaffeOnSpark: Deep Learning On Spark Cluster
Jen Aman
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
hdhappy001
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache Spark Fundamentals
Zahra Eskandari
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
AI and Spark - IBM Community AI Day
Nick Pentreath
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Ad

Recently uploaded (20)

PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
BinarySearchTree in datastructures in detail
kichokuttu
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
big data eco system fundamentals of data science
arivukarasi
 
What Is Data Integration and Transformation?
subhashenia
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Ad

Spark summit 2019 infrastructure for deep learning in apache spark 0425

  • 2. Kaarthik Sivashanmugam, Wee Hyong Tok Microsoft Infrastructure for Deep Learning in Apache Spark #UnifiedAnalytics #SparkAISummit
  • 3. Agenda • Evolution of data infrastructure • ML workflow: Data prep & DNN training • Intro to deep learning and computing needs • Distributed deep learning and challenges • Unified platform using Spark – Infra considerations, challenges • ML Pipelines 3#UnifiedAnalytics #SparkAISummit
  • 4. Video Feeds Call Logs Data Web logs Products Images …… Organization’s Data Database / Data Warehouse Organization’s data
  • 5. Machine Learning Typical E2E Process … Prepare Experiment Deploy Orchestrate
  • 6. + Machine Learning and Deep Learning workloads 6#UnifiedAnalytics #SparkAISummit
  • 7. How long does it take to train Resnet-50 on ImageNet? 7#UnifiedAnalytics #SparkAISummit 14 daysBefore 2017 NVIDIA M40 GPU
  • 8. Training Resnet-50 on Imagenet 8#UnifiedAnalytics #SparkAISummit 1 hour 31 mins 15 mins Apr Sept Nov Tesla P100 x 256 1,600 CPUs Tesla P100 x 1,024 Facebook Caffe2 UC Berkeley, TACC, UC Davis Tensorflow Preferred Network ChainerMN 2017 6.6 mins Tesla P40 x 2,048 Tencent TensorFlow July Nov 2.0 mins Sony Neural Network Library (NNL) Tesla V100 x 3,456 2018 2019 Fujitsu MXNet 1.2 mins Tesla V100 x 2,048 Apr
  • 9. Considerations for Deep Learning @ Scale • CPU vs. GPU • Single vs. multi-GPU • MPI vs. non-MPI • Infiniband vs. Ethernet 9#UnifiedAnalytics #SparkAISummit Credits: Mathew Salvaris https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-deployment-of-deep-learning-models/
  • 10. “Things” you need to deal with when training machine learning/deep learning models Gather results Secure Access Scale resources Schedule jobs Dependencies and Containers Provision VM clusters Distribute data Handling failures
  • 11. Machine Learning Typical E2E Process … Prepare Experiment Deploy Orchestrate
  • 12. Machine Learning and Deep Learning 12#UnifiedAnalytics #SparkAISummit Top figure source; Bottom figure from NVIDIA ML DL
  • 13. Lots of ML Frameworks …. 13#UnifiedAnalytics #SparkAISummit TensorFlow PyTorch Scikit-Learn MXNet Chainer Keras
  • 14. Design Choices for Big Data and Machine Learning/Deep Learning 14#UnifiedAnalytics #SparkAISummit Laptop Spark + Separate infrastructure for ML/DL training/inference Cloud Spark
  • 15. Execution Models for Spark and Deep Learning 15#UnifiedAnalytics #SparkAISummit Task 1 • Independent Tasks • Embarrassingly Parallel and Massively Scalable Task 2 Task 3 Spark Data Parallelism Model Parallelism • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes Distributed Learning Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
  • 16. Execution Models for Spark and Deep Learning 16#UnifiedAnalytics #SparkAISummit Task 1 • Independent Tasks • Embarrassingly Parallel and Massively Scalable Task 2 Task 3 Spark • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes Distributed Learning Task 3 Task 2 Task 1 Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
  • 17. Execution Models for Spark and Deep Learning 17#UnifiedAnalytics #SparkAISummit Task 1 • Independent Tasks • Embarrassingly Parallel and Massively Scalable • Re-run crashed task Task 2 Task 3 Spark • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes • Re-run all tasks Distributed Learning Task 3 Task 2 Task 1 Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
  • 18. Spark + ML/DL 18#UnifiedAnalytics #SparkAISummit www.aka.ms/spark Sparkflow TensorFlowOnSpark Project Hydrogen HorovodRunner
  • 19. 19#UnifiedAnalytics #SparkAISummit Microsoft Machine Learning for Apache Spark v0.16 Microsoft’s Open Source Contributions to Apache Spark www.aka.ms/spark Azure/mmlspark Cognitive Services Spark Serving Model Interpretability LightGBM Gradient Boosting Deep Networks with CNTK HTTP on Spark
  • 20. Demo - Azure Databricks and Deep Learning 20#UnifiedAnalytics #SparkAISummit
  • 21. Demo – Distributed Deep Learning using Tensorflow with HorovodRunner 21#UnifiedAnalytics #SparkAISummit
  • 22. What do you need for training / distributed training? CPU GPU Network Storage Deep Learning Framework Memory Physics of Machine Learning and Deep Learning
  • 23. GPU Device Interconnect • NVLink • GPUDirect P2P • GPUDirect RDMA Interconnect topology sample Credits:CUDA-MPI Blog (https://bit.ly/2KnmN58)
  • 24. From CUDA to NCCL1 to NCCL2 Multi-Core CPU GPU Multi-GPU Multi-GPU Multi-Node NCCL 2NCCL 1CUDA Multi-GPU Communication Library Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  • 25. NCCL 2.x (multi-node) Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  • 26. NCCL 2.x (multi- node) Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  • 27. Spark & GPU • Using GPU with Spark options: 1. Native support (cluster manager, GPU tasks): SPARK- 24615 2. Use cores/memory as proxy for GPU resources and allow GPU-enabled code execution 3. Code implementation/generation for GPU offload • Considerations – Flexibility – Data management – Multi-GPU execution 27#UnifiedAnalytics #SparkAISummit
  • 28. Infrastructure Considerations • Data format, storage and reuse – Co-locate Data Engineering storage infrastructure (cluster-local) – DL Framework support for HDFS (reading from HDFS does not mean data-locality-aware computation) – Sharing data between Spark and Deep Learning (HDFS, Spark-TF connector, Parquet/Petastorm) • Job execution – Gang scheduling – Refer to SPARK-24374 – Support for GPU (and other accelerators) – Refer to SPARK-24615 – Cluster sharing with other types of jobs (CPU-only cluster vs. CPU+GPU cluster) – Quota management – Support for Docker containers – MPI vs. non-MPI – Difference GPU generations • Node, GPU connectivity – Infiniband, RDMA – GPU Interconnect options – Interconnect-aware scheduling, minimize distribution, repacking
  • 29. ML Pipelines • Using machine learning pipelines, data scientists, data engineers, and IT professionals can collaborate on different steps/phases • Enable use of best tech for different phases in ML/DL workflow 29#UnifiedAnalytics #SparkAISummit
  • 30. Demo – Azure ML Pipelines & Databricks 30#UnifiedAnalytics #SparkAISummit
  • 31. What do you need for training / distributed training? CPU GPU Network Storage Deep Learning Framework Memory Physics of Machine Learning and Deep Learning
  • 32. Kaarthik Sivashanmugam, Wee Hyong Tok Microsoft Infrastructure for Deep Learning in Apache Spark #UnifiedAnalytics #SparkAISummit
  • 33. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT