SlideShare a Scribd company logo
© Cloudera, Inc. All rights reserved.
Bridging the Gap Between Big Data and Deep
Learning with Apache Spark 2.4
Robert Hryniewicz
@robhryniewicz
© Cloudera, Inc. All rights reserved. 2
Two of the most significant communities: Spark & Machine Learning (ML)
DataFrame-based APIs TensorFlow/Pytorch
More than 50 Data Sources Tf.data, tf.transform
Data/ML Pipeline APIs Horovod
Structured Streaming and Continuous Processing Numpy/Scipy/Pandas/Scikit-learn/XGBoost
Pandas UDF torch-vision/torch-text
Python/Java/R interfaces
Spark ML
© Cloudera, Inc. All rights reserved. 3
What we need?
• Build a data/ML pipeline that fetches training samples from HDFS/Hive and
trains a Deep Learning (DL) model in parallel
• Apply a trained DL model to a batch or streaming datasets and get the
predicted results
© Cloudera, Inc. All rights reserved. 4
Different execution models
• Spark
• Tasks are independent of each other
• Partial tasks can be started if there is no enough resource for all tasks
• Distributed Deep Learning (DL) model training
• Tasks are coordinated with master role
• All tasks in the same job should be started simultaneously
• All tasks would communicate and synchronize with each other
© Cloudera, Inc. All rights reserved. 5
Task 1
Task 2
Task 3
Spark
• tasks are independent of each other
• massively parallel and scalable
Distributed DL model training
• complete coordination among tasks
• optimized for communication
Task 1
Task 2 Task 3
Execution Models
© Cloudera, Inc. All rights reserved. 6
Task 1
Task 2
Task 3
Spark
• tasks are independent of each other
• massively parallel and scalable
• if one task crashes, rerun that one
Distributed DL model training
• complete coordination among tasks
• optimized for communication
• if one task crashes, must rerun all tasks
Task 1
Task 2 Task 3
Incompatible Execution Models
© Cloudera, Inc. All rights reserved. 7
Apache Spark 2.4: Barrier Execution Mode
• Barrier scheduling: gang scheduling on top of the existing MapReduce
execution mode
• Distributed DL job can be run as a Spark job inside of data/ML pipelines
• It starts all tasks together
• It provides sufficient info and tooling to run a hybrid distributed job
• It cancels and restarts all tasks in case of failures
© Cloudera, Inc. All rights reserved. 8
Barrier Execution Mode
• RDD.barrier() tells Spark to launch the tasks together
• context.barrier() places a global barrier and waits until all tasks in this stage
hit this barrier
rdd.barrier().mapPartitions { iter =>
context = TaskContext.get() // get barrier task context
… // do sth in each task
context.barrier() // wait until all tasks finish
}
© Cloudera, Inc. All rights reserved. 9
Barrier Execution Mode
• RDD.barrier() tells Spark to launch the tasks together
• context.barrier() places a global barrier and waits until all tasks in this stage
hit this barrier
• context.getTaskInfos() returns info about all tasks in this stage
rdd.barrier().mapPartitions { iter =>
context = TaskContext.get() // get barrier task context
if (context.partitionId == 0) {
address = context.getTaskInfos().map(_.address)
… // run distributed DL training script which accepts “address” as argument
}
context.barrier() // wait until all tasks finish
}
© Cloudera, Inc. All rights reserved. 10
The data/DL pipeline - Load dataset
dataset = spark.read
.format(“image”)
.option(“dropInvalid”, true)
.load(“/data/dl/images”)
© Cloudera, Inc. All rights reserved. 11
The data/DL pipeline - running in barrier execution mode
dataset = dataset.rdd.barrier().mapPartitions(runDistTrain)
def runDistTrain(batches):
context = TaskContext.get()
partitionId = context.partitionId()
if partitionId == 0:
… // run distributed DL training scripts at master, and save model to HDFS
else:
… // usually do nothing at slaves
context.barrier()
© Cloudera, Inc. All rights reserved. 12
Unifying execution models
Stage 1
data prep
massively parallel
Stage 2
distributed DL training
gang scheduled
Stage 3
data sink
massively parallel
© Cloudera, Inc. All rights reserved. 13
In the future (Spark 3.0+):
• Optimized Data Exchange (Spark-24579)
• Accelerator Aware Scheduling (SPARK-24615)
© Cloudera, Inc. All rights reserved. 14
Optimized Data Exchange
• Data exchange between Spark
Dataset/DataFrame and DL
frameworks(TensorFlow/Pytorch)
• The integration should be
simplified and efficient
© Cloudera, Inc. All rights reserved. 15
Vectorized computation
DataFrame.toArrowRDD(maxRecordsPerBatch=4096)
def runDistTrain(batches):
context = TaskContext.get()
… // start distributed training in the same (or separate) process,
fetch records from Spark DataFrame directly
context.barrier()
return [model]
dataset.toArrowRDD.barrier().mapPartitions(runDistTrain).collect()
© Cloudera, Inc. All rights reserved. 16
Accelerator Aware Scheduling
• Goals:
• To utilize accelerators (GPUs, FPGAs) in a heterogeneous cluster
• To utilize multiple accelerators in a multi-task environment
• Users workflow:
• Submit Spark application and request GPU resources per executor
• Request number of GPUs to use for a task (RDD stage, Pandas UDF)
• In the customized task logic, retrieve the logical indices of assigned GPUs and use them
© Cloudera, Inc. All rights reserved. 17
Example: request accelerators
With accelerator awareness, users can specify accelerators constraints or hints:
rdd.accelerated
.by(“/gpu/p100”)
.numPerTask(2)
.required
© Cloudera, Inc. All rights reserved.
THANK YOU
Robert Hryniewicz
@robhryniewicz

More Related Content

What's hot (20)

PPTX
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
PPTX
Enterprise Grade Streaming under 2ms on Hadoop
DataWorks Summit/Hadoop Summit
 
PDF
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Spark Summit
 
PPTX
Node Labels in YARN
DataWorks Summit
 
PPTX
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
 
PPTX
Intro to Apache Spark
Cloudera, Inc.
 
PDF
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
PDF
Spark Summit EU talk by Mike Percy
Spark Summit
 
PDF
Serverless Data Platform
Shu-Jeng Hsieh
 
PDF
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
PDF
dplyr Interfaces to Large-Scale Data
Cloudera, Inc.
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
PDF
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
DataWorks Summit/Hadoop Summit
 
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
PDF
Spark, spark streaming & tachyon
Johan hong
 
PDF
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Jen Aman
 
PPTX
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
PPTX
Running secured Spark job in Kubernetes compute cluster and integrating with ...
DataWorks Summit
 
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Enterprise Grade Streaming under 2ms on Hadoop
DataWorks Summit/Hadoop Summit
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Spark Summit
 
Node Labels in YARN
DataWorks Summit
 
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
 
Intro to Apache Spark
Cloudera, Inc.
 
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Mike Percy
Spark Summit
 
Serverless Data Platform
Shu-Jeng Hsieh
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
dplyr Interfaces to Large-Scale Data
Cloudera, Inc.
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
DataWorks Summit/Hadoop Summit
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Spark, spark streaming & tachyon
Johan hong
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Jen Aman
 
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
DataWorks Summit
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 

Similar to Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning (20)

PDF
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
PDF
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
PDF
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Databricks
 
PDF
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Matt Stubbs
 
PDF
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
PPT
Bigdata processing with Spark - part II
Arjen de Vries
 
PDF
Apache Spark Streaming
Bartosz Jankiewicz
 
PDF
Scalable Data Science in Python and R on Apache Spark
felixcss
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
PDF
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
PPTX
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
PDF
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Anyscale
 
PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Spark Core
Girish Khanzode
 
PPTX
Apache spark
Prashant Pranay
 
PPTX
Apache Spark Components
Girish Khanzode
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Databricks
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Matt Stubbs
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Bigdata processing with Spark - part II
Arjen de Vries
 
Apache Spark Streaming
Bartosz Jankiewicz
 
Scalable Data Science in Python and R on Apache Spark
felixcss
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Anyscale
 
End-to-end Data Pipeline with Apache Spark
Databricks
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Apache Spark Core
Girish Khanzode
 
Apache spark
Prashant Pranay
 
Apache Spark Components
Girish Khanzode
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
July Patch Tuesday
Ivanti
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 

Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning

  • 1. © Cloudera, Inc. All rights reserved. Bridging the Gap Between Big Data and Deep Learning with Apache Spark 2.4 Robert Hryniewicz @robhryniewicz
  • 2. © Cloudera, Inc. All rights reserved. 2 Two of the most significant communities: Spark & Machine Learning (ML) DataFrame-based APIs TensorFlow/Pytorch More than 50 Data Sources Tf.data, tf.transform Data/ML Pipeline APIs Horovod Structured Streaming and Continuous Processing Numpy/Scipy/Pandas/Scikit-learn/XGBoost Pandas UDF torch-vision/torch-text Python/Java/R interfaces Spark ML
  • 3. © Cloudera, Inc. All rights reserved. 3 What we need? • Build a data/ML pipeline that fetches training samples from HDFS/Hive and trains a Deep Learning (DL) model in parallel • Apply a trained DL model to a batch or streaming datasets and get the predicted results
  • 4. © Cloudera, Inc. All rights reserved. 4 Different execution models • Spark • Tasks are independent of each other • Partial tasks can be started if there is no enough resource for all tasks • Distributed Deep Learning (DL) model training • Tasks are coordinated with master role • All tasks in the same job should be started simultaneously • All tasks would communicate and synchronize with each other
  • 5. © Cloudera, Inc. All rights reserved. 5 Task 1 Task 2 Task 3 Spark • tasks are independent of each other • massively parallel and scalable Distributed DL model training • complete coordination among tasks • optimized for communication Task 1 Task 2 Task 3 Execution Models
  • 6. © Cloudera, Inc. All rights reserved. 6 Task 1 Task 2 Task 3 Spark • tasks are independent of each other • massively parallel and scalable • if one task crashes, rerun that one Distributed DL model training • complete coordination among tasks • optimized for communication • if one task crashes, must rerun all tasks Task 1 Task 2 Task 3 Incompatible Execution Models
  • 7. © Cloudera, Inc. All rights reserved. 7 Apache Spark 2.4: Barrier Execution Mode • Barrier scheduling: gang scheduling on top of the existing MapReduce execution mode • Distributed DL job can be run as a Spark job inside of data/ML pipelines • It starts all tasks together • It provides sufficient info and tooling to run a hybrid distributed job • It cancels and restarts all tasks in case of failures
  • 8. © Cloudera, Inc. All rights reserved. 8 Barrier Execution Mode • RDD.barrier() tells Spark to launch the tasks together • context.barrier() places a global barrier and waits until all tasks in this stage hit this barrier rdd.barrier().mapPartitions { iter => context = TaskContext.get() // get barrier task context … // do sth in each task context.barrier() // wait until all tasks finish }
  • 9. © Cloudera, Inc. All rights reserved. 9 Barrier Execution Mode • RDD.barrier() tells Spark to launch the tasks together • context.barrier() places a global barrier and waits until all tasks in this stage hit this barrier • context.getTaskInfos() returns info about all tasks in this stage rdd.barrier().mapPartitions { iter => context = TaskContext.get() // get barrier task context if (context.partitionId == 0) { address = context.getTaskInfos().map(_.address) … // run distributed DL training script which accepts “address” as argument } context.barrier() // wait until all tasks finish }
  • 10. © Cloudera, Inc. All rights reserved. 10 The data/DL pipeline - Load dataset dataset = spark.read .format(“image”) .option(“dropInvalid”, true) .load(“/data/dl/images”)
  • 11. © Cloudera, Inc. All rights reserved. 11 The data/DL pipeline - running in barrier execution mode dataset = dataset.rdd.barrier().mapPartitions(runDistTrain) def runDistTrain(batches): context = TaskContext.get() partitionId = context.partitionId() if partitionId == 0: … // run distributed DL training scripts at master, and save model to HDFS else: … // usually do nothing at slaves context.barrier()
  • 12. © Cloudera, Inc. All rights reserved. 12 Unifying execution models Stage 1 data prep massively parallel Stage 2 distributed DL training gang scheduled Stage 3 data sink massively parallel
  • 13. © Cloudera, Inc. All rights reserved. 13 In the future (Spark 3.0+): • Optimized Data Exchange (Spark-24579) • Accelerator Aware Scheduling (SPARK-24615)
  • 14. © Cloudera, Inc. All rights reserved. 14 Optimized Data Exchange • Data exchange between Spark Dataset/DataFrame and DL frameworks(TensorFlow/Pytorch) • The integration should be simplified and efficient
  • 15. © Cloudera, Inc. All rights reserved. 15 Vectorized computation DataFrame.toArrowRDD(maxRecordsPerBatch=4096) def runDistTrain(batches): context = TaskContext.get() … // start distributed training in the same (or separate) process, fetch records from Spark DataFrame directly context.barrier() return [model] dataset.toArrowRDD.barrier().mapPartitions(runDistTrain).collect()
  • 16. © Cloudera, Inc. All rights reserved. 16 Accelerator Aware Scheduling • Goals: • To utilize accelerators (GPUs, FPGAs) in a heterogeneous cluster • To utilize multiple accelerators in a multi-task environment • Users workflow: • Submit Spark application and request GPU resources per executor • Request number of GPUs to use for a task (RDD stage, Pandas UDF) • In the customized task logic, retrieve the logical indices of assigned GPUs and use them
  • 17. © Cloudera, Inc. All rights reserved. 17 Example: request accelerators With accelerator awareness, users can specify accelerators constraints or hints: rdd.accelerated .by(“/gpu/p100”) .numPerTask(2) .required
  • 18. © Cloudera, Inc. All rights reserved. THANK YOU Robert Hryniewicz @robhryniewicz