SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Xingbo Jiang, Databricks
Updates from Project Hydrogen: Unifying
State-of-the-Art AI and Big Data in Apache Spark
#UnifiedDataAnalytics #SparkAISummit
About Me
• Software Engineer at
• Committer of Apache Spark
Xingbo Jiang (Github: jiangxb1987)
4
Announced last June, Project Hydrogen is a major Spark initiative
to unify state-of-the-art AI and big data workloads.
About Project Hydrogen
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
5
Why Spark + AI?
Runtime
Delta
Spark Core Engine
Big Data Processing
ETL + SQL +Streaming
Machine Learning
MLlib + SparkR
Apache Spark:
The First Unified Analytics Engine
6
and many more...
Internet of ThingsDigital Personalization
Huge disruptive innovations are affecting most enterprises on the planet
Healthcare and Genomics Fraud Prevention
AI is re-shaping the world
7
Better AI needs more data
8
The cross...
99
Map/Reduce
CaffeOnSpark
TensorFlowOnSpark
DataFrame-based APIs
50+ Data Sources
Python/Java/R interfaces
Structured Streaming
ML Pipelines API
Continuous
Processing
RDD
Project Tungsten
Pandas UDF
TensorFrames
scikit-learn
pandas/numpy/scipy
LIBLINEAR
R
glmnet
xgboost
GraphLab
Caffe/PyTorch/MXNet
TensorFlow
Keras
Distributed
TensorFlow
Horovod
tf.data
tf.transform
AI/ML
??
TF XLA
10
Why Project Hydrogen?
Two simple stories
11
data warehouse load fit model
data
stream
load predict model
Distributed training
data warehouse load fit model
Required: Be able to read from
Delta Lake, Parquet,
MySQL, Hive, etc.
Answer: Apache Spark
Required: distributed GPU cluster
for fast training
Answer: Horovod, Distributed
Tensorflow, etc
12
Two separate data and AI clusters?
load using a Spark
cluster
fit on a GPU
cluster
model
save data
required: glue code
13
Streaming model inference
Kafka load predict model
required:
● save to stream sink
● GPU for fast inference
14
A hybrid Spark and AI cluster?
load using a Spark
cluster w/ GPUs
fit a model
distributedly
on the same
cluster
model
load using a Spark
cluster w/ GPUs
predict w/ GPUs as
a Spark task
model
15
Unfortunately, it doesn’t work out of the box.
See a previous demo.
17
Project Hydrogen to fill the major gaps
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
18
Updates from Project Hydrogen
● Available features
● Future Improvement
● How to utilize
19
Story #1:
Distributed training
load using a Spark
cluster w/ GPUs
fit a model
distributedly
on the same
cluster
model
20
Project Hydrogen: barrier execution mode
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
21
Different execution models
Task 1
Task 2
Task 3
Spark (MapReduce)
Tasks are independent of each other
Embarrassingly parallel & massively scalable
Distributed training
Complete coordination among tasks
Optimized for communication
Task 1
Task 2 Task 3
22
Barrier execution mode
• All tasks start together
• Sufficient info to run a hybrid distributed job
• Cancel and restart all tasks on failure
JIRA: SPARK-24374 (Spark 2.4)
23
API: RDD.barrier()
RDD.barrier() tells Spark to launch the tasks together.
rdd.barrier().mapPartitions { iter =>
val context = BarrierTaskContext.get()
...
}
24
API: context.barrier()
context.barrier() places a global barrier and waits until all tasks in
this stage hit this barrier.
val context = BarrierTaskContext.get()
… // preparation
context.barrier()
25
API: context.getTaskInfos()
context.getTaskInfos() returns info about all tasks in this stage.
if (context.partitionId == 0) {
val addrs = context.getTaskInfos().map(_.address)
... // start a hybrid training job, e.g., via MPI
}
context.barrier() // wait until training finishes
26
Barrier mode integration
27
Horovod (an LF AI hosted project)
● Little modification to single-node code
● High-performance I/O via MPI and NCCL
● Same convergence theory
● Limitations
28
Hydrogen integration with Horovod
● HorovodRunner with Databricks Runtime 5.0 ML has released
● Runs Horovod under barrier execution mode
● Hides details from users
def train_hvd():
hvd.init()
… # train using Horovod
HorovodRunner(np=2).run(train_hvd)
29
Implementation of HorovodRunner
Integrating Horovod with barrier mode is straightforward:
● Pickle and broadcast the train function.
○ Inspect code and warn users about potential issues.
● Launch a Spark job in barrier execution mode.
● In the first executor, use worker addresses to launch the Horovod MPI job.
● Terminate Horovod if the Spark job got cancelled.
○ Hint: PR_SET_PDEATHSIG
Limitation:
● Tailored for Databricks Runtime ML
○ Horovod built with TensorFlow/PyTorch, SSH, OpenMPI, NCCL, etc.
○ Spark 2.4, GPU cluster configuration, etc.
30
Project Hydrogen: Accelerator-aware scheduling
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
31
Accelerator-aware scheduling
JIRA: SPARK-24615 (ETA: Spark 3.0)
Executor 0
GPU:0
GPU:1
Task 0
Task 1
Executor 1
GPU:0
GPU:1
Task 2
Task 3
Task 4
Driver
Cluster
Manager
32
● Some cluster managers already support accelerators
(GPU/FPGA/etc...)
● Spark still need to be aware of accelerators. Example:
Why Spark needs accelerator awareness?
Executor 0
GPU:0
GPU:1
Task 0
Task 1
Executor 1
GPU:0
GPU:1
Task 2
Task 3
Task 4 ?
33
Workarounds (a.k.a hacks)
● Only allow one Spark task on each node
○ Pros: avoid accelerator resources contention
○ Cons: waste resources, poor performance
● Running tasks choose resources collaboratively (e.g. shared
locks)
34
User Spark Cluster Manager
0. Auto-discover resources.
1. Submit an application with
resource requests.
2. Pass resource requests to
cluster manager.
4. Register executors.
3. Allocate executors with
resource isolation.
5. Submit a Spark job. 6. Schedule tasks on available
executors.
7. Dynamic allocation.
8. Retrieve assigned resources
and use them in tasks.
9. Monitor and recover failed
executors.
Proposed workflow
35
Discover and request accelerators
Admin can specify a script to auto-discover accelerators (SPARK-27024)
● spark.driver.resource.${resourceName}.discoveryScript
● spark.executor.resource.${resourceName}.discoveryScript
● e.g., `nvidia-smi --query-gpu=index ...`
User can request accelerators at application level (SPARK-27366)
● spark.executor.resource.${resourceName}.amount
● spark.driver.resource.${resourceName}.amount
● spark.task.resource.${resourceName}.amount
36
Retrieve assigned accelerators
User can retrieve assigned accelerators from task context (SPARK-27366)
context = TaskContext.get()
assigned_gpu =
context.resources().get(“gpu”).get.addresses.head
with tf.device(assigned_gpu):
# training code ...
37
Cluster manager support
YARN
SPARK-27361
Kubernetes
SPARK-27362
Mesos (not
started)
SPARK-27363
Standalone
SPARK-27360
38
Web UI for accelerators
39
Support general accelerator types
We keep the interfaces general to support other types of
accelerators other than GPU in the future, e.g. FPGA
● “GPU” is not a hard-coded resource type.
● spark.executor.resource.{resourceName}.discoveryScript
● context.resources() returns a map from resourceName to ResourceInformation
(resource name and addresses).
40
Features beyond Project Hydrogen
● Resource request at task level.
● Fine-grained scheduling within one GPU.
● Affinity and anti-affinity.
● ...
41
Story #2:
Streaming model inference
load using a Spark
cluster w/ GPUs
predict w/ GPUs as
a Spark task
model
42
Project Hydrogen: Optimized data exchange
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
43
Optimized data exchange
None of the integrations are possible without exchanging data
between Spark and AI frameworks. And performance matters.
JIRA: SPARK-24579
44
Pandas UDF
Pandas UDF was introduced in Spark 2.3, which uses Arrow for
data exchange and utilizes Pandas for vectorized computation.
45
Pandas UDF for distributed inference
Pandas UDF makes it simple to apply a model to a data stream.
@pandas_udf(...)
def predict(features):
...
spark.readStream(...) 
.withColumn(‘prediction’, predict(col(‘features’)))
46
Return StructType from Pandas UDF
We improved scalar Pandas UDF to complex return types. So users
can return predicted labels and raw scores together.
JIRA: SPARK-23836 (Spark 3.0)
@pandas_udf(...)
def predict(features):
# ...
return pd.DataFrame({'labels': labels, 'scores': scores})
47
Data pipelining
CPU GPU
t1 fetch batch #1
t2 fetch batch #2 process batch #1
t3 fetch batch #3 process batch #2
t4 process batch #3
CPU GPU
t1 fetch batch #1
t2 process batch #1
t3 fetch batch #2
t4 process batch #2
t5 fetch batch #3
t6 process batch #3 (pipelining)
48
Pandas UDF prefetch
To improve the throughput, we prefetch Arrow record batches in
the queue while executing Pandas UDF on the current batch.
● Enabled by default since Databricks Runtime 5.2.
● Up to 2x for I/O and compute balanced workload.
● Observed 1.5x in real workload.
JIRA: SPARK-27569 (ETA: Spark 3.0)
49
Per-batch initialization overhead
A new Pandas UDF interface that load the model only once and
reuse it on an iterator of batches.
JIRA: SPARK-26412 (Spark 3.0)
@pandas_udf(...)
def predict(batches):
model = … # load model once
for batch in batches:
yield model.predict(batch)
50
Acknowledgement
● Many ideas in Project Hydrogen are based on previous
community work: TensorFrames, BigDL, Apache Arrow, Pandas
UDF, Spark GPU support, MPI, etc.
● We would like to thank many Spark committers and
contributors who helped the project proposal, design, and
implementation.
51
Acknowledgement
● Alex Sergeev
● Andy Feng
● Bryan Cutler
● Felix Cheung
● Hyukjin Kwon
● Imran Rashid
● Jason Lowe
● Jerry Shao
● Li Jin
● Madhukar Korupolu
● Mark Hamstra
● Robert Evans
● Sean Owen
● Shane Knapp
● Takuya Ueshin
● Thomas Graves
● Wenchen Fan
● Xiangrui Meng
● Xiao Li
● Yi Wu
● Yinan Li
● Yu Jiang
● … and many more!
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

PDF
Modern ETL Pipelines with Change Data Capture
Databricks
 
PDF
Deep Dive into the New Features of Apache Spark 3.1
Databricks
 
PDF
The delta architecture
Prakash Chockalingam
 
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
PDF
Infrastructure for Deep Learning in Apache Spark
Databricks
 
PDF
Internals of Speeding up PySpark with Arrow
Databricks
 
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Databricks
 
PDF
Hyperspace for Delta Lake
Databricks
 
PDF
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Databricks
 
PDF
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Databricks
 
PDF
Real-Time Health Score Application using Apache Spark on Kubernetes
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PPTX
Riak TS
clive boulton
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PDF
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
PDF
Vectorized R Execution in Apache Spark
Databricks
 
PDF
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Databricks
 
PDF
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Khai Tran
 
Modern ETL Pipelines with Change Data Capture
Databricks
 
Deep Dive into the New Features of Apache Spark 3.1
Databricks
 
The delta architecture
Prakash Chockalingam
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
Infrastructure for Deep Learning in Apache Spark
Databricks
 
Internals of Speeding up PySpark with Arrow
Databricks
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Databricks
 
Hyperspace for Delta Lake
Databricks
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Databricks
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Databricks
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Riak TS
clive boulton
 
Making Apache Spark Better with Delta Lake
Databricks
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
Vectorized R Execution in Apache Spark
Databricks
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Databricks
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Khai Tran
 

Similar to Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark (20)

PDF
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
PDF
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
PDF
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
PDF
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Matt Stubbs
 
PDF
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Databricks
 
PDF
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Anyscale
 
PDF
Spark Summit EU talk by Tim Hunter
Spark Summit
 
PDF
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
PDF
Apache spark 2.4 and beyond
Xiao Li
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
PDF
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PPTX
Apache spark
Ramakrishna kapa
 
PPTX
Apachespark 160612140708
Srikrishna k
 
PDF
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
DataWorks Summit
 
PDF
Big data beyond the JVM - DDTX 2018
Holden Karau
 
PPTX
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Matt Stubbs
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Databricks
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Anyscale
 
Spark Summit EU talk by Tim Hunter
Spark Summit
 
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Apache spark 2.4 and beyond
Xiao Li
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Apache spark
Ramakrishna kapa
 
Apachespark 160612140708
Srikrishna k
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
DataWorks Summit
 
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
BinarySearchTree in datastructures in detail
kichokuttu
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
big data eco system fundamentals of data science
arivukarasi
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Xingbo Jiang, Databricks Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark #UnifiedDataAnalytics #SparkAISummit
  • 3. About Me • Software Engineer at • Committer of Apache Spark Xingbo Jiang (Github: jiangxb1987)
  • 4. 4 Announced last June, Project Hydrogen is a major Spark initiative to unify state-of-the-art AI and big data workloads. About Project Hydrogen Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 6. Runtime Delta Spark Core Engine Big Data Processing ETL + SQL +Streaming Machine Learning MLlib + SparkR Apache Spark: The First Unified Analytics Engine 6
  • 7. and many more... Internet of ThingsDigital Personalization Huge disruptive innovations are affecting most enterprises on the planet Healthcare and Genomics Fraud Prevention AI is re-shaping the world 7
  • 8. Better AI needs more data 8
  • 9. The cross... 99 Map/Reduce CaffeOnSpark TensorFlowOnSpark DataFrame-based APIs 50+ Data Sources Python/Java/R interfaces Structured Streaming ML Pipelines API Continuous Processing RDD Project Tungsten Pandas UDF TensorFrames scikit-learn pandas/numpy/scipy LIBLINEAR R glmnet xgboost GraphLab Caffe/PyTorch/MXNet TensorFlow Keras Distributed TensorFlow Horovod tf.data tf.transform AI/ML ?? TF XLA
  • 11. Two simple stories 11 data warehouse load fit model data stream load predict model
  • 12. Distributed training data warehouse load fit model Required: Be able to read from Delta Lake, Parquet, MySQL, Hive, etc. Answer: Apache Spark Required: distributed GPU cluster for fast training Answer: Horovod, Distributed Tensorflow, etc 12
  • 13. Two separate data and AI clusters? load using a Spark cluster fit on a GPU cluster model save data required: glue code 13
  • 14. Streaming model inference Kafka load predict model required: ● save to stream sink ● GPU for fast inference 14
  • 15. A hybrid Spark and AI cluster? load using a Spark cluster w/ GPUs fit a model distributedly on the same cluster model load using a Spark cluster w/ GPUs predict w/ GPUs as a Spark task model 15
  • 16. Unfortunately, it doesn’t work out of the box. See a previous demo.
  • 17. 17 Project Hydrogen to fill the major gaps Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 18. 18 Updates from Project Hydrogen ● Available features ● Future Improvement ● How to utilize
  • 19. 19 Story #1: Distributed training load using a Spark cluster w/ GPUs fit a model distributedly on the same cluster model
  • 20. 20 Project Hydrogen: barrier execution mode Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 21. 21 Different execution models Task 1 Task 2 Task 3 Spark (MapReduce) Tasks are independent of each other Embarrassingly parallel & massively scalable Distributed training Complete coordination among tasks Optimized for communication Task 1 Task 2 Task 3
  • 22. 22 Barrier execution mode • All tasks start together • Sufficient info to run a hybrid distributed job • Cancel and restart all tasks on failure JIRA: SPARK-24374 (Spark 2.4)
  • 23. 23 API: RDD.barrier() RDD.barrier() tells Spark to launch the tasks together. rdd.barrier().mapPartitions { iter => val context = BarrierTaskContext.get() ... }
  • 24. 24 API: context.barrier() context.barrier() places a global barrier and waits until all tasks in this stage hit this barrier. val context = BarrierTaskContext.get() … // preparation context.barrier()
  • 25. 25 API: context.getTaskInfos() context.getTaskInfos() returns info about all tasks in this stage. if (context.partitionId == 0) { val addrs = context.getTaskInfos().map(_.address) ... // start a hybrid training job, e.g., via MPI } context.barrier() // wait until training finishes
  • 27. 27 Horovod (an LF AI hosted project) ● Little modification to single-node code ● High-performance I/O via MPI and NCCL ● Same convergence theory ● Limitations
  • 28. 28 Hydrogen integration with Horovod ● HorovodRunner with Databricks Runtime 5.0 ML has released ● Runs Horovod under barrier execution mode ● Hides details from users def train_hvd(): hvd.init() … # train using Horovod HorovodRunner(np=2).run(train_hvd)
  • 29. 29 Implementation of HorovodRunner Integrating Horovod with barrier mode is straightforward: ● Pickle and broadcast the train function. ○ Inspect code and warn users about potential issues. ● Launch a Spark job in barrier execution mode. ● In the first executor, use worker addresses to launch the Horovod MPI job. ● Terminate Horovod if the Spark job got cancelled. ○ Hint: PR_SET_PDEATHSIG Limitation: ● Tailored for Databricks Runtime ML ○ Horovod built with TensorFlow/PyTorch, SSH, OpenMPI, NCCL, etc. ○ Spark 2.4, GPU cluster configuration, etc.
  • 30. 30 Project Hydrogen: Accelerator-aware scheduling Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 31. 31 Accelerator-aware scheduling JIRA: SPARK-24615 (ETA: Spark 3.0) Executor 0 GPU:0 GPU:1 Task 0 Task 1 Executor 1 GPU:0 GPU:1 Task 2 Task 3 Task 4 Driver Cluster Manager
  • 32. 32 ● Some cluster managers already support accelerators (GPU/FPGA/etc...) ● Spark still need to be aware of accelerators. Example: Why Spark needs accelerator awareness? Executor 0 GPU:0 GPU:1 Task 0 Task 1 Executor 1 GPU:0 GPU:1 Task 2 Task 3 Task 4 ?
  • 33. 33 Workarounds (a.k.a hacks) ● Only allow one Spark task on each node ○ Pros: avoid accelerator resources contention ○ Cons: waste resources, poor performance ● Running tasks choose resources collaboratively (e.g. shared locks)
  • 34. 34 User Spark Cluster Manager 0. Auto-discover resources. 1. Submit an application with resource requests. 2. Pass resource requests to cluster manager. 4. Register executors. 3. Allocate executors with resource isolation. 5. Submit a Spark job. 6. Schedule tasks on available executors. 7. Dynamic allocation. 8. Retrieve assigned resources and use them in tasks. 9. Monitor and recover failed executors. Proposed workflow
  • 35. 35 Discover and request accelerators Admin can specify a script to auto-discover accelerators (SPARK-27024) ● spark.driver.resource.${resourceName}.discoveryScript ● spark.executor.resource.${resourceName}.discoveryScript ● e.g., `nvidia-smi --query-gpu=index ...` User can request accelerators at application level (SPARK-27366) ● spark.executor.resource.${resourceName}.amount ● spark.driver.resource.${resourceName}.amount ● spark.task.resource.${resourceName}.amount
  • 36. 36 Retrieve assigned accelerators User can retrieve assigned accelerators from task context (SPARK-27366) context = TaskContext.get() assigned_gpu = context.resources().get(“gpu”).get.addresses.head with tf.device(assigned_gpu): # training code ...
  • 37. 37 Cluster manager support YARN SPARK-27361 Kubernetes SPARK-27362 Mesos (not started) SPARK-27363 Standalone SPARK-27360
  • 38. 38 Web UI for accelerators
  • 39. 39 Support general accelerator types We keep the interfaces general to support other types of accelerators other than GPU in the future, e.g. FPGA ● “GPU” is not a hard-coded resource type. ● spark.executor.resource.{resourceName}.discoveryScript ● context.resources() returns a map from resourceName to ResourceInformation (resource name and addresses).
  • 40. 40 Features beyond Project Hydrogen ● Resource request at task level. ● Fine-grained scheduling within one GPU. ● Affinity and anti-affinity. ● ...
  • 41. 41 Story #2: Streaming model inference load using a Spark cluster w/ GPUs predict w/ GPUs as a Spark task model
  • 42. 42 Project Hydrogen: Optimized data exchange Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 43. 43 Optimized data exchange None of the integrations are possible without exchanging data between Spark and AI frameworks. And performance matters. JIRA: SPARK-24579
  • 44. 44 Pandas UDF Pandas UDF was introduced in Spark 2.3, which uses Arrow for data exchange and utilizes Pandas for vectorized computation.
  • 45. 45 Pandas UDF for distributed inference Pandas UDF makes it simple to apply a model to a data stream. @pandas_udf(...) def predict(features): ... spark.readStream(...) .withColumn(‘prediction’, predict(col(‘features’)))
  • 46. 46 Return StructType from Pandas UDF We improved scalar Pandas UDF to complex return types. So users can return predicted labels and raw scores together. JIRA: SPARK-23836 (Spark 3.0) @pandas_udf(...) def predict(features): # ... return pd.DataFrame({'labels': labels, 'scores': scores})
  • 47. 47 Data pipelining CPU GPU t1 fetch batch #1 t2 fetch batch #2 process batch #1 t3 fetch batch #3 process batch #2 t4 process batch #3 CPU GPU t1 fetch batch #1 t2 process batch #1 t3 fetch batch #2 t4 process batch #2 t5 fetch batch #3 t6 process batch #3 (pipelining)
  • 48. 48 Pandas UDF prefetch To improve the throughput, we prefetch Arrow record batches in the queue while executing Pandas UDF on the current batch. ● Enabled by default since Databricks Runtime 5.2. ● Up to 2x for I/O and compute balanced workload. ● Observed 1.5x in real workload. JIRA: SPARK-27569 (ETA: Spark 3.0)
  • 49. 49 Per-batch initialization overhead A new Pandas UDF interface that load the model only once and reuse it on an iterator of batches. JIRA: SPARK-26412 (Spark 3.0) @pandas_udf(...) def predict(batches): model = … # load model once for batch in batches: yield model.predict(batch)
  • 50. 50 Acknowledgement ● Many ideas in Project Hydrogen are based on previous community work: TensorFrames, BigDL, Apache Arrow, Pandas UDF, Spark GPU support, MPI, etc. ● We would like to thank many Spark committers and contributors who helped the project proposal, design, and implementation.
  • 51. 51 Acknowledgement ● Alex Sergeev ● Andy Feng ● Bryan Cutler ● Felix Cheung ● Hyukjin Kwon ● Imran Rashid ● Jason Lowe ● Jerry Shao ● Li Jin ● Madhukar Korupolu ● Mark Hamstra ● Robert Evans ● Sean Owen ● Shane Knapp ● Takuya Ueshin ● Thomas Graves ● Wenchen Fan ● Xiangrui Meng ● Xiao Li ● Yi Wu ● Yinan Li ● Yu Jiang ● … and many more!
  • 52. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT