Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning

© Cloudera, Inc. All rights reserved.
Bridging the Gap Between Big Data and Deep
Learning with Apache Spark 2.4
Robert Hryniewicz
@robhryniewicz

© Cloudera, Inc. All rights reserved. 2
Two of the most significant communities: Spark & Machine Learning (ML)
DataFrame-based APIs TensorFlow/Pytorch
More than 50 Data Sources Tf.data, tf.transform
Data/ML Pipeline APIs Horovod
Structured Streaming and Continuous Processing Numpy/Scipy/Pandas/Scikit-learn/XGBoost
Pandas UDF torch-vision/torch-text
Python/Java/R interfaces
Spark ML

What we need?
• Build a data/ML pipeline that fetches training samples from HDFS/Hive and
trains a Deep Learning (DL) model in parallel
• Apply a trained DL model to a batch or streaming datasets and get the
predicted results

Different execution models
• Spark
• Tasks are independent of each other
• Partial tasks can be started if there is no enough resource for all tasks
• Distributed Deep Learning (DL) model training
• Tasks are coordinated with master role
• All tasks in the same job should be started simultaneously
• All tasks would communicate and synchronize with each other

Task 1
Task 2
Task 3
Spark
• tasks are independent of each other
• massively parallel and scalable
Distributed DL model training
• complete coordination among tasks
• optimized for communication
Task 1
Task 2 Task 3
Execution Models

Task 1
Task 2
Task 3
Spark
• tasks are independent of each other
• massively parallel and scalable
• if one task crashes, rerun that one
Distributed DL model training
• complete coordination among tasks
• optimized for communication
• if one task crashes, must rerun all tasks
Task 1
Task 2 Task 3
Incompatible Execution Models

Apache Spark 2.4: Barrier Execution Mode
• Barrier scheduling: gang scheduling on top of the existing MapReduce
execution mode
• Distributed DL job can be run as a Spark job inside of data/ML pipelines
• It starts all tasks together
• It provides sufficient info and tooling to run a hybrid distributed job
• It cancels and restarts all tasks in case of failures

Barrier Execution Mode
• RDD.barrier() tells Spark to launch the tasks together
• context.barrier() places a global barrier and waits until all tasks in this stage
hit this barrier
rdd.barrier().mapPartitions { iter =>
context = TaskContext.get() // get barrier task context
… // do sth in each task
context.barrier() // wait until all tasks finish
}

Barrier Execution Mode
• RDD.barrier() tells Spark to launch the tasks together
• context.barrier() places a global barrier and waits until all tasks in this stage
hit this barrier
• context.getTaskInfos() returns info about all tasks in this stage
rdd.barrier().mapPartitions { iter =>
context = TaskContext.get() // get barrier task context
if (context.partitionId == 0) {
address = context.getTaskInfos().map(_.address)
… // run distributed DL training script which accepts “address” as argument
}
context.barrier() // wait until all tasks finish
}

The data/DL pipeline - Load dataset
dataset = spark.read
.format(“image”)
.option(“dropInvalid”, true)
.load(“/data/dl/images”)

The data/DL pipeline - running in barrier execution mode
dataset = dataset.rdd.barrier().mapPartitions(runDistTrain)
def runDistTrain(batches):
context = TaskContext.get()
partitionId = context.partitionId()
if partitionId == 0:
… // run distributed DL training scripts at master, and save model to HDFS
else:
… // usually do nothing at slaves
context.barrier()

Unifying execution models
Stage 1
data prep
massively parallel
Stage 2
distributed DL training
gang scheduled
Stage 3
data sink
massively parallel

In the future (Spark 3.0+):
• Optimized Data Exchange (Spark-24579)
• Accelerator Aware Scheduling (SPARK-24615)

Optimized Data Exchange
• Data exchange between Spark
Dataset/DataFrame and DL
frameworks(TensorFlow/Pytorch)
• The integration should be
simplified and efficient

Vectorized computation
DataFrame.toArrowRDD(maxRecordsPerBatch=4096)
def runDistTrain(batches):
context = TaskContext.get()
… // start distributed training in the same (or separate) process,
fetch records from Spark DataFrame directly
context.barrier()
return [model]
dataset.toArrowRDD.barrier().mapPartitions(runDistTrain).collect()

Accelerator Aware Scheduling
• Goals:
• To utilize accelerators (GPUs, FPGAs) in a heterogeneous cluster
• To utilize multiple accelerators in a multi-task environment
• Users workflow:
• Submit Spark application and request GPU resources per executor
• Request number of GPUs to use for a task (RDD stage, Pandas UDF)
• In the customized task logic, retrieve the logical indices of assigned GPUs and use them

Example: request accelerators
With accelerator awareness, users can specify accelerators constraints or hints:
rdd.accelerated
.by(“/gpu/p100”)
.numPerTask(2)
.required

Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning

More Related Content

What's hot (20)

Similar to Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning