SlideShare a Scribd company logo
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
A Scalable
Implementation of Deep
Learning on Spark
Alexander Ulanov 1
Joint work with Xiangrui Meng2, Bert Greevenbosch3
With the help from Guoqiang Li4, Andrey Simanovsky1
1Hewlett-Packard Labs 2Databricks 3Huawei & Jules Energy 4Spark
community
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.2
Outline
• Artificial neural network basics
• Implementation of Multilayer Perceptron (MLP) in Spark
• Optimization & parallelization
• Experiments
• Future work
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3
Artificial neural network
Basics
• Statistical model that approximates a function of multiple inputs
• Consists of interconnected “neurons” which exchange messages
– “Neuron” produces an output by applying a transformation function on its
inputs
• Network with more than 3 layers of neurons is called “deep”, instance of deep
learning
Layer types & learning
• A layer type is defined by a transformation function
– Affine: 𝑦𝑗 = 𝒘𝒊𝒋 ∙ 𝑥𝑖 + 𝑏𝑗, Sigmoid: 𝑦𝑖 = 1 + 𝑒−𝑥 𝑖 −1
, Convolution, Softmax, etc.
• Multilayer perceptron (MLP) – a network with several pairs of Affine & Sigmoid
layers
• Model parameters – weights that “neurons” use for transformations
• Parameters are iteratively estimated with the backpropagation algorithm
Multilayer perceptron
• Speech recognition (phoneme classification), computer vision
𝑥
𝑦
input
output
hidden layer
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4
Example of MLP in Spark
Handwritten digits recognition
• Dataset MNIST [LeCun et al. 1998]
• 28x28 greyscale images of handwritten digits 0-9
• MLP with 784 inputs, 10 outputs and two hidden layers of
300 and 100 neurons
val digits: DataFrame = sqlContext.read.format("libsvm").load("/data/mnist")
val mlp = new MultilayerPerceptronClassifier()
.setLayers(Array(784, 300, 100, 10))
.setBlockSize(128)
val model = mlp.fit(digits)
784 inputs 300
neurons
100 neurons 10 neurons
1st hidden
layer
2nd hidden layer Output
layer
digits = sqlContext.read.format("libsvm").load("/data/mnist")
mlp = MultilayerPerceptronClassifier(layers=[784, 300, 100, 10], blockSize=128)
model = mlp.fit(digits)
Scala
Python
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5
Pipeline with PCA+MLP in Spark
val digits: DataFrame = sqlContext.read.format(“libsvm”).load(“/data/mnist”)
val pca = new PCA()
.setInputCol(“features”)
.setK(20)
.setOutPutCol(“features20”)
val mlp = new MultilayerPerceptronClassifier()
.setFeaturesCol(“features20”)
.setLayers(Array(20, 50, 10))
.setBlockSize(128)
val pipeline = new Pipeline()
.setStages(Array(pca, mlp))
val model = pipeline.fit(digits)
digits = sqlContext.read.format("libsvm").load("/data/mnist8m")
pca = PCA(inputCol="features", k=20, outputCol="features20")
mlp = MultilayerPerceptronClassifier(featuresCol="features20", layers=[20, 50, 10],
blockSize=128)
pipeline = Pipeline(stages=[pca, mlp])
model = pipeline.fit(digits)
Scala
Python
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6
MLP implementation in Spark
Requirements
• Conform to Spark APIs
• Extensible interface (deep learning API)
• Efficient and scalable (single node & cluster)
Why conform to Spark APIs?
• Spark can call any Java, Python or Scala library, not necessary designed for Spark
– Results with expensive data movement from Spark RDD to the library
– Prohibits from using for Spark ML Pipelines
Extensible interface
• Our implementation processes each layer as a black box with backpropagation in general
form
– Allows further introduction of new layers and features
• CNN, Autoencoder, RBM are currently under dev. by community
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7
Efficiency
Batch processing
• Layer’s affine transformations can be represented in vector form: 𝒚 = 𝑊 𝑇
𝒙 + 𝒃
– 𝒚 – output from the layer, vector of size 𝑛
– 𝑊 – the matrix of layer weights 𝑚 × 𝑛 , 𝒃 – bias, vector of size 𝑚
– 𝒙 – input to the layer, vector of size 𝑚
• Vector-matrix multiplications are not as efficient as matrix-matrix
– Stack 𝑠 input vectors (into batch) to perform matrices multiplication: 𝒀 = 𝑊 𝑇
𝑿 + 𝑩
– 𝑿 is 𝑚 × 𝑠 , 𝒀 is 𝑛 × 𝑠 ,
– 𝑩 is 𝑛 × 𝑠 , each column contains a copy of 𝒃
• We implemented batch processing in matrix form
– Enabled the use of optimized native BLAS libraries
– Memory is reused to limit GC overhead
= * +
= * +
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8
1.00E-04
1.00E-03
1.00E-02
1.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
(1x1)*(1x1)
(10x10)*(10x1)
(10x10)*(10x10)
(100x100)*(100x1)
(100x100)*(100x10)
(100x100)*(100x100)
(1000x1000)*(1000x100)
(1000x1000)*(1000x1000)
(10000x10000)*(10000x1000)
(10000x10000)*(10000x10000)
dgemm performance
netlib-NVBLAS netlib-MKL netlib OpenBLAS netlib-f2jblas
Single node BLAS
BLAS in Spark
• BLAS – Basic Linear Algebra Subprograms
• Hardware optimized native in C & Fortran
– CPU: MKL, OpenBLAS etc.
– GPU: NVBLAS (F-BLAS interface to CUDA)
• Use in Spark through Netlib-java
Experiments
• Huge benefit from native BLAS vs pure Java
f2jblas
• GPU is faster (2x) only for large matrices
– When compute is larger than copy to/from
GPU
• More details:
– https://github.com/avulanov/scala-blas
– “linalg: Matrix Computations in Apache
Spark” Reza et al., 2015
CPU: 2x Xeon X5650 @ 2.67GHz, 32GB RAM
GPU: Tesla M2050 3GB, 575MHz, 448 CUDA
cores
seconds
Matrices size
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9
Scalability
Parallelization
• Each iteration 𝑘, each node 𝑖
– 1. Gets parameters 𝑤 𝑘
from master
– 2. Computes a gradient 𝛻𝑖
𝑘
𝐹(𝑑𝑎𝑡𝑎𝑖)
– 3. Sends a gradient to master
– 4. Master computes 𝑤 𝑘+1
based on gradients
• Gradient type
– Batch – process all data on each iteration
– Stochastic – random point
– Mini-batch – random batch
• How many workers to use?
– Less workers – less compute
– More workers – more communication
𝑤 𝑘
𝑤 𝑘+1
≔ 𝑌 𝛻𝑖
𝑘
𝐹
Master
Executor
1
Executor
N
Partition 1
Partition 2
Partition P
Executor
1
Executor
N
V
V
v
𝛻1
𝑘
𝐹(𝑑𝑎𝑡𝑎1)
𝛻 𝑁
𝑘
𝐹(𝑑𝑎𝑡𝑎 𝑁)
𝛻1
𝑘
𝐹
Master
Executor
1
Executor
N
Master V
V
v
1.
2.
3.
4.
GoTo #1
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10
Communication and computation trade-off
Parallelization of batch gradient
• There are 𝑑 data points, 𝑓 features and 𝑘 classes
– Assume, we want to train logistic regression, it has 𝑓𝑘 parameters
• Communication: 𝑛 workers get/receive 𝑓𝑘 64 bit parameters through the network with
bandwidth 𝑏 and software overhead 𝑐. Use all-reduce:
– 𝑡 𝑐𝑚 = 2
64𝑓𝑘
𝑏
+ 𝑐 log2 𝑛
• Computation: each worker has 𝑝 FLOPS and processes
𝑑
𝑛
of data, that needs 𝑓𝑘 operations
– 𝑡 𝑐𝑝~
𝑑
𝑛
𝑓𝑘
𝑝
• What is the optimal number of workers?
– min
𝑛
𝑡 𝑐𝑚 + 𝑡 𝑐𝑝 ⇒ 𝑛 = 𝑚𝑎𝑥
𝑑𝑓𝑘 ln 2
𝑝 128𝑓𝑘 𝑏+2𝑐
, 1
– 𝑚𝑎𝑥
𝑑∙𝑤∙ln 2
𝑝 128𝑤 𝑏+2𝑐
, 1 , if 𝑤 is the number of model parameters and floating point operations
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11
Analysis of the trade-off
Optimal number of workers for batch gradient
• Parallelism in a cluster
– 𝑛 = 𝑚𝑎𝑥
𝑑∙𝑤∙ln 2
𝑝 128𝑤 𝑏+2𝑐
, 1
• Analysis
– More FLOPS 𝑝 means lower degree of batch gradient parallelism in a cluster
– More operations, i.e. more features and classes 𝑤 = 𝑓𝑘 (or a deep network) means higher degree
– Small 𝑐 overhead for get/receive a message means higher degree
• Example: MNIST8M handwritten digit recognition dataset
– 8.1M documents, 784 features, 10 classes, logistic regression
– 32GFlops double precision CPU, 1Gbit network, overhead ~ 0.1s
– 𝑛 = 𝑚𝑎𝑥
8.1𝑀∙784∙10∙0.69
32𝐺 128∙784∙10 1𝐺+2∙0.1
, 1 = 6
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12
0
20
40
60
80
100
0 1 2 3 4 5 6
Spark MLP vs Caffe MLP
MLP (total) MLP (compute)
Caffe CPU Caffe GPU
Scalability testing
Setup
• MNIST character recognition 60K samples
• 6-layer MLP (784,2500,2000,1500,1000,500,10)
• 12M parameters
• CPU: Xeon X5650 @ 2.67GHz
• GPU: Tesla M2050 3GB, 575MHz
• Caffe (Deep Learning from Berkeley): 1 node
• Spark: 1 master + 5 workers
Results per iteration
• Single node (both tools double precision)
– 1.6 slower than Caffe CPU (Scala vs C++)
• Scalability
– 5 nodes give 4.7x speedup, beats Caffe, close to
GPU
Seconds
Workers
Communication
cost
𝑛 = 𝑚𝑎𝑥
60𝐾 ∙ 12𝑀 ∙ 0.69
64𝐺 128 ∙ 12𝑀 950𝑀 + 2 ∙ 0.1
, 1 = 𝟒
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13
Conclusions & future work
Conclusions
• Scalable multilayer perceptron is available in Spark 1.5.0
• Extensible internal API for Artificial Neural Networks
– Further contributions are welcome!
• Native BLAS (and GPU) speeds up Spark
• Heuristics for parallelization of batch gradient
Work in progress [SPARK-5575]
• Autoencoder(s)
• Restricted Boltzmann Machines
• Drop-out
• Convolutional neural networks
Future work
• SGD & parameter server
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thank you

More Related Content

What's hot (20)

PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
PDF
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
PDF
Spark Meetup @ Netflix, 05/19/2015
Yves Raimond
 
PDF
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark Summit
 
PDF
Distributed deep learning
Mehdi Shibahara
 
PDF
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Spark Summit
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PPTX
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
PPTX
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
PPTX
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Spark Summit
 
PDF
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
PDF
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
PDF
Better {ML} Together: GraphLab Create + Spark
Turi, Inc.
 
PDF
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Databricks
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Databricks
 
PDF
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark Summit
 
PDF
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Spark Summit
 
PDF
Introduction to apache horn (incubating)
Edward Yoon
 
PDF
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Spark Meetup @ Netflix, 05/19/2015
Yves Raimond
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark Summit
 
Distributed deep learning
Mehdi Shibahara
 
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Spark Summit
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Spark Summit
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
Better {ML} Together: GraphLab Create + Spark
Turi, Inc.
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Databricks
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark Summit
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Spark Summit
 
Introduction to apache horn (incubating)
Edward Yoon
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 

Viewers also liked (20)

PDF
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
PPTX
TensorFrames: Google Tensorflow on Apache Spark
Databricks
 
PPTX
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
PDF
CaffeOnSpark: Deep Learning On Spark Cluster
Jen Aman
 
PDF
ETL со Spark
Vasil Remeniuk
 
PDF
FareBor Presentation
Andrey Vilchinsky
 
PDF
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
MLconf
 
PDF
Opening slides to Warsaw Scala FortyFives on Testing tools
Jacek Laskowski
 
PDF
apsis - Automatic Hyperparameter Optimization Framework for Machine Learning
andi1400
 
PPTX
A Prototype Storage Subsystem based on Phase Change Memory
IBM Research
 
PDF
Image Object Detection Pipeline
Abhinav Dadhich
 
PDF
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Антон Шестаков
 
PPTX
Александр Сербул —1С-Битрикс — ICBDA 2015
rusbase
 
PDF
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Spark Summit
 
PPTX
Genome Analysis Pipelines with Spark and ADAM
Allen Day, PhD
 
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
PDF
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Spark Summit
 
PDF
Neural Networks and Deep Learning
Asim Jalis
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Data Con LA
 
PDF
Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age
batchinsights
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
TensorFrames: Google Tensorflow on Apache Spark
Databricks
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
CaffeOnSpark: Deep Learning On Spark Cluster
Jen Aman
 
ETL со Spark
Vasil Remeniuk
 
FareBor Presentation
Andrey Vilchinsky
 
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
MLconf
 
Opening slides to Warsaw Scala FortyFives on Testing tools
Jacek Laskowski
 
apsis - Automatic Hyperparameter Optimization Framework for Machine Learning
andi1400
 
A Prototype Storage Subsystem based on Phase Change Memory
IBM Research
 
Image Object Detection Pipeline
Abhinav Dadhich
 
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Антон Шестаков
 
Александр Сербул —1С-Битрикс — ICBDA 2015
rusbase
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Spark Summit
 
Genome Analysis Pipelines with Spark and ADAM
Allen Day, PhD
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Spark Summit
 
Neural Networks and Deep Learning
Asim Jalis
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Data Con LA
 
Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age
batchinsights
 
Ad

Similar to A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov (20)

PPTX
Nuts and Bolts of Transfer Learning.pptx
vmanjusundertamil21
 
PPTX
Sachpazis: Demystifying Neural Networks: A Comprehensive Guide
Dr.Costas Sachpazis
 
PDF
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
PPTX
Deep Learning with Apache MXNet (September 2017)
Julien SIMON
 
PDF
Implementation of linear regression and logistic regression on Spark
Dalei Li
 
PDF
Enterprise Scale Topological Data Analysis Using Spark
Spark Summit
 
PDF
Machine learning at Scale with Apache Spark
Martin Zapletal
 
PDF
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
PDF
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 
PDF
Startup.Ml: Using neon for NLP and Localization Applications
Intel Nervana
 
PPTX
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
PPT
eam2
butest
 
PPTX
StackNet Meta-Modelling framework
Sri Ambati
 
PDF
Python Keras module for advanced python programming
AnaswaraKU
 
PDF
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Databricks
 
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 
PDF
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Kundjanasith Thonglek
 
PPTX
Computer Vision for Beginners
Sanghamitra Deb
 
PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
 
Nuts and Bolts of Transfer Learning.pptx
vmanjusundertamil21
 
Sachpazis: Demystifying Neural Networks: A Comprehensive Guide
Dr.Costas Sachpazis
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Deep Learning with Apache MXNet (September 2017)
Julien SIMON
 
Implementation of linear regression and logistic regression on Spark
Dalei Li
 
Enterprise Scale Topological Data Analysis Using Spark
Spark Summit
 
Machine learning at Scale with Apache Spark
Martin Zapletal
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 
Startup.Ml: Using neon for NLP and Localization Applications
Intel Nervana
 
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
eam2
butest
 
StackNet Meta-Modelling framework
Sri Ambati
 
Python Keras module for advanced python programming
AnaswaraKU
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Databricks
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Kundjanasith Thonglek
 
Computer Vision for Beginners
Sanghamitra Deb
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

  • 1. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. A Scalable Implementation of Deep Learning on Spark Alexander Ulanov 1 Joint work with Xiangrui Meng2, Bert Greevenbosch3 With the help from Guoqiang Li4, Andrey Simanovsky1 1Hewlett-Packard Labs 2Databricks 3Huawei & Jules Energy 4Spark community
  • 2. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.2 Outline • Artificial neural network basics • Implementation of Multilayer Perceptron (MLP) in Spark • Optimization & parallelization • Experiments • Future work
  • 3. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3 Artificial neural network Basics • Statistical model that approximates a function of multiple inputs • Consists of interconnected “neurons” which exchange messages – “Neuron” produces an output by applying a transformation function on its inputs • Network with more than 3 layers of neurons is called “deep”, instance of deep learning Layer types & learning • A layer type is defined by a transformation function – Affine: 𝑦𝑗 = 𝒘𝒊𝒋 ∙ 𝑥𝑖 + 𝑏𝑗, Sigmoid: 𝑦𝑖 = 1 + 𝑒−𝑥 𝑖 −1 , Convolution, Softmax, etc. • Multilayer perceptron (MLP) – a network with several pairs of Affine & Sigmoid layers • Model parameters – weights that “neurons” use for transformations • Parameters are iteratively estimated with the backpropagation algorithm Multilayer perceptron • Speech recognition (phoneme classification), computer vision 𝑥 𝑦 input output hidden layer
  • 4. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4 Example of MLP in Spark Handwritten digits recognition • Dataset MNIST [LeCun et al. 1998] • 28x28 greyscale images of handwritten digits 0-9 • MLP with 784 inputs, 10 outputs and two hidden layers of 300 and 100 neurons val digits: DataFrame = sqlContext.read.format("libsvm").load("/data/mnist") val mlp = new MultilayerPerceptronClassifier() .setLayers(Array(784, 300, 100, 10)) .setBlockSize(128) val model = mlp.fit(digits) 784 inputs 300 neurons 100 neurons 10 neurons 1st hidden layer 2nd hidden layer Output layer digits = sqlContext.read.format("libsvm").load("/data/mnist") mlp = MultilayerPerceptronClassifier(layers=[784, 300, 100, 10], blockSize=128) model = mlp.fit(digits) Scala Python
  • 5. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5 Pipeline with PCA+MLP in Spark val digits: DataFrame = sqlContext.read.format(“libsvm”).load(“/data/mnist”) val pca = new PCA() .setInputCol(“features”) .setK(20) .setOutPutCol(“features20”) val mlp = new MultilayerPerceptronClassifier() .setFeaturesCol(“features20”) .setLayers(Array(20, 50, 10)) .setBlockSize(128) val pipeline = new Pipeline() .setStages(Array(pca, mlp)) val model = pipeline.fit(digits) digits = sqlContext.read.format("libsvm").load("/data/mnist8m") pca = PCA(inputCol="features", k=20, outputCol="features20") mlp = MultilayerPerceptronClassifier(featuresCol="features20", layers=[20, 50, 10], blockSize=128) pipeline = Pipeline(stages=[pca, mlp]) model = pipeline.fit(digits) Scala Python
  • 6. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6 MLP implementation in Spark Requirements • Conform to Spark APIs • Extensible interface (deep learning API) • Efficient and scalable (single node & cluster) Why conform to Spark APIs? • Spark can call any Java, Python or Scala library, not necessary designed for Spark – Results with expensive data movement from Spark RDD to the library – Prohibits from using for Spark ML Pipelines Extensible interface • Our implementation processes each layer as a black box with backpropagation in general form – Allows further introduction of new layers and features • CNN, Autoencoder, RBM are currently under dev. by community
  • 7. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7 Efficiency Batch processing • Layer’s affine transformations can be represented in vector form: 𝒚 = 𝑊 𝑇 𝒙 + 𝒃 – 𝒚 – output from the layer, vector of size 𝑛 – 𝑊 – the matrix of layer weights 𝑚 × 𝑛 , 𝒃 – bias, vector of size 𝑚 – 𝒙 – input to the layer, vector of size 𝑚 • Vector-matrix multiplications are not as efficient as matrix-matrix – Stack 𝑠 input vectors (into batch) to perform matrices multiplication: 𝒀 = 𝑊 𝑇 𝑿 + 𝑩 – 𝑿 is 𝑚 × 𝑠 , 𝒀 is 𝑛 × 𝑠 , – 𝑩 is 𝑛 × 𝑠 , each column contains a copy of 𝒃 • We implemented batch processing in matrix form – Enabled the use of optimized native BLAS libraries – Memory is reused to limit GC overhead = * + = * +
  • 8. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 (1x1)*(1x1) (10x10)*(10x1) (10x10)*(10x10) (100x100)*(100x1) (100x100)*(100x10) (100x100)*(100x100) (1000x1000)*(1000x100) (1000x1000)*(1000x1000) (10000x10000)*(10000x1000) (10000x10000)*(10000x10000) dgemm performance netlib-NVBLAS netlib-MKL netlib OpenBLAS netlib-f2jblas Single node BLAS BLAS in Spark • BLAS – Basic Linear Algebra Subprograms • Hardware optimized native in C & Fortran – CPU: MKL, OpenBLAS etc. – GPU: NVBLAS (F-BLAS interface to CUDA) • Use in Spark through Netlib-java Experiments • Huge benefit from native BLAS vs pure Java f2jblas • GPU is faster (2x) only for large matrices – When compute is larger than copy to/from GPU • More details: – https://github.com/avulanov/scala-blas – “linalg: Matrix Computations in Apache Spark” Reza et al., 2015 CPU: 2x Xeon X5650 @ 2.67GHz, 32GB RAM GPU: Tesla M2050 3GB, 575MHz, 448 CUDA cores seconds Matrices size
  • 9. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9 Scalability Parallelization • Each iteration 𝑘, each node 𝑖 – 1. Gets parameters 𝑤 𝑘 from master – 2. Computes a gradient 𝛻𝑖 𝑘 𝐹(𝑑𝑎𝑡𝑎𝑖) – 3. Sends a gradient to master – 4. Master computes 𝑤 𝑘+1 based on gradients • Gradient type – Batch – process all data on each iteration – Stochastic – random point – Mini-batch – random batch • How many workers to use? – Less workers – less compute – More workers – more communication 𝑤 𝑘 𝑤 𝑘+1 ≔ 𝑌 𝛻𝑖 𝑘 𝐹 Master Executor 1 Executor N Partition 1 Partition 2 Partition P Executor 1 Executor N V V v 𝛻1 𝑘 𝐹(𝑑𝑎𝑡𝑎1) 𝛻 𝑁 𝑘 𝐹(𝑑𝑎𝑡𝑎 𝑁) 𝛻1 𝑘 𝐹 Master Executor 1 Executor N Master V V v 1. 2. 3. 4. GoTo #1
  • 10. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10 Communication and computation trade-off Parallelization of batch gradient • There are 𝑑 data points, 𝑓 features and 𝑘 classes – Assume, we want to train logistic regression, it has 𝑓𝑘 parameters • Communication: 𝑛 workers get/receive 𝑓𝑘 64 bit parameters through the network with bandwidth 𝑏 and software overhead 𝑐. Use all-reduce: – 𝑡 𝑐𝑚 = 2 64𝑓𝑘 𝑏 + 𝑐 log2 𝑛 • Computation: each worker has 𝑝 FLOPS and processes 𝑑 𝑛 of data, that needs 𝑓𝑘 operations – 𝑡 𝑐𝑝~ 𝑑 𝑛 𝑓𝑘 𝑝 • What is the optimal number of workers? – min 𝑛 𝑡 𝑐𝑚 + 𝑡 𝑐𝑝 ⇒ 𝑛 = 𝑚𝑎𝑥 𝑑𝑓𝑘 ln 2 𝑝 128𝑓𝑘 𝑏+2𝑐 , 1 – 𝑚𝑎𝑥 𝑑∙𝑤∙ln 2 𝑝 128𝑤 𝑏+2𝑐 , 1 , if 𝑤 is the number of model parameters and floating point operations
  • 11. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11 Analysis of the trade-off Optimal number of workers for batch gradient • Parallelism in a cluster – 𝑛 = 𝑚𝑎𝑥 𝑑∙𝑤∙ln 2 𝑝 128𝑤 𝑏+2𝑐 , 1 • Analysis – More FLOPS 𝑝 means lower degree of batch gradient parallelism in a cluster – More operations, i.e. more features and classes 𝑤 = 𝑓𝑘 (or a deep network) means higher degree – Small 𝑐 overhead for get/receive a message means higher degree • Example: MNIST8M handwritten digit recognition dataset – 8.1M documents, 784 features, 10 classes, logistic regression – 32GFlops double precision CPU, 1Gbit network, overhead ~ 0.1s – 𝑛 = 𝑚𝑎𝑥 8.1𝑀∙784∙10∙0.69 32𝐺 128∙784∙10 1𝐺+2∙0.1 , 1 = 6
  • 12. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12 0 20 40 60 80 100 0 1 2 3 4 5 6 Spark MLP vs Caffe MLP MLP (total) MLP (compute) Caffe CPU Caffe GPU Scalability testing Setup • MNIST character recognition 60K samples • 6-layer MLP (784,2500,2000,1500,1000,500,10) • 12M parameters • CPU: Xeon X5650 @ 2.67GHz • GPU: Tesla M2050 3GB, 575MHz • Caffe (Deep Learning from Berkeley): 1 node • Spark: 1 master + 5 workers Results per iteration • Single node (both tools double precision) – 1.6 slower than Caffe CPU (Scala vs C++) • Scalability – 5 nodes give 4.7x speedup, beats Caffe, close to GPU Seconds Workers Communication cost 𝑛 = 𝑚𝑎𝑥 60𝐾 ∙ 12𝑀 ∙ 0.69 64𝐺 128 ∙ 12𝑀 950𝑀 + 2 ∙ 0.1 , 1 = 𝟒
  • 13. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13 Conclusions & future work Conclusions • Scalable multilayer perceptron is available in Spark 1.5.0 • Extensible internal API for Artificial Neural Networks – Further contributions are welcome! • Native BLAS (and GPU) speeds up Spark • Heuristics for parallelization of batch gradient Work in progress [SPARK-5575] • Autoencoder(s) • Restricted Boltzmann Machines • Drop-out • Convolutional neural networks Future work • SGD & parameter server
  • 14. © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Thank you