A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
A Scalable
Implementation of Deep
Learning on Spark
Alexander Ulanov 1
Joint work with Xiangrui Meng2, Bert Greevenbosch3
With the help from Guoqiang Li4, Andrey Simanovsky1
1Hewlett-Packard Labs 2Databricks 3Huawei & Jules Energy 4Spark
community

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.2
Outline
• Artificial neural network basics
• Implementation of Multilayer Perceptron (MLP) in Spark
• Optimization & parallelization
• Experiments
• Future work

Artificial neural network
Basics
• Statistical model that approximates a function of multiple inputs
• Consists of interconnected “neurons” which exchange messages
– “Neuron” produces an output by applying a transformation function on its
inputs
• Network with more than 3 layers of neurons is called “deep”, instance of deep
learning
Layer types & learning
• A layer type is defined by a transformation function
– Affine: 𝑦𝑗 = 𝒘𝒊𝒋 ∙ 𝑥𝑖 + 𝑏𝑗, Sigmoid: 𝑦𝑖 = 1 + 𝑒−𝑥 𝑖 −1
, Convolution, Softmax, etc.
• Multilayer perceptron (MLP) – a network with several pairs of Affine & Sigmoid
layers
• Model parameters – weights that “neurons” use for transformations
• Parameters are iteratively estimated with the backpropagation algorithm
Multilayer perceptron
• Speech recognition (phoneme classification), computer vision
𝑥
𝑦
input
output
hidden layer

Example of MLP in Spark
Handwritten digits recognition
• Dataset MNIST [LeCun et al. 1998]
• 28x28 greyscale images of handwritten digits 0-9
• MLP with 784 inputs, 10 outputs and two hidden layers of
300 and 100 neurons
val digits: DataFrame = sqlContext.read.format("libsvm").load("/data/mnist")
val mlp = new MultilayerPerceptronClassifier()
.setLayers(Array(784, 300, 100, 10))
.setBlockSize(128)
val model = mlp.fit(digits)
784 inputs 300
neurons
100 neurons 10 neurons
1st hidden
layer
2nd hidden layer Output
layer
digits = sqlContext.read.format("libsvm").load("/data/mnist")
mlp = MultilayerPerceptronClassifier(layers=[784, 300, 100, 10], blockSize=128)
model = mlp.fit(digits)
Scala
Python

Pipeline with PCA+MLP in Spark
val digits: DataFrame = sqlContext.read.format(“libsvm”).load(“/data/mnist”)
val pca = new PCA()
.setInputCol(“features”)
.setK(20)
.setOutPutCol(“features20”)
val mlp = new MultilayerPerceptronClassifier()
.setFeaturesCol(“features20”)
.setLayers(Array(20, 50, 10))
.setBlockSize(128)
val pipeline = new Pipeline()
.setStages(Array(pca, mlp))
val model = pipeline.fit(digits)
digits = sqlContext.read.format("libsvm").load("/data/mnist8m")
pca = PCA(inputCol="features", k=20, outputCol="features20")
mlp = MultilayerPerceptronClassifier(featuresCol="features20", layers=[20, 50, 10],
blockSize=128)
pipeline = Pipeline(stages=[pca, mlp])
model = pipeline.fit(digits)
Scala
Python

MLP implementation in Spark
Requirements
• Conform to Spark APIs
• Extensible interface (deep learning API)
• Efficient and scalable (single node & cluster)
Why conform to Spark APIs?
• Spark can call any Java, Python or Scala library, not necessary designed for Spark
– Results with expensive data movement from Spark RDD to the library
– Prohibits from using for Spark ML Pipelines
Extensible interface
• Our implementation processes each layer as a black box with backpropagation in general
form
– Allows further introduction of new layers and features
• CNN, Autoencoder, RBM are currently under dev. by community

Efficiency
Batch processing
• Layer’s affine transformations can be represented in vector form: 𝒚 = 𝑊 𝑇
𝒙 + 𝒃
– 𝒚 – output from the layer, vector of size 𝑛
– 𝑊 – the matrix of layer weights 𝑚 × 𝑛 , 𝒃 – bias, vector of size 𝑚
– 𝒙 – input to the layer, vector of size 𝑚
• Vector-matrix multiplications are not as efficient as matrix-matrix
– Stack 𝑠 input vectors (into batch) to perform matrices multiplication: 𝒀 = 𝑊 𝑇
𝑿 + 𝑩
– 𝑿 is 𝑚 × 𝑠 , 𝒀 is 𝑛 × 𝑠 ,
– 𝑩 is 𝑛 × 𝑠 , each column contains a copy of 𝒃
• We implemented batch processing in matrix form
– Enabled the use of optimized native BLAS libraries
– Memory is reused to limit GC overhead
= * +
= * +

1.00E-04
1.00E-03
1.00E-02
1.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
(1x1)*(1x1)
(10x10)*(10x1)
(10x10)*(10x10)
(100x100)*(100x1)
(100x100)*(100x10)
(100x100)*(100x100)
(1000x1000)*(1000x100)
(1000x1000)*(1000x1000)
(10000x10000)*(10000x1000)
(10000x10000)*(10000x10000)
dgemm performance
netlib-NVBLAS netlib-MKL netlib OpenBLAS netlib-f2jblas
Single node BLAS
BLAS in Spark
• BLAS – Basic Linear Algebra Subprograms
• Hardware optimized native in C & Fortran
– CPU: MKL, OpenBLAS etc.
– GPU: NVBLAS (F-BLAS interface to CUDA)
• Use in Spark through Netlib-java
Experiments
• Huge benefit from native BLAS vs pure Java
f2jblas
• GPU is faster (2x) only for large matrices
– When compute is larger than copy to/from
GPU
• More details:
– https://github.com/avulanov/scala-blas
– “linalg: Matrix Computations in Apache
Spark” Reza et al., 2015
CPU: 2x Xeon X5650 @ 2.67GHz, 32GB RAM
GPU: Tesla M2050 3GB, 575MHz, 448 CUDA
cores
seconds
Matrices size

Scalability
Parallelization
• Each iteration 𝑘, each node 𝑖
– 1. Gets parameters 𝑤 𝑘
from master
– 2. Computes a gradient 𝛻𝑖
𝑘
𝐹(𝑑𝑎𝑡𝑎𝑖)
– 3. Sends a gradient to master
– 4. Master computes 𝑤 𝑘+1
based on gradients
• Gradient type
– Batch – process all data on each iteration
– Stochastic – random point
– Mini-batch – random batch
• How many workers to use?
– Less workers – less compute
– More workers – more communication
𝑤 𝑘
𝑤 𝑘+1
≔ 𝑌 𝛻𝑖
𝑘
𝐹
Master
Executor
1
Executor
N
Partition 1
Partition 2
Partition P
Executor
1
Executor
N
V
V
v
𝛻1
𝑘
𝐹(𝑑𝑎𝑡𝑎1)
𝛻 𝑁
𝑘
𝐹(𝑑𝑎𝑡𝑎 𝑁)
𝛻1
𝑘
𝐹
Master
Executor
1
Executor
N
Master V
V
v
1.
2.
3.
4.
GoTo #1

Communication and computation trade-off
Parallelization of batch gradient
• There are 𝑑 data points, 𝑓 features and 𝑘 classes
– Assume, we want to train logistic regression, it has 𝑓𝑘 parameters
• Communication: 𝑛 workers get/receive 𝑓𝑘 64 bit parameters through the network with
bandwidth 𝑏 and software overhead 𝑐. Use all-reduce:
– 𝑡 𝑐𝑚 = 2
64𝑓𝑘
𝑏
+ 𝑐 log2 𝑛
• Computation: each worker has 𝑝 FLOPS and processes
𝑑
𝑛
of data, that needs 𝑓𝑘 operations
– 𝑡 𝑐𝑝~
𝑑
𝑛
𝑓𝑘
𝑝
• What is the optimal number of workers?
– min
𝑛
𝑡 𝑐𝑚 + 𝑡 𝑐𝑝 ⇒ 𝑛 = 𝑚𝑎𝑥
𝑑𝑓𝑘 ln 2
𝑝 128𝑓𝑘 𝑏+2𝑐
, 1
– 𝑚𝑎𝑥
𝑑∙𝑤∙ln 2
𝑝 128𝑤 𝑏+2𝑐
, 1 , if 𝑤 is the number of model parameters and floating point operations

Analysis of the trade-off
Optimal number of workers for batch gradient
• Parallelism in a cluster
– 𝑛 = 𝑚𝑎𝑥
𝑑∙𝑤∙ln 2
𝑝 128𝑤 𝑏+2𝑐
, 1
• Analysis
– More FLOPS 𝑝 means lower degree of batch gradient parallelism in a cluster
– More operations, i.e. more features and classes 𝑤 = 𝑓𝑘 (or a deep network) means higher degree
– Small 𝑐 overhead for get/receive a message means higher degree
• Example: MNIST8M handwritten digit recognition dataset
– 8.1M documents, 784 features, 10 classes, logistic regression
– 32GFlops double precision CPU, 1Gbit network, overhead ~ 0.1s
– 𝑛 = 𝑚𝑎𝑥
8.1𝑀∙784∙10∙0.69
32𝐺 128∙784∙10 1𝐺+2∙0.1
, 1 = 6

0
20
40
60
80
100
0 1 2 3 4 5 6
Spark MLP vs Caffe MLP
MLP (total) MLP (compute)
Caffe CPU Caffe GPU
Scalability testing
Setup
• MNIST character recognition 60K samples
• 6-layer MLP (784,2500,2000,1500,1000,500,10)
• 12M parameters
• CPU: Xeon X5650 @ 2.67GHz
• GPU: Tesla M2050 3GB, 575MHz
• Caffe (Deep Learning from Berkeley): 1 node
• Spark: 1 master + 5 workers
Results per iteration
• Single node (both tools double precision)
– 1.6 slower than Caffe CPU (Scala vs C++)
• Scalability
– 5 nodes give 4.7x speedup, beats Caffe, close to
GPU
Seconds
Workers
Communication
cost
𝑛 = 𝑚𝑎𝑥
60𝐾 ∙ 12𝑀 ∙ 0.69
64𝐺 128 ∙ 12𝑀 950𝑀 + 2 ∙ 0.1
, 1 = 𝟒

Conclusions & future work
Conclusions
• Scalable multilayer perceptron is available in Spark 1.5.0
• Extensible internal API for Artificial Neural Networks
– Further contributions are welcome!
• Native BLAS (and GPU) speeds up Spark
• Heuristics for parallelization of batch gradient
Work in progress [SPARK-5575]
• Autoencoder(s)
• Restricted Boltzmann Machines
• Drop-out
• Convolutional neural networks
Future work
• SGD & parameter server

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov (20)

More from Spark Summit (20)

Recently uploaded (20)

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov