Parallel Distributed Deep Learning on HPCC Systems

Innovation and
Reinvention Driving
Transformation
OCTOBER 9, 2018
2018 HPCC Systems® Community
Day
Robert K.L.
Kennedy
Parallel Distributed Deep Learning on HPCC
Systems

Background
• Expand HPCC Systems complex machine learning capabilities
• Current HPCC Systems ML libraries have common ML algorithms
• Lacks more advanced ML algorithms, such as Deep Learning
• Expand HPCC Systems libraries to include Deep Learning
Parallel Distributed Deep Learning on HPCC Systems 2

Presentation Outline
• Project Goals
• Problem Statement
• Brief Neural Network Background
• Introduction to Parallel Deep Learning Methods and Techniques
• Overview of the Technologies Used in this Implementation
• Details of the Implementation
• Implementation Validation and Statistical Analysis
• Future Work

• Extend HPCC Systems Libraries to include Deep Learning
• Specifically Distributed Deep Learning
• Combine HPCC Systems and TensorFlow
• Widely used open source DL library
• HPCC Systems Provides:
• Cluster environment
• Distribution of data and code
• Communication between nodes
• TensorFlow Provides:
• Deep Learning training algorithms for localized execution
Project Goals

• Deep Learning models are large and Complex
• DL needs large amounts of training data
• Training Process
• Time requirements increase with data size and model complexity
• Computation requirements increase as well
• Large multi node computers are needed to effectively train large, cutting edge
Deep Learning models
Problem Statement

• Neural Network Visualization
• 2 hidden layers, fully connected, 3 class
classification output
• Forwardpropagation and Backpropagation
• Optimize Model with respect to Loss
Function
• Gradient Descent, SGD, Batch SGD, Mini-batch
SGD
Neural Network Background

• Data Parallelism
• Model Parallelism
• Synchronous and
Asynchronous
• Parallel SGD
Parallel Deep Learning
Data Parallelism Model Parallelism

• ECL/HPCC Systems Handles the ‘Data Parallelism’ part of the parallelization
• Pyembed handles the localized neural network training
• Using Python, TensorFlow, Keras, and other libraries
• The implementation is a synchronous data parallel stochastic gradient descent
• However, it is not limited to using SGD at a localized level
• The implementation is not limited to TensorFlow
• Using Keras, other Deep Learning ‘backends’ can be used with no change in code
Implementation – Overview

• TensorFlow
• Google’s Popular Deep Learning Library
• Keras
• Deep Learning Library API – uses
TensorFlow or other ‘backend’
• Much less code to produce same model
TensorFlow | Keras

• ECL Partitions Training Data into N partitions
• Where N is the number of slave nodes
• Pyembed – plugin that allows ECL to execute python code
• ECL Distributes the Pyembed code along with data to each node
• Passes into Pyembed the data, NN model, and meta-data as parameters
Implementation - HPCC and ECL

• Receives parameters at time of execution, passed in from ECL
• Then converts to types usable by the python libraries
• Builds localized NN model from the inputs
• Recall this is iterative so the input model changes on each Epoch
• Trains the inputted model on its partition of data
• Returns the updated model weights once completed
• Does not return any training data
Implementation – Pyembed

Code Example – Parallel SGD

• Using a ‘big-data’ dataset, 3,692,556 records long
• Each record is 850 bytes long
• 80/20 Split for Training and Testing Datasets
• We use 10 (from the 80 split) data set sizes, each with different class imbalance
ratios
• 1:1, 1:5, 1:10, 1:25, 1:50, 1:100, 1:500, 1:1000, 1:1500, 1:2000
• Ranging from 2,240 records to 2,241,120 records
• 1.9 MB to 1.9 GB in size
• Each dataset is run on 5 different cluster sizes
• 1 node, 2 nodes, 4 nodes, 8 nodes, and 16 nodes
• Cluster is cloud based and each node has 1 CPU and 3.5 gigs of memory
Case Study – Training Time – Design

• Note the Y scale of the Left graph is logarithmic
Case Study – Training Time – Results
●●●● ●
●
●
●
●
●●●● ● ●
●
●
●
●●●● ● ●
●
●
●
●●●● ● ●
●
●
●
●●●● ● ● ●
●
●
0
2000
4000
6000
0 500 1000 1500
Training Dataset Size (thousands)
TrainingTime(seconds)
# of Nodes
●
●
●
●
●
1
2
4
8
16
Training Time Comparison
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
64
512
4096
4 32 256 2048
Training Dataset Size (thousands)
TrainingTime(seconds)
# of Nodes
●
●
●
●
●
1
2
4
8
16
Training Time Comparison

• Uses same experimental design as
previous case study
• Model performance is slightly improved
by number of nodes
• See: slope of the red line
• Other dataset sizes’ model
performance effects out of scope
• Due to the severe class imbalance
Case Study – Model Performance
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
1 2 4 8 16
Nodes
Performance(AUC)
Data Size
●
●
●
●
●
●
●
●
●
●
1
5
10
25
50
100
500
1000
1500
2000
Model Performance vs. # Nodes

• Successful Implementation of a synchronous data parallel deep learning
algorithm
• Case Studies show the runtime is valid across a wide spectrum of clusters sizes
and dataset sizes
• Leveraged HPCC Systems and TensorFlow to bring Deep Learning to HPCC
Systems
• Started new open source HPCC Systems Library for distributed DL
• Accompanying Documentation, Test cases, and Performance tests
• Provided possible research avenues for future work
Conclusion

• Improved Data Parallelism
• For HPCC Systems with multiple slave Thor nodes on a single logical computer
• Model Parallelism Implementation
• Hybrid Approach – Model and Data parallelism
• Asynchronous Parallelism
• This paradigm has additional challenges on a cluster system
Future Work

Parallel Distributed Deep Learning on HPCC Systems

More Related Content

What's hot (20)

Similar to Parallel Distributed Deep Learning on HPCC Systems (20)

More from HPCC Systems (20)

Recently uploaded (20)

Parallel Distributed Deep Learning on HPCC Systems