Large Model support and Distribute deep learning

Florin Manaila
HPC/Deep Learning Architect and Inventor
IBM Cognitive Systems Europe
florin.manaila@de.ibm.com
August 31, 2018
IBM PowerAI
(Large Model Support and Distributed Deep Learning)

Problem
―
2
 Datasets are large and growing
 The size of a batch of samples is large and growing
 Sample sizes are large and growing
 More and more sophisticated models are being designed, some with
hundreds of layers
 GPU memory capacity is growing as well (but slower)
 Limited by cost, technology, physical space
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation

Problem
―
3
 So stay within the bounds then?
Well..
We don’t like constraints!
We’ve already paid for memory in this system! Why can’t we use
that?
I’m using a batch size of 1 and am already pushing the limits, I can’t
compromise any more!

POWER
CPU
DDR4
GPU
NVLink
Graphics
Memory
CPUDDR4
GPU
PCIe
Graphics
Memory
PowerAI with Large Model Support (LMS)
4
Neural Network is growing deeper and wider and in near future, memory to keep the
network parameters may exceed the GPU memory (16GB/32GB)
Large Model Support is required in deep learning frameworks (i.e. swap-out unused
parameters to large CPU memory (TB order)

LMS Usage in IBM-Caffe
5
LMS enables processing of high definition images, large models, and higher batch sizes that doesn’t fit in
GPU memory today (Maximum GPU memory available in Nvidia P100 GPUs is 16GB).
LMS Options
• lms <size in KB>
• lms_frac <x>, where 0<x<1.0
Example of running IBM Caffe with LMS for Deep Residual Network – Resnet152 :
/opt/DL/caffe-ibm/bin/caffe train -gpu 0,1,2,3 –solver=solver.prototxt -lms 10000 –lms_frac=0.5
Note that configuring the “lms” and “lms_frac”
values depends on the below factors:
•Batch size used
•Model used
•Number of GPUs used
•System memory available

LMS in TensorFlow 1.8
Enabling large models and datasets
6
 TFLMS modifies the TensorFlow graph prior to training to inject swap nodes that will swap tensors in and
out of GPU memory to system memory.
 Contributed to the community (not accepted yet):
https://github.com/tensorflow/tensorflow/pull/19845
 Large bandwidth of NVLink2 makes this perform well while enabling the graph to train against larger
datasets, higher resolutions and/or large models.
 Relies on an existing contrib module, tf.contrib.graph_editor
https://www.tensorflow.org/api_docs/python/tf/contrib/graph_editor
l+1l-1 LLayer 1
Loss
Function
…..…
……….
…...
Forward
Backward
l
…….
CPU memory
GPU memory
l+1l-1 LLayer 1
Loss
Function
…..…
……….
…...
Forward
Backward
l
…….
GPU memory
Swap-out
Swap-in
Normal Backpropagation Backpropagation in LMS (Swap)
Keep unused parameters in GPU memory Swap-out unused parameters to CPU memory
TFLMS is designated as
Tech Preview in the
PowerAI 1.5.2 release

LMS Usage in TensorFlow 1.8
7
We insert adam optimizer before a
session starts to modify the graph
Forward Backward
adam_optimizer
Finds links between fw and bw and insert swap-
out/in nodes. TFLMS recognize backward node by
checking “adam_optimizer” scope
Loss

Large AI Models Train
~4 Times Faster
POWER9 Servers with NVLink to GPUs
vs
x86 Servers with PCIe to GPUs
8
3.1 Hours
49 Mins
0
2000
4000
6000
8000
10000
12000
Xeon x86 2640v4 w/
4x V100 GPUs
Power AC922 w/ 4x
V100 GPUs
Time(secs)
Caffe with LMS (Large Model Support)
Runtime of 1000 Iterations
3.8x Faster
GoogleNet model on Enlarged
ImageNet Dataset (2240x2240)Cognitive Systems Europe / August 31 / © 2018 IBM Corporation

Deep Learning at work
Available options
9
Longer Training Time Shorter Training Time

Distributed Deep Learning
Goals
10Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
The overall goal of ddlrun is to improve the
user experience DDL users.
To this end the primary features
of ddlrun are:
 Error Checking/Configuration Verification
 Automatic Topology Detection and
Rankfile generation
 Automatic mpirun option handling

PowerAI Distributed Deep Learning Library (DDL)
Communication Library for Distributed
Deep Learning Training
• Enables deep learning software to scale to 100s of
servers with GPUs
• Works across variety of system sizes
• Works with variety of network types, switch topologies
Released results @ 256 P100 GPUs
• Better scaling efficiency than Facebook AI Research:
95% (IBM) vs <90% (FB)
• Higher image recognition accuracy than Microsoft:
33.8% (IBM) vs 29.8% (MS)
TECHNICAL DETAILS:
https://arxiv.org/abs/1708.02188

What does DDL do?
DDL for
TensorFlow
1. Places the job on the local GPU to the CPU (negotiating to
use NVLink interface)
2. Places the job on its nearest neighbor, to leverage NVLink
GPU:GPU communication
3. Places the job on the same system, on the other socket
4. Sends the job, integrating RDMA over IB (not present in the
frameworks themselves), to a remote system and it’s first GPU
Same kind of intelligence you see in good HPC job schedulers, but created with specific tuning for our architecture

Network Switch
GPU
Memory
POWER
CPUDDR4
GPU
Storage Network IB, Eth
PCle
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
GPU
Memory
POWER
CPUDDR4
GPU
Storage Network IB, Eth
PCle
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
PowerAI DDL Dimensions
Communication paths
DDL splits reductions into different dimensions, using different algorithms for each dimension.
Network Switch
GPU
Memory
POWER
CPUDDR4
GPU
Storage
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
GPU
Memory
POWER
CPUDDR4
GPU
Storage
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
PCle PCle
Network IB, Eth Network IB, Eth

PowerAI DDL
IBM Distributed Deep Learning Library provides:
 A C library that provides functions needed to perform distributed deep learning operations
(such as allreduce)
 The library utilizes the MPI and NCCL libraries
 A tool for launching jobs across a cluster called ddlrun
 Framework integrations:
• Provides a custom operator for TensorFlow, plus python wrappers around DDL library
• DDL integration is built into caffe-ibm
• Pytorch will follow soon

PowerAI DDL
allReduce
 allReduce performs an element-wise reduction on arrays of data spread across nodes of
a cluster
 At the end of the allReduce calculation, every node will have a copy of the result
 DDL provides support for the sum and average reduction operations

DDL’s allReduce uses knowledge of the
cluster layout to perform reductions
between nodes in a certain order
What is Special about DDL's
allReduce?
DDL attempts to perform reductions
between nodes in the order that will
cause the lowest communication
overhead.
It takes into account the fact that not all
nodes are connected with the same
interface
DDL performs best compared to other
allreduce libraries when used in a
cluster with a non-flat topology.

PowerAI DDL
Run
Common ddlrun arguments:
» --m : Select the DDL mode
» --accelerators : Specify the number of GPUs per node to use
» --tcp : Use TCP communication between nodes instead of Infiniband
» --mpiarg : Pass along extra MPI arguments
» --verbose : Provides extra output describing checks that are performed
» --skipchecks : Don’t perform network checks
Activate caffe-ibm:
source /opt/DL/caffe-ibm/bin/caffe-activate
Launch a program using ddlrun:
ddlrun --H host1,host2 caffe train --
solver=SOLVER.prototxt
Activate ddl-tensorflow:
source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-
activate
Launch a program using ddlrun:
ddlrun --H host1,host2 python MY_SCRIPT.py

PowerAI DDL Modes
There are several different reduction algorithms that DDL implements (called modes).
The user can choose which mode to use for each dimension of the calculation
Some of the available modes are:
• b : Uses lower level NCCL functions. This generally gets the best performance between GPUs
in the same node.
• n : Uses higher level NCCL functions.
• r : Performs a ring based reduction using MPI commands.
• m : Uses higher level MPI functions. This can be used on clusters without GPUs.
• p : Determines the best mode to use for each dimension. There is a small startup cost and
larger upfront GPU memory usage when using p mode.

PowerAI DDL
Automatic Topology Detection and Rankfile generation
$ ddlrun -H host1,host2,host3,host4 python …..
Another common source of frustration when getting started with DDL is
the generation of the rankfile.
With the version from PowerAI 1.5.2 of ddlrun, the topology is inferred
from the host list and a rankfile is automatically generated by
discovering the configuration of the first host in the host list and
verifying that all other hosts have the same configuration.
This command will automatically generate and use the following
rankfile:
#host = host1,host2,host3,host4
#aisles = 1
#racks = 1
#nodes = 4
#accelerators = 4
#sockets = 2
#cores = 16
rank 0=host1 slot=0:0-7

PowerAI DDL
Automatic mpirun option handling
$ ddlrun -H host1,host2,host3,host4 python /mnist/mnist-env.py ...
+ mpirun -x PATH -x LD_LIBRARY_PATH -x DDL_OPTIONS -gpu --rankfile /tmp/ddlrun.BxI9Ufpz1Ycz/RANKFILE -n 16
python /mnist/mnist-env.py
There are quite a few options that have to be passed to mpirun every time a job is launched, and some that only
need to be passed depending on what version of mpi is being used or how the environment is set up.
ddlrun now handles these options automatically, displaying the fully constructed mpirun command it used. E.g.:
If there’s ever a need to pass additional options to mpirun, the --mpiarg option can be used. E.g.:
$ ddlrun --mpiarg "-pami_noib" -H host1,host2,host3,host4 python /mnist/mnist-env.py
20

64
32
16
8
4
2
1
1 2 4 8 16 32 64
#Nodes
Speedup
PowerAI DDL
PowerAI DDL Performance
training ResNet-50 Imagenet1k, Caffe
#GPUs 4 8 16 32 64 128 256
#Nodes 1 2 4 8 16 32 64
Speedup 1.0 2.0 3.9 7.9 15.5 30.5 60.6
Scaling efficiency 1.00 1.00 .98 .99 .97 .95 .95
ideal
actual

64
32
16
8
4
2
1
1 2 4 8 16 32 64
#Nodes
Speedup
PowerAI DDL
PowerAI DDL Performasnce
training ResNet-101 Imagenet22k, Caffe
#GPUs 4 8 64 256
#Nodes 1 2 16 64
Speedup 1.0 1.8 3.9 13.8
Scaling efficiency 1.00 .92 .86 .85
ideal
actual
7 hours to 33.8% top-1 accuracy using 256 GPUs

PowerAI DDL
Modifying TensorFlow Scripts
23
TensorFlow scripts must be modified to use DDL. A DDL enabled TensorFlow script
must do the following (Importing ddl.py will provide some of this functionality
automatically):
• Each process must use the same initial values for trainable model parameters
– The values used in Process 0 should be broadcasted to every other process. This behavior is done automatically when
ddl.py is imported.
• The batch size will become (NUMBER_PROCESSES * BATCH_SIZE). Sections
of code that are calculating the number of batches being calculated per step may
need to be modified.
– The DDL operator provides the number of processes through the function: ddl.size()

PowerAI DDL
Modifying TensorFlow Scripts
24
• Each process should be operating on different input data
• The DDL operator provides a unique ID for each process through the function ddl.rank()
• An allreduce should be performed to sync up the gradients in each process
• Importing ddl.py will automatically overwrite Optimizer.apply_gradients and tf.keras.gradients to call DDL’s allreduce
function
• Some tasks, such as printing and reading and writing files should only be
performed on a single process
• The DDL operator provides a unique ID for each process through the function ddl.rank(). Generally, these sorts of actions
are only performed when ddl.rank() == 0.

PowerAI DDL
TF operator functions/semantics
25
1. Init function: This must be called before any real TF operators. Typically, we can execute this op on CPU
using an additional session. The input is the DDL configuration. This will inform the targeted network topology
and learner mapping to it. The output consists of MPI information (rank,size) and GPU assignment for TF.

PowerAI DDL
26
2. Bcast function: Broadcast is to synchronize all the trainable parameters (i.e., weights and biases) before
the training. Broadcast can be called once init has been called and completed on the assigned GPU device.
Each and every trainable parameter must be broadcasted to ensure good convergence.

PowerAI DDL
27
3. AllReduceN function : This is an aggregated version of AllReduce. Essentially, this takes an array of N
tensors, performs allreduce in a single shot, and return an array of N reduced tensors. The benefits of using
AllReduceN are better performance and simpler integration.

Thank you
29
Florin Manaila
Cognitive Systems Europe
HPC/Deep Learning Senior IT Architect
—
florin.manaila@de.ibm.com
+49-7034-274-5294
ibm.com

Large Model support and Distribute deep learning

More Related Content

What's hot (20)

Similar to Large Model support and Distribute deep learning (20)

More from Ganesan Narayanasamy (20)

Recently uploaded (20)

Large Model support and Distribute deep learning

Editor's Notes