SlideShare a Scribd company logo
Florin Manaila
HPC/Deep Learning Architect and Inventor
IBM Cognitive Systems Europe
florin.manaila@de.ibm.com
August 31, 2018
IBM PowerAI
(Large Model Support and Distributed Deep Learning)
Problem
―
2
 Datasets are large and growing
 The size of a batch of samples is large and growing
 Sample sizes are large and growing
 More and more sophisticated models are being designed, some with
hundreds of layers
 GPU memory capacity is growing as well (but slower)
 Limited by cost, technology, physical space
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Problem
―
3
 So stay within the bounds then?
Well..
We don’t like constraints!
We’ve already paid for memory in this system! Why can’t we use
that?
I’m using a batch size of 1 and am already pushing the limits, I can’t
compromise any more!
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
POWER
CPU
DDR4
GPU
NVLink
Graphics
Memory
CPUDDR4
GPU
PCIe
Graphics
Memory
PowerAI with Large Model Support (LMS)
4
Neural Network is growing deeper and wider and in near future, memory to keep the
network parameters may exceed the GPU memory (16GB/32GB)
Large Model Support is required in deep learning frameworks (i.e. swap-out unused
parameters to large CPU memory (TB order)
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
LMS Usage in IBM-Caffe
5
LMS enables processing of high definition images, large models, and higher batch sizes that doesn’t fit in
GPU memory today (Maximum GPU memory available in Nvidia P100 GPUs is 16GB).
LMS Options
• lms <size in KB>
• lms_frac <x>, where 0<x<1.0
Example of running IBM Caffe with LMS for Deep Residual Network – Resnet152 :
/opt/DL/caffe-ibm/bin/caffe train -gpu 0,1,2,3 –solver=solver.prototxt -lms 10000 –lms_frac=0.5
Note that configuring the “lms” and “lms_frac”
values depends on the below factors:
•Batch size used
•Model used
•Number of GPUs used
•System memory available
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
LMS in TensorFlow 1.8
Enabling large models and datasets
6
 TFLMS modifies the TensorFlow graph prior to training to inject swap nodes that will swap tensors in and
out of GPU memory to system memory.
 Contributed to the community (not accepted yet):
https://github.com/tensorflow/tensorflow/pull/19845
 Large bandwidth of NVLink2 makes this perform well while enabling the graph to train against larger
datasets, higher resolutions and/or large models.
 Relies on an existing contrib module, tf.contrib.graph_editor
https://www.tensorflow.org/api_docs/python/tf/contrib/graph_editor
l+1l-1 LLayer 1
Loss
Function
…..…
……….
…...
Forward
Backward
l
…….
CPU memory
GPU memory
l+1l-1 LLayer 1
Loss
Function
…..…
……….
…...
Forward
Backward
l
…….
GPU memory
Swap-out
Swap-in
Normal Backpropagation Backpropagation in LMS (Swap)
Keep unused parameters in GPU memory Swap-out unused parameters to CPU memory
TFLMS is designated as
Tech Preview in the
PowerAI 1.5.2 release
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
LMS Usage in TensorFlow 1.8
7
We insert adam optimizer before a
session starts to modify the graph
Forward Backward
adam_optimizer
Finds links between fw and bw and insert swap-
out/in nodes. TFLMS recognize backward node by
checking “adam_optimizer” scope
Loss
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Large AI Models Train
~4 Times Faster
POWER9 Servers with NVLink to GPUs
vs
x86 Servers with PCIe to GPUs
8
3.1 Hours
49 Mins
0
2000
4000
6000
8000
10000
12000
Xeon x86 2640v4 w/
4x V100 GPUs
Power AC922 w/ 4x
V100 GPUs
Time(secs)
Caffe with LMS (Large Model Support)
Runtime of 1000 Iterations
3.8x Faster
GoogleNet model on Enlarged
ImageNet Dataset (2240x2240)Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Deep Learning at work
Available options
9
Longer Training Time Shorter Training Time
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Distributed Deep Learning
Goals
10Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
The overall goal of ddlrun is to improve the
user experience DDL users.
To this end the primary features
of ddlrun are:
 Error Checking/Configuration Verification
 Automatic Topology Detection and
Rankfile generation
 Automatic mpirun option handling
PowerAI Distributed Deep Learning Library (DDL)
Communication Library for Distributed
Deep Learning Training
• Enables deep learning software to scale to 100s of
servers with GPUs
• Works across variety of system sizes
• Works with variety of network types, switch topologies
Released results @ 256 P100 GPUs
• Better scaling efficiency than Facebook AI Research:
95% (IBM) vs <90% (FB)
• Higher image recognition accuracy than Microsoft:
33.8% (IBM) vs 29.8% (MS)
TECHNICAL DETAILS:
https://arxiv.org/abs/1708.02188
What does DDL do?
DDL for
TensorFlow
1. Places the job on the local GPU to the CPU (negotiating to
use NVLink interface)
2. Places the job on its nearest neighbor, to leverage NVLink
GPU:GPU communication
3. Places the job on the same system, on the other socket
4. Sends the job, integrating RDMA over IB (not present in the
frameworks themselves), to a remote system and it’s first GPU
Same kind of intelligence you see in good HPC job schedulers, but created with specific tuning for our architecture
12Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Network Switch
GPU
Memory
POWER
CPUDDR4
GPU
Storage Network IB, Eth
PCle
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
GPU
Memory
POWER
CPUDDR4
GPU
Storage Network IB, Eth
PCle
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
PowerAI DDL Dimensions
Communication paths
DDL splits reductions into different dimensions, using different algorithms for each dimension.
Network Switch
GPU
Memory
POWER
CPUDDR4
GPU
Storage
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
GPU
Memory
POWER
CPUDDR4
GPU
Storage
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
PCle PCle
Network IB, Eth Network IB, Eth
PowerAI DDL
IBM Distributed Deep Learning Library provides:
 A C library that provides functions needed to perform distributed deep learning operations
(such as allreduce)
 The library utilizes the MPI and NCCL libraries
 A tool for launching jobs across a cluster called ddlrun
 Framework integrations:
• Provides a custom operator for TensorFlow, plus python wrappers around DDL library
• DDL integration is built into caffe-ibm
• Pytorch will follow soon
14Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
PowerAI DDL
allReduce
 allReduce performs an element-wise reduction on arrays of data spread across nodes of
a cluster
 At the end of the allReduce calculation, every node will have a copy of the result
 DDL provides support for the sum and average reduction operations
15Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
DDL’s allReduce uses knowledge of the
cluster layout to perform reductions
between nodes in a certain order
What is Special about DDL's
allReduce?
DDL attempts to perform reductions
between nodes in the order that will
cause the lowest communication
overhead.
It takes into account the fact that not all
nodes are connected with the same
interface
DDL performs best compared to other
allreduce libraries when used in a
cluster with a non-flat topology.
16Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
PowerAI DDL
Run
Common ddlrun arguments:
» --m : Select the DDL mode
» --accelerators : Specify the number of GPUs per node to use
» --tcp : Use TCP communication between nodes instead of Infiniband
» --mpiarg : Pass along extra MPI arguments
» --verbose : Provides extra output describing checks that are performed
» --skipchecks : Don’t perform network checks
Activate caffe-ibm:
source /opt/DL/caffe-ibm/bin/caffe-activate
Launch a program using ddlrun:
ddlrun --H host1,host2 caffe train --
solver=SOLVER.prototxt
Activate ddl-tensorflow:
source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-
activate
Launch a program using ddlrun:
ddlrun --H host1,host2 python MY_SCRIPT.py
PowerAI DDL Modes
There are several different reduction algorithms that DDL implements (called modes).
The user can choose which mode to use for each dimension of the calculation
Some of the available modes are:
• b : Uses lower level NCCL functions. This generally gets the best performance between GPUs
in the same node.
• n : Uses higher level NCCL functions.
• r : Performs a ring based reduction using MPI commands.
• m : Uses higher level MPI functions. This can be used on clusters without GPUs.
• p : Determines the best mode to use for each dimension. There is a small startup cost and
larger upfront GPU memory usage when using p mode.
PowerAI DDL
Automatic Topology Detection and Rankfile generation
$ ddlrun -H host1,host2,host3,host4 python …..
Another common source of frustration when getting started with DDL is
the generation of the rankfile.
With the version from PowerAI 1.5.2 of ddlrun, the topology is inferred
from the host list and a rankfile is automatically generated by
discovering the configuration of the first host in the host list and
verifying that all other hosts have the same configuration.
This command will automatically generate and use the following
rankfile:
#host = host1,host2,host3,host4
#aisles = 1
#racks = 1
#nodes = 4
#accelerators = 4
#sockets = 2
#cores = 16
rank 0=host1 slot=0:0-7
rank 4=host1 slot=0:8-15
rank 8=host1 slot=1:0-7
rank 12=host1 slot=1:8-15
rank 1=host2 slot=0:0-7
rank 5=host2 slot=0:8-15
rank 9=host2 slot=1:0-7
rank 13=host2 slot=1:8-15
rank 2=host3 slot=0:0-7
rank 6=host3 slot=0:8-15
rank 10=host3 slot=1:0-7
rank 14=host3 slot=1:8-15
rank 3=host4 slot=0:0-7
rank 7=host4 slot=0:8-15
rank 11=host4 slot=1:0-7
rank 15=host4 slot=1:8-15
PowerAI DDL
Automatic mpirun option handling
$ ddlrun -H host1,host2,host3,host4 python /mnist/mnist-env.py ...
+ mpirun -x PATH -x LD_LIBRARY_PATH -x DDL_OPTIONS -gpu --rankfile /tmp/ddlrun.BxI9Ufpz1Ycz/RANKFILE -n 16
python /mnist/mnist-env.py
There are quite a few options that have to be passed to mpirun every time a job is launched, and some that only
need to be passed depending on what version of mpi is being used or how the environment is set up.
ddlrun now handles these options automatically, displaying the fully constructed mpirun command it used. E.g.:
If there’s ever a need to pass additional options to mpirun, the --mpiarg option can be used. E.g.:
$ ddlrun --mpiarg "-pami_noib" -H host1,host2,host3,host4 python /mnist/mnist-env.py
20
64
32
16
8
4
2
1
1 2 4 8 16 32 64
#Nodes
Speedup
PowerAI DDL
PowerAI DDL Performance
training ResNet-50 Imagenet1k, Caffe
#GPUs 4 8 16 32 64 128 256
#Nodes 1 2 4 8 16 32 64
Speedup 1.0 2.0 3.9 7.9 15.5 30.5 60.6
Scaling efficiency 1.00 1.00 .98 .99 .97 .95 .95
ideal
actual
21Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
64
32
16
8
4
2
1
1 2 4 8 16 32 64
#Nodes
Speedup
PowerAI DDL
PowerAI DDL Performasnce
training ResNet-101 Imagenet22k, Caffe
#GPUs 4 8 64 256
#Nodes 1 2 16 64
Speedup 1.0 1.8 3.9 13.8
Scaling efficiency 1.00 .92 .86 .85
ideal
actual
7 hours to 33.8% top-1 accuracy using 256 GPUs
22Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
PowerAI DDL
Modifying TensorFlow Scripts
23
TensorFlow scripts must be modified to use DDL. A DDL enabled TensorFlow script
must do the following (Importing ddl.py will provide some of this functionality
automatically):
• Each process must use the same initial values for trainable model parameters
– The values used in Process 0 should be broadcasted to every other process. This behavior is done automatically when
ddl.py is imported.
• The batch size will become (NUMBER_PROCESSES * BATCH_SIZE). Sections
of code that are calculating the number of batches being calculated per step may
need to be modified.
– The DDL operator provides the number of processes through the function: ddl.size()
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
PowerAI DDL
Modifying TensorFlow Scripts
24
• Each process should be operating on different input data
• The DDL operator provides a unique ID for each process through the function ddl.rank()
• An allreduce should be performed to sync up the gradients in each process
• Importing ddl.py will automatically overwrite Optimizer.apply_gradients and tf.keras.gradients to call DDL’s allreduce
function
• Some tasks, such as printing and reading and writing files should only be
performed on a single process
• The DDL operator provides a unique ID for each process through the function ddl.rank(). Generally, these sorts of actions
are only performed when ddl.rank() == 0.
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
PowerAI DDL
TF operator functions/semantics
25
1. Init function: This must be called before any real TF operators. Typically, we can execute this op on CPU
using an additional session. The input is the DDL configuration. This will inform the targeted network topology
and learner mapping to it. The output consists of MPI information (rank,size) and GPU assignment for TF.
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
PowerAI DDL
TF operator functions/semantics
26
2. Bcast function: Broadcast is to synchronize all the trainable parameters (i.e., weights and biases) before
the training. Broadcast can be called once init has been called and completed on the assigned GPU device.
Each and every trainable parameter must be broadcasted to ensure good convergence.
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
PowerAI DDL
TF operator functions/semantics
27
3. AllReduceN function : This is an aggregated version of AllReduce. Essentially, this takes an array of N
tensors, performs allreduce in a single shot, and return an array of N reduced tensors. The benefits of using
AllReduceN are better performance and simpler integration.
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
28Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Thank you
29
Florin Manaila
Cognitive Systems Europe
HPC/Deep Learning Senior IT Architect
—
florin.manaila@de.ibm.com
+49-7034-274-5294
ibm.com
30

More Related Content

PPTX
PowerAI Deep dive
Ganesan Narayanasamy
 
PDF
IBM HPC Transformation with AI
Ganesan Narayanasamy
 
PDF
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
PDF
OpenPOWER Webinar on Machine Learning for Academic Research
Ganesan Narayanasamy
 
PDF
SNAP MACHINE LEARNING
Ganesan Narayanasamy
 
PDF
BSC LMS DDL
Ganesan Narayanasamy
 
PDF
OpenPOWER Boot camp in Zurich
Ganesan Narayanasamy
 
PPTX
2018 bsc power9 and power ai
Ganesan Narayanasamy
 
PowerAI Deep dive
Ganesan Narayanasamy
 
IBM HPC Transformation with AI
Ganesan Narayanasamy
 
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
OpenPOWER Webinar on Machine Learning for Academic Research
Ganesan Narayanasamy
 
SNAP MACHINE LEARNING
Ganesan Narayanasamy
 
BSC LMS DDL
Ganesan Narayanasamy
 
OpenPOWER Boot camp in Zurich
Ganesan Narayanasamy
 
2018 bsc power9 and power ai
Ganesan Narayanasamy
 

What's hot (20)

PDF
Distributed deep learning reference architecture v3.2l
Ganesan Narayanasamy
 
PDF
Covid-19 Response Capability with Power Systems
Ganesan Narayanasamy
 
PPTX
WML OpenPOWER presentation
Ganesan Narayanasamy
 
PDF
OpenPOWER/POWER9 AI webinar
Ganesan Narayanasamy
 
PDF
CFD on Power
Ganesan Narayanasamy
 
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
 
PPT
OpenPOWER Webinar
Ganesan Narayanasamy
 
PDF
Ac922 cdac webinar
Ganesan Narayanasamy
 
PDF
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
PDF
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Ganesan Narayanasamy
 
PPTX
AI OpenPOWER Academia Discussion Group
Ganesan Narayanasamy
 
PDF
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Intel® Software
 
PDF
Summit workshop thompto
Ganesan Narayanasamy
 
PDF
OpenPOWER/POWER9 Webinar from MIT and IBM
Ganesan Narayanasamy
 
PDF
Transparent Hardware Acceleration for Deep Learning
Indrajit Poddar
 
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
PDF
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Intel® Software
 
PDF
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
PPTX
Innovation with ai at scale on the edge vt sept 2019 v0
Ganesan Narayanasamy
 
PDF
IBM Data Centric Systems & OpenPOWER
inside-BigData.com
 
Distributed deep learning reference architecture v3.2l
Ganesan Narayanasamy
 
Covid-19 Response Capability with Power Systems
Ganesan Narayanasamy
 
WML OpenPOWER presentation
Ganesan Narayanasamy
 
OpenPOWER/POWER9 AI webinar
Ganesan Narayanasamy
 
CFD on Power
Ganesan Narayanasamy
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
 
OpenPOWER Webinar
Ganesan Narayanasamy
 
Ac922 cdac webinar
Ganesan Narayanasamy
 
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Ganesan Narayanasamy
 
AI OpenPOWER Academia Discussion Group
Ganesan Narayanasamy
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Intel® Software
 
Summit workshop thompto
Ganesan Narayanasamy
 
OpenPOWER/POWER9 Webinar from MIT and IBM
Ganesan Narayanasamy
 
Transparent Hardware Acceleration for Deep Learning
Indrajit Poddar
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Intel® Software
 
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
Innovation with ai at scale on the edge vt sept 2019 v0
Ganesan Narayanasamy
 
IBM Data Centric Systems & OpenPOWER
inside-BigData.com
 
Ad

Similar to Large Model support and Distribute deep learning (20)

PDF
Open power ddl and lms
Ganesan Narayanasamy
 
PDF
Austin,TX Meetup presentation tensorflow final oct 26 2017
Clarisse Hedglin
 
PPT
Enabling a hardware accelerated deep learning data science experience for Apa...
DataWorks Summit
 
PDF
Power AI introduction
Snowy Chen
 
PDF
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
PDF
Distributed deep learning optimizations for Finance
geetachauhan
 
PPTX
OpenPOWER and IBM AI overview
Ganesan Narayanasamy
 
PPTX
Introduction to PowerAI - The Enterprise AI Platform
Indrajit Poddar
 
PDF
Open source ai_technical_trend
Mario Cho
 
PDF
IBM Cloud Paris Meetup - 20190520 - IA & Power
IBM France Lab
 
PPTX
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
DataScienceConferenc1
 
PDF
Distributed deep learning optimizations - AI WithTheBest
geetachauhan
 
PDF
OpenPOWER Workshop in Silicon Valley
Ganesan Narayanasamy
 
PPTX
PowerAI Deep Dive ( key points )
Paulo Sergio Lemes Queiroz
 
PDF
Distributed Deep Learning with Hadoop and TensorFlow
Jan Wiegelmann
 
PPTX
IBM AI at Scale
Ganesan Narayanasamy
 
PPTX
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 
PDF
Deep Learning using OpenPOWER
Ganesan Narayanasamy
 
PPTX
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Intel® Software
 
PDF
Neural Networks from Scratch - TensorFlow 101
Gerold Bausch
 
Open power ddl and lms
Ganesan Narayanasamy
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Clarisse Hedglin
 
Enabling a hardware accelerated deep learning data science experience for Apa...
DataWorks Summit
 
Power AI introduction
Snowy Chen
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
Distributed deep learning optimizations for Finance
geetachauhan
 
OpenPOWER and IBM AI overview
Ganesan Narayanasamy
 
Introduction to PowerAI - The Enterprise AI Platform
Indrajit Poddar
 
Open source ai_technical_trend
Mario Cho
 
IBM Cloud Paris Meetup - 20190520 - IA & Power
IBM France Lab
 
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
DataScienceConferenc1
 
Distributed deep learning optimizations - AI WithTheBest
geetachauhan
 
OpenPOWER Workshop in Silicon Valley
Ganesan Narayanasamy
 
PowerAI Deep Dive ( key points )
Paulo Sergio Lemes Queiroz
 
Distributed Deep Learning with Hadoop and TensorFlow
Jan Wiegelmann
 
IBM AI at Scale
Ganesan Narayanasamy
 
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 
Deep Learning using OpenPOWER
Ganesan Narayanasamy
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Intel® Software
 
Neural Networks from Scratch - TensorFlow 101
Gerold Bausch
 
Ad

More from Ganesan Narayanasamy (20)

PDF
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Ganesan Narayanasamy
 
PDF
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
PDF
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
PDF
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
PDF
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
PDF
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
 
PDF
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
PDF
IBM BOA for POWER
Ganesan Narayanasamy
 
PDF
OpenPOWER System Marconi100
Ganesan Narayanasamy
 
PDF
OpenPOWER Latest Updates
Ganesan Narayanasamy
 
PDF
POWER10 innovations for HPC
Ganesan Narayanasamy
 
PDF
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
PDF
AI in healthcare - Use Cases
Ganesan Narayanasamy
 
PDF
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
PDF
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
PDF
Poster from NUS
Ganesan Narayanasamy
 
PDF
SAP HANA on POWER9 systems
Ganesan Narayanasamy
 
PPTX
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
PDF
AI in the enterprise
Ganesan Narayanasamy
 
PDF
Robustness in deep learning
Ganesan Narayanasamy
 
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Ganesan Narayanasamy
 
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
IBM BOA for POWER
Ganesan Narayanasamy
 
OpenPOWER System Marconi100
Ganesan Narayanasamy
 
OpenPOWER Latest Updates
Ganesan Narayanasamy
 
POWER10 innovations for HPC
Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
AI in healthcare - Use Cases
Ganesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
Poster from NUS
Ganesan Narayanasamy
 
SAP HANA on POWER9 systems
Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
AI in the enterprise
Ganesan Narayanasamy
 
Robustness in deep learning
Ganesan Narayanasamy
 

Recently uploaded (20)

PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Doc9.....................................
SofiaCollazos
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
The Future of Artificial Intelligence (AI)
Mukul
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Software Development Methodologies in 2025
KodekX
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 

Large Model support and Distribute deep learning

  • 1. Florin Manaila HPC/Deep Learning Architect and Inventor IBM Cognitive Systems Europe [email protected] August 31, 2018 IBM PowerAI (Large Model Support and Distributed Deep Learning)
  • 2. Problem ― 2  Datasets are large and growing  The size of a batch of samples is large and growing  Sample sizes are large and growing  More and more sophisticated models are being designed, some with hundreds of layers  GPU memory capacity is growing as well (but slower)  Limited by cost, technology, physical space Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 3. Problem ― 3  So stay within the bounds then? Well.. We don’t like constraints! We’ve already paid for memory in this system! Why can’t we use that? I’m using a batch size of 1 and am already pushing the limits, I can’t compromise any more! Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 4. POWER CPU DDR4 GPU NVLink Graphics Memory CPUDDR4 GPU PCIe Graphics Memory PowerAI with Large Model Support (LMS) 4 Neural Network is growing deeper and wider and in near future, memory to keep the network parameters may exceed the GPU memory (16GB/32GB) Large Model Support is required in deep learning frameworks (i.e. swap-out unused parameters to large CPU memory (TB order) Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 5. LMS Usage in IBM-Caffe 5 LMS enables processing of high definition images, large models, and higher batch sizes that doesn’t fit in GPU memory today (Maximum GPU memory available in Nvidia P100 GPUs is 16GB). LMS Options • lms <size in KB> • lms_frac <x>, where 0<x<1.0 Example of running IBM Caffe with LMS for Deep Residual Network – Resnet152 : /opt/DL/caffe-ibm/bin/caffe train -gpu 0,1,2,3 –solver=solver.prototxt -lms 10000 –lms_frac=0.5 Note that configuring the “lms” and “lms_frac” values depends on the below factors: •Batch size used •Model used •Number of GPUs used •System memory available Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 6. LMS in TensorFlow 1.8 Enabling large models and datasets 6  TFLMS modifies the TensorFlow graph prior to training to inject swap nodes that will swap tensors in and out of GPU memory to system memory.  Contributed to the community (not accepted yet): https://github.com/tensorflow/tensorflow/pull/19845  Large bandwidth of NVLink2 makes this perform well while enabling the graph to train against larger datasets, higher resolutions and/or large models.  Relies on an existing contrib module, tf.contrib.graph_editor https://www.tensorflow.org/api_docs/python/tf/contrib/graph_editor l+1l-1 LLayer 1 Loss Function …..… ………. …... Forward Backward l ……. CPU memory GPU memory l+1l-1 LLayer 1 Loss Function …..… ………. …... Forward Backward l ……. GPU memory Swap-out Swap-in Normal Backpropagation Backpropagation in LMS (Swap) Keep unused parameters in GPU memory Swap-out unused parameters to CPU memory TFLMS is designated as Tech Preview in the PowerAI 1.5.2 release Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 7. LMS Usage in TensorFlow 1.8 7 We insert adam optimizer before a session starts to modify the graph Forward Backward adam_optimizer Finds links between fw and bw and insert swap- out/in nodes. TFLMS recognize backward node by checking “adam_optimizer” scope Loss Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 8. Large AI Models Train ~4 Times Faster POWER9 Servers with NVLink to GPUs vs x86 Servers with PCIe to GPUs 8 3.1 Hours 49 Mins 0 2000 4000 6000 8000 10000 12000 Xeon x86 2640v4 w/ 4x V100 GPUs Power AC922 w/ 4x V100 GPUs Time(secs) Caffe with LMS (Large Model Support) Runtime of 1000 Iterations 3.8x Faster GoogleNet model on Enlarged ImageNet Dataset (2240x2240)Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 9. Deep Learning at work Available options 9 Longer Training Time Shorter Training Time Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 10. Distributed Deep Learning Goals 10Cognitive Systems Europe / August 31 / © 2018 IBM Corporation The overall goal of ddlrun is to improve the user experience DDL users. To this end the primary features of ddlrun are:  Error Checking/Configuration Verification  Automatic Topology Detection and Rankfile generation  Automatic mpirun option handling
  • 11. PowerAI Distributed Deep Learning Library (DDL) Communication Library for Distributed Deep Learning Training • Enables deep learning software to scale to 100s of servers with GPUs • Works across variety of system sizes • Works with variety of network types, switch topologies Released results @ 256 P100 GPUs • Better scaling efficiency than Facebook AI Research: 95% (IBM) vs <90% (FB) • Higher image recognition accuracy than Microsoft: 33.8% (IBM) vs 29.8% (MS) TECHNICAL DETAILS: https://arxiv.org/abs/1708.02188
  • 12. What does DDL do? DDL for TensorFlow 1. Places the job on the local GPU to the CPU (negotiating to use NVLink interface) 2. Places the job on its nearest neighbor, to leverage NVLink GPU:GPU communication 3. Places the job on the same system, on the other socket 4. Sends the job, integrating RDMA over IB (not present in the frameworks themselves), to a remote system and it’s first GPU Same kind of intelligence you see in good HPC job schedulers, but created with specific tuning for our architecture 12Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 13. Network Switch GPU Memory POWER CPUDDR4 GPU Storage Network IB, Eth PCle DDR4 POWER CPU GPU GPU GPU GPU Memory GPU Memory GPU Memory GPU Memory POWER CPUDDR4 GPU Storage Network IB, Eth PCle DDR4 POWER CPU GPU GPU GPU GPU Memory GPU Memory GPU Memory PowerAI DDL Dimensions Communication paths DDL splits reductions into different dimensions, using different algorithms for each dimension. Network Switch GPU Memory POWER CPUDDR4 GPU Storage DDR4 POWER CPU GPU GPU GPU GPU Memory GPU Memory GPU Memory GPU Memory POWER CPUDDR4 GPU Storage DDR4 POWER CPU GPU GPU GPU GPU Memory GPU Memory GPU Memory PCle PCle Network IB, Eth Network IB, Eth
  • 14. PowerAI DDL IBM Distributed Deep Learning Library provides:  A C library that provides functions needed to perform distributed deep learning operations (such as allreduce)  The library utilizes the MPI and NCCL libraries  A tool for launching jobs across a cluster called ddlrun  Framework integrations: • Provides a custom operator for TensorFlow, plus python wrappers around DDL library • DDL integration is built into caffe-ibm • Pytorch will follow soon 14Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 15. PowerAI DDL allReduce  allReduce performs an element-wise reduction on arrays of data spread across nodes of a cluster  At the end of the allReduce calculation, every node will have a copy of the result  DDL provides support for the sum and average reduction operations 15Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 16. DDL’s allReduce uses knowledge of the cluster layout to perform reductions between nodes in a certain order What is Special about DDL's allReduce? DDL attempts to perform reductions between nodes in the order that will cause the lowest communication overhead. It takes into account the fact that not all nodes are connected with the same interface DDL performs best compared to other allreduce libraries when used in a cluster with a non-flat topology. 16Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 17. PowerAI DDL Run Common ddlrun arguments: » --m : Select the DDL mode » --accelerators : Specify the number of GPUs per node to use » --tcp : Use TCP communication between nodes instead of Infiniband » --mpiarg : Pass along extra MPI arguments » --verbose : Provides extra output describing checks that are performed » --skipchecks : Don’t perform network checks Activate caffe-ibm: source /opt/DL/caffe-ibm/bin/caffe-activate Launch a program using ddlrun: ddlrun --H host1,host2 caffe train -- solver=SOLVER.prototxt Activate ddl-tensorflow: source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow- activate Launch a program using ddlrun: ddlrun --H host1,host2 python MY_SCRIPT.py
  • 18. PowerAI DDL Modes There are several different reduction algorithms that DDL implements (called modes). The user can choose which mode to use for each dimension of the calculation Some of the available modes are: • b : Uses lower level NCCL functions. This generally gets the best performance between GPUs in the same node. • n : Uses higher level NCCL functions. • r : Performs a ring based reduction using MPI commands. • m : Uses higher level MPI functions. This can be used on clusters without GPUs. • p : Determines the best mode to use for each dimension. There is a small startup cost and larger upfront GPU memory usage when using p mode.
  • 19. PowerAI DDL Automatic Topology Detection and Rankfile generation $ ddlrun -H host1,host2,host3,host4 python ….. Another common source of frustration when getting started with DDL is the generation of the rankfile. With the version from PowerAI 1.5.2 of ddlrun, the topology is inferred from the host list and a rankfile is automatically generated by discovering the configuration of the first host in the host list and verifying that all other hosts have the same configuration. This command will automatically generate and use the following rankfile: #host = host1,host2,host3,host4 #aisles = 1 #racks = 1 #nodes = 4 #accelerators = 4 #sockets = 2 #cores = 16 rank 0=host1 slot=0:0-7 rank 4=host1 slot=0:8-15 rank 8=host1 slot=1:0-7 rank 12=host1 slot=1:8-15 rank 1=host2 slot=0:0-7 rank 5=host2 slot=0:8-15 rank 9=host2 slot=1:0-7 rank 13=host2 slot=1:8-15 rank 2=host3 slot=0:0-7 rank 6=host3 slot=0:8-15 rank 10=host3 slot=1:0-7 rank 14=host3 slot=1:8-15 rank 3=host4 slot=0:0-7 rank 7=host4 slot=0:8-15 rank 11=host4 slot=1:0-7 rank 15=host4 slot=1:8-15
  • 20. PowerAI DDL Automatic mpirun option handling $ ddlrun -H host1,host2,host3,host4 python /mnist/mnist-env.py ... + mpirun -x PATH -x LD_LIBRARY_PATH -x DDL_OPTIONS -gpu --rankfile /tmp/ddlrun.BxI9Ufpz1Ycz/RANKFILE -n 16 python /mnist/mnist-env.py There are quite a few options that have to be passed to mpirun every time a job is launched, and some that only need to be passed depending on what version of mpi is being used or how the environment is set up. ddlrun now handles these options automatically, displaying the fully constructed mpirun command it used. E.g.: If there’s ever a need to pass additional options to mpirun, the --mpiarg option can be used. E.g.: $ ddlrun --mpiarg "-pami_noib" -H host1,host2,host3,host4 python /mnist/mnist-env.py 20
  • 21. 64 32 16 8 4 2 1 1 2 4 8 16 32 64 #Nodes Speedup PowerAI DDL PowerAI DDL Performance training ResNet-50 Imagenet1k, Caffe #GPUs 4 8 16 32 64 128 256 #Nodes 1 2 4 8 16 32 64 Speedup 1.0 2.0 3.9 7.9 15.5 30.5 60.6 Scaling efficiency 1.00 1.00 .98 .99 .97 .95 .95 ideal actual 21Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 22. 64 32 16 8 4 2 1 1 2 4 8 16 32 64 #Nodes Speedup PowerAI DDL PowerAI DDL Performasnce training ResNet-101 Imagenet22k, Caffe #GPUs 4 8 64 256 #Nodes 1 2 16 64 Speedup 1.0 1.8 3.9 13.8 Scaling efficiency 1.00 .92 .86 .85 ideal actual 7 hours to 33.8% top-1 accuracy using 256 GPUs 22Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 23. PowerAI DDL Modifying TensorFlow Scripts 23 TensorFlow scripts must be modified to use DDL. A DDL enabled TensorFlow script must do the following (Importing ddl.py will provide some of this functionality automatically): • Each process must use the same initial values for trainable model parameters – The values used in Process 0 should be broadcasted to every other process. This behavior is done automatically when ddl.py is imported. • The batch size will become (NUMBER_PROCESSES * BATCH_SIZE). Sections of code that are calculating the number of batches being calculated per step may need to be modified. – The DDL operator provides the number of processes through the function: ddl.size() Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 24. PowerAI DDL Modifying TensorFlow Scripts 24 • Each process should be operating on different input data • The DDL operator provides a unique ID for each process through the function ddl.rank() • An allreduce should be performed to sync up the gradients in each process • Importing ddl.py will automatically overwrite Optimizer.apply_gradients and tf.keras.gradients to call DDL’s allreduce function • Some tasks, such as printing and reading and writing files should only be performed on a single process • The DDL operator provides a unique ID for each process through the function ddl.rank(). Generally, these sorts of actions are only performed when ddl.rank() == 0. Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 25. PowerAI DDL TF operator functions/semantics 25 1. Init function: This must be called before any real TF operators. Typically, we can execute this op on CPU using an additional session. The input is the DDL configuration. This will inform the targeted network topology and learner mapping to it. The output consists of MPI information (rank,size) and GPU assignment for TF. Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 26. PowerAI DDL TF operator functions/semantics 26 2. Bcast function: Broadcast is to synchronize all the trainable parameters (i.e., weights and biases) before the training. Broadcast can be called once init has been called and completed on the assigned GPU device. Each and every trainable parameter must be broadcasted to ensure good convergence. Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 27. PowerAI DDL TF operator functions/semantics 27 3. AllReduceN function : This is an aggregated version of AllReduce. Essentially, this takes an array of N tensors, performs allreduce in a single shot, and return an array of N reduced tensors. The benefits of using AllReduceN are better performance and simpler integration. Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 28. 28Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 29. Thank you 29 Florin Manaila Cognitive Systems Europe HPC/Deep Learning Senior IT Architect — [email protected] +49-7034-274-5294 ibm.com
  • 30. 30

Editor's Notes

  • #5: PowerAI Release 4 Addresses Memory Constraints within Deep Learning Large Model Support PowerAI Release 4 addresses a second Deep Learning scaling challenge: the size of memory available within GPUs. When data scientists develop a Deep Learning workload, the structure of matrices in the neural model, and the data elements which train the model (in a batch), must sit within the memory on GPU. As models grow in complexity and data sets increase in size, Data Scientists are forced to make tradeoffs to stay within the constrained 16GB memory limits. Instead of training on web-scale images, users can train on high definition video. Instead of being forced in to less complex, shallower deep learning models, customers can develop more accurate models for better inference capability. With Large Model Support, enabled by PowerAI's unique NVLink connection between CPU (memory) and GPU, the entire model and dataset can be loaded in to system memory and cached down to the GPU for action. IBM’s testing has enabled increased model size (more layers, larger matrices), increased data element sizes (higher definition images), and larger batch sizes (for faster time to convergence). With Large Model Support, data scientists can load models which span nearly an entire terabyte of system memory across 4 GPUs. The final impact? Customers can now address bigger challenges and get much more work done within a cluster of PowerAI servers increasing organizational efficiency. Available Now PowerAI Release 4 with Large Model Support is available today for download at https://ibm.biz/powerai, and supports the IBM S822LC for HPC.