Increasing Cluster Performance by Combining rCUDA with Slurm

Increasing cluster
performance by combining
rCUDA with Slurm
Federico Silla
Technical University of Valencia
Spain

HPC Advisory Council Switzerland Conference 2016 2/56
Outline
rCUDA … what’s that?

Basics of CUDA
GPU

rCUDA … remote CUDA
No GPU

A software technology that enables a more
flexible use of GPUs in computing facilities
No GPU
rCUDA … remote CUDA

Basics of rCUDA

Physical
configuration
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
PCI-e
CPU
MainMemory
Network
Interconnection Network
Logical connections
Logical
configuration
Cluster envision with rCUDA
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
GPU GPU
mem
MainMemory
Network
GPU GPU
mem
PCI-e
CPU
GPU GPU
mem
MainMemory
Network
GPU GPU
mem
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
 rCUDA allows a new vision of a GPU deployment, moving from
the usual cluster configuration:
to the following one:

Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?

Outline
Two questions:

The main concern with rCUDA is the
reduced bandwidth to the remote GPU
Concern with rCUDA
No GPU

Using InfiniBand networks

H2D pageable D2H pageable
H2D pinned D2H pinned
rCUDA EDR Orig rCUDA EDR Opt
rCUDA FDR Orig rCUDA FDR Opt
Initial transfers within rCUDA

 CUDASW++
Bioinformatics software for Smith-Waterman protein database
searches
144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478
0
5
10
15
20
25
30
35
40
45
0
2
4
6
8
10
12
14
16
18
FDR Overhead QDR Overhead GbE Overhead CUDA
rCUDA FDR rCUDA QDR rCUDA GbE
Sequence Length
rCUDAOverhead(%)
ExecutionTime(s)
Performance depending on network

H2D pageable D2H pageable
H2D pinned
Almost 100% of
available BW
D2H pinned
Almost 100% of
available BW
rCUDA EDR Orig rCUDA EDR Opt
rCUDA FDR Orig rCUDA FDR Opt
Optimized transfers within rCUDA

rCUDA optimizations on applications
• Several applications executed with
CUDA and rCUDA
• K20 GPU and FDR InfiniBand
• K40 GPU and EDR InfiniBand
Lower
is better

Outline
Two questions:

Outline
rCUDA improves
cluster performance

 Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU
 FDR InfiniBand based cluster
Test bench for studying rCUDA+Slurm
node with
the Slurm
scheduler
node with
the Slurm
scheduler node with
the Slurm
scheduler
8+1 GPU nodes
16+1 GPU nodes
4+1 GPU nodes

 Applications used for tests:
 GPU-Blast (21 seconds; 1 GPU; 1599 MB)
 LAMMPS (15 seconds; 4 GPUs; 876 MB)
 MCUDA-MEME (165 seconds; 4 GPUs; 151 MB)
 GROMACS (2 nodes) (167 seconds)
 NAMD (4 nodes) (11 minutes)
 BarraCUDA (10 minutes; 1 GPU; 3319 MB)
 GPU-LIBSVM (5 minutes; 1GPU; 145 MB)
 MUMmerGPU (5 minutes; 1GPU; 2804 MB)
Non-GPU
Short execution time
Long execution time
Set 1
Set 2
 Three workloads:
 Set 1
 Set 2
 Set 1 + Set 2
Applications for studying rCUDA+Slurm

Workloads for studying rCUDA+Slurm (I)

Performance of rCUDA+Slurm (I)

Workloads for studying rCUDA+Slurm (II)

Performance of rCUDA+Slurm (II)

Outline
Why does rCUDA improve
cluster performance?

node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Non-accelerated applications keep GPUs idle in the nodes
where they use all the cores
1st reason for improved performance
A CPU-only application spreading over
these nodes will make their GPUs
unavailable for accelerated applications
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)

Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Accelerated applications keep CPUs idle in the nodes
where they execute
An accelerated application using just one
CPU core may avoid other jobs to be
dispatched to this node
2nd reason for improved performance (I)

Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Accelerated applications keep CPUs idle in the nodes
where they execute
An accelerated MPI application using
just one CPU core per node may
keep part of the cluster busy
2nd reason for improved performance (II)

• Do applications completely squeeze the GPUs available in the cluster?
• When a GPU is assigned to an application, computational resources
inside the GPU may not be fully used
• Application presenting low level of parallelism
• CPU code being executed (GPU assigned ≠ GPU working)
• GPU-core stall due to lack of data
• etc …
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
3rd reason for improved performance

GPU usage of GPU-Blast
GPU assigned
but not used
GPU assigned
but not used

GPU usage of CUDA-MEME
GPU utilization is far away from maximum

GPU usage of LAMMPS
GPU assigned
but not used

GPU allocation vs GPU utilization
GPUs
assigned
but not
used

Sharing a GPU among jobs: GPU-Blast
Two concurrent
instances of GPU-Blast
One
instance
required
about 51
seconds

Two concurrent
First
instance

Two concurrent
First
instance
Second
instance

Sharing a GPU among jobs
• LAMMPS: 876 MB
• mCUDA-MEME: 151 MB
• BarraCUDA: 3319 MB
• MUMmerGPU: 2104 MB
• GPU-LIBSVM: 145 MB
K20 GPU

Outline
Other reasons for
using rCUDA?

Cheaper cluster upgrade
No GPU
• Let’s suppose that a cluster without GPUs needs to be
upgraded to use GPUs
• GPUs require large power supplies
• Are power supplies already installed in the
nodes large enough?
• GPUs require large amounts of space
• Does current form factor of the nodes allow
to install GPUs?
The answer to both
questions is usually “NO”

No GPU
GPU-enabled
Approach 1: augment the cluster with some CUDA GPU-
enabled nodes  only those GPU-enabled nodes can
execute accelerated applications

Approach 2: augment the cluster with some rCUDA
servers  all nodes can execute accelerated
applications
GPU-enabled

 Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU
 FDR InfiniBand based cluster
16 nodes without GPU + 1 node with 4 GPUs

More workloads for studying rCUDA+Slurm

Performance
-68% -60%
-63% -56%
+131% +119%

Outline
Additional reasons for
using rCUDA?

#1: More GPUs for a single application
64
GPUs!

 MonteCarlo Multi-GPU (from NVIDIA samples)
Lower
is better
Higher
is better
#1: More GPUs for a single application
FDR InfiniBand +
NVIDIA Tesla K20

#2: Virtual machines can share GPUs
• The GPU is assigned by using PCI passthrough exclusively
to a single virtual machine
• Concurrent usage of the GPU is not possible

High performance
network available
Low performance
network available
#2: Virtual machines can share GPUs

 Box A has 4 GPUs but only one is busy
 Box B has 8 GPUs but only two are busy
1. Move jobs from Box B to Box A and
switch off Box B
2. Migration should be transparent to
applications (decided by the global
scheduler)
#3: GPU task migration
Box A
Box B
Migration is performed
at GPU granularity

1
1
3
7
13
14
14
Job granularity instead of GPU granularity
#3: GPU task migration

Outline
… in summary …

• Cons:
1.Reduced bandwidth to remote GPU (really a concern??)
Pros and cons of rCUDA
• Pros:
1.Many GPUs for a single application
2.Concurrent GPU access to virtual machines
3.Increased cluster throughput
4.Similar performance with smaller investment
5.Easier (cheaper) cluster upgrade
6.Migration of GPU jobs
7.Reduced energy consumption
8.Increased GPU utilization

Get a free copy of rCUDA at
http://www.rcuda.net
@rcuda_
More than 650 requests world wide
rCUDA is a development by Technical University of Valencia

Thanks!
Questions?
rCUDA is a development by Technical University of Valencia

Increasing Cluster Performance by Combining rCUDA with Slurm

More Related Content

What's hot (20)

Similar to Increasing Cluster Performance by Combining rCUDA with Slurm (20)

More from inside-BigData.com (20)

Recently uploaded (20)

Increasing Cluster Performance by Combining rCUDA with Slurm