SlideShare a Scribd company logo
Increasing cluster
performance by combining
rCUDA with Slurm
Federico Silla
Technical University of Valencia
Spain
HPC Advisory Council Switzerland Conference 2016 2/56
Outline
rCUDA … what’s that?
HPC Advisory Council Switzerland Conference 2016 3/56
Basics of CUDA
GPU
HPC Advisory Council Switzerland Conference 2016 4/56
rCUDA … remote CUDA
No GPU
HPC Advisory Council Switzerland Conference 2016 5/56
A software technology that enables a more
flexible use of GPUs in computing facilities
No GPU
rCUDA … remote CUDA
HPC Advisory Council Switzerland Conference 2016 6/56
Basics of rCUDA
HPC Advisory Council Switzerland Conference 2016 7/56
Basics of rCUDA
HPC Advisory Council Switzerland Conference 2016 8/56
Basics of rCUDA
HPC Advisory Council Switzerland Conference 2016 9/56
Physical
configuration
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
PCI-e
CPU
MainMemory
Network
Interconnection Network
Logical connections
Logical
configuration
Cluster envision with rCUDA
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
Interconnection Network
PCI-e
CPU
GPU GPU
mem
MainMemory
Network
GPU GPU
mem
PCI-e
CPU
GPU GPU
mem
MainMemory
Network
GPU GPU
mem
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
 rCUDA allows a new vision of a GPU deployment, moving from
the usual cluster configuration:
to the following one:
HPC Advisory Council Switzerland Conference 2016 10/56
Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?
HPC Advisory Council Switzerland Conference 2016 11/56
Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?
HPC Advisory Council Switzerland Conference 2016 12/56
The main concern with rCUDA is the
reduced bandwidth to the remote GPU
Concern with rCUDA
No GPU
HPC Advisory Council Switzerland Conference 2016 13/56
Using InfiniBand networks
HPC Advisory Council Switzerland Conference 2016 14/56
H2D pageable D2H pageable
H2D pinned D2H pinned
rCUDA EDR Orig rCUDA EDR Opt
rCUDA FDR Orig rCUDA FDR Opt
Initial transfers within rCUDA
HPC Advisory Council Switzerland Conference 2016 15/56
 CUDASW++
Bioinformatics software for Smith-Waterman protein database
searches
144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478
0
5
10
15
20
25
30
35
40
45
0
2
4
6
8
10
12
14
16
18
FDR Overhead QDR Overhead GbE Overhead CUDA
rCUDA FDR rCUDA QDR rCUDA GbE
Sequence Length
rCUDAOverhead(%)
ExecutionTime(s)
Performance depending on network
HPC Advisory Council Switzerland Conference 2016 16/56
H2D pageable D2H pageable
H2D pinned
Almost 100% of
available BW
D2H pinned
Almost 100% of
available BW
rCUDA EDR Orig rCUDA EDR Opt
rCUDA FDR Orig rCUDA FDR Opt
Optimized transfers within rCUDA
HPC Advisory Council Switzerland Conference 2016 17/56
rCUDA optimizations on applications
• Several applications executed with
CUDA and rCUDA
• K20 GPU and FDR InfiniBand
• K40 GPU and EDR InfiniBand
Lower
is better
HPC Advisory Council Switzerland Conference 2016 18/56
Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?
HPC Advisory Council Switzerland Conference 2016 19/56
Outline
rCUDA improves
cluster performance
HPC Advisory Council Switzerland Conference 2016 20/56
 Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU
 FDR InfiniBand based cluster
Test bench for studying rCUDA+Slurm
node with
the Slurm
scheduler
node with
the Slurm
scheduler node with
the Slurm
scheduler
8+1 GPU nodes
16+1 GPU nodes
4+1 GPU nodes
HPC Advisory Council Switzerland Conference 2016 21/56
 Applications used for tests:
 GPU-Blast (21 seconds; 1 GPU; 1599 MB)
 LAMMPS (15 seconds; 4 GPUs; 876 MB)
 MCUDA-MEME (165 seconds; 4 GPUs; 151 MB)
 GROMACS (2 nodes) (167 seconds)
 NAMD (4 nodes) (11 minutes)
 BarraCUDA (10 minutes; 1 GPU; 3319 MB)
 GPU-LIBSVM (5 minutes; 1GPU; 145 MB)
 MUMmerGPU (5 minutes; 1GPU; 2804 MB)
Non-GPU
Short execution time
Long execution time
Set 1
Set 2
 Three workloads:
 Set 1
 Set 2
 Set 1 + Set 2
Applications for studying rCUDA+Slurm
HPC Advisory Council Switzerland Conference 2016 22/56
Workloads for studying rCUDA+Slurm (I)
HPC Advisory Council Switzerland Conference 2016 23/56
Performance of rCUDA+Slurm (I)
HPC Advisory Council Switzerland Conference 2016 24/56
Workloads for studying rCUDA+Slurm (II)
HPC Advisory Council Switzerland Conference 2016 25/56
Performance of rCUDA+Slurm (II)
HPC Advisory Council Switzerland Conference 2016 26/56
Outline
Why does rCUDA improve
cluster performance?
HPC Advisory Council Switzerland Conference 2016 27/56
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Non-accelerated applications keep GPUs idle in the nodes
where they use all the cores
1st reason for improved performance
A CPU-only application spreading over
these nodes will make their GPUs
unavailable for accelerated applications
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)
HPC Advisory Council Switzerland Conference 2016 28/56
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Accelerated applications keep CPUs idle in the nodes
where they execute
An accelerated application using just one
CPU core may avoid other jobs to be
dispatched to this node
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)
2nd reason for improved performance (I)
HPC Advisory Council Switzerland Conference 2016 29/56
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Accelerated applications keep CPUs idle in the nodes
where they execute
An accelerated MPI application using
just one CPU core per node may
keep part of the cluster busy
2nd reason for improved performance (II)
HPC Advisory Council Switzerland Conference 2016 30/56
• Do applications completely squeeze the GPUs available in the cluster?
• When a GPU is assigned to an application, computational resources
inside the GPU may not be fully used
• Application presenting low level of parallelism
• CPU code being executed (GPU assigned ≠ GPU working)
• GPU-core stall due to lack of data
• etc …
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
3rd reason for improved performance
HPC Advisory Council Switzerland Conference 2016 31/56
GPU usage of GPU-Blast
GPU assigned
but not used
GPU assigned
but not used
HPC Advisory Council Switzerland Conference 2016 32/56
GPU usage of CUDA-MEME
GPU utilization is far away from maximum
HPC Advisory Council Switzerland Conference 2016 33/56
GPU usage of LAMMPS
GPU assigned
but not used
HPC Advisory Council Switzerland Conference 2016 34/56
GPU allocation vs GPU utilization
GPUs
assigned
but not
used
HPC Advisory Council Switzerland Conference 2016 35/56
Sharing a GPU among jobs: GPU-Blast
Two concurrent
instances of GPU-Blast
One
instance
required
about 51
seconds
HPC Advisory Council Switzerland Conference 2016 36/56
Two concurrent
instances of GPU-Blast
Sharing a GPU among jobs: GPU-Blast
First
instance
HPC Advisory Council Switzerland Conference 2016 37/56
Two concurrent
instances of GPU-Blast
Sharing a GPU among jobs: GPU-Blast
First
instance
Second
instance
HPC Advisory Council Switzerland Conference 2016 38/56
Sharing a GPU among jobs
• LAMMPS: 876 MB
• mCUDA-MEME: 151 MB
• BarraCUDA: 3319 MB
• MUMmerGPU: 2104 MB
• GPU-LIBSVM: 145 MB
K20 GPU
HPC Advisory Council Switzerland Conference 2016 39/56
Outline
Other reasons for
using rCUDA?
HPC Advisory Council Switzerland Conference 2016 40/56
Cheaper cluster upgrade
No GPU
• Let’s suppose that a cluster without GPUs needs to be
upgraded to use GPUs
• GPUs require large power supplies
• Are power supplies already installed in the
nodes large enough?
• GPUs require large amounts of space
• Does current form factor of the nodes allow
to install GPUs?
The answer to both
questions is usually “NO”
HPC Advisory Council Switzerland Conference 2016 41/56
Cheaper cluster upgrade
No GPU
GPU-enabled
Approach 1: augment the cluster with some CUDA GPU-
enabled nodes  only those GPU-enabled nodes can
execute accelerated applications
HPC Advisory Council Switzerland Conference 2016 42/56
Cheaper cluster upgrade
Approach 2: augment the cluster with some rCUDA
servers  all nodes can execute accelerated
applications
GPU-enabled
HPC Advisory Council Switzerland Conference 2016 43/56
 Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU
 FDR InfiniBand based cluster
16 nodes without GPU + 1 node with 4 GPUs
Cheaper cluster upgrade
HPC Advisory Council Switzerland Conference 2016 44/56
More workloads for studying rCUDA+Slurm
HPC Advisory Council Switzerland Conference 2016 45/56
Performance
-68% -60%
-63% -56%
+131% +119%
HPC Advisory Council Switzerland Conference 2016 46/56
Outline
Additional reasons for
using rCUDA?
HPC Advisory Council Switzerland Conference 2016 47/56
#1: More GPUs for a single application
64
GPUs!
HPC Advisory Council Switzerland Conference 2016 48/56
 MonteCarlo Multi-GPU (from NVIDIA samples)
Lower
is better
Higher
is better
#1: More GPUs for a single application
FDR InfiniBand +
NVIDIA Tesla K20
HPC Advisory Council Switzerland Conference 2016 49/56
#2: Virtual machines can share GPUs
• The GPU is assigned by using PCI passthrough exclusively
to a single virtual machine
• Concurrent usage of the GPU is not possible
HPC Advisory Council Switzerland Conference 2016 50/56
High performance
network available
Low performance
network available
#2: Virtual machines can share GPUs
HPC Advisory Council Switzerland Conference 2016 51/56
 Box A has 4 GPUs but only one is busy
 Box B has 8 GPUs but only two are busy
1. Move jobs from Box B to Box A and
switch off Box B
2. Migration should be transparent to
applications (decided by the global
scheduler)
#3: GPU task migration
Box A
Box B
Migration is performed
at GPU granularity
HPC Advisory Council Switzerland Conference 2016 52/56
1
1
3
7
13
14
14
Job granularity instead of GPU granularity
#3: GPU task migration
HPC Advisory Council Switzerland Conference 2016 53/56
Outline
… in summary …
HPC Advisory Council Switzerland Conference 2016 54/56
• Cons:
1.Reduced bandwidth to remote GPU (really a concern??)
Pros and cons of rCUDA
• Pros:
1.Many GPUs for a single application
2.Concurrent GPU access to virtual machines
3.Increased cluster throughput
4.Similar performance with smaller investment
5.Easier (cheaper) cluster upgrade
6.Migration of GPU jobs
7.Reduced energy consumption
8.Increased GPU utilization
HPC Advisory Council Switzerland Conference 2016 55/56
Get a free copy of rCUDA at
http://www.rcuda.net
@rcuda_
More than 650 requests world wide
rCUDA is a development by Technical University of Valencia
HPC Advisory Council Switzerland Conference 2016 56/56
Thanks!
Questions?
rCUDA is a development by Technical University of Valencia

More Related Content

What's hot (20)

PDF
Panda scalable hpc_bestpractices_tue100418
inside-BigData.com
 
PDF
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
PDF
Overview of the MVAPICH Project and Future Roadmap
inside-BigData.com
 
PDF
High-Performance and Scalable Designs of Programming Models for Exascale Systems
inside-BigData.com
 
PDF
NNSA Explorations: ARM for Supercomputing
inside-BigData.com
 
PDF
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
inside-BigData.com
 
PDF
Rapids: Data Science on GPUs
inside-BigData.com
 
PDF
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
PDF
A Fresh Look at HPC from Huawei Enterprise
inside-BigData.com
 
PDF
The Sierra Supercomputer: Science and Technology on a Mission
inside-BigData.com
 
PDF
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
PDF
Scalable and Distributed DNN Training on Modern HPC Systems
inside-BigData.com
 
PDF
Nvidia at SEMICon, Munich
Alison B. Lowndes
 
PDF
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
PDF
Challenges and Opportunities for HPC Interconnects and MPI
inside-BigData.com
 
PDF
dCUDA: Distributed GPU Computing with Hardware Overlap
inside-BigData.com
 
PDF
Sierra Supercomputer: Science Unleashed
inside-BigData.com
 
PDF
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
inside-BigData.com
 
PPT
The SKA Project - The World's Largest Streaming Data Processor
inside-BigData.com
 
PDF
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA
 
Panda scalable hpc_bestpractices_tue100418
inside-BigData.com
 
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
Overview of the MVAPICH Project and Future Roadmap
inside-BigData.com
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
inside-BigData.com
 
NNSA Explorations: ARM for Supercomputing
inside-BigData.com
 
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
inside-BigData.com
 
Rapids: Data Science on GPUs
inside-BigData.com
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
A Fresh Look at HPC from Huawei Enterprise
inside-BigData.com
 
The Sierra Supercomputer: Science and Technology on a Mission
inside-BigData.com
 
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
Scalable and Distributed DNN Training on Modern HPC Systems
inside-BigData.com
 
Nvidia at SEMICon, Munich
Alison B. Lowndes
 
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
Challenges and Opportunities for HPC Interconnects and MPI
inside-BigData.com
 
dCUDA: Distributed GPU Computing with Hardware Overlap
inside-BigData.com
 
Sierra Supercomputer: Science Unleashed
inside-BigData.com
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
inside-BigData.com
 
The SKA Project - The World's Largest Streaming Data Processor
inside-BigData.com
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA
 

Similar to Increasing Cluster Performance by Combining rCUDA with Slurm (20)

PPTX
PGI Compilers & Tools Update- March 2018
NVIDIA
 
PDF
Accelerating Data Science With GPUs
iguazio
 
PDF
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
PPTX
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
NeoKenj
 
PDF
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
Jim St. Leger
 
PDF
Design installation-commissioning-red raider-cluster-ttu
Alan Sill
 
PPTX
Pycon2014 GPU computing
Ashwin Ashok
 
PPTX
CUDA Sessions You Won't Want to Miss at GTC 2019
NVIDIA
 
PDF
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
inside-BigData.com
 
PPTX
OpenACC Monthly Highlights: November 2020
OpenACC
 
PDF
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...
Jürgen Ambrosi
 
PDF
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
Jeremy Eder
 
PPTX
High Performance Networking Leveraging the DPDK and Growing Community
6WIND
 
PDF
02 ai inference acceleration with components all in open hardware: opencapi a...
Yutaka Kawai
 
PDF
Evolution of Supermicro GPU Server Solution
NVIDIA Taiwan
 
PDF
RAPIDS Overview
NVIDIA Japan
 
PDF
3 Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...
Jürgen Ambrosi
 
PDF
GPU enablement for data science on OpenShift | DevNation Tech Talk
Red Hat Developers
 
PDF
Multi-GPU FFT Performance on Different Hardware
inside-BigData.com
 
ODP
Ti DSP optimization on Jacinto
Hank (Tai-Chi) Wang
 
PGI Compilers & Tools Update- March 2018
NVIDIA
 
Accelerating Data Science With GPUs
iguazio
 
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
NeoKenj
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
Jim St. Leger
 
Design installation-commissioning-red raider-cluster-ttu
Alan Sill
 
Pycon2014 GPU computing
Ashwin Ashok
 
CUDA Sessions You Won't Want to Miss at GTC 2019
NVIDIA
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
inside-BigData.com
 
OpenACC Monthly Highlights: November 2020
OpenACC
 
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...
Jürgen Ambrosi
 
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
Jeremy Eder
 
High Performance Networking Leveraging the DPDK and Growing Community
6WIND
 
02 ai inference acceleration with components all in open hardware: opencapi a...
Yutaka Kawai
 
Evolution of Supermicro GPU Server Solution
NVIDIA Taiwan
 
RAPIDS Overview
NVIDIA Japan
 
3 Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...
Jürgen Ambrosi
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
Red Hat Developers
 
Multi-GPU FFT Performance on Different Hardware
inside-BigData.com
 
Ti DSP optimization on Jacinto
Hank (Tai-Chi) Wang
 
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
inside-BigData.com
 
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
PPTX
Transforming Private 5G Networks
inside-BigData.com
 
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
PDF
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
PDF
Machine Learning for Weather Forecasts
inside-BigData.com
 
PDF
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
PDF
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
PDF
State of ARM-based HPC
inside-BigData.com
 
PDF
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
PDF
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
PDF
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
PDF
Overview of HPC Interconnects
inside-BigData.com
 
PDF
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com
 
PDF
Data Parallel Deep Learning
inside-BigData.com
 
PDF
Making Supernovae with Jets
inside-BigData.com
 
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
inside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com
 
Data Parallel Deep Learning
inside-BigData.com
 
Making Supernovae with Jets
inside-BigData.com
 
Ad

Recently uploaded (20)

PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Biography of Daniel Podor.pdf
Daniel Podor
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 

Increasing Cluster Performance by Combining rCUDA with Slurm

  • 1. Increasing cluster performance by combining rCUDA with Slurm Federico Silla Technical University of Valencia Spain
  • 2. HPC Advisory Council Switzerland Conference 2016 2/56 Outline rCUDA … what’s that?
  • 3. HPC Advisory Council Switzerland Conference 2016 3/56 Basics of CUDA GPU
  • 4. HPC Advisory Council Switzerland Conference 2016 4/56 rCUDA … remote CUDA No GPU
  • 5. HPC Advisory Council Switzerland Conference 2016 5/56 A software technology that enables a more flexible use of GPUs in computing facilities No GPU rCUDA … remote CUDA
  • 6. HPC Advisory Council Switzerland Conference 2016 6/56 Basics of rCUDA
  • 7. HPC Advisory Council Switzerland Conference 2016 7/56 Basics of rCUDA
  • 8. HPC Advisory Council Switzerland Conference 2016 8/56 Basics of rCUDA
  • 9. HPC Advisory Council Switzerland Conference 2016 9/56 Physical configuration CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e PCI-e CPU MainMemory Network Interconnection Network Logical connections Logical configuration Cluster envision with rCUDA GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem PCI-e CPU MainMemory Network Interconnection Network PCI-e CPU GPU GPU mem MainMemory Network GPU GPU mem PCI-e CPU GPU GPU mem MainMemory Network GPU GPU mem CPU MainMemory Network GPU GPU mem GPU GPU mem PCI-e CPU MainMemory Network GPU GPU mem GPU GPU mem PCI-e CPU MainMemory Network GPU GPU mem GPU GPU mem PCI-e  rCUDA allows a new vision of a GPU deployment, moving from the usual cluster configuration: to the following one:
  • 10. HPC Advisory Council Switzerland Conference 2016 10/56 Outline Two questions: • Why should we need rCUDA? • rCUDA … slower CUDA?
  • 11. HPC Advisory Council Switzerland Conference 2016 11/56 Outline Two questions: • Why should we need rCUDA? • rCUDA … slower CUDA?
  • 12. HPC Advisory Council Switzerland Conference 2016 12/56 The main concern with rCUDA is the reduced bandwidth to the remote GPU Concern with rCUDA No GPU
  • 13. HPC Advisory Council Switzerland Conference 2016 13/56 Using InfiniBand networks
  • 14. HPC Advisory Council Switzerland Conference 2016 14/56 H2D pageable D2H pageable H2D pinned D2H pinned rCUDA EDR Orig rCUDA EDR Opt rCUDA FDR Orig rCUDA FDR Opt Initial transfers within rCUDA
  • 15. HPC Advisory Council Switzerland Conference 2016 15/56  CUDASW++ Bioinformatics software for Smith-Waterman protein database searches 144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478 0 5 10 15 20 25 30 35 40 45 0 2 4 6 8 10 12 14 16 18 FDR Overhead QDR Overhead GbE Overhead CUDA rCUDA FDR rCUDA QDR rCUDA GbE Sequence Length rCUDAOverhead(%) ExecutionTime(s) Performance depending on network
  • 16. HPC Advisory Council Switzerland Conference 2016 16/56 H2D pageable D2H pageable H2D pinned Almost 100% of available BW D2H pinned Almost 100% of available BW rCUDA EDR Orig rCUDA EDR Opt rCUDA FDR Orig rCUDA FDR Opt Optimized transfers within rCUDA
  • 17. HPC Advisory Council Switzerland Conference 2016 17/56 rCUDA optimizations on applications • Several applications executed with CUDA and rCUDA • K20 GPU and FDR InfiniBand • K40 GPU and EDR InfiniBand Lower is better
  • 18. HPC Advisory Council Switzerland Conference 2016 18/56 Outline Two questions: • Why should we need rCUDA? • rCUDA … slower CUDA?
  • 19. HPC Advisory Council Switzerland Conference 2016 19/56 Outline rCUDA improves cluster performance
  • 20. HPC Advisory Council Switzerland Conference 2016 20/56  Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU  FDR InfiniBand based cluster Test bench for studying rCUDA+Slurm node with the Slurm scheduler node with the Slurm scheduler node with the Slurm scheduler 8+1 GPU nodes 16+1 GPU nodes 4+1 GPU nodes
  • 21. HPC Advisory Council Switzerland Conference 2016 21/56  Applications used for tests:  GPU-Blast (21 seconds; 1 GPU; 1599 MB)  LAMMPS (15 seconds; 4 GPUs; 876 MB)  MCUDA-MEME (165 seconds; 4 GPUs; 151 MB)  GROMACS (2 nodes) (167 seconds)  NAMD (4 nodes) (11 minutes)  BarraCUDA (10 minutes; 1 GPU; 3319 MB)  GPU-LIBSVM (5 minutes; 1GPU; 145 MB)  MUMmerGPU (5 minutes; 1GPU; 2804 MB) Non-GPU Short execution time Long execution time Set 1 Set 2  Three workloads:  Set 1  Set 2  Set 1 + Set 2 Applications for studying rCUDA+Slurm
  • 22. HPC Advisory Council Switzerland Conference 2016 22/56 Workloads for studying rCUDA+Slurm (I)
  • 23. HPC Advisory Council Switzerland Conference 2016 23/56 Performance of rCUDA+Slurm (I)
  • 24. HPC Advisory Council Switzerland Conference 2016 24/56 Workloads for studying rCUDA+Slurm (II)
  • 25. HPC Advisory Council Switzerland Conference 2016 25/56 Performance of rCUDA+Slurm (II)
  • 26. HPC Advisory Council Switzerland Conference 2016 26/56 Outline Why does rCUDA improve cluster performance?
  • 27. HPC Advisory Council Switzerland Conference 2016 27/56 Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM • Non-accelerated applications keep GPUs idle in the nodes where they use all the cores 1st reason for improved performance A CPU-only application spreading over these nodes will make their GPUs unavailable for accelerated applications Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes)
  • 28. HPC Advisory Council Switzerland Conference 2016 28/56 Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM • Accelerated applications keep CPUs idle in the nodes where they execute An accelerated application using just one CPU core may avoid other jobs to be dispatched to this node Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) 2nd reason for improved performance (I)
  • 29. HPC Advisory Council Switzerland Conference 2016 29/56 Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM • Accelerated applications keep CPUs idle in the nodes where they execute An accelerated MPI application using just one CPU core per node may keep part of the cluster busy 2nd reason for improved performance (II)
  • 30. HPC Advisory Council Switzerland Conference 2016 30/56 • Do applications completely squeeze the GPUs available in the cluster? • When a GPU is assigned to an application, computational resources inside the GPU may not be fully used • Application presenting low level of parallelism • CPU code being executed (GPU assigned ≠ GPU working) • GPU-core stall due to lack of data • etc … Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM 3rd reason for improved performance
  • 31. HPC Advisory Council Switzerland Conference 2016 31/56 GPU usage of GPU-Blast GPU assigned but not used GPU assigned but not used
  • 32. HPC Advisory Council Switzerland Conference 2016 32/56 GPU usage of CUDA-MEME GPU utilization is far away from maximum
  • 33. HPC Advisory Council Switzerland Conference 2016 33/56 GPU usage of LAMMPS GPU assigned but not used
  • 34. HPC Advisory Council Switzerland Conference 2016 34/56 GPU allocation vs GPU utilization GPUs assigned but not used
  • 35. HPC Advisory Council Switzerland Conference 2016 35/56 Sharing a GPU among jobs: GPU-Blast Two concurrent instances of GPU-Blast One instance required about 51 seconds
  • 36. HPC Advisory Council Switzerland Conference 2016 36/56 Two concurrent instances of GPU-Blast Sharing a GPU among jobs: GPU-Blast First instance
  • 37. HPC Advisory Council Switzerland Conference 2016 37/56 Two concurrent instances of GPU-Blast Sharing a GPU among jobs: GPU-Blast First instance Second instance
  • 38. HPC Advisory Council Switzerland Conference 2016 38/56 Sharing a GPU among jobs • LAMMPS: 876 MB • mCUDA-MEME: 151 MB • BarraCUDA: 3319 MB • MUMmerGPU: 2104 MB • GPU-LIBSVM: 145 MB K20 GPU
  • 39. HPC Advisory Council Switzerland Conference 2016 39/56 Outline Other reasons for using rCUDA?
  • 40. HPC Advisory Council Switzerland Conference 2016 40/56 Cheaper cluster upgrade No GPU • Let’s suppose that a cluster without GPUs needs to be upgraded to use GPUs • GPUs require large power supplies • Are power supplies already installed in the nodes large enough? • GPUs require large amounts of space • Does current form factor of the nodes allow to install GPUs? The answer to both questions is usually “NO”
  • 41. HPC Advisory Council Switzerland Conference 2016 41/56 Cheaper cluster upgrade No GPU GPU-enabled Approach 1: augment the cluster with some CUDA GPU- enabled nodes  only those GPU-enabled nodes can execute accelerated applications
  • 42. HPC Advisory Council Switzerland Conference 2016 42/56 Cheaper cluster upgrade Approach 2: augment the cluster with some rCUDA servers  all nodes can execute accelerated applications GPU-enabled
  • 43. HPC Advisory Council Switzerland Conference 2016 43/56  Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU  FDR InfiniBand based cluster 16 nodes without GPU + 1 node with 4 GPUs Cheaper cluster upgrade
  • 44. HPC Advisory Council Switzerland Conference 2016 44/56 More workloads for studying rCUDA+Slurm
  • 45. HPC Advisory Council Switzerland Conference 2016 45/56 Performance -68% -60% -63% -56% +131% +119%
  • 46. HPC Advisory Council Switzerland Conference 2016 46/56 Outline Additional reasons for using rCUDA?
  • 47. HPC Advisory Council Switzerland Conference 2016 47/56 #1: More GPUs for a single application 64 GPUs!
  • 48. HPC Advisory Council Switzerland Conference 2016 48/56  MonteCarlo Multi-GPU (from NVIDIA samples) Lower is better Higher is better #1: More GPUs for a single application FDR InfiniBand + NVIDIA Tesla K20
  • 49. HPC Advisory Council Switzerland Conference 2016 49/56 #2: Virtual machines can share GPUs • The GPU is assigned by using PCI passthrough exclusively to a single virtual machine • Concurrent usage of the GPU is not possible
  • 50. HPC Advisory Council Switzerland Conference 2016 50/56 High performance network available Low performance network available #2: Virtual machines can share GPUs
  • 51. HPC Advisory Council Switzerland Conference 2016 51/56  Box A has 4 GPUs but only one is busy  Box B has 8 GPUs but only two are busy 1. Move jobs from Box B to Box A and switch off Box B 2. Migration should be transparent to applications (decided by the global scheduler) #3: GPU task migration Box A Box B Migration is performed at GPU granularity
  • 52. HPC Advisory Council Switzerland Conference 2016 52/56 1 1 3 7 13 14 14 Job granularity instead of GPU granularity #3: GPU task migration
  • 53. HPC Advisory Council Switzerland Conference 2016 53/56 Outline … in summary …
  • 54. HPC Advisory Council Switzerland Conference 2016 54/56 • Cons: 1.Reduced bandwidth to remote GPU (really a concern??) Pros and cons of rCUDA • Pros: 1.Many GPUs for a single application 2.Concurrent GPU access to virtual machines 3.Increased cluster throughput 4.Similar performance with smaller investment 5.Easier (cheaper) cluster upgrade 6.Migration of GPU jobs 7.Reduced energy consumption 8.Increased GPU utilization
  • 55. HPC Advisory Council Switzerland Conference 2016 55/56 Get a free copy of rCUDA at http://www.rcuda.net @rcuda_ More than 650 requests world wide rCUDA is a development by Technical University of Valencia
  • 56. HPC Advisory Council Switzerland Conference 2016 56/56 Thanks! Questions? rCUDA is a development by Technical University of Valencia