Latest HPC News from NVIDIA

2
NVIDIA POWERS WORLD'S FASTEST
SUPERCOMPUTER
27,648
Volta Tensor Core GPUs
Summit Becomes First System To Scale The 100 Petaflops Milestone
122 PF 3 EF
HPC AI

3
NVIDIA POWERS FASTEST SUPERCOMPUTERS
IN US, EUROPE, JAPAN, INDUSTRY
17 of World’s 20 Most Energy-efficient Supercomputers
Piz Daint
Europe’s Fastest
5,320 GPUs| 20 PF
ORNL Summit
World’s Fastest
27,648 GPUs| 122 PF
ABCI
Japan’s Fastest
4,352 GPUs| 20 PF
ENI HPC4
Fastest Industrial
3,200 GPUs| 12 PF
LLNL Sierra
US 2nd Fastest
17,280 GPUs| 72 PF

4
ALL TOP 15 APPLICATIONS
ACCELERATED
550+ Applications Accelerated
8X CUDA DOWNLOADS
2018
8M
1M
2012
DEFINING THE NEXT GIANT WAVE IN
HPC
OAK RIDGE SUMMIT
World’s fastest supercomputer
120+ Petaflop HPC; 3+ Exaflop of AI
ABCI Supercomputer (AIST)
Japan’s fastest AI supercomputer
Piz Daint
Europe’s fastest supercomputer
MOST ADOPTED PLATFORM FOR ACCELERATING HPC
259
319
400
470
554
2014 2015 2016 2017 2018
#of GPU-Acc elerat ed Apps

5
NVIDIA SDK & LIBRARIES
INDUSTRY FRAMEWORKS
& APPLICATIONS
CUSTOMER USECASES
SUPERCOMPUTING
+550
Applications
CUDA
NCCLcuDNN TensorRTcuBLAS DeepStreamcuSPARSEcuFFT
Amber
NAMDLAMMPS
CHROMA
ENTERPRISE APPLICATIONSCONSUMER INTERNET
ManufacturingHealthcare EngineeringSpeech Translate Recommender
Molecular
Simulations
Weather
Forecasting
Seismic
Mapping
cuRAND
NVIDIA TESLA PLATFORM
World’s Leading Data Center Platform for Accelerating HPC and AI
TESLA GPUs & SYSTEMS
SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY

6
END-TO-END PRODUCT FAMILY
HPC/TRAINING INFERENCE
EMBEDDED
Jetson TX1
DATA CENTER
Tesla P4
AUTOMOTIVE
Drive PX2
Tesla P100Tesla V100Titan V
DATA CENTERDESKTOP
FULLY INTERGRATED DL SUPERCOMPUTER
Tesla V100TITAN V
DESKTOP WORKSTATION DATA
CENTER
Tesla V100
DGX StationTITAN Quadro
DGX-1 DGX-2
V100 PCIE
FULLY INTEGRATED AI SYSTEMS

7
GPU-Accelerated
Server Platform
Dell EM C Fujitsu HPE IBM Lenovo Superm icro
SCX-E4
› 4x V100 NVLINK
• Pow erEdge C4140 • Prim ergy CX400 M4
• Pow er System s
AC922*
• SYS-1028GQ-TVRT
SCX-E3
› 8x V100 PCIE
• Apollo 6500
(XL270d Gen10)
SCX-E2
› 4x V100 PCIE
• Pow erEdge C4140
• Pow erEdge T640
• Pow erEdge R940xa
• Prim ergy CX400 M4 • SD530/D2 • SYS-1029GQ
SCX-E1
› 2x V100 PCIE
• Pow erEdge R840
• Pow erEdge R740xd*
• Pow erEdge R740*
• Prim ergy RX2540 M4 • SD650
HGX-T1
› 8x V100 NVLINK
• Apollo 6500
(XL270d Gen10)
• SYS-4029GP-TVRT
V100 32GB SERVERS AVAILABLE FROM OEMS
Server Catalog
*reduced GPU configuration

8
TESLA V100
Form Factor
Performance 7.8TF DP, 15.7TF SP, 125TF FP16 7TF DP, 14TF SP, 112TF FP16
Memory Size 16GB/32GB HBM2
Memory Bandwidth 900GB/s
GPU Peer to Peer
NVLink (up to 300 GB/s) +
PCIe Gen3 (up to 32 GB/s)
PCIe Gen3 (up to 32 GB/s)
Power 300W 250W
Available From All
Major OEMs
S X M 2 3 2 G B P / N = 9 0 0 - 2 G 5 0 3 - 0 0 1 0 - 0 0 0 , P C IE 3 2 G B P / N = 9 0 0 - 2 G 5 0 0 - 0 0 1 0 - 0 0 0 , S X M 2 1 6 G B P / N = 9 0 0 - 2 G 5 0 3 - 0 0 0 0 - 0 0 0 , P C IE 1 6 G B P / N = 9 0 0 - 2 G 5 0 0 - 0 0 0 0 - 0 0 0
SXM2 PCIe

10
NVSWITCH
WORLD’S HIGHEST BANDWIDTH ON-NODE SWITCH
7.2 Terabits/sec or 900 GB/sec
18 NVLINK ports | 50GB/s per port bi-directional
Fully-connected crossbar
2 billion transistors | 47.5mm x 47.5mm package

11
NVSWITCH
ENABLES THE WORLD’S LARGEST GPU
16 Tesla V100 32GB Connected by New NVSwitch
2 petaFLOPS of DL Compute
Unified 512GB HBM2 GPU Memory Space
300GB/sec Every GPU-to-GPU
2.4TB/sec of Total Cross-section Bandwidth

12
THE LARGEST, FASTEST SHARED
MEMORY SUPERNODE FOR THE MOST
DIFFICULT HPC CHALLENGES
• 125 TFLOPS DP, 250 TF SP
• 512 GB shared memory
• 14.4 TB/s aggregate HBM BW
• 2.4 TB/s bisection BW
• 8x EDR network
• 30 TB SSD
INTRODUCING
NVIDIA DGX-2
THE WORLD’S MOST
POWERFUL HPC
SUPERNODE

13
INSIDE DGX-2: “WORLD’S LARGEST GPU”
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
13
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8

14
OVER 2X HIGHER PERFORMANCE WITH NVSWITCH
Two DGX-1 Compared to DGX-2
2 H G X - 1 V s e r v e r s h a v e d u a l s o c k e t X e o n E 5 2 6 9 8 v 4 P r o c e s s o r . 8 x V 1 0 0 G P U s . S e r v e r s c o n n e c t e d v ia 4 X 1 0 0 G b IB p o r t s ( r u n o n D G X - 1 ) | H G X - 2 s e r v e r h a s d u a l- s o c k e t X e o n P la t in u m 8 1 6 8 P r o c e s s o r . 1 6 V 1 0 0 G P U s ( r u n o n D G X - 2 )
Physics
(MILC benchmark)
4D Grid
Weather
(ECMWF benchmark)
All-to-all
Recommender
(Sparse Embedding)
Reduce & Broadcast
Language Model
(Transformer with MoE)
All-to-all
DGX-2 with NVSwitchTwo DGX-1 (Volta)
2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER
AI TrainingHPC

15
TESLA HGX-2
FUSING HPC AND AI INTO
ONE UNIFIED COMPUTING ARCHITECTURE
Multi-precision Computing
2 PFLOPS AI | 250 TFLOPS FP32 | 125 TFLOPS FP64
16 Tesla V100 GPUs | 0.5TB Memory | 2.4 TB/s
Building Block for Partner Systems & DGX-2

16
NVIDIA SOFTWARE PLATFORM UPDATES

17
NVIDIA GPU CLOUD (NGC)
Simple Access to GPU-Accelerated Software
Cloud Servers
Workstations
Deploy Applications In
Minutes, Not Days
Discover 35 Optimized
Containers
Run Anywhere with Maximum
Performance
GPU-Powered
Accelerate
Time to Market

18
CONTAINERS SIMPLIFY APPLICATION DEPLOYMENTS
DRIVERS + OPERATING SYSTEM
CONTAINER RUNTIME
NAMD 2.12
CUDA
libraries
VMD
CUDA
libraries
GROMACS
CUDA
libraries
NAMD 2.13
CUDA
libraries
Environment modules simplified/eliminated
Performance equivalent to bare metal
Deploy applications in minutes
Higher productivity for sys admins & users
SHARED CLUSTER
Portable on various systems Reproducible results

19
NGC CONTAINER REGISTRY
10 Containers at Launch, 35 Containers Today
bigdft
candle
chroma
gamess
gromacs
lammps
lattice-microbes
MILC
namd
pgi
picongpu
relion
vmd
caffe
caffe2
cntk
cuda
digits
inferenceserver
mxnet
pytorch
tensorflow
tensorrt
theano
torch
index
paraview-holodeck
paraview-index
paraview-optix
chainer
h20ai-driverless
kinetica
mapd
paddlepaddle
Deep Learning HPC HPC Visualization PartnersNVIDIA/K8s
Kubernetes
on NVIDIA GPUs
*NewContainers since SC 17

20
CUDA TOOLKIT 9.2
Optimized for Volta:
• Tensor Cores
• Second-Generation NVLink
• HBM2 Stacked Memory
UNLEASHES POWER OF VOLTA
COOPERATIVE THREAD GROUPS
Flexible Thread Groups
Efficient Parallel Algorithms
• Synchronize Across
Thread Blocks in a Single
GPU or Multi-GPUs
• RNN and CNN Optimizations (cuBLAS)
• >20x Faster Image Processing (NPP)
• Speed up FFT of prime size matrices
(cuFFT)
FASTER LIBRARIES
DEVELOPER TOOLS & PLATFORM UPDATES
• CUTLASS 1.0 accelerate custom
linear algebra algorithms
• 2x faster CUDA kernel launch
• New OS & Compiler Support
• Unified Memory Profiling
• NVLink Visualization

21
CUDA 9.2 PLATFORM SUPPORT
New OS and Host Compilers
PLATFORM OS VERSION COMPILERS
Windows Windows Server 2016
2012 R2
Microsoft
Visual Studio 2017 (15.6)
Linux
16.04.4 LTS
17.10 non
GCC 7.x
PGI 18.x
Clang 5.0.x
ICC 17
XLC 13.1.6 (POWER)
7.5
7.5 POWER LE
SLES 12 SP3
27
Leap 42.3
Mac macOS 10.13.4 Xcode 9.2

22
silica IFPEN, RMM-DIIS on P100
OPENACC GROWING MOMENTUM
Wide Adoption Across Key HPC Codes
ANSYS Fluent
Gaussian
VASP
LSDalton
MPAS
GAMERA
GTC
XGC
ACME
FLASH
COSMO
Numeca
Over 100 Apps* Using OpenACC
Prof. Georg Kresse
Computational Materials Physics
University of Vienna
For VASP, OpenACC is the way forward for GPU
acceleration. Performance is similar to CUDA, and
OpenACC dramatically decreases GPU
development and maintenance efforts. We’re
excited to collaborate with NVIDIA and PGI as an
early adopter of Unified Memory.
VASP
Top Quantum Chemistry and Material Science Code
* Applications in production and development

23
DCGM
Active Health Monitoring
• Run-time health checks
• Prologue check: Quick health check of
the GPU
• Epilogue check: online GPU diagnostic
tests to determine root cause issues
NVIDIA Data Center GPU
Manager
developer.nvidia.com/cuda-toolkit
Diagnostics & System Validation
• GPU Compute Performance
• Interconnect BW & Latency
• Power & Thermals
Policy Framework
• Assists in Recovery Action Automation
• Group Control over Power & Clock Policy
• Dynamic Page Retirement Policy

Latest HPC News from NVIDIA

More Related Content

What's hot (20)

Similar to Latest HPC News from NVIDIA (20)

More from inside-BigData.com (20)

Recently uploaded (20)

Latest HPC News from NVIDIA