21. 21
HPC アプリケーションの性能向上
All results are measured
Except BerkeleyGW, V100 used is single V100 SXM2. A100 used is single A100 SXM4
More apps detail: AMBER based on PME-Cellulose, GROMACS with STMV (h-bond), LAMMPS with Atomic Fluid LJ-2.5, NAMD with v3.0a1 STMV_NVE
Chroma with szscl21_24_128, FUN3D with dpw, RTM with Isotropic Radius 4 1024^3, SPECFEM3D with Cartesian four material model
BerkeleyGW based on Chi Sum and uses 8xV100 in DGX-1, vs 8xA100 in DGX A100
1.5X 1.5X 1.6X
1.9X
2.0X
2.1X
1.7X
1.8X
1.9X
0.0x
0.5x
1.0x
1.5x
2.0x
A100
Speedup
V100
分子動力学 物理 工学 地球科学
22. 22
0
40
80
120
160
200
240
280
320
PME-Cellulose_NVE
AMBER Performance Equivalency
Single GPU Node vs Multiple Cascade Lake CPU-Only Nodes
CPU Server: Dual Xeon Gold [email protected] , GPU Servers: Platinum [email protected] with NVIDIA A100 80GB SXM4
CUDA Version: CUDA 11.1; Dataset: PME-Cellulose_NVE
To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to
scale beyond 8 nodes.
# of CPU Only Nodes
1 node
4x A100 GPUs
140 CPU
Nodes
280 CPU
Nodes
70 CPU
Nodes
1 node
8x A100 GPUs
1 node
2x A100 GPUs
AMBER
Molecular Dynamics
Suite of programs to simulate molecular
dynamics on biomolecule
VERSION
20.6-AT_20.10
ACCELERATED FEATURES
PMEMD Explicit Solvent and GB Implicit Solvent
SCALABILITY
Multi-GPU and Single Node
MORE INFORMATION
http://ambermd.org/gpus
26. 26
MIG 用語 – インスタンス
GPU インスタンスとコンピュート インスタンスから構成される MIG デバイス
● Multi-Instance GPU (MIG) - The MIG feature allows one or more GPU Instances to be allocated within a
GPU. Making a single GPU appear as if it were many.
● GPU Instances (GI) - A fully isolated collection of all physical GPU resources. Can contain one or more
GPU Compute Instances. Contains one or more Compute Instances.
● Compute Instance (CI) - An isolated collection of GPU SMs (CUDA cores) belongs to a single GPU
Instance. Shares GPU memory with other CIs in that GPU Instance.
● Instance Profile - A GPU Instance Profile (GIP) or GPU Compute Instance Profile (CIP) defines the
configuration and available resources in an Instance.
● MIG Device - A MIG Device is the actual “GPU” an application sees and it is the combination of a GPU,
GPU Instance, and Compute Instance.
28. 29
MIG の有効化と無効化
● GPU 毎に有効化・無効化
● MIG 有効と無効の GPU が混在可能
● 使用中のGPUは「保留」状態となり、再起動後に変
更が有効に
● MIG の有効化・無効化にはroot権限が必要
● 一度設定した有効・無効の状態は、サーバーの再起
動後も有効
# All MIG configuration requires sudo
dgxuser@DGXA100:~$ nvidia-smi -i 0 -mig 1
Unable to enable MIG Mode for GPU 00000000:07:00.0: Insufficient Permissions
Terminating early due to previous errors.
dgxuser@DGXA100:~$ sudo nvidia-smi -i 0 -mig 1
00000000:07:00.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi).
Please first kill all processes using the device and retry the command or reboot the system to make MIG mode effective.
All done.
dgxuser@DGXA100:~$ sudo nvidia-smi -i 0 -mig 1
Warning: MIG mode is in pending enable state for GPU 00000000:07:00.0:In use by another client
00000000:07:00.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi).
Please first kill all processes using the device and retry the command or reboot the system to make MIG mode effective.
All done.
dgxuser@DGXA100:~$ nvidia-smi -q -i 0 | grep -i MIG -A 2
MIG Mode
Current : Disabled
Pending : Enabled
dgxuser@DGXA100:~$ sudo nvidia-smi -i 0 -r
The following GPUs could not be reset:
GPU 00000000:07:00.0: In use by another client
1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring
application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system.
dgxuser@DGXA100:~$ sudo systemctl stop nvsm
dgxuser@DGXA100:~$ sudo systemctl stop dcgm
dgxuser@DGXA100:~$ sudo nvidia-smi -i 0 -r
GPU 00000000:07:00.0 was successfully reset.
All done.
dgxuser@DGXA100:~$ nvidia-smi -i 0 --query-gpu=mig.mode.pending,mig.mode.current --format=csv
mig.mode.pending, mig.mode.current
Enabled, Enabled
dgxuser@DGXA100:~$ sudo nvidia-smi -i 0,1,2,3 -mig 1
Enabled MIG Mode for GPU 00000000:07:00.0
Enabled MIG Mode for GPU 00000000:0F:00.0
Enabled MIG Mode for GPU 00000000:47:00.0
Enabled MIG Mode for GPU 00000000:4E:00.0
All done.
dgxuser@DGXA100:~$ sudo systemctl start nvsm
dgxuser@DGXA100:~$ sudo systemctl start dcgm
30. 32
Using MIG in Docker Containers
Passing through specific MIG devices
With MIG disabled the GPU is still specified by the GPU index or GPU UUID.
However, when MIG is enabled the device is specified by index in the format <gpu-device-index>:<mig-device-
index> or by UUID in the format MIG-<gpu-id>/<gpu-instance-id>/<gpu-compute-instance-id>.
Configure MIG (as root)
Configure MIG (non-root)
MIG monitoring
Note: Commands are slightly different on Docker version 18.x (shown here is 19.03+)
docker run --cap-add SYS_ADMIN --gpus '"device=0"' -e NVIDIA_MIG_CONFIG_DEVICES="all" nvidia/cuda:9.0-base nvidia-smi
docker run --gpus '"device=0:0,0:1"' nvidia/cuda:9.0-base nvidia-smi # MIG ENABLED
docker run --cap-add SYS_ADMIN --gpus '"device=0"' -e NVIDIA_MIG_MONITOR_DEVICES="all" nvidia/cuda:9.0-base nvidia-smi
chown nobody /proc/driver/nvidia/capabilities/mig/config
docker run --cap-add SYS_ADMIN --user nobody --gpus '"device=0"' ¥
-e NVIDIA_MIG_CONFIG_DEVICES="all" nvidia/cuda:9.0-base nvidia-smi
docker run --gpus '"device=0,1"' nvidia/cuda:9.0-base nvidia-smi # MIG DISABLED