SlideShare a Scribd company logo
IBM Power System AC922 : The
Brain Behind the Supercomputer
—
Pidad D’Souza(pidsouza@in.ibm.com)
Aditya Nitsure(anitsure@in.ibm.com)
Power System Performance, ISDL, IBM, Bengaluru
Agenda
● AC922 System Components
● AC922 Characteristics
● System Features
IBM Systems at Supercomputing 2019 / © 2019 IBM Corporation
The most powerful supercomputer on the planet
4 out of 11 Top500 are IBM Power9 Systems
3
▪ 4,608 IBM AC922 nodes
▪ 200 peta FLOPS
▪ 27,648 NVIDIA Tesla GPUs
▪ 25 gigabytes per second between nodes
▪ 13 MW Energy
Exascale Energy Budget: 20-40 megawatts
(MW)
Innovations in Hardware and Software
▪ Processor/Accelerators
▪ Memory
▪ Interconnect
▪ Spectrum MPI, Math Libraries
4 out of Top 10 Green500 systems are IBM
Power9 systems
AiMOS – Green500 No. 3 with 15.72 GFlops/Watt
Heterogenous Systems
4
+
Rest of Sequential
CPU Code
Compute-Intensive Code
Application Code
GPU Acceleration
5
CPU
– Large and broad instruction set to perform complex operations
GPU
– High throughput – Massive parallelization through large number of cores
– Specialized for SIMD/SIMT
Heterogenous Computing
Maximize
performance
and energy
efficiency
– NVLink 2.0 : High-Bandwidth Interconnect
o 150 bi-directional bandwidth (or 100 GB/s for 6 GPU
config) between CPU-GPU and GPU-GPU
– Coherent access to CPU memory
Summit and Sierra Supercomputer configurations
6
Nvidia V100
NVLink
150GB/s
DDR4
170GB/s
POWER9
PCIe4.0
CAPI 2.0
NVLink
150GB/s
NVLink
100GB/s
DDR4
170GB/s
POWER9
NVLink
100GB/s
Sierra
(4 GPU Half Node)
Summit
(6 GPU Half Node)
IB
PCIe4.0
CAPI 2.0
Coherent access to
system memory
Nvidia V100
• CPU and GPU co-operate in execution of
work
• GPU coherently access to CPU memory
Coherent access to
system memory
IB
7
– Delivers unprecedented performance for modern
HPC, analytics, and artificial intelligence (AI)
– Designed to fully exploit the capabilities of CPU
and GPU accelerators
– Eliminates I/O bottlenecks and allows sharing
memory across GPUs and CPUs
– Extraordinary POWER9 CPUs
– 2-6 NVIDIA® Tesla® V100 GPUs with NVLink
– Co-optimized hardware and software for deep
learning and AI
– Supports up to 5.6x more I/O bandwidth than
competitive servers
– Combines the cutting edge AI innovation Data
Scientists desire with the dependability IT
– Next Gen PCIe - PCIe Gen4 2x faster
IBM POWER9 AC922 Server
7
8
– Designed for AI Computing and HPC
– Second-Generation NVLink™
– HBM2 Memory: Faster, Higher Efficiency
– Enhanced Unified Memory and Address Translation
Services
– Maximum Performance and Maximum Efficiency Modes
– Number of SM/cores : 80/5120
– Double Precision Performance : 7.5 TFLOPS
– Single Precision Performance : 15 TFLOPS
– 125 Tensor TFLOPS
– GPU Memory : 16 or 32 GB
– Memory bandwidth : 900 GB/s
https://devblogs.nvidia.com/inside-volta
Nvidia Tesla V100 GPU
AC922 SYSTEM
CHARACTERISTICS
9
Boost application performance with sustained peak memory bandwidth of
~280GB/s
CPU STREAM Bandwidth
STREAM benchmark ( https://www.cs.virginia.edu/stream/) *not submitted
10
NVIDIA Volta 100 Compute – Single and Double Precision
–Applications to have more compute
power
–Shorten time to completion
–Accomplish more
simulation/experiment
–1.5x higher compute than NVIDIA
P100 GPUs
0
2
4
6
8
10
12
14
16
S822LC + P100 AC922 + V100
4.8
7.45
9.8
15.3
Compute-TFLOPS
NVIDIA V100 SGEMM and DGEMM
DGEMM SGEMM
1.5x higher
11
NVIDIA V100 GPU memory bandwidth (GPU STREAM)
0
100
200
300
400
500
600
700
800
900
S822LC + P100 AC922 + V100
512
840
Bandwidth-GB/s
840 GB/s
1.6x Higher
–1.6x Higher Bandwidth than NVIDIA
P100
–Speed up of memory intensive
applications
Theoretical
12
0
10
20
30
40
50
60
70
Xeon E5-
2640 V4 +
P100
S822LC +
P100
AC922 + 6
V100
AC922 + 4
V100
12
34.16
45.9
68
Bandwidth–GB/s
CPU to GPU NVLink Vs PCIe3 bandwidth
5.6x better
3.8x better
2x
–NVLink 2.0 is 5.6x better than PCIe3
–Remove CPU-GPU Data transfer
bottlenecks
2.8x better
1.34x
Note: NVIDIA bandwidth test used for measurement 13
NVLINK Bandwidth with varied data sizes
–Minimize communication latencies
–Unlock PCIe bottlenecks
–Transfer larger data at high speed
–Ideal for data size larger than GPU
memory
0
10000
20000
30000
40000
50000
60000
70000
80000
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
Bandwidth-MB/s
Data Size - KBytes
NVLink2.0 vs PCIe3 Host to Device
Bandwidth
2NVLinkPerGPU 3NVLinkPerGPU PCIe3
14
Workload Optimized
Frequency (WOF)
– Boost performance of less active workload
through higher frequency
– Lower the frequency to save power or boost other
cores
– Maximize performance through dynamically
adjusting processor frequency
– Governing factors
• Processor utilization, Number of active cores &
Environment condition
– Power Saver Modes
• Dynamic Performance Mode(DPM)
• Maximum Performance Mode(MPM)
15
IBM Systems at Supercomputing 2019 / © 2019 IBM Corporation
HPC Interconnect
–Multi-Host Adapter (Mellanox
ConnectX-5 EDR)
• Latency : sub-600 nanoseconds
• Bandwidth : 2 ports of 100Gb/s
• Message Rate : 200M messages/second
–Adapter Features
• Switch based collectives - SHARP
• Hardware Tag Matching
• User mode memory registration(UMR)
• GPU Direct RDMA
• Tunneled Atomics
P9
X-Bus
x8x8
IB - EDR
P9
16
The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems, SC18.
Bi-section Bandwidth & All Reduce scaling on Summit
– Good scaling at large scale due to ~74% of
bisection bandwidth with adaptive routing
enabled
– SMPI supports HCOLL(FCA) & SHARP,
enables applications to run with best
collective performance
The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems, SC18.
– Imporved application performance through
burst buffer
– Applications not bottlenecked on I/O
operations on Parallel file systems
Burst buffer performance on Summit
HPC APPLICATION
PERFORMANCE Memory
Capacity
IO
Compute
Interconnect
Memory
Bandwidth
Parameters impacting HPC
Application Performance
19
HPC APPLICATION
ACCELERATION
METHODOLOGIES
o Unified Memory
o Coherency
o ATS
o OpenMP
20
CUDA Programming
21
h_data = cudaMallocHost(size) // Allocate
memory on the host
d_data = cudaMalloc(size) // Allocate memory on
the GPU
init_dataCPU(h_data)
cudaMemcpy(h_data, d_data, size,
HostToDevice) // Move data to GPU
gpu_kernel<<<…>>> // GPU compute
cudaMemcpy(d_data, h_data, size,
DeviceToHost) // Move results back to CPU
cpu_processing(h_data)
21
Unified Memory Programming
22
– Single memory address space accessible to
both CPU & GPU
– Enables oversubscribing memory
• Computation of data size larger than GPU
memory
– System wide atomic memory operations
– Transparent Memory migration between CPU
and GPU depending on who accesses it
• Explicit migration through
cudaMemPrefetchAsyn()
– Allocating Unified memory
• Replace “malloc” & “new” with
“cudaMallocManaged”
GPU CPU
Unified Memory
22
Unified Memory Advises
23
– ReadMostly
• Data is mostly read, occasionally written
• Duplicate pages, writes possible but expensive
– PreferredLocation
• Specify preferred location for data
• “resist” migrations from the preferred location
– AccessedBy
• Establish mappings to avoid migrations and
access directly
char *data;
cudaMallocManaged(&data, size);
init_dataCPU(data, size);
cudaMemPrefetchAsync(data, size, gpuID);
cudaMemAdvise(data, size, …ReadMostly,
gpuID);
gpuKernel<<<… >>>(data, size);
// Transparent data migration to GPU
cudaDeviceSynchronize();
use_dataCPU(data, size);
//Data migrate back to CPU
23
*data = malloc(size);
gpu_kernel<<<…>>>(data);
data[1024];
gpu_kernel<<<…>>>(data);
extern float *data;
gpu_kernel<<<…>>>(data);
Hardware Coherency & ATS
24
– Hardware coherency
• CPU can directly access and cache GPU
memory
• Native atomics support
– Address Translation Services(ATS)
• Allows the GPU to access the CPU’s page
tables directly
• System allocator support – malloc, stack,
global, file system
Simplifiedprogramming with
ATS
24
CUDA Aware MPI
25
–Avoid staging of GPU buffers in
host memory
–Run applications efficiently
–IBM SpectrumMPI is CUDA-
Aware
25
Code without CUDA-Aware MPI (using GPU buffers)
//MPI Rank 0
CudaMemcpy(…, DeviceToHost)
MPI_Send()
//MPI Rank 1
MPI_Recv()
CudaMemcpy(…, HostToDevice)
Code with CUDA-Aware MPI (using GPU buffers)
//MPI Rank 0
CudaMemcpy(…, DeviceToHost)
MPI_Send()
//MPI Rank 1
MPI_Recv()
CudaMemcpy(…, HostToDevice)
https://devblogs.nvidia.com/introduction-cuda-aware-mpi/
GPU Direct RDMA
26
– Data exchange between GPU and other Peer
devices using PCIe standards
– Network devices directly access GPU memory
bypassing host
26
Monitoring and Profiling
tools
27
Monitoring and Profiling tools
Monitoring
➢ mpstat, vmstat – CPU and memory utilization
➢ numastat – numa memory statistics
➢ top/htop – real-time view of system usage
Profiling
➢ Perf record/report – CPU profiling
➢ nvprof – GPU profiling
CPU memory GPU
memory
numastat
nvidia-smi
Also check “nvidia-smi –query-gpu” more monitoring options
Monitoring and Profiling tools
nvprof
• The nvprof is command-line profiling tool which enables you to collect and view
profiling data
• Using nvprof one can collect –
• kernel execution time
• memory transfers
• memory set and CUDA API calls
• events or metrics for CUDA kernels
NVVP (NVIDIA Visual Profiler)
• The Visual Profiler displays a timeline of your application's activity on both the CPU
and GPU so that one can identify opportunities for performance improvement.
• Visualize profile data collected from nvprof
• More documentation can be found @ https://docs.nvidia.com/cuda/profiler-users-guide/index.html
Monitoring and Profiling tools
Nvidia Visual Profiler
Data
Transfer
Compute
Conclusion
32
➢ AC922 Designed for Super Computers
➢ Better performance for HPC applications
➢ High speed interconnect NVLink between CPU & GPU
➢ Simplified programming using Unified memory, ATS, and
OpenMP
References
–IBM Power System AC922 Introduction and Technical Overview
–NVIDIA Volta GPU
–IBM Power Systems Proof Points
–Unified Memory on P9+V100
–Summit SuperComputer
–Sierra SuperComputer
33
Notices and disclaimers
© 2018 International Business Machines Corporation. No part of this
document may be reproduced or transmitted in any form without
written permission from IBM.
U.S. Government Users Restricted Rights — use, duplication or
disclosurerestricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including informationrelating to
products that have not yet been announced by IBM) has been reviewed
for accuracy as of the date of initial publication and could include
unintentional technical or typographical errors. IBM shall have no
responsibility to update this information. This document is distributed
“as is” without any warranty, either express or implied. In no event,
shall IBM be liable for any damage arising from the use of this
information, including but not limited to, loss of data, business
interruption, loss of profit or loss of opportunity. IBM products and
services are warranted per the terms and conditions of the agreements
under which they are provided.
IBM products are manufactured from new parts or new and used parts.
In some cases, a product may not be new and may have been previously
installed. Regardless, our warranty terms apply.”
Any statements regarding IBM's future direction, intent or product
plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a
controlled, isolated environments. Customer examples are presented as
illustrations of how those customers have used IBM products and the
results they may have achieved. Actual performance, cost, savings or
other results in other operating environments may vary.
References in this document to IBM products, programs, or services
does not imply that IBM intends to make such products, programs or
services available in all countries in which IBM operates or does
business.
Workshops, sessions and associated materials may have been prepared
by independent session speakers, and do not necessarily reflect the
views of IBM. All materials and discussions are provided for
informational purposes only, and are neither intended to, nor shall
constitutelegal or other guidance or advice to any individual participant
or their specific situation.
It is the customer’s responsibility to insure its own compliance
with legal requirements and to obtain advice of competent legal counsel
as to the identificationand interpretationof any relevant laws and
regulatory requirements that may affect the customer’s business and
any actions the customer may need to take to comply with such
laws. IBM does not provide legal advice or represent or warrant that its
services or products will ensure that the customer follows any law.
34
Notices and disclaimers
continued
Information concerningnon-IBM products was obtained from the
suppliers of those products, their published announcements or other
publicly available sources. IBM has not tested those products about this
publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed
to the suppliers of those products. IBM does not warrant the quality of
any third-party products, or the ability of any such third-party products
to interoperate with IBM’s products. IBM expressly disclaims all
warranties, expressed or implied, including but not limited to, the
implied warranties of merchantability and fitness for a purpose.
The provision of the information contained herein is not intended to, and
does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com and [names of other referenced IBM
products and services used in the presentation] are trademarks of
International Business Machines Corporation, registered in many
jurisdictions worldwide. Other product and service names might
be trademarks of IBM or other companies. A current list of IBM
trademarks is available on the Web at “Copyright and trademark
information”at: www.ibm.com/legal/copytrade.shtml.
35
Thank you
Pidad D’Souza
Power System Performance Architect
—
pidsouza@in.ibm.com
+91-80-4177 6526
ibm.com
36
®
37

More Related Content

PDF
POWER10 innovations for HPC
Ganesan Narayanasamy
 
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
 
PDF
IBM HPC Transformation with AI
Ganesan Narayanasamy
 
PDF
Summit workshop thompto
Ganesan Narayanasamy
 
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
PDF
POWER9 for AI & HPC
inside-BigData.com
 
PDF
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
PPT
OpenPOWER Webinar
Ganesan Narayanasamy
 
POWER10 innovations for HPC
Ganesan Narayanasamy
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
 
IBM HPC Transformation with AI
Ganesan Narayanasamy
 
Summit workshop thompto
Ganesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
POWER9 for AI & HPC
inside-BigData.com
 
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
OpenPOWER Webinar
Ganesan Narayanasamy
 

What's hot (20)

PDF
MIT's experience on OpenPOWER/POWER 9 platform
Ganesan Narayanasamy
 
PDF
Covid-19 Response Capability with Power Systems
Ganesan Narayanasamy
 
PDF
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
PDF
OpenPOWER Webinar on Machine Learning for Academic Research
Ganesan Narayanasamy
 
PDF
Heterogeneous Computing : The Future of Systems
Anand Haridass
 
PDF
SNAP MACHINE LEARNING
Ganesan Narayanasamy
 
PPTX
EXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
Iosif Itkin
 
PPTX
Ac922 watson 180208 v1
IBM Sverige
 
PDF
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Ganesan Narayanasamy
 
PDF
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
PDF
Omp tutorial cpugpu_programming_cdac
Ganesan Narayanasamy
 
PDF
CFD on Power
Ganesan Narayanasamy
 
PDF
@IBM Power roadmap 8
Diego Alberto Tamayo
 
PPTX
2018 bsc power9 and power ai
Ganesan Narayanasamy
 
PDF
BSC LMS DDL
Ganesan Narayanasamy
 
PDF
Using a Field Programmable Gate Array to Accelerate Application Performance
Odinot Stanislas
 
PDF
POWER9 AC922 Newell System - HPC & AI
Anand Haridass
 
PPTX
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Deepak Shankar
 
PDF
AMD It's Time to ROC
inside-BigData.com
 
PDF
IBM BOA for POWER
Ganesan Narayanasamy
 
MIT's experience on OpenPOWER/POWER 9 platform
Ganesan Narayanasamy
 
Covid-19 Response Capability with Power Systems
Ganesan Narayanasamy
 
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
OpenPOWER Webinar on Machine Learning for Academic Research
Ganesan Narayanasamy
 
Heterogeneous Computing : The Future of Systems
Anand Haridass
 
SNAP MACHINE LEARNING
Ganesan Narayanasamy
 
EXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
Iosif Itkin
 
Ac922 watson 180208 v1
IBM Sverige
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Ganesan Narayanasamy
 
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
Omp tutorial cpugpu_programming_cdac
Ganesan Narayanasamy
 
CFD on Power
Ganesan Narayanasamy
 
@IBM Power roadmap 8
Diego Alberto Tamayo
 
2018 bsc power9 and power ai
Ganesan Narayanasamy
 
BSC LMS DDL
Ganesan Narayanasamy
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Odinot Stanislas
 
POWER9 AC922 Newell System - HPC & AI
Anand Haridass
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Deepak Shankar
 
AMD It's Time to ROC
inside-BigData.com
 
IBM BOA for POWER
Ganesan Narayanasamy
 
Ad

Similar to Ac922 cdac webinar (20)

PPTX
Stream Processing
arnamoy10
 
PDF
HPC Infrastructure To Solve The CFD Grand Challenge
Anand Haridass
 
PDF
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
OpenStack Korea Community
 
PPTX
APSys Presentation Final copy2
Junli Gu
 
PDF
Design installation-commissioning-red raider-cluster-ttu
Alan Sill
 
PDF
Evolution of Supermicro GPU Server Solution
NVIDIA Taiwan
 
PDF
High Performance Computing for LiDAR Data Production
MattBethel1
 
PDF
AI Accelerators for Cloud Datacenters
CastLabKAIST
 
PDF
RAPIDS Overview
NVIDIA Japan
 
PPTX
Amd accelerated computing -ufrj
Roberto Brandao
 
PPTX
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Rebekah Rodriguez
 
PDF
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 
PDF
Advances in GPU Computing
Frédéric Parienté
 
PPTX
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 
PPTX
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Rebekah Rodriguez
 
PPTX
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
PDF
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA
 
PDF
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
PPTX
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Rebekah Rodriguez
 
PDF
組み込みから HPC まで ARM コアで実現するエコシステム
Shinnosuke Furuya
 
Stream Processing
arnamoy10
 
HPC Infrastructure To Solve The CFD Grand Challenge
Anand Haridass
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
OpenStack Korea Community
 
APSys Presentation Final copy2
Junli Gu
 
Design installation-commissioning-red raider-cluster-ttu
Alan Sill
 
Evolution of Supermicro GPU Server Solution
NVIDIA Taiwan
 
High Performance Computing for LiDAR Data Production
MattBethel1
 
AI Accelerators for Cloud Datacenters
CastLabKAIST
 
RAPIDS Overview
NVIDIA Japan
 
Amd accelerated computing -ufrj
Roberto Brandao
 
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Rebekah Rodriguez
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 
Advances in GPU Computing
Frédéric Parienté
 
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Rebekah Rodriguez
 
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Rebekah Rodriguez
 
組み込みから HPC まで ARM コアで実現するエコシステム
Shinnosuke Furuya
 
Ad

More from Ganesan Narayanasamy (20)

PDF
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Ganesan Narayanasamy
 
PDF
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
PDF
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
PDF
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
PDF
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
PDF
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
 
PDF
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
PDF
OpenPOWER System Marconi100
Ganesan Narayanasamy
 
PDF
OpenPOWER Latest Updates
Ganesan Narayanasamy
 
PDF
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
PDF
AI in healthcare - Use Cases
Ganesan Narayanasamy
 
PDF
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
PDF
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
PDF
Poster from NUS
Ganesan Narayanasamy
 
PDF
SAP HANA on POWER9 systems
Ganesan Narayanasamy
 
PPTX
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
PDF
AI in the enterprise
Ganesan Narayanasamy
 
PDF
Robustness in deep learning
Ganesan Narayanasamy
 
PDF
Perspectives of Frond end Design
Ganesan Narayanasamy
 
PDF
A2O Core implementation on FPGA
Ganesan Narayanasamy
 
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Ganesan Narayanasamy
 
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
OpenPOWER System Marconi100
Ganesan Narayanasamy
 
OpenPOWER Latest Updates
Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
AI in healthcare - Use Cases
Ganesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
Poster from NUS
Ganesan Narayanasamy
 
SAP HANA on POWER9 systems
Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
AI in the enterprise
Ganesan Narayanasamy
 
Robustness in deep learning
Ganesan Narayanasamy
 
Perspectives of Frond end Design
Ganesan Narayanasamy
 
A2O Core implementation on FPGA
Ganesan Narayanasamy
 

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Doc9.....................................
SofiaCollazos
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Doc9.....................................
SofiaCollazos
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Software Development Methodologies in 2025
KodekX
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 

Ac922 cdac webinar

  • 1. IBM Power System AC922 : The Brain Behind the Supercomputer — Pidad D’Souza([email protected]) Aditya Nitsure([email protected]) Power System Performance, ISDL, IBM, Bengaluru
  • 2. Agenda ● AC922 System Components ● AC922 Characteristics ● System Features
  • 3. IBM Systems at Supercomputing 2019 / © 2019 IBM Corporation The most powerful supercomputer on the planet 4 out of 11 Top500 are IBM Power9 Systems 3 ▪ 4,608 IBM AC922 nodes ▪ 200 peta FLOPS ▪ 27,648 NVIDIA Tesla GPUs ▪ 25 gigabytes per second between nodes ▪ 13 MW Energy Exascale Energy Budget: 20-40 megawatts (MW) Innovations in Hardware and Software ▪ Processor/Accelerators ▪ Memory ▪ Interconnect ▪ Spectrum MPI, Math Libraries 4 out of Top 10 Green500 systems are IBM Power9 systems AiMOS – Green500 No. 3 with 15.72 GFlops/Watt
  • 5. + Rest of Sequential CPU Code Compute-Intensive Code Application Code GPU Acceleration 5 CPU – Large and broad instruction set to perform complex operations GPU – High throughput – Massive parallelization through large number of cores – Specialized for SIMD/SIMT Heterogenous Computing Maximize performance and energy efficiency
  • 6. – NVLink 2.0 : High-Bandwidth Interconnect o 150 bi-directional bandwidth (or 100 GB/s for 6 GPU config) between CPU-GPU and GPU-GPU – Coherent access to CPU memory Summit and Sierra Supercomputer configurations 6 Nvidia V100 NVLink 150GB/s DDR4 170GB/s POWER9 PCIe4.0 CAPI 2.0 NVLink 150GB/s NVLink 100GB/s DDR4 170GB/s POWER9 NVLink 100GB/s Sierra (4 GPU Half Node) Summit (6 GPU Half Node) IB PCIe4.0 CAPI 2.0 Coherent access to system memory Nvidia V100 • CPU and GPU co-operate in execution of work • GPU coherently access to CPU memory Coherent access to system memory IB
  • 7. 7 – Delivers unprecedented performance for modern HPC, analytics, and artificial intelligence (AI) – Designed to fully exploit the capabilities of CPU and GPU accelerators – Eliminates I/O bottlenecks and allows sharing memory across GPUs and CPUs – Extraordinary POWER9 CPUs – 2-6 NVIDIA® Tesla® V100 GPUs with NVLink – Co-optimized hardware and software for deep learning and AI – Supports up to 5.6x more I/O bandwidth than competitive servers – Combines the cutting edge AI innovation Data Scientists desire with the dependability IT – Next Gen PCIe - PCIe Gen4 2x faster IBM POWER9 AC922 Server 7
  • 8. 8 – Designed for AI Computing and HPC – Second-Generation NVLink™ – HBM2 Memory: Faster, Higher Efficiency – Enhanced Unified Memory and Address Translation Services – Maximum Performance and Maximum Efficiency Modes – Number of SM/cores : 80/5120 – Double Precision Performance : 7.5 TFLOPS – Single Precision Performance : 15 TFLOPS – 125 Tensor TFLOPS – GPU Memory : 16 or 32 GB – Memory bandwidth : 900 GB/s https://devblogs.nvidia.com/inside-volta Nvidia Tesla V100 GPU
  • 10. Boost application performance with sustained peak memory bandwidth of ~280GB/s CPU STREAM Bandwidth STREAM benchmark ( https://www.cs.virginia.edu/stream/) *not submitted 10
  • 11. NVIDIA Volta 100 Compute – Single and Double Precision –Applications to have more compute power –Shorten time to completion –Accomplish more simulation/experiment –1.5x higher compute than NVIDIA P100 GPUs 0 2 4 6 8 10 12 14 16 S822LC + P100 AC922 + V100 4.8 7.45 9.8 15.3 Compute-TFLOPS NVIDIA V100 SGEMM and DGEMM DGEMM SGEMM 1.5x higher 11
  • 12. NVIDIA V100 GPU memory bandwidth (GPU STREAM) 0 100 200 300 400 500 600 700 800 900 S822LC + P100 AC922 + V100 512 840 Bandwidth-GB/s 840 GB/s 1.6x Higher –1.6x Higher Bandwidth than NVIDIA P100 –Speed up of memory intensive applications Theoretical 12
  • 13. 0 10 20 30 40 50 60 70 Xeon E5- 2640 V4 + P100 S822LC + P100 AC922 + 6 V100 AC922 + 4 V100 12 34.16 45.9 68 Bandwidth–GB/s CPU to GPU NVLink Vs PCIe3 bandwidth 5.6x better 3.8x better 2x –NVLink 2.0 is 5.6x better than PCIe3 –Remove CPU-GPU Data transfer bottlenecks 2.8x better 1.34x Note: NVIDIA bandwidth test used for measurement 13
  • 14. NVLINK Bandwidth with varied data sizes –Minimize communication latencies –Unlock PCIe bottlenecks –Transfer larger data at high speed –Ideal for data size larger than GPU memory 0 10000 20000 30000 40000 50000 60000 70000 80000 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 Bandwidth-MB/s Data Size - KBytes NVLink2.0 vs PCIe3 Host to Device Bandwidth 2NVLinkPerGPU 3NVLinkPerGPU PCIe3 14
  • 15. Workload Optimized Frequency (WOF) – Boost performance of less active workload through higher frequency – Lower the frequency to save power or boost other cores – Maximize performance through dynamically adjusting processor frequency – Governing factors • Processor utilization, Number of active cores & Environment condition – Power Saver Modes • Dynamic Performance Mode(DPM) • Maximum Performance Mode(MPM) 15
  • 16. IBM Systems at Supercomputing 2019 / © 2019 IBM Corporation HPC Interconnect –Multi-Host Adapter (Mellanox ConnectX-5 EDR) • Latency : sub-600 nanoseconds • Bandwidth : 2 ports of 100Gb/s • Message Rate : 200M messages/second –Adapter Features • Switch based collectives - SHARP • Hardware Tag Matching • User mode memory registration(UMR) • GPU Direct RDMA • Tunneled Atomics P9 X-Bus x8x8 IB - EDR P9 16
  • 17. The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems, SC18. Bi-section Bandwidth & All Reduce scaling on Summit – Good scaling at large scale due to ~74% of bisection bandwidth with adaptive routing enabled – SMPI supports HCOLL(FCA) & SHARP, enables applications to run with best collective performance
  • 18. The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems, SC18. – Imporved application performance through burst buffer – Applications not bottlenecked on I/O operations on Parallel file systems Burst buffer performance on Summit
  • 20. HPC APPLICATION ACCELERATION METHODOLOGIES o Unified Memory o Coherency o ATS o OpenMP 20
  • 21. CUDA Programming 21 h_data = cudaMallocHost(size) // Allocate memory on the host d_data = cudaMalloc(size) // Allocate memory on the GPU init_dataCPU(h_data) cudaMemcpy(h_data, d_data, size, HostToDevice) // Move data to GPU gpu_kernel<<<…>>> // GPU compute cudaMemcpy(d_data, h_data, size, DeviceToHost) // Move results back to CPU cpu_processing(h_data) 21
  • 22. Unified Memory Programming 22 – Single memory address space accessible to both CPU & GPU – Enables oversubscribing memory • Computation of data size larger than GPU memory – System wide atomic memory operations – Transparent Memory migration between CPU and GPU depending on who accesses it • Explicit migration through cudaMemPrefetchAsyn() – Allocating Unified memory • Replace “malloc” & “new” with “cudaMallocManaged” GPU CPU Unified Memory 22
  • 23. Unified Memory Advises 23 – ReadMostly • Data is mostly read, occasionally written • Duplicate pages, writes possible but expensive – PreferredLocation • Specify preferred location for data • “resist” migrations from the preferred location – AccessedBy • Establish mappings to avoid migrations and access directly char *data; cudaMallocManaged(&data, size); init_dataCPU(data, size); cudaMemPrefetchAsync(data, size, gpuID); cudaMemAdvise(data, size, …ReadMostly, gpuID); gpuKernel<<<… >>>(data, size); // Transparent data migration to GPU cudaDeviceSynchronize(); use_dataCPU(data, size); //Data migrate back to CPU 23
  • 24. *data = malloc(size); gpu_kernel<<<…>>>(data); data[1024]; gpu_kernel<<<…>>>(data); extern float *data; gpu_kernel<<<…>>>(data); Hardware Coherency & ATS 24 – Hardware coherency • CPU can directly access and cache GPU memory • Native atomics support – Address Translation Services(ATS) • Allows the GPU to access the CPU’s page tables directly • System allocator support – malloc, stack, global, file system Simplifiedprogramming with ATS 24
  • 25. CUDA Aware MPI 25 –Avoid staging of GPU buffers in host memory –Run applications efficiently –IBM SpectrumMPI is CUDA- Aware 25 Code without CUDA-Aware MPI (using GPU buffers) //MPI Rank 0 CudaMemcpy(…, DeviceToHost) MPI_Send() //MPI Rank 1 MPI_Recv() CudaMemcpy(…, HostToDevice) Code with CUDA-Aware MPI (using GPU buffers) //MPI Rank 0 CudaMemcpy(…, DeviceToHost) MPI_Send() //MPI Rank 1 MPI_Recv() CudaMemcpy(…, HostToDevice) https://devblogs.nvidia.com/introduction-cuda-aware-mpi/
  • 26. GPU Direct RDMA 26 – Data exchange between GPU and other Peer devices using PCIe standards – Network devices directly access GPU memory bypassing host 26
  • 28. Monitoring and Profiling tools Monitoring ➢ mpstat, vmstat – CPU and memory utilization ➢ numastat – numa memory statistics ➢ top/htop – real-time view of system usage Profiling ➢ Perf record/report – CPU profiling ➢ nvprof – GPU profiling CPU memory GPU memory numastat
  • 29. nvidia-smi Also check “nvidia-smi –query-gpu” more monitoring options Monitoring and Profiling tools
  • 30. nvprof • The nvprof is command-line profiling tool which enables you to collect and view profiling data • Using nvprof one can collect – • kernel execution time • memory transfers • memory set and CUDA API calls • events or metrics for CUDA kernels NVVP (NVIDIA Visual Profiler) • The Visual Profiler displays a timeline of your application's activity on both the CPU and GPU so that one can identify opportunities for performance improvement. • Visualize profile data collected from nvprof • More documentation can be found @ https://docs.nvidia.com/cuda/profiler-users-guide/index.html Monitoring and Profiling tools
  • 32. Conclusion 32 ➢ AC922 Designed for Super Computers ➢ Better performance for HPC applications ➢ High speed interconnect NVLink between CPU & GPU ➢ Simplified programming using Unified memory, ATS, and OpenMP
  • 33. References –IBM Power System AC922 Introduction and Technical Overview –NVIDIA Volta GPU –IBM Power Systems Proof Points –Unified Memory on P9+V100 –Summit SuperComputer –Sierra SuperComputer 33
  • 34. Notices and disclaimers © 2018 International Business Machines Corporation. No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights — use, duplication or disclosurerestricted by GSA ADP Schedule Contract with IBM. Information in these presentations (including informationrelating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. This document is distributed “as is” without any warranty, either express or implied. In no event, shall IBM be liable for any damage arising from the use of this information, including but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted per the terms and conditions of the agreements under which they are provided. IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.” Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitutelegal or other guidance or advice to any individual participant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identificationand interpretationof any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer follows any law. 34
  • 35. Notices and disclaimers continued Information concerningnon-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products about this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warranties of merchantability and fitness for a purpose. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo, ibm.com and [names of other referenced IBM products and services used in the presentation] are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information”at: www.ibm.com/legal/copytrade.shtml. 35
  • 36. Thank you Pidad D’Souza Power System Performance Architect — [email protected] +91-80-4177 6526 ibm.com 36
  • 37. ® 37