SlideShare a Scribd company logo
3
Most read
6
Most read
10
Most read
SUMMIT
SUPERCOMPUTER
Supervisor: Dr. R. Venkatesan
Presentation by: Vigneshwar Ramaswamy
Masc. in Computer Engineering
MUN ID: 201990029
Memorial University of Newfoundland, Canada
Summit Supercomputer Architecture 1
Outline
• Introduction
• Summit Overview
• Specification of Summit
• IBM Power9 Architecture
• NVIDIA Tesla V100 Architecture
• Interconnect
• Application
Summit Supercomputer Architecture 2
Introduction
• Summit was the fastest computer in the world from November 2018 to June 2020.
• 2nd Rank on TOP500 peak speed 148.6 pflops ( High Performance Linpack benchmark).
• 8th Rank on Green500 with power efficiency of 14.719 Gflops/watt.
• As of June 2018 – 2020, the summit topped HPCG benchmark used by 5 out of 6
Gordon Bell Finalist teams.
• Summit has Achieved to reach exa operations per second (exaop), achieving 1.88
exaops during a Genmoic Analysis and expected to reach 3.3 exaops using mixed
precision calculations.
Summit Supercomputer Architecture 3
Summit Overview and Specifications
• Processor: IBM POWER9™ (2/node)
• GPUs: 27,648 NVIDIA Volta V100s (6/node)
• Theoretical Peak (Rpeak) performance :200 Pflops
• Linpack performance :-148.6 PFlops.
• It has 2,414,592 cores
• 250petabytes storage capacity
• Nodes: 4,608
• Memory/ each node: 512GB DDR4 + 96GB HBM2 (1/2TF,CPU-GPU accessing)
• NV Memory/node: 1600GB
• Total System Memory: >10PB DDR4 + HBM + Non-volatile
Summit Supercomputer Architecture 4
Summit Overview and Specifications
• Interconnect Topology: Mellanox EDR 100G InfiniBand,Non-blocking Fat Tree
• 25gigabytes per second between nodes
• In-Network Computing acceleration for communications frameworks such as
MPI(Message Passing Interface).
• Peak Power Consumption: 13MW
• Operating system :Red Hat Enterprise Linux (RHEL) version 7.4.
Summit Supercomputer Architecture 5
Summit Nodes
Summit Supercomputer Architecture 6
FIGURE 1: SUMMIT NODE BLOCK DIAGRAM
SOURCE: Summit, Oak Ridge National Laboratory (official web page), https://www.olcf.ornl.gov/summit/
IBM POWER9 Processor
• Summit’s POWER9 processor contain 24 active
cores (4 hardware threads/core).
• Peripheral component interconnect express
(PCI – Express) Gen4.
• NVLink 2.0
• 14nm finFET Semiconductor Process with 8.0
billion transistors
• High Bandwidth Signaling Technology
• 16Gb/s interface – Local SMP
• 25 Gb/s interface – 25G Link – Accelerator,
remote SMP
Summit Supercomputer Architecture 7
FIGURE 2: POWER9 ARCHITECTURE
SOURCE: S. K. Sadasivam, B. W. Thompto, R. Kalla and W. J. Starke, "IBM Power9
Processor Architecture," in IEEE Micro, vol. 37, no. 2, pp. 40-51, Mar.-Apr.
2017.doi: 10.1109/MM.2017.40
Core pipeline
• Microarchitecture has Reduced pipeline
length.
• Removes the instruction grouping
technique .
• Introduces new features to proactively
avoid hazards in the load store unit (LSU)
and improve the LSU’s execution efficiency.
• Complete up to 128 instruction per
cycle.(SMT 4)
• New lock management control improves
the performance
Summit Supercomputer Architecture 8
FIGURE 3: POWER9 VS POWER8 PIPELINE STAGES
SOURCE: S. K. Sadasivam, B. W. Thompto, R. Kalla and W. J. Starke, "IBM Power9
Processor Architecture," in IEEE Micro, vol. 37, no. 2, pp. 40-51, Mar.-Apr.
2017.doi: 10.1109/MM.2017.40
Key components of Power9 core
Summit Supercomputer Architecture 9
Figure 4: SMT4 Core Figure 5: SMT8 Core
Figure 6: Power9 SMT4 core. The detailed core block diagram
shows all the key components of the Power9 core.
Cache Capacity of
POWER9 Processor
• L1I: 32 KiB (per core, 8-way set associative)
• L1D: 32 KiB (per core, 8-way)
• L2: 512 KiB (per pair of cores)
• L3: 120 MiB eDRAM, 20-way
Summit Supercomputer Architecture 10
FIGURE 7: SMT8 Cache
SOURCE: S. K. Sadasivam, B. W. Thompto, R. Kalla and W. J. Starke, "IBM Power9 Processor Architecture," in IEEE
Micro, vol. 37, no. 2, pp. 40-51, Mar.-Apr. 2017.doi: 10.1109/MM.2017.40
NVDIA Tesla V100
GPU Architecture
• This GPU is built with 21 billion transistors
• It has peak performance of 7.8 TFLOP/s of
double precision floating point performance
(FP64)
• It has 15.7 TFLOP/s of single precision
performance(FP32).
• It has 5376 FP32 cores, 5376 INT32 cores,
2688 FP64 cores, 672 Tensor cores, 366
Texture units.
• (8) 512-bit memory controllers control
access to the 16 GB of HBM2 memory.
• 6 MB L2 cache that is available to the SMs
• NVIDIA’s NVLink interconnect to pass data
between GPUs as well as from CPU-to-GPU
Summit Supercomputer Architecture 11
FIGURE 8: NVIDIA TESLA V100 GPU ARCHITECTURE
SOURCE: NVIDIA TESLA V100 GPU Architecture, White paper,
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-
whitepaper.pdf
Volta Streaming Multiprocessor
• This new Streaming Multiprocessor architecture delivers major improvements in
performance and energy efficiency.
• New mixed precision tensor Cores.
• 50% higher efficiency on general computation workloads.
• High performance L1 data cache.
• V100 SM has 64 FP32 cores and 32 FP64 cores per SM.
• Supports more threads, warps, and thread blocks when compared to prior GPU
generations
• A 128-KB combined memory block for shared memory and L1 cache can be
configured to allow up to 96 KB of shared memory.
• Each SM has four texture units which use to set the size of the L1 cache.
Summit Supercomputer Architecture 12
FIGURE 9: VOLTA GV100 Streaming
Multiprocessor (SM)
SOURCE: NVIDIA TESLA V100 GPU Architecture,
White paper, https://images.nvidia.com/content/volta-
architecture/pdf/volta-architecture-whitepaper.pdf
Tensor Cores
• V100 GPU contains 640 Tensor Cores: eight
(8) per SM and two (2) per each processing
block (partition) within an SM.
• Each Tensor Cores performs 64 FP
FMA(fused multiplication and addition)
operations per clock.
• For deep learning training ,Tensor Cores
provide up to 12x higher peak TFLOPS on
Tesla V100 compared to pascal.
• For deep learning inference, Tensor Cores
provide up to 6x higher peak TFLOPS on
Tesla V100 when ompared to pascal.
Summit Supercomputer Architecture 13
FIGURE 10: Pascal and Volta 4 x 4 matrix multiplication
SOURCE: NVIDIA TESLA V100 GPU Architecture, White paper,
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-
whitepaper.pdf
Tensor cores
• Each Tensor Core operates on a 4x4 matrix and performs
the following operation:
• D = A×B + C, where A, B, C, and D are 4x4 matrices.
• Each FP16 multiply gives a full-precision product which is
accumulated in a FP32 addition to provide the result.
Summit Supercomputer Architecture 14
FIGURE 11: Tensor Core 4 x 4 Matrix Multiply and
accumulate
FIGURE 12: Mixed Precision Multiply and Accumulate in
Tensor core
Performance of Tensor Cores on Matrix
Multiplications
Summit Supercomputer Architecture 15
FIGURE 13: Single precision (FP32) FIGURE 14: Mixed precision
NVIDIA NVLink
• In Summit Supercomputer, the
Tesla V100 accelerators and
Power9 CPUs are connected with
NVLink.
• More performance when
compared to PCLe interconnects.
• Each link provides 25
Gigabytes/second in each
direction.
Summit Supercomputer Architecture 16
FIGURE 15: NVDIA NVLink
Interconnect
• Nodes are connected with Mellanox dual rail EDR InfiniBand network.
• Each node it gives 25 GB/s Bandwidth .
• Using dual-rail Mellanox EDR(Enhanced Data Rate) 100Gb/s InfiniBand interconnect for both
storage and inter-process communications traffic
• All nodes are interconnected with Non-Blocking Fat Tree topology.
• Implemented by three level tree.
Summit Supercomputer Architecture 17
FIGURE 16: ConnectX-5adapterandinterface
withPOWER9 chips
FIGURE 17: Fat Tree Topology
Application- Finding the Drug Compounds to fight against the
corona virus
• Summit was used to screen through a library of 8000 datasets of known FDA approved drug compounds to
fight against the corona virus.
• Narrowed down the dataset to 77 in just 2 days.
• Summit uses Virus genome to search for a very specific type of drug compounds.
• On comparing with the world’s fastest computer Fugaku, which was used to conduct molecule level
simulations.
• narrowed from 2128 existing drugs and picked 12 drugs that bond easily to the proteins in 10 days.
• Fugaku can perform more than 415 quadrillion computations a second which is 2.8 times faster than summit.
Summit Supercomputer Architecture 18
Comparison with other Supercomputers
Summit Supercomputer Architecture 19
Rank Rmax Name Model Processor Cores Interconnect Memory Manufact
urer
Operating
system
Rpeak
(PFLOPS)
1 415.530 FUGAKU SUPERCOMPUTER
FUGAKU
A64FX 48C 2.2GHz 7,299,072
Tofu interconnect D
4,866,048 GB Fujitsu Red Hat Enterprise
Linux
513.855
2 148.6 SUMMIT IBM POWER
SYSTEM AC922
IBM POWER9 22C
3.07GHz
2,414,592 Dual-rail Mellanox EDR
Infiniband
2,801,664 GB IBM RHEL 7.4
200.795
3 94.640 SIERRA IBM POWER
SYSTEM AC922
IBM POWER9 22C
3.07GHz
1,572,480 Dual-rail Mellanox EDR
Infiniband
1,382,400 GB IBM RHEL 7.4
125.712
4 93.014 SUNWAY
TAIHULIGHT
SUNWAY MPP SUNWAY
SW26010 260C
1.45 GHZ
10,649,600 Sunway 1,310,720 GB NRCPC Sunway RaiseOS
2.0.5
125.436
Supercomputers
development
over the past 27
years
Summit Supercomputer Architecture 20
CM-5 Supercomputer
Fugaku Supercomputer Sunway Taihu Light
Summit Supercomputer
•Thank you
Summit Supercomputer Architecture 21

More Related Content

What's hot (12)

PPT
Introduction to Natural Language Processing
Pranav Gupta
 
PDF
Machine learning interview questions and answers
kavinilavuG
 
PDF
Natural Language Processing and Machine Learning
Karthik Sankar
 
PPTX
Intel I3,I5,I7 Processor
sagar solanky
 
PDF
Deep learning - A Visual Introduction
Lukas Masuch
 
PPTX
CPU vs GPU Comparison
jeetendra mandal
 
PPTX
Core i 7 processor
Sumit Biswas
 
PPTX
Artifical Neural Network and its applications
Sangeeta Tiwari
 
PDF
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
Databricks
 
PPT
Hyper Threading technology
Karunakar Singh Thakur
 
PDF
Natural Language Processing
CloudxLab
 
PPTX
Artificial neural networks and its applications
PoojaKoshti2
 
Introduction to Natural Language Processing
Pranav Gupta
 
Machine learning interview questions and answers
kavinilavuG
 
Natural Language Processing and Machine Learning
Karthik Sankar
 
Intel I3,I5,I7 Processor
sagar solanky
 
Deep learning - A Visual Introduction
Lukas Masuch
 
CPU vs GPU Comparison
jeetendra mandal
 
Core i 7 processor
Sumit Biswas
 
Artifical Neural Network and its applications
Sangeeta Tiwari
 
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
Databricks
 
Hyper Threading technology
Karunakar Singh Thakur
 
Natural Language Processing
CloudxLab
 
Artificial neural networks and its applications
PoojaKoshti2
 

Similar to Hardware architecture of Summit Supercomputer (20)

PDF
HPC Infrastructure To Solve The CFD Grand Challenge
Anand Haridass
 
PDF
組み込みから HPC まで ARM コアで実現するエコシステム
Shinnosuke Furuya
 
PDF
POWER9 for AI & HPC
inside-BigData.com
 
PDF
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
inside-BigData.com
 
PDF
Latest HPC News from NVIDIA
inside-BigData.com
 
DOCX
Supercomputer - Overview
ARINDAM ROY
 
PPTX
GPU Architecture NVIDIA (GTX GeForce 480)
Fatima Qayyum
 
PPTX
Streaming multiprocessors and HPC
OmkarKachare1
 
PDF
Ac922 cdac webinar
Ganesan Narayanasamy
 
PDF
GTC 2017: Powering the AI Revolution
NVIDIA
 
PPTX
HPC Top 5 Stories: January 12, 2018
NVIDIA
 
PDF
Barcelona Supercomputing Center, Generador de Riqueza
Facultad de Informática UCM
 
PDF
POWER9 AC922 Newell System - HPC & AI
Anand Haridass
 
PDF
Designing HPC Architectures at the Barcelona Supercomputing Center
Facultad de Informática UCM
 
PDF
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
PDF
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
PDF
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
MuhammadAbdullah311866
 
PDF
Areportaboutdatavbasesandtheirusageinworld.pdf
Arman468851
 
PDF
Mauricio breteernitiz hpc-exascale-iscte
mbreternitz
 
PDF
POWER10 innovations for HPC
Ganesan Narayanasamy
 
HPC Infrastructure To Solve The CFD Grand Challenge
Anand Haridass
 
組み込みから HPC まで ARM コアで実現するエコシステム
Shinnosuke Furuya
 
POWER9 for AI & HPC
inside-BigData.com
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
inside-BigData.com
 
Latest HPC News from NVIDIA
inside-BigData.com
 
Supercomputer - Overview
ARINDAM ROY
 
GPU Architecture NVIDIA (GTX GeForce 480)
Fatima Qayyum
 
Streaming multiprocessors and HPC
OmkarKachare1
 
Ac922 cdac webinar
Ganesan Narayanasamy
 
GTC 2017: Powering the AI Revolution
NVIDIA
 
HPC Top 5 Stories: January 12, 2018
NVIDIA
 
Barcelona Supercomputing Center, Generador de Riqueza
Facultad de Informática UCM
 
POWER9 AC922 Newell System - HPC & AI
Anand Haridass
 
Designing HPC Architectures at the Barcelona Supercomputing Center
Facultad de Informática UCM
 
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
MuhammadAbdullah311866
 
Areportaboutdatavbasesandtheirusageinworld.pdf
Arman468851
 
Mauricio breteernitiz hpc-exascale-iscte
mbreternitz
 
POWER10 innovations for HPC
Ganesan Narayanasamy
 
Ad

Recently uploaded (20)

PPTX
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PPTX
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
Introduction to Design of Machine Elements
PradeepKumarS27
 
PPTX
GitOps_Without_K8s_Training simple one without k8s
DanialHabibi2
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
PPTX
GitOps_Repo_Structure for begeinner(Scaffolindg)
DanialHabibi2
 
PPTX
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
PPTX
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Thermal runway and thermal stability.pptx
godow93766
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
Introduction to Design of Machine Elements
PradeepKumarS27
 
GitOps_Without_K8s_Training simple one without k8s
DanialHabibi2
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
GitOps_Repo_Structure for begeinner(Scaffolindg)
DanialHabibi2
 
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
MRRS Strength and Durability of Concrete
CivilMythili
 
Day2 B2 Best.pptx
helenjenefa1
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Ad

Hardware architecture of Summit Supercomputer

  • 1. SUMMIT SUPERCOMPUTER Supervisor: Dr. R. Venkatesan Presentation by: Vigneshwar Ramaswamy Masc. in Computer Engineering MUN ID: 201990029 Memorial University of Newfoundland, Canada Summit Supercomputer Architecture 1
  • 2. Outline • Introduction • Summit Overview • Specification of Summit • IBM Power9 Architecture • NVIDIA Tesla V100 Architecture • Interconnect • Application Summit Supercomputer Architecture 2
  • 3. Introduction • Summit was the fastest computer in the world from November 2018 to June 2020. • 2nd Rank on TOP500 peak speed 148.6 pflops ( High Performance Linpack benchmark). • 8th Rank on Green500 with power efficiency of 14.719 Gflops/watt. • As of June 2018 – 2020, the summit topped HPCG benchmark used by 5 out of 6 Gordon Bell Finalist teams. • Summit has Achieved to reach exa operations per second (exaop), achieving 1.88 exaops during a Genmoic Analysis and expected to reach 3.3 exaops using mixed precision calculations. Summit Supercomputer Architecture 3
  • 4. Summit Overview and Specifications • Processor: IBM POWER9™ (2/node) • GPUs: 27,648 NVIDIA Volta V100s (6/node) • Theoretical Peak (Rpeak) performance :200 Pflops • Linpack performance :-148.6 PFlops. • It has 2,414,592 cores • 250petabytes storage capacity • Nodes: 4,608 • Memory/ each node: 512GB DDR4 + 96GB HBM2 (1/2TF,CPU-GPU accessing) • NV Memory/node: 1600GB • Total System Memory: >10PB DDR4 + HBM + Non-volatile Summit Supercomputer Architecture 4
  • 5. Summit Overview and Specifications • Interconnect Topology: Mellanox EDR 100G InfiniBand,Non-blocking Fat Tree • 25gigabytes per second between nodes • In-Network Computing acceleration for communications frameworks such as MPI(Message Passing Interface). • Peak Power Consumption: 13MW • Operating system :Red Hat Enterprise Linux (RHEL) version 7.4. Summit Supercomputer Architecture 5
  • 6. Summit Nodes Summit Supercomputer Architecture 6 FIGURE 1: SUMMIT NODE BLOCK DIAGRAM SOURCE: Summit, Oak Ridge National Laboratory (official web page), https://www.olcf.ornl.gov/summit/
  • 7. IBM POWER9 Processor • Summit’s POWER9 processor contain 24 active cores (4 hardware threads/core). • Peripheral component interconnect express (PCI – Express) Gen4. • NVLink 2.0 • 14nm finFET Semiconductor Process with 8.0 billion transistors • High Bandwidth Signaling Technology • 16Gb/s interface – Local SMP • 25 Gb/s interface – 25G Link – Accelerator, remote SMP Summit Supercomputer Architecture 7 FIGURE 2: POWER9 ARCHITECTURE SOURCE: S. K. Sadasivam, B. W. Thompto, R. Kalla and W. J. Starke, "IBM Power9 Processor Architecture," in IEEE Micro, vol. 37, no. 2, pp. 40-51, Mar.-Apr. 2017.doi: 10.1109/MM.2017.40
  • 8. Core pipeline • Microarchitecture has Reduced pipeline length. • Removes the instruction grouping technique . • Introduces new features to proactively avoid hazards in the load store unit (LSU) and improve the LSU’s execution efficiency. • Complete up to 128 instruction per cycle.(SMT 4) • New lock management control improves the performance Summit Supercomputer Architecture 8 FIGURE 3: POWER9 VS POWER8 PIPELINE STAGES SOURCE: S. K. Sadasivam, B. W. Thompto, R. Kalla and W. J. Starke, "IBM Power9 Processor Architecture," in IEEE Micro, vol. 37, no. 2, pp. 40-51, Mar.-Apr. 2017.doi: 10.1109/MM.2017.40
  • 9. Key components of Power9 core Summit Supercomputer Architecture 9 Figure 4: SMT4 Core Figure 5: SMT8 Core Figure 6: Power9 SMT4 core. The detailed core block diagram shows all the key components of the Power9 core.
  • 10. Cache Capacity of POWER9 Processor • L1I: 32 KiB (per core, 8-way set associative) • L1D: 32 KiB (per core, 8-way) • L2: 512 KiB (per pair of cores) • L3: 120 MiB eDRAM, 20-way Summit Supercomputer Architecture 10 FIGURE 7: SMT8 Cache SOURCE: S. K. Sadasivam, B. W. Thompto, R. Kalla and W. J. Starke, "IBM Power9 Processor Architecture," in IEEE Micro, vol. 37, no. 2, pp. 40-51, Mar.-Apr. 2017.doi: 10.1109/MM.2017.40
  • 11. NVDIA Tesla V100 GPU Architecture • This GPU is built with 21 billion transistors • It has peak performance of 7.8 TFLOP/s of double precision floating point performance (FP64) • It has 15.7 TFLOP/s of single precision performance(FP32). • It has 5376 FP32 cores, 5376 INT32 cores, 2688 FP64 cores, 672 Tensor cores, 366 Texture units. • (8) 512-bit memory controllers control access to the 16 GB of HBM2 memory. • 6 MB L2 cache that is available to the SMs • NVIDIA’s NVLink interconnect to pass data between GPUs as well as from CPU-to-GPU Summit Supercomputer Architecture 11 FIGURE 8: NVIDIA TESLA V100 GPU ARCHITECTURE SOURCE: NVIDIA TESLA V100 GPU Architecture, White paper, https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture- whitepaper.pdf
  • 12. Volta Streaming Multiprocessor • This new Streaming Multiprocessor architecture delivers major improvements in performance and energy efficiency. • New mixed precision tensor Cores. • 50% higher efficiency on general computation workloads. • High performance L1 data cache. • V100 SM has 64 FP32 cores and 32 FP64 cores per SM. • Supports more threads, warps, and thread blocks when compared to prior GPU generations • A 128-KB combined memory block for shared memory and L1 cache can be configured to allow up to 96 KB of shared memory. • Each SM has four texture units which use to set the size of the L1 cache. Summit Supercomputer Architecture 12 FIGURE 9: VOLTA GV100 Streaming Multiprocessor (SM) SOURCE: NVIDIA TESLA V100 GPU Architecture, White paper, https://images.nvidia.com/content/volta- architecture/pdf/volta-architecture-whitepaper.pdf
  • 13. Tensor Cores • V100 GPU contains 640 Tensor Cores: eight (8) per SM and two (2) per each processing block (partition) within an SM. • Each Tensor Cores performs 64 FP FMA(fused multiplication and addition) operations per clock. • For deep learning training ,Tensor Cores provide up to 12x higher peak TFLOPS on Tesla V100 compared to pascal. • For deep learning inference, Tensor Cores provide up to 6x higher peak TFLOPS on Tesla V100 when ompared to pascal. Summit Supercomputer Architecture 13 FIGURE 10: Pascal and Volta 4 x 4 matrix multiplication SOURCE: NVIDIA TESLA V100 GPU Architecture, White paper, https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture- whitepaper.pdf
  • 14. Tensor cores • Each Tensor Core operates on a 4x4 matrix and performs the following operation: • D = A×B + C, where A, B, C, and D are 4x4 matrices. • Each FP16 multiply gives a full-precision product which is accumulated in a FP32 addition to provide the result. Summit Supercomputer Architecture 14 FIGURE 11: Tensor Core 4 x 4 Matrix Multiply and accumulate FIGURE 12: Mixed Precision Multiply and Accumulate in Tensor core
  • 15. Performance of Tensor Cores on Matrix Multiplications Summit Supercomputer Architecture 15 FIGURE 13: Single precision (FP32) FIGURE 14: Mixed precision
  • 16. NVIDIA NVLink • In Summit Supercomputer, the Tesla V100 accelerators and Power9 CPUs are connected with NVLink. • More performance when compared to PCLe interconnects. • Each link provides 25 Gigabytes/second in each direction. Summit Supercomputer Architecture 16 FIGURE 15: NVDIA NVLink
  • 17. Interconnect • Nodes are connected with Mellanox dual rail EDR InfiniBand network. • Each node it gives 25 GB/s Bandwidth . • Using dual-rail Mellanox EDR(Enhanced Data Rate) 100Gb/s InfiniBand interconnect for both storage and inter-process communications traffic • All nodes are interconnected with Non-Blocking Fat Tree topology. • Implemented by three level tree. Summit Supercomputer Architecture 17 FIGURE 16: ConnectX-5adapterandinterface withPOWER9 chips FIGURE 17: Fat Tree Topology
  • 18. Application- Finding the Drug Compounds to fight against the corona virus • Summit was used to screen through a library of 8000 datasets of known FDA approved drug compounds to fight against the corona virus. • Narrowed down the dataset to 77 in just 2 days. • Summit uses Virus genome to search for a very specific type of drug compounds. • On comparing with the world’s fastest computer Fugaku, which was used to conduct molecule level simulations. • narrowed from 2128 existing drugs and picked 12 drugs that bond easily to the proteins in 10 days. • Fugaku can perform more than 415 quadrillion computations a second which is 2.8 times faster than summit. Summit Supercomputer Architecture 18
  • 19. Comparison with other Supercomputers Summit Supercomputer Architecture 19 Rank Rmax Name Model Processor Cores Interconnect Memory Manufact urer Operating system Rpeak (PFLOPS) 1 415.530 FUGAKU SUPERCOMPUTER FUGAKU A64FX 48C 2.2GHz 7,299,072 Tofu interconnect D 4,866,048 GB Fujitsu Red Hat Enterprise Linux 513.855 2 148.6 SUMMIT IBM POWER SYSTEM AC922 IBM POWER9 22C 3.07GHz 2,414,592 Dual-rail Mellanox EDR Infiniband 2,801,664 GB IBM RHEL 7.4 200.795 3 94.640 SIERRA IBM POWER SYSTEM AC922 IBM POWER9 22C 3.07GHz 1,572,480 Dual-rail Mellanox EDR Infiniband 1,382,400 GB IBM RHEL 7.4 125.712 4 93.014 SUNWAY TAIHULIGHT SUNWAY MPP SUNWAY SW26010 260C 1.45 GHZ 10,649,600 Sunway 1,310,720 GB NRCPC Sunway RaiseOS 2.0.5 125.436
  • 20. Supercomputers development over the past 27 years Summit Supercomputer Architecture 20 CM-5 Supercomputer Fugaku Supercomputer Sunway Taihu Light Summit Supercomputer