SlideShare a Scribd company logo
© 2018 Arm Limited
Dr. Oliver Perks (olly.perks@arm.com)
16th September 2019
Arm as a Viable Architecture
For HPC and AI
HPC and AI Advisory Council 2019
Leicester
2 © 2018 Arm Limited
70%of the world’s population
uses Arm technology
© 2018 Arm Limited
Not just mobile
phones!
4 © 2018 Arm Limited
History of Arm in HPC: A Busy Decade
2011 Calxada
•32-bit ARrmv7-
A – Cortex A9
2011-2015
Mont-Blanc 1
•32-bit Armv7-A
•Cortex A15
•First Arm HPC
system
2014
AMD Opteron
A1100
•64-bit Armv8-A
•Cortex A57
•4-8 Cores
2015
Cavium
ThunderX
•64-bit Armv8-A
•48 Cores
2017
(Cavium)
Marvell
ThunderX 2
•64-bit Armv8-A
•32 Cores
2019
Fujitsu A64FX
•First Arm chip
with SVE
vectorisation
•48 Cores
5 © 2018 Arm Limited
Marvell ThunderX2 CN99XX
• Marvell’s next generation 64-bit Arm processor
• Taken from Broadcom Vulcan
• 32 cores @ 2.2 GHz (other SKUs available)
• 4 Way SMT (up to 256 threads / node)
• Fully out of order execution
• 8 DDR4 Memory channels (~250 GB/s Dual socket)
– Vs 6 on Skylake
• Available in dual SoC configurations
• CCPI2 interconnect
• 180-200w / socket
• No SVE vectorisation
• 128-bit NEON vectorisation
6 © 2018 Arm Limited
Fujitsu A64FX
• Chip designed for RIKEN Fugaku (POST-K)
• Based on Arm ISA technology
• 48 core 64-bit Armv8 processor
• + 4 dedicated OS cores
• With SVE vectorisation
• 512 bit vector length
• High performance
• >2.7 TFLOPs
• Low power : 15GF/W (dgemm)
• 32 GB HBM2
• No DDR
• 1 TB/s bandwidth
• TOFU 3 interconnect
7 © 2018 Arm Limited
Deployments: HPE Astra at Sandia
Mapping performance to real-world mission applications
• HPE Apollo 70
• #156 on Top500
– 1.76 PFLOPs Rmax (2.2 PFLOPs Rpeak)
• Marvell ThunderX2 processors
– 28-core @ 2.0 Ghz
– 332 TB aggregate memory capacity
– 885 TB/s of aggregate memory bandwidth
• 2592 HPE Apollo 70 nodes
– 145, 152 cores
• Mellanox EDR InfiniBand
• OS: RedHat
8 © 2018 Arm Limited
Deployments: Isambard @ GW4
• 10,752 Armv8 cores (168 x 2 x 32)
• Marvell ThunderX2 32core
• 2.2GHz (Turbo 2.5GHz)
• Cray XC50 ‘Scout’ form factor
• High-speed Aries interconnect
• Cray HPC optimised software stack
• CCE, Cray MPI, Cray LibSci, CrayPat, …
• OS: CLE (Cray Linux Environment)
• Phase 2 (the Arm part):
• Being used for production science
• Available for industry and academia
9 © 2018 Arm Limited
Fulhame Catalyst system at EPCC
Bristol: VASP, CASTEP, Gromacs, CP2K,
Unified Model, NAMD, Oasis, NEMO,
OpenIFS, CASINO, LAMMPS
EPCC: WRF, OpenFOAM, Two PhD candidates
Leicester: Data-intensive apps, genomics,
MOAB Torque, DiRAC collaboration
Deployments: Catalyst
• Deployments to accelerate the growth of
the Arm HPC ecosystem
• Each machine will have:
• 64 HPE Apollo 70 nodes
• Dual 32-core Marvell ThunderX2 processors
• 4096 cores per system each
• 256GB of memory / node
• Mellanox InfiniBand interconnects
• OS: SUSE Linux Enterprise Server for HPC
10 © 2018 Arm Limited
Deployment: CEA
12 © 2018 Arm Limited
The Cloud
Open access to server class Arm
13 © 2018 Arm Limited
The Cloud
Open access to server class Arm
16 Cores A72 @ 2.3 GHz
© 2018 Arm Limited
Software
Ecosystem
15 © 2018 Arm Limited
: Typical HPC packages available for Arm
OpenHPC is a community effort to provide a common,
verified set of open source packages for HPC
deployments
Arm and partners actively involved:
• Arm is a silver member of OpenHPC
• Linaro is on Technical Steering Committee
• Arm-based machines in the OpenHPC build
infrastructure
Status: 1.3.6 release out now
• Packages built on Armv8-A for CentOS and SUSE
Functional Areas Components include
Base OS CentOS 7.5, SLES 12 SP3
Administrative
Tools
Conman, Ganglia, Lmod, LosF, Nagios, pdsh, pdsh-
mod-slurm, prun, EasyBuild, ClusterShell, mrsh,
Genders, Shine, test-suite
Provisioning Warewulf
Resource Mgmt. SLURM, Munge
I/O Services Lustre client (community version)
Numerical/Scientific
Libraries
Boost, GSL, FFTW, Metis, PETSc, Trilinos, Hypre,
SuperLU, SuperLU_Dist,Mumps, OpenBLAS,
Scalapack, SLEPc, PLASMA, ptScotch
I/O Libraries HDF5 (pHDF5), NetCDF (including C++ and Fortran
interfaces), Adios
Compiler Families GNU (gcc, g++, gfortran), LLVM
MPI Families OpenMPI, MPICH
Development Tools Autotools (autoconf, automake, libtool), Cmake,
Valgrind,R, SciPy/NumPy, hwloc
Performance Tools PAPI, IMB, pdtoolkit, TAU, Scalasca, Score-P,
SIONLib
17 © 2018 Arm Limited
Commercial C/C++/Fortran compiler with best-in-class performance
Tuned for Scientific Computing, HPC and Enterprise workloads
• Processor-specific optimizations for various server-class Arm-based platforms
• Optimal shared-memory parallelism using latest Arm-optimized OpenMP
runtime
Linux user-space compiler with latest features
• C++ 14 and Fortran 2003 language support with OpenMP 4.5*
• Support for Armv8-A and SVE architecture extension
• Based on LLVM and Flang, leading open-source compiler projects
Commercially supported by Arm
• Available for a wide range of Arm-based platforms running leading Linux
distributions – RedHat, SUSE and Ubuntu
Compilers tuned for Scientific
Computing and HPC
Latest features and
performance optimizations
Commercially supported
by Arm
* Without offloading
18 © 2018 Arm Limited
Commercial 64-bit ArmV8-A math Libraries
• Commonly used low-level maths routines – BLAS, LAPACK and FFT
• Optimised maths intrinsics
• Validated with NAG’s test suite, a de facto standard
Best-in-class performance with commercial support
• Tuned by Arm for specific core - like TX2 and Cortex-A72
• Maintained and supported by Arm for wide range of Arm-based SoCs
Silicon partners can provide tuned micro kernels for their SoCs
• Partners can contribute directly through open source routes
• Parallel tuning within our library increases overall application performance
Commercially Supported
by Arm
Validated with
NAG test suite
Best-in-class performance
© 2018 Arm Limited
Application
Performance
Traditional HPC
22 © 2018 Arm Limited
Early Results from Astra
23 © 2018 Arm Limited
EM (EMPIRE) Code on Astra
24 © 2018 Arm Limited
Single node performance results
https://github.com/UoB-HPC/benchmarks
25 © 2018 Arm Limited
26 © 2018 Arm Limited
ArmPL: DGEMM performance
Excellent serial and parallel performance
0
10
20
30
40
50
60
70
80
90
100
0 1000 2000 3000 4000 5000 6000 7000 8000Percentageofpeakperformance
Problem size
Arm PL 19.0 DGEMM parallel performance improvement
for Cavium ThunderX2 using 56 threads
ArmPL 19.0
Achieving very high performance at the node
level leveraging high core counts and large
memory bandwidth
Single core performance at
95% of peak for DGEMM (not shown)
Parallel performance at 90% of peak
Improved parallel performance for small
problem sizes
27 © 2018 Arm Limited
ArmPL 19.3 FFT 3-d complex-to-complex vs FFTW 3.3.8
• FFT timings on TX2
• 3-d complex-complex double
precision
• Higher means new Arm
Performance Libraries FFT
implementation better than
FFTW 3.3.8
• Results show:
• Average about 2.5x faster than
FFTW
• Lots of this comes from better
usage of vectors on TX2
0
1
2
3
4
5
6
7
0 100 200 300 400 500
ArmPerformanceLibrariesspeedupoverFFTW
Transform dimensions (NxNxN)
Arm Performance Libraries 19.3 vs FFTW 3.3.8
Complex-to-complex double precision 3-d transforms
Arm PL faster than
FFTW (speedup > 1)
Performance parity
(speedup = 1)
FFTW faster than
Arm PL (speedup < 1)
28 © 2018 Arm Limited
Architecture Adoption: Community Engagement
Training Events and Hackathons
• Arm as a viable alternative to X86
• Needs to be easy to port to
• Working codes and performant codes
• Team of field application engineers
• Work with code teams
• Educate, port and optimize
• Successful previous events
• Next event
• PRACE Aarch64 Training Sep 30th - Oct 1st
• Cambridge - Free to attend
• https://events.prace-ri.eu/event/900/
Galaxy simulation in SWIFTsim computed on Arm
Catalyst during DiRAC Hackathon last week
© 2018 Arm Limited
Machine Learning and
Artificial Intelligence
32 © 2018 Arm Limited
Machine Learning and Artificial Intelligence
33 © 2018 Arm Limited
ML Frameworks on server-class Aarch64 platforms
• Recent effort to enable server-scale on-CPU ML workloads on
AArch64
• Build guides for key frameworks available:
• Tensorflow - https://gitlab.com/arm-
hpc/packages/wikis/packages/tensorflow
• PyTorch - https://gitlab.com/arm-hpc/packages/wikis/packages/pytorch
• MXNET - https://gitlab.com/arm-hpc/packages/wikis/packages/mxnet
• And guides for key dependencies: CPython; NumPy etc.
• Currently focusing on inference problems
• ML Perf (https://mlperf.org) for realistic workloads.
34 © 2018 Arm Limited
TensorFlow and maths libraries on-CPU: as it comes
• Extensive use of x86 specific
optimizations
• Provided by MKL-DNN (now Deep
Neural Network Library (DNNL))
• Eigen-only builds remove
dependence on MKL-derived
contraction kernels
• Portable
TensorFlow
DNNL
x86
Framework
Data
& Models
DL lib
Maths Kernels
ResNet50mobilenet
Imagenet coco
= portable = impl. / x86 specific
(not portable)
Hardware
Eigen
35 © 2018 Arm Limited
TensorFlow and maths libraries: on AArch64
• Arm Performance Libraries
• Micro- architecture optimized
• Targeting server class cores
• High release cadence
• GEMMs at the core of matmul and
convolutions
• Leveraging ArmPL has potential to
deliver optimal performance in key
kernels for on-CPU, server scale ML
workload.
TensorFlow
Eigen
DNNL
ArmPL
AArch64
Framework
Data
& Models
DL lib
Maths Kernels
ResNet50mobilenet
Imagenet coco
= portable / ported = impl. / x86 specific
(not portable)
Hardware
36 © 2018 Arm Limited
ML for HPC
A collaborative approach
• Early-career STFC Innovation Placement project
• Six-month project based at the Manchester Design Office
• The wider ML community is developing new reference HPC benchmarks to expose
issues at scale.
• This project will engage with the community driving this work and utilise the benchmark
problems to highlight and diagnose performance bottlenecks.
• Complements work underway at Arm towards enabling server-scale on-CPU ML
• Demonstration of using platforms like Catalyst for HPC ML.
© 2018 Arm Limited
Going Forward
43 © 2018 Arm Limited
The Future of Arm in HPC
What’s next?
• By more vendors
• Marvell, Ampere, Amazon,
HiSilicon, Fujitsu
• Targeting different market
segments
• All built on the Arm
ecosystem
• Supported by the tools
Processors
• Large scale and small scale
deployments
• Increased exposure to the
architecture
• More applications and
libraries ported to Arm
• Including ISVs
• Increased community
• Neoverse – IP roadmap for
silicon vendors
• Investment in software
ecosystem
• E.g. F18
• Support for customers
• Applications
• Software
• Performance
Deployments Commitment from Arm
44 © 2018 Arm Limited
Arm HPC Ecosystem website: www.arm.com/hpc
Get involved
• News, events, blogs, webinars, etc.
• Guides for porting HPC applications
• Quick-start guides for tools
• Links to community collaboration sites
• Arm HPC Users Group (AHUG)
The Arm trademarks featured in this presentation are registered
trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks
featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks
© 2018 Arm Limited

More Related Content

What's hot (16)

PDF
Hibernation in Linux 2.6.29
Varun Mahajan
 
PDF
Marpol Annex VI Chapter IV- GHG Emissions and Energy Efficiency Regulations
Mohammud Hanif Dewan M.Phil.
 
PPTX
ECDIS TEST
MrisCrulis2
 
PDF
Positions and Directions in the Ship
Universidad Maritima del Caribe
 
PPTX
COLREG 1972, A Presentation
Amarinder Singh Brar
 
PDF
Energy Efficiency Measures for Ships and Potential Barriers for Adoption
Mohammud Hanif Dewan M.Phil.
 
PPTX
Propulsion system
Saurav Saini
 
PDF
Structural Design of Drill Ships
João Henrique Volpini Mattos
 
PPTX
All Presentations during CXL Forum at Flash Memory Summit 22
Memory Fabric Forum
 
PPTX
Q1 Memory Fabric Forum: XConn CXL Switches for AI
Memory Fabric Forum
 
PDF
Development of generalized trimaran hullform design methodology (5-16-2014)
Trimaran Enthusiasts
 
PDF
ORALS
Dheeraj Kaushal
 
PDF
Turkish ports
rajeshpatel5002
 
PDF
Communication practice on board
Capt. Persobi Waldemar
 
PPTX
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
AMD
 
PPTX
Broadcom PCIe & CXL Switches OCP Final.pptx
Memory Fabric Forum
 
Hibernation in Linux 2.6.29
Varun Mahajan
 
Marpol Annex VI Chapter IV- GHG Emissions and Energy Efficiency Regulations
Mohammud Hanif Dewan M.Phil.
 
ECDIS TEST
MrisCrulis2
 
Positions and Directions in the Ship
Universidad Maritima del Caribe
 
COLREG 1972, A Presentation
Amarinder Singh Brar
 
Energy Efficiency Measures for Ships and Potential Barriers for Adoption
Mohammud Hanif Dewan M.Phil.
 
Propulsion system
Saurav Saini
 
Structural Design of Drill Ships
João Henrique Volpini Mattos
 
All Presentations during CXL Forum at Flash Memory Summit 22
Memory Fabric Forum
 
Q1 Memory Fabric Forum: XConn CXL Switches for AI
Memory Fabric Forum
 
Development of generalized trimaran hullform design methodology (5-16-2014)
Trimaran Enthusiasts
 
Turkish ports
rajeshpatel5002
 
Communication practice on board
Capt. Persobi Waldemar
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
AMD
 
Broadcom PCIe & CXL Switches OCP Final.pptx
Memory Fabric Forum
 

Similar to Arm as a Viable Architecture for HPC and AI (20)

PDF
Arm in HPC
inside-BigData.com
 
PDF
RDMA on ARM
inside-BigData.com
 
PDF
An Update on Arm HPC
inside-BigData.com
 
PPTX
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Eric Van Hensbergen
 
PDF
NNSA Explorations: ARM for Supercomputing
inside-BigData.com
 
PDF
ARM HPC Ecosystem
inside-BigData.com
 
PDF
Arm - ceph on arm update
inwin stack
 
PPTX
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm
 
PDF
Arm Neoverse market update_05122020.pdf
Paul Yang
 
PDF
Efficient software development with heterogeneous devices
Arm
 
PDF
How to Select Hardware for Internet of Things Systems?
Hannes Tschofenig
 
PPTX
PowerAI Deep dive
Ganesan Narayanasamy
 
PPTX
Arm: Enabling CXL devices within the Data Center with Arm Solutions
Memory Fabric Forum
 
PDF
HPC network stack on ARM - Linaro HPC Workshop 2018
Linaro
 
PPTX
Ceph on 64-bit ARM with X-Gene
Ceph Community
 
PPTX
Ceph Day Seoul - Ceph on Arm Scaleable and Efficient
Ceph Community
 
PPTX
Module 1 - ARM 32 Bit Microcontroller
Amogha Bandrikalli
 
PDF
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
LEGATO project
 
PDF
Demystify OpenPOWER
Anand Haridass
 
PDF
LCU13: GPGPU on ARM Experience Report
Linaro
 
Arm in HPC
inside-BigData.com
 
RDMA on ARM
inside-BigData.com
 
An Update on Arm HPC
inside-BigData.com
 
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Eric Van Hensbergen
 
NNSA Explorations: ARM for Supercomputing
inside-BigData.com
 
ARM HPC Ecosystem
inside-BigData.com
 
Arm - ceph on arm update
inwin stack
 
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm
 
Arm Neoverse market update_05122020.pdf
Paul Yang
 
Efficient software development with heterogeneous devices
Arm
 
How to Select Hardware for Internet of Things Systems?
Hannes Tschofenig
 
PowerAI Deep dive
Ganesan Narayanasamy
 
Arm: Enabling CXL devices within the Data Center with Arm Solutions
Memory Fabric Forum
 
HPC network stack on ARM - Linaro HPC Workshop 2018
Linaro
 
Ceph on 64-bit ARM with X-Gene
Ceph Community
 
Ceph Day Seoul - Ceph on Arm Scaleable and Efficient
Ceph Community
 
Module 1 - ARM 32 Bit Microcontroller
Amogha Bandrikalli
 
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
LEGATO project
 
Demystify OpenPOWER
Anand Haridass
 
LCU13: GPGPU on ARM Experience Report
Linaro
 
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
inside-BigData.com
 
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
PPTX
Transforming Private 5G Networks
inside-BigData.com
 
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
PDF
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
PDF
Machine Learning for Weather Forecasts
inside-BigData.com
 
PPTX
HPC AI Advisory Council Update
inside-BigData.com
 
PDF
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
PDF
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
PDF
State of ARM-based HPC
inside-BigData.com
 
PDF
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
PDF
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
PDF
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
PDF
Overview of HPC Interconnects
inside-BigData.com
 
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
inside-BigData.com
 
Ad

Recently uploaded (20)

PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 

Arm as a Viable Architecture for HPC and AI

  • 1. © 2018 Arm Limited Dr. Oliver Perks ([email protected]) 16th September 2019 Arm as a Viable Architecture For HPC and AI HPC and AI Advisory Council 2019 Leicester
  • 2. 2 © 2018 Arm Limited 70%of the world’s population uses Arm technology
  • 3. © 2018 Arm Limited Not just mobile phones!
  • 4. 4 © 2018 Arm Limited History of Arm in HPC: A Busy Decade 2011 Calxada •32-bit ARrmv7- A – Cortex A9 2011-2015 Mont-Blanc 1 •32-bit Armv7-A •Cortex A15 •First Arm HPC system 2014 AMD Opteron A1100 •64-bit Armv8-A •Cortex A57 •4-8 Cores 2015 Cavium ThunderX •64-bit Armv8-A •48 Cores 2017 (Cavium) Marvell ThunderX 2 •64-bit Armv8-A •32 Cores 2019 Fujitsu A64FX •First Arm chip with SVE vectorisation •48 Cores
  • 5. 5 © 2018 Arm Limited Marvell ThunderX2 CN99XX • Marvell’s next generation 64-bit Arm processor • Taken from Broadcom Vulcan • 32 cores @ 2.2 GHz (other SKUs available) • 4 Way SMT (up to 256 threads / node) • Fully out of order execution • 8 DDR4 Memory channels (~250 GB/s Dual socket) – Vs 6 on Skylake • Available in dual SoC configurations • CCPI2 interconnect • 180-200w / socket • No SVE vectorisation • 128-bit NEON vectorisation
  • 6. 6 © 2018 Arm Limited Fujitsu A64FX • Chip designed for RIKEN Fugaku (POST-K) • Based on Arm ISA technology • 48 core 64-bit Armv8 processor • + 4 dedicated OS cores • With SVE vectorisation • 512 bit vector length • High performance • >2.7 TFLOPs • Low power : 15GF/W (dgemm) • 32 GB HBM2 • No DDR • 1 TB/s bandwidth • TOFU 3 interconnect
  • 7. 7 © 2018 Arm Limited Deployments: HPE Astra at Sandia Mapping performance to real-world mission applications • HPE Apollo 70 • #156 on Top500 – 1.76 PFLOPs Rmax (2.2 PFLOPs Rpeak) • Marvell ThunderX2 processors – 28-core @ 2.0 Ghz – 332 TB aggregate memory capacity – 885 TB/s of aggregate memory bandwidth • 2592 HPE Apollo 70 nodes – 145, 152 cores • Mellanox EDR InfiniBand • OS: RedHat
  • 8. 8 © 2018 Arm Limited Deployments: Isambard @ GW4 • 10,752 Armv8 cores (168 x 2 x 32) • Marvell ThunderX2 32core • 2.2GHz (Turbo 2.5GHz) • Cray XC50 ‘Scout’ form factor • High-speed Aries interconnect • Cray HPC optimised software stack • CCE, Cray MPI, Cray LibSci, CrayPat, … • OS: CLE (Cray Linux Environment) • Phase 2 (the Arm part): • Being used for production science • Available for industry and academia
  • 9. 9 © 2018 Arm Limited Fulhame Catalyst system at EPCC Bristol: VASP, CASTEP, Gromacs, CP2K, Unified Model, NAMD, Oasis, NEMO, OpenIFS, CASINO, LAMMPS EPCC: WRF, OpenFOAM, Two PhD candidates Leicester: Data-intensive apps, genomics, MOAB Torque, DiRAC collaboration Deployments: Catalyst • Deployments to accelerate the growth of the Arm HPC ecosystem • Each machine will have: • 64 HPE Apollo 70 nodes • Dual 32-core Marvell ThunderX2 processors • 4096 cores per system each • 256GB of memory / node • Mellanox InfiniBand interconnects • OS: SUSE Linux Enterprise Server for HPC
  • 10. 10 © 2018 Arm Limited Deployment: CEA
  • 11. 12 © 2018 Arm Limited The Cloud Open access to server class Arm
  • 12. 13 © 2018 Arm Limited The Cloud Open access to server class Arm 16 Cores A72 @ 2.3 GHz
  • 13. © 2018 Arm Limited Software Ecosystem
  • 14. 15 © 2018 Arm Limited : Typical HPC packages available for Arm OpenHPC is a community effort to provide a common, verified set of open source packages for HPC deployments Arm and partners actively involved: • Arm is a silver member of OpenHPC • Linaro is on Technical Steering Committee • Arm-based machines in the OpenHPC build infrastructure Status: 1.3.6 release out now • Packages built on Armv8-A for CentOS and SUSE Functional Areas Components include Base OS CentOS 7.5, SLES 12 SP3 Administrative Tools Conman, Ganglia, Lmod, LosF, Nagios, pdsh, pdsh- mod-slurm, prun, EasyBuild, ClusterShell, mrsh, Genders, Shine, test-suite Provisioning Warewulf Resource Mgmt. SLURM, Munge I/O Services Lustre client (community version) Numerical/Scientific Libraries Boost, GSL, FFTW, Metis, PETSc, Trilinos, Hypre, SuperLU, SuperLU_Dist,Mumps, OpenBLAS, Scalapack, SLEPc, PLASMA, ptScotch I/O Libraries HDF5 (pHDF5), NetCDF (including C++ and Fortran interfaces), Adios Compiler Families GNU (gcc, g++, gfortran), LLVM MPI Families OpenMPI, MPICH Development Tools Autotools (autoconf, automake, libtool), Cmake, Valgrind,R, SciPy/NumPy, hwloc Performance Tools PAPI, IMB, pdtoolkit, TAU, Scalasca, Score-P, SIONLib
  • 15. 17 © 2018 Arm Limited Commercial C/C++/Fortran compiler with best-in-class performance Tuned for Scientific Computing, HPC and Enterprise workloads • Processor-specific optimizations for various server-class Arm-based platforms • Optimal shared-memory parallelism using latest Arm-optimized OpenMP runtime Linux user-space compiler with latest features • C++ 14 and Fortran 2003 language support with OpenMP 4.5* • Support for Armv8-A and SVE architecture extension • Based on LLVM and Flang, leading open-source compiler projects Commercially supported by Arm • Available for a wide range of Arm-based platforms running leading Linux distributions – RedHat, SUSE and Ubuntu Compilers tuned for Scientific Computing and HPC Latest features and performance optimizations Commercially supported by Arm * Without offloading
  • 16. 18 © 2018 Arm Limited Commercial 64-bit ArmV8-A math Libraries • Commonly used low-level maths routines – BLAS, LAPACK and FFT • Optimised maths intrinsics • Validated with NAG’s test suite, a de facto standard Best-in-class performance with commercial support • Tuned by Arm for specific core - like TX2 and Cortex-A72 • Maintained and supported by Arm for wide range of Arm-based SoCs Silicon partners can provide tuned micro kernels for their SoCs • Partners can contribute directly through open source routes • Parallel tuning within our library increases overall application performance Commercially Supported by Arm Validated with NAG test suite Best-in-class performance
  • 17. © 2018 Arm Limited Application Performance Traditional HPC
  • 18. 22 © 2018 Arm Limited Early Results from Astra
  • 19. 23 © 2018 Arm Limited EM (EMPIRE) Code on Astra
  • 20. 24 © 2018 Arm Limited Single node performance results https://github.com/UoB-HPC/benchmarks
  • 21. 25 © 2018 Arm Limited
  • 22. 26 © 2018 Arm Limited ArmPL: DGEMM performance Excellent serial and parallel performance 0 10 20 30 40 50 60 70 80 90 100 0 1000 2000 3000 4000 5000 6000 7000 8000Percentageofpeakperformance Problem size Arm PL 19.0 DGEMM parallel performance improvement for Cavium ThunderX2 using 56 threads ArmPL 19.0 Achieving very high performance at the node level leveraging high core counts and large memory bandwidth Single core performance at 95% of peak for DGEMM (not shown) Parallel performance at 90% of peak Improved parallel performance for small problem sizes
  • 23. 27 © 2018 Arm Limited ArmPL 19.3 FFT 3-d complex-to-complex vs FFTW 3.3.8 • FFT timings on TX2 • 3-d complex-complex double precision • Higher means new Arm Performance Libraries FFT implementation better than FFTW 3.3.8 • Results show: • Average about 2.5x faster than FFTW • Lots of this comes from better usage of vectors on TX2 0 1 2 3 4 5 6 7 0 100 200 300 400 500 ArmPerformanceLibrariesspeedupoverFFTW Transform dimensions (NxNxN) Arm Performance Libraries 19.3 vs FFTW 3.3.8 Complex-to-complex double precision 3-d transforms Arm PL faster than FFTW (speedup > 1) Performance parity (speedup = 1) FFTW faster than Arm PL (speedup < 1)
  • 24. 28 © 2018 Arm Limited Architecture Adoption: Community Engagement Training Events and Hackathons • Arm as a viable alternative to X86 • Needs to be easy to port to • Working codes and performant codes • Team of field application engineers • Work with code teams • Educate, port and optimize • Successful previous events • Next event • PRACE Aarch64 Training Sep 30th - Oct 1st • Cambridge - Free to attend • https://events.prace-ri.eu/event/900/ Galaxy simulation in SWIFTsim computed on Arm Catalyst during DiRAC Hackathon last week
  • 25. © 2018 Arm Limited Machine Learning and Artificial Intelligence
  • 26. 32 © 2018 Arm Limited Machine Learning and Artificial Intelligence
  • 27. 33 © 2018 Arm Limited ML Frameworks on server-class Aarch64 platforms • Recent effort to enable server-scale on-CPU ML workloads on AArch64 • Build guides for key frameworks available: • Tensorflow - https://gitlab.com/arm- hpc/packages/wikis/packages/tensorflow • PyTorch - https://gitlab.com/arm-hpc/packages/wikis/packages/pytorch • MXNET - https://gitlab.com/arm-hpc/packages/wikis/packages/mxnet • And guides for key dependencies: CPython; NumPy etc. • Currently focusing on inference problems • ML Perf (https://mlperf.org) for realistic workloads.
  • 28. 34 © 2018 Arm Limited TensorFlow and maths libraries on-CPU: as it comes • Extensive use of x86 specific optimizations • Provided by MKL-DNN (now Deep Neural Network Library (DNNL)) • Eigen-only builds remove dependence on MKL-derived contraction kernels • Portable TensorFlow DNNL x86 Framework Data & Models DL lib Maths Kernels ResNet50mobilenet Imagenet coco = portable = impl. / x86 specific (not portable) Hardware Eigen
  • 29. 35 © 2018 Arm Limited TensorFlow and maths libraries: on AArch64 • Arm Performance Libraries • Micro- architecture optimized • Targeting server class cores • High release cadence • GEMMs at the core of matmul and convolutions • Leveraging ArmPL has potential to deliver optimal performance in key kernels for on-CPU, server scale ML workload. TensorFlow Eigen DNNL ArmPL AArch64 Framework Data & Models DL lib Maths Kernels ResNet50mobilenet Imagenet coco = portable / ported = impl. / x86 specific (not portable) Hardware
  • 30. 36 © 2018 Arm Limited ML for HPC A collaborative approach • Early-career STFC Innovation Placement project • Six-month project based at the Manchester Design Office • The wider ML community is developing new reference HPC benchmarks to expose issues at scale. • This project will engage with the community driving this work and utilise the benchmark problems to highlight and diagnose performance bottlenecks. • Complements work underway at Arm towards enabling server-scale on-CPU ML • Demonstration of using platforms like Catalyst for HPC ML.
  • 31. © 2018 Arm Limited Going Forward
  • 32. 43 © 2018 Arm Limited The Future of Arm in HPC What’s next? • By more vendors • Marvell, Ampere, Amazon, HiSilicon, Fujitsu • Targeting different market segments • All built on the Arm ecosystem • Supported by the tools Processors • Large scale and small scale deployments • Increased exposure to the architecture • More applications and libraries ported to Arm • Including ISVs • Increased community • Neoverse – IP roadmap for silicon vendors • Investment in software ecosystem • E.g. F18 • Support for customers • Applications • Software • Performance Deployments Commitment from Arm
  • 33. 44 © 2018 Arm Limited Arm HPC Ecosystem website: www.arm.com/hpc Get involved • News, events, blogs, webinars, etc. • Guides for porting HPC applications • Quick-start guides for tools • Links to community collaboration sites • Arm HPC Users Group (AHUG)
  • 34. The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. www.arm.com/company/policies/trademarks © 2018 Arm Limited