Arm as a Viable Architecture for HPC and AI

© 2018 Arm Limited
Dr. Oliver Perks (olly.perks@arm.com)
16th September 2019
Arm as a Viable Architecture
For HPC and AI
HPC and AI Advisory Council 2019
Leicester

2 © 2018 Arm Limited
70%of the world’s population
uses Arm technology

© 2018 Arm Limited
Not just mobile
phones!

History of Arm in HPC: A Busy Decade
2011 Calxada
•32-bit ARrmv7-
A – Cortex A9
2011-2015
Mont-Blanc 1
•32-bit Armv7-A
•Cortex A15
•First Arm HPC
system
2014
AMD Opteron
A1100
•64-bit Armv8-A
•Cortex A57
•4-8 Cores
2015
Cavium
ThunderX
•64-bit Armv8-A
•48 Cores
2017
(Cavium)
Marvell
ThunderX 2
•64-bit Armv8-A
•32 Cores
2019
Fujitsu A64FX
•First Arm chip
with SVE
vectorisation
•48 Cores

Marvell ThunderX2 CN99XX
• Marvell’s next generation 64-bit Arm processor
• Taken from Broadcom Vulcan
• 32 cores @ 2.2 GHz (other SKUs available)
• 4 Way SMT (up to 256 threads / node)
• Fully out of order execution
• 8 DDR4 Memory channels (~250 GB/s Dual socket)
– Vs 6 on Skylake
• Available in dual SoC configurations
• CCPI2 interconnect
• 180-200w / socket
• No SVE vectorisation
• 128-bit NEON vectorisation

Fujitsu A64FX
• Chip designed for RIKEN Fugaku (POST-K)
• Based on Arm ISA technology
• 48 core 64-bit Armv8 processor
• + 4 dedicated OS cores
• With SVE vectorisation
• 512 bit vector length
• High performance
• >2.7 TFLOPs
• Low power : 15GF/W (dgemm)
• 32 GB HBM2
• No DDR
• 1 TB/s bandwidth
• TOFU 3 interconnect

Deployments: HPE Astra at Sandia
Mapping performance to real-world mission applications
• HPE Apollo 70
• #156 on Top500
– 1.76 PFLOPs Rmax (2.2 PFLOPs Rpeak)
• Marvell ThunderX2 processors
– 28-core @ 2.0 Ghz
– 332 TB aggregate memory capacity
– 885 TB/s of aggregate memory bandwidth
• 2592 HPE Apollo 70 nodes
– 145, 152 cores
• Mellanox EDR InfiniBand
• OS: RedHat

Deployments: Isambard @ GW4
• 10,752 Armv8 cores (168 x 2 x 32)
• Marvell ThunderX2 32core
• 2.2GHz (Turbo 2.5GHz)
• Cray XC50 ‘Scout’ form factor
• High-speed Aries interconnect
• Cray HPC optimised software stack
• CCE, Cray MPI, Cray LibSci, CrayPat, …
• OS: CLE (Cray Linux Environment)
• Phase 2 (the Arm part):
• Being used for production science
• Available for industry and academia

Fulhame Catalyst system at EPCC
Bristol: VASP, CASTEP, Gromacs, CP2K,
Unified Model, NAMD, Oasis, NEMO,
OpenIFS, CASINO, LAMMPS
EPCC: WRF, OpenFOAM, Two PhD candidates
Leicester: Data-intensive apps, genomics,
MOAB Torque, DiRAC collaboration
Deployments: Catalyst
• Deployments to accelerate the growth of
the Arm HPC ecosystem
• Each machine will have:
• 64 HPE Apollo 70 nodes
• Dual 32-core Marvell ThunderX2 processors
• 4096 cores per system each
• 256GB of memory / node
• Mellanox InfiniBand interconnects
• OS: SUSE Linux Enterprise Server for HPC

Deployment: CEA

The Cloud
Open access to server class Arm

The Cloud
Open access to server class Arm
16 Cores A72 @ 2.3 GHz

© 2018 Arm Limited
Software
Ecosystem

: Typical HPC packages available for Arm
OpenHPC is a community effort to provide a common,
verified set of open source packages for HPC
deployments
Arm and partners actively involved:
• Arm is a silver member of OpenHPC
• Linaro is on Technical Steering Committee
• Arm-based machines in the OpenHPC build
infrastructure
Status: 1.3.6 release out now
• Packages built on Armv8-A for CentOS and SUSE
Functional Areas Components include
Base OS CentOS 7.5, SLES 12 SP3
Administrative
Tools
Conman, Ganglia, Lmod, LosF, Nagios, pdsh, pdsh-
mod-slurm, prun, EasyBuild, ClusterShell, mrsh,
Genders, Shine, test-suite
Provisioning Warewulf
Resource Mgmt. SLURM, Munge
I/O Services Lustre client (community version)
Numerical/Scientific
Libraries
Boost, GSL, FFTW, Metis, PETSc, Trilinos, Hypre,
SuperLU, SuperLU_Dist,Mumps, OpenBLAS,
Scalapack, SLEPc, PLASMA, ptScotch
I/O Libraries HDF5 (pHDF5), NetCDF (including C++ and Fortran
interfaces), Adios
Compiler Families GNU (gcc, g++, gfortran), LLVM
MPI Families OpenMPI, MPICH
Development Tools Autotools (autoconf, automake, libtool), Cmake,
Valgrind,R, SciPy/NumPy, hwloc
Performance Tools PAPI, IMB, pdtoolkit, TAU, Scalasca, Score-P,
SIONLib

Commercial C/C++/Fortran compiler with best-in-class performance
Tuned for Scientific Computing, HPC and Enterprise workloads
• Processor-specific optimizations for various server-class Arm-based platforms
• Optimal shared-memory parallelism using latest Arm-optimized OpenMP
runtime
Linux user-space compiler with latest features
• C++ 14 and Fortran 2003 language support with OpenMP 4.5*
• Support for Armv8-A and SVE architecture extension
• Based on LLVM and Flang, leading open-source compiler projects
Commercially supported by Arm
• Available for a wide range of Arm-based platforms running leading Linux
distributions – RedHat, SUSE and Ubuntu
Compilers tuned for Scientific
Computing and HPC
Latest features and
performance optimizations
Commercially supported
by Arm
* Without offloading

Commercial 64-bit ArmV8-A math Libraries
• Commonly used low-level maths routines – BLAS, LAPACK and FFT
• Optimised maths intrinsics
• Validated with NAG’s test suite, a de facto standard
Best-in-class performance with commercial support
• Tuned by Arm for specific core - like TX2 and Cortex-A72
• Maintained and supported by Arm for wide range of Arm-based SoCs
Silicon partners can provide tuned micro kernels for their SoCs
• Partners can contribute directly through open source routes
• Parallel tuning within our library increases overall application performance
Commercially Supported
by Arm
Validated with
NAG test suite
Best-in-class performance

© 2018 Arm Limited
Application
Performance
Traditional HPC

Early Results from Astra

EM (EMPIRE) Code on Astra

Single node performance results
https://github.com/UoB-HPC/benchmarks

ArmPL: DGEMM performance
Excellent serial and parallel performance
0
10
20
30
40
50
60
70
80
90
100
0 1000 2000 3000 4000 5000 6000 7000 8000Percentageofpeakperformance
Problem size
Arm PL 19.0 DGEMM parallel performance improvement
for Cavium ThunderX2 using 56 threads
ArmPL 19.0
Achieving very high performance at the node
level leveraging high core counts and large
memory bandwidth
Single core performance at
95% of peak for DGEMM (not shown)
Parallel performance at 90% of peak
Improved parallel performance for small
problem sizes

ArmPL 19.3 FFT 3-d complex-to-complex vs FFTW 3.3.8
• FFT timings on TX2
• 3-d complex-complex double
precision
• Higher means new Arm
Performance Libraries FFT
implementation better than
FFTW 3.3.8
• Results show:
• Average about 2.5x faster than
FFTW
• Lots of this comes from better
usage of vectors on TX2
0
1
2
3
4
5
6
7
0 100 200 300 400 500
ArmPerformanceLibrariesspeedupoverFFTW
Transform dimensions (NxNxN)
Arm Performance Libraries 19.3 vs FFTW 3.3.8
Complex-to-complex double precision 3-d transforms
Arm PL faster than
FFTW (speedup > 1)
Performance parity
(speedup = 1)
FFTW faster than
Arm PL (speedup < 1)

Architecture Adoption: Community Engagement
Training Events and Hackathons
• Arm as a viable alternative to X86
• Needs to be easy to port to
• Working codes and performant codes
• Team of field application engineers
• Work with code teams
• Educate, port and optimize
• Successful previous events
• Next event
• PRACE Aarch64 Training Sep 30th - Oct 1st
• Cambridge - Free to attend
• https://events.prace-ri.eu/event/900/
Galaxy simulation in SWIFTsim computed on Arm
Catalyst during DiRAC Hackathon last week

Machine Learning and Artificial Intelligence

ML Frameworks on server-class Aarch64 platforms
• Recent effort to enable server-scale on-CPU ML workloads on
AArch64
• Build guides for key frameworks available:
• Tensorflow - https://gitlab.com/arm-
hpc/packages/wikis/packages/tensorflow
• PyTorch - https://gitlab.com/arm-hpc/packages/wikis/packages/pytorch
• MXNET - https://gitlab.com/arm-hpc/packages/wikis/packages/mxnet
• And guides for key dependencies: CPython; NumPy etc.
• Currently focusing on inference problems
• ML Perf (https://mlperf.org) for realistic workloads.

TensorFlow and maths libraries on-CPU: as it comes
• Extensive use of x86 specific
optimizations
• Provided by MKL-DNN (now Deep
Neural Network Library (DNNL))
• Eigen-only builds remove
dependence on MKL-derived
contraction kernels
• Portable
TensorFlow
DNNL
x86
Framework
Data
& Models
DL lib
Maths Kernels
ResNet50mobilenet
Imagenet coco
= portable = impl. / x86 specific
(not portable)
Hardware
Eigen

TensorFlow and maths libraries: on AArch64
• Arm Performance Libraries
• Micro- architecture optimized
• Targeting server class cores
• High release cadence
• GEMMs at the core of matmul and
convolutions
• Leveraging ArmPL has potential to
deliver optimal performance in key
kernels for on-CPU, server scale ML
workload.
TensorFlow
Eigen
DNNL
ArmPL
AArch64
Framework
Data
& Models
DL lib
Maths Kernels
ResNet50mobilenet
Imagenet coco
= portable / ported = impl. / x86 specific
(not portable)
Hardware

ML for HPC
A collaborative approach
• Early-career STFC Innovation Placement project
• Six-month project based at the Manchester Design Office
• The wider ML community is developing new reference HPC benchmarks to expose
issues at scale.
• This project will engage with the community driving this work and utilise the benchmark
problems to highlight and diagnose performance bottlenecks.
• Complements work underway at Arm towards enabling server-scale on-CPU ML
• Demonstration of using platforms like Catalyst for HPC ML.

The Future of Arm in HPC
What’s next?
• By more vendors
• Marvell, Ampere, Amazon,
HiSilicon, Fujitsu
• Targeting different market
segments
• All built on the Arm
ecosystem
• Supported by the tools
Processors
• Large scale and small scale
deployments
• Increased exposure to the
architecture
• More applications and
libraries ported to Arm
• Including ISVs
• Increased community
• Neoverse – IP roadmap for
silicon vendors
• Investment in software
ecosystem
• E.g. F18
• Support for customers
• Applications
• Software
• Performance
Deployments Commitment from Arm

Arm HPC Ecosystem website: www.arm.com/hpc
Get involved
• News, events, blogs, webinars, etc.
• Guides for porting HPC applications
• Quick-start guides for tools
• Links to community collaboration sites
• Arm HPC Users Group (AHUG)

The Arm trademarks featured in this presentation are registered
trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks
featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks
© 2018 Arm Limited

Arm as a Viable Architecture for HPC and AI

More Related Content

What's hot (16)

Similar to Arm as a Viable Architecture for HPC and AI (20)

More from inside-BigData.com (20)

Recently uploaded (20)

Arm as a Viable Architecture for HPC and AI