Programming Languages & Tools
for Higher Performance &
Productivity
Hitoshi Murai (RIKEN)
Shun Kamatsuka (Fujitsu)
Tomotake Nakamura (Fujitsu)
Dec. 13, 2017 ARM HPC Workshop 1
Introduction of this Session
nFor higher performance & productivity on
HPC systems, programming environments
have a crucial role.
⦁ languages
⦁ compilers
⦁ tools
⦁ libraries
nRIKEN AICS and Fujitsu are collaborating to
design the programming env. of the
upcoming post-K computer.
Dec. 13, 2017 ARM HPC Workshop 2
Agenda of this Session
1. XcalableMP PGAS Language
⦁ by Hitoshi Murai
2. Advantages of the Compiler for Post-K
Computer
⦁ by Shun Kamatsuka
3. Overview of Programming Assistance
Tools for Post-K Computer
⦁ by Tomotake Nakamura
Dec. 13, 2017 ARM HPC Workshop 3
XcalableMP PGAS Language
Hitoshi Murai (RIKEN)
Dec. 13, 2017 ARM HPC Workshop 4
Introduction
nMessage Passing Interface (MPI) is a de-
facto standard for programming distributed-
memory HPC systems.
nProgramming with MPI is a very hard work.
Dec. 13, 2017 ARM HPC Workshop 5
We are developing the XcalableMP (XMP)
PGAS language, which could provide
both high performance and productivity,
for post-K.
What's PGAS?
nPartitioned Global Address Space
n"Global"
⦁ All processes or threads share one address
space and can access to every data in it.
n"Partitioned"
⦁ Remote and local data are distinguished and
might have different manners and costs of
access.
Dec. 13, 2017 ARM HPC Workshop 6
p0 p1 p2 p3
PGAS
private	address	space
What's ?
n A directive-based PGAS language
⦁ Extension for C/Fortran.
⦁ Latest ver. 1.3 is available at:
⦁ Defined by XMP WG of the PC Cluster Consortium.
n Two models of PGAS for distributed-memory
parallel programming:
⦁ Global view (data/work mapping directives)
⦁ Local view (coarray)
n Interoperable with other languages and
models (e.g. Python, MPI, OpenMP, OpenACC)
Dec. 13, 2017 ARM HPC Workshop 7
www.xcalablemp.org
Two Parallelization Models in XMP
nGlobal view
⦁ Users specify how a set of nodes cooperate to solve a
whole problem.
⦁ Rich directives for data/work mapping and comm.
⦁ Highly productive but suitable mainly to data parallelism.
nLocal view
⦁ Users specify how each node works to solve a partial
problem.
⦁ Coarray of Fortran 2008.
⦁ Lowly productive but more flexible.
Dec. 13, 2017 8ARM HPC Workshop
Example of a Global-view XMP Program
Dec. 13, 2017 9
real, dimension(lx,ly,lz) :: sr, se, ...
...
do iz = 1, lz-1
do iy = 1, ly
do ix = 1, lx
wu0 = sm(ix,iy,iz ) / sr(ix,iy,iz )
wu1 = sm(ix,iy,iz+1) / sr(ix,iy,iz+1)
wv0 = sn(ix,iy,iz ) / sr(ix,iy,iz )
...
ARM HPC Workshop
Example of a Global-view XMP Program
Dec. 13, 2017 10
!$xmp nodes p(npx,npy,npz)
!$xmp template (lx,ly,lz) :: t
!$xmp distribute (block,block,block) onto p :: t
real, dimension(lx,ly,lz) :: sr, se, ...
!$xmp align (ix,iy,iz) with t(ix,iy,iz) ::
!$xmp& sr, se, sm, sp, sn, sl, ...
!$xmp shadow (1,1,1) ::
!$xmp& sr, se, sm, sp, sn, sl, ...
...
!$xmp reflect (sr, sm, sp, se, sn, sl)
!$xmp loop (ix,iy,iz) on t(ix,iy,iz)
do iz = 1, lz-1
do iy = 1, ly
do ix = 1, lx
wu0 = sm(ix,iy,iz ) / sr(ix,iy,iz )
wu1 = sm(ix,iy,iz+1) / sr(ix,iy,iz+1)
wv0 = sn(ix,iy,iz ) / sr(ix,iy,iz )
...
stencil communication
work mapping
(parallel loops)
ARM HPC Workshop
data mapping
Local-view Programming
nCoarray, a PGAS feature of Fortran 2008, is
available in XMP/C as well as in
XMP/Fortran.
nBasic idea: data declared as coarray can
be accessed by remote nodes.
Dec. 13, 2017 ARM HPC Workshop 11
real a(1024)[*], b(1024)
a(512:1024)[1] = b(1:512)
sync all
float a[1024]:[*], b[1024];
a[512:512]:[0] = b[0:512];
xmp_sync_all(NULL);
XMP/Fortran XMP/C
1. An array a is declared as a coarray.
2. A local array section b(1:512) is put to a remote array section a(512:1024) on image 1.
3. A memory fence and barrier synchronization is performed.
1
2
3
1
2
3
Omni XcalableMP Compiler
n An open-source reference
impl. being developed by
RIKEN & U. Tsukuba.
n Latest Ver. 1.2.2 available at:
n Supported platforms include:
K, Fujitsu FX100, NEC SX, IBM BlueGene,
Hitachi SR, Cray, Linux clusters, etc.
n Proven applications include:
⦁ Plasma (3D fluid)
⦁ Seismic Imaging (3D stencil)
⦁ Fusion (Particle-in-Cell)
⦁ etc.
Dec. 13, 2017 ARM HPC Workshop 12
omni-compiler.org
C/Fortran
compiler
Frontend
Translator
Backend
.....
.....
XMP program
.....
.....
Executable
Comm. libraries
XMP runtime
Omni XMP
C/Fortran+MPI
program
HPL (of HPC Challenge Benchmarks)
nWritten in the global view of XMP/C
nData is distributed in the block-cyclic manner
and DGEMM is invoked for each block.
nOverlapping comm. and calc. using
asynchronous gmove
Dec. 13, 2017 13
double A_L[N][NB];
#pragma xmp align A_L[i][*] with t(*,i)
:
#pragma xmp gmove async(1)
A_L[k:len][0:NB] = A[k:len][j:NB];
:
for(m=j+NB;m<N;m+=NB){
for(n=j+NB;n<N;n+=NB){
cblas_dgemm(&A[m][n], ..);
if(xmp_test_async(1)){
// receive A[k:len][j:NB];
:
10
100
1000
256 2048 16384
423 TFlops (80.7%)
4,096 nodes
TFlops
Number of nodes
971 TFlops (46.3%)
16,384 nodes
ARM HPC Workshop
NICAM-DC (of Fiber Miniapps)
Dec. 13, 2017 ARM HPC Workshop 14
10
15
20
25
30
35
10 20 30 40
Speedup	(MPI/10	=	10)
Number	of	MPI	Processes
XMP MPI
n Written in the local
view of
XMP/Fortran with
coarray.
n The coarray-based
impl. is almost
comparable to the
original MPI-based
one.
XcalableMP2.0
nDynamic multitasking for manycore
processors
⦁ Breakaway from Bulk Synchronous Parallel (BSP)
model.
⦁ More chances for overlapping comm. and
comp.
nEnhancements of loop parallelization
nSupport for newer version of base
languages (Fortran 2008, C99, and C++11)
Dec. 13, 2017 ARM HPC Workshop 15
Summary
n PGAS languages are promising alternatives to MPI.
n XMP is a directive-based PGAS extension for Fortran
and C.
n XMP supports the global- and local-view
programming to achieve both high performance
and productivity.
n XMP will be available on post-K.
Dec. 13, 2017 16
omni-compiler.orgwww.xcalablemp.org
More information is available at:
ARM HPC Workshop

Programming Languages & Tools for Higher Performance & Productivity

  • 1.
    Programming Languages &Tools for Higher Performance & Productivity Hitoshi Murai (RIKEN) Shun Kamatsuka (Fujitsu) Tomotake Nakamura (Fujitsu) Dec. 13, 2017 ARM HPC Workshop 1
  • 2.
    Introduction of thisSession nFor higher performance & productivity on HPC systems, programming environments have a crucial role. ⦁ languages ⦁ compilers ⦁ tools ⦁ libraries nRIKEN AICS and Fujitsu are collaborating to design the programming env. of the upcoming post-K computer. Dec. 13, 2017 ARM HPC Workshop 2
  • 3.
    Agenda of thisSession 1. XcalableMP PGAS Language ⦁ by Hitoshi Murai 2. Advantages of the Compiler for Post-K Computer ⦁ by Shun Kamatsuka 3. Overview of Programming Assistance Tools for Post-K Computer ⦁ by Tomotake Nakamura Dec. 13, 2017 ARM HPC Workshop 3
  • 4.
    XcalableMP PGAS Language HitoshiMurai (RIKEN) Dec. 13, 2017 ARM HPC Workshop 4
  • 5.
    Introduction nMessage Passing Interface(MPI) is a de- facto standard for programming distributed- memory HPC systems. nProgramming with MPI is a very hard work. Dec. 13, 2017 ARM HPC Workshop 5 We are developing the XcalableMP (XMP) PGAS language, which could provide both high performance and productivity, for post-K.
  • 6.
    What's PGAS? nPartitioned GlobalAddress Space n"Global" ⦁ All processes or threads share one address space and can access to every data in it. n"Partitioned" ⦁ Remote and local data are distinguished and might have different manners and costs of access. Dec. 13, 2017 ARM HPC Workshop 6 p0 p1 p2 p3 PGAS private address space
  • 7.
    What's ? n Adirective-based PGAS language ⦁ Extension for C/Fortran. ⦁ Latest ver. 1.3 is available at: ⦁ Defined by XMP WG of the PC Cluster Consortium. n Two models of PGAS for distributed-memory parallel programming: ⦁ Global view (data/work mapping directives) ⦁ Local view (coarray) n Interoperable with other languages and models (e.g. Python, MPI, OpenMP, OpenACC) Dec. 13, 2017 ARM HPC Workshop 7 www.xcalablemp.org
  • 8.
    Two Parallelization Modelsin XMP nGlobal view ⦁ Users specify how a set of nodes cooperate to solve a whole problem. ⦁ Rich directives for data/work mapping and comm. ⦁ Highly productive but suitable mainly to data parallelism. nLocal view ⦁ Users specify how each node works to solve a partial problem. ⦁ Coarray of Fortran 2008. ⦁ Lowly productive but more flexible. Dec. 13, 2017 8ARM HPC Workshop
  • 9.
    Example of aGlobal-view XMP Program Dec. 13, 2017 9 real, dimension(lx,ly,lz) :: sr, se, ... ... do iz = 1, lz-1 do iy = 1, ly do ix = 1, lx wu0 = sm(ix,iy,iz ) / sr(ix,iy,iz ) wu1 = sm(ix,iy,iz+1) / sr(ix,iy,iz+1) wv0 = sn(ix,iy,iz ) / sr(ix,iy,iz ) ... ARM HPC Workshop
  • 10.
    Example of aGlobal-view XMP Program Dec. 13, 2017 10 !$xmp nodes p(npx,npy,npz) !$xmp template (lx,ly,lz) :: t !$xmp distribute (block,block,block) onto p :: t real, dimension(lx,ly,lz) :: sr, se, ... !$xmp align (ix,iy,iz) with t(ix,iy,iz) :: !$xmp& sr, se, sm, sp, sn, sl, ... !$xmp shadow (1,1,1) :: !$xmp& sr, se, sm, sp, sn, sl, ... ... !$xmp reflect (sr, sm, sp, se, sn, sl) !$xmp loop (ix,iy,iz) on t(ix,iy,iz) do iz = 1, lz-1 do iy = 1, ly do ix = 1, lx wu0 = sm(ix,iy,iz ) / sr(ix,iy,iz ) wu1 = sm(ix,iy,iz+1) / sr(ix,iy,iz+1) wv0 = sn(ix,iy,iz ) / sr(ix,iy,iz ) ... stencil communication work mapping (parallel loops) ARM HPC Workshop data mapping
  • 11.
    Local-view Programming nCoarray, aPGAS feature of Fortran 2008, is available in XMP/C as well as in XMP/Fortran. nBasic idea: data declared as coarray can be accessed by remote nodes. Dec. 13, 2017 ARM HPC Workshop 11 real a(1024)[*], b(1024) a(512:1024)[1] = b(1:512) sync all float a[1024]:[*], b[1024]; a[512:512]:[0] = b[0:512]; xmp_sync_all(NULL); XMP/Fortran XMP/C 1. An array a is declared as a coarray. 2. A local array section b(1:512) is put to a remote array section a(512:1024) on image 1. 3. A memory fence and barrier synchronization is performed. 1 2 3 1 2 3
  • 12.
    Omni XcalableMP Compiler nAn open-source reference impl. being developed by RIKEN & U. Tsukuba. n Latest Ver. 1.2.2 available at: n Supported platforms include: K, Fujitsu FX100, NEC SX, IBM BlueGene, Hitachi SR, Cray, Linux clusters, etc. n Proven applications include: ⦁ Plasma (3D fluid) ⦁ Seismic Imaging (3D stencil) ⦁ Fusion (Particle-in-Cell) ⦁ etc. Dec. 13, 2017 ARM HPC Workshop 12 omni-compiler.org C/Fortran compiler Frontend Translator Backend ..... ..... XMP program ..... ..... Executable Comm. libraries XMP runtime Omni XMP C/Fortran+MPI program
  • 13.
    HPL (of HPCChallenge Benchmarks) nWritten in the global view of XMP/C nData is distributed in the block-cyclic manner and DGEMM is invoked for each block. nOverlapping comm. and calc. using asynchronous gmove Dec. 13, 2017 13 double A_L[N][NB]; #pragma xmp align A_L[i][*] with t(*,i) : #pragma xmp gmove async(1) A_L[k:len][0:NB] = A[k:len][j:NB]; : for(m=j+NB;m<N;m+=NB){ for(n=j+NB;n<N;n+=NB){ cblas_dgemm(&A[m][n], ..); if(xmp_test_async(1)){ // receive A[k:len][j:NB]; : 10 100 1000 256 2048 16384 423 TFlops (80.7%) 4,096 nodes TFlops Number of nodes 971 TFlops (46.3%) 16,384 nodes ARM HPC Workshop
  • 14.
    NICAM-DC (of FiberMiniapps) Dec. 13, 2017 ARM HPC Workshop 14 10 15 20 25 30 35 10 20 30 40 Speedup (MPI/10 = 10) Number of MPI Processes XMP MPI n Written in the local view of XMP/Fortran with coarray. n The coarray-based impl. is almost comparable to the original MPI-based one.
  • 15.
    XcalableMP2.0 nDynamic multitasking formanycore processors ⦁ Breakaway from Bulk Synchronous Parallel (BSP) model. ⦁ More chances for overlapping comm. and comp. nEnhancements of loop parallelization nSupport for newer version of base languages (Fortran 2008, C99, and C++11) Dec. 13, 2017 ARM HPC Workshop 15
  • 16.
    Summary n PGAS languagesare promising alternatives to MPI. n XMP is a directive-based PGAS extension for Fortran and C. n XMP supports the global- and local-view programming to achieve both high performance and productivity. n XMP will be available on post-K. Dec. 13, 2017 16 omni-compiler.orgwww.xcalablemp.org More information is available at: ARM HPC Workshop