CFD on Power

© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
1
Task-based GPU acceleration in Computational
Fluid Dynamics with OpenMP 4.5 and CUDA in
OpenPOWER platforms.
OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain
June 2018
Samuel Antao
IBM Research, Daresbury, UK

2
IBM Research @ Daresbury, UK

3
IBM Research @ Daresbury, UK – STFC Partnership Mission
• 2015: £313 million investment over next 5 years
• Agreement for IBM Collaborative Research and Development (R&D) that established IBM Research presence
in the UK
• Product and Services Agreement with IBM UK and Ireland
• Access to the latest data-centric and cognitive computing technologies, including IBMs world-class Watson
cognitive computing platform
• Joint commercialization of intellectual property assets produced in the partnership
Help the UK industries and institutions bringing cutting-edge computational science,
engineering and applicable technologies, such as data-centric cognitive computing, to
boost growth and development of the UK economy

4
IBM Research @ Daresbury, UK – People
7
Over 26 computational
scientists and engineers

5
IBM Research @ Daresbury, UK – Research areas
• Case studies:
– Smart Crop Protection -
Precision Agriculture
• Data science + Life sciences
– Improving disease diagnostics
and personalised treatments
• Life sciences + Machine learning
– Cognitive treatment plant
• Engineering + Machine learning
– Parameterisation of engineering
models
• Engineering + Machine learning

6
Task-based GPU acceleration in Computational
Fluid Dynamics with OpenMP 4.5 and CUDA in
OpenPOWER platforms.
OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain
June 2018
Samuel Antao
IBM Research, Daresbury, UK

7
CFD and Algebraic-Multigrid
• Solve set of partial-differential equations over several
time steps
– Discretization:
• Unstructured vs Structured
– Equations:
• Velocity
• Pressure
• Turbulence
• Iterative solvers
– Jacobi
– Gauss-Seidel
– Conjugate Gradient
• Multigrid approaches
– Solve the problem at different resolutions
• Coarse and fine grids/meshes
– Less Iterations for fine grids
• Algebraic multigrid (AMG) – encode mesh
information in algebraic format
– Sparse matrices.
source: http://web.utk.edu/~wfeng1/research.html

8
CFD and Algebraic-Multigrid
source: http://web.utk.edu/~wfeng1/research.html
NVLINKTM
NVLINKTM
NVLINKTM
NVLINKTM
InfiniBandTM
MPI rank
• Grid partitioned by
MPI ranks
• Ranks distributed by
nodes
• More than one rank
executing in one
node
• Challenges:
– Different grids have
different compute
needs
– Access strides vary,
unstructured data
accesses.
– CPU-GPU data
movements
– Regular
communication
between ranks
• Halo elements
• Residuals
• Synchronizations

9
CFD and Algebraic-Multigrid – Code Saturne
• Open-source – developed and maintained by EDF
• 350K lines of code:
– 50% C – 37% Fortran – 13% Python
• Rich ecosystem to configure/parameterise simulations, generate meshes
• History of good scalability
Cores Time in
Solver
Efficiency
262,144 789.79 s -
524,288 403.18 s 97%
MPI Tasks Time in
Solver
Efficiency
524,288 70.114 s -
1,048,576 52.574 s 66%
1,572,864 45.731 s 76%
105B Cell Mesh (MIRA, BGQ)
13B Cell Mesh (MIRA, BGQ)

10
CFD and Algebraic-Multigrid – Execution time distribution
• Many components (kernels)
contribute to total execution
time
• There are data dependencies
between consecutive kernels
• There are opportunities to
keep data in the device
between kernels
• Some kernels may have
lower compute intensity, it
could still be worthwhile
computing them in the GPU if
the data is already there
Gauss-Seidel solver
(Velocity)
Other
Matrix-vector mult.
MSR
Matrix-vector mult.
CSR
Dot products
Multigrid setup
Compute coarse cells
from fine cells
Other AMG-related
Pressure
(AMG)
Single thread profiling - Code Saturne 5.0+

11
Directive-based programming models
• Porting existing code to accelerators is time consuming…
• The sooner we have code running in the GPU the sooner you can start …
– … learning where overheads are
– … identifying what data patterns are being used
– … spotting kernels performing poorly
– … making decisions on what strategies can be used to improve performance
• Directive-based programming models can get you started much quicker
– Don’t need to bother about device memory allocation and data pointers
– Implementation defaults already exploiting device features
– Easily create data environments where data resides in the GPU
– Improve your code portability
• Clang C/C++ and IBM XL C/C++/Fortran compiler provide OpenMP 4.5 support
• PGI C/C++/Fortran compiler provide OpenACC support
• Can be complemented with existing GPU accelerated libraries
– cuSparse
– AMGx
XL
clang

12
• OpenMP 4.5 data environments – Code Saturne 5.0+ snippet – Conjugate Gradient
IBM Conﬁdential GPU acceleration in Code Saturne
1 static cs_sles_convergence_state_t _conjugate_gradient (/* ... */)
2 {
3 # pragma omp target data if (n_rows > GPU_THRESHOLD )
4 /* Move result vector to device and copied it back at the ned of the scope */
5 map(tofrom:vx[: vx_size ])
6 /* Move right -hand side vector to the device */
7 map(to:rhs [: n_rows ])
8 /* Allocate all auxiliary vectors in the device */
9 map(alloc: _aux_vectors [: tmp_size ])
10 {
11
12 /* Solver code */
13
14 }
15 }
Listing 2: OpenMP 4.5 data environment for a level of the AMG solver.
during the computation of a level so it can be copied to the device at the beginning of the level. The result
vector can also be kept in the device for a signiﬁcant part of the execution, and only has to be copied to
the host during halo exchange. OpenMP 4.5 makes managing the data according to the aforementioned
observations almost trivial: a single directive su ces to set the scope - see Listing 2. Each time halos
All arrays reside in the device in this scope!
The programming model manages host/device pointers mapping for you!

13
• OpenMP 4.5 target regions – Code Saturne 5.0+ snippet – Dot products
6 vx[ii] += (alpha * dk[ii]);
7 rk[ii] += (alpha * zk[ii]);
8 }
9
10 /* ... */
11 }
12
13 /* ... */
14
15 static void _cs_dot_xx_xy_superblock (cs_lnum_t n,
16 const cs_real_t *restrict x,
17 const cs_real_t *restrict y,
18 double *xx ,
19 double *xy)
20 {
21 double dot_xx = 0.0, dot_xy = 0.0;
22
23 # pragma omp target teams distribute parallel for reduction (+: dot_xx , dot_xy)
24 if ( n > GPU_THRESHOLD )
25 map(to:x[:n],y[:n])
26 map(tofrom:dot_xx , dot_xy)
27 for (cs_lnum_t i = 0; i < n; ++i) {
28 const double tx = x[i];
29 const double ty = y[i];
30 dot_xx += tx*tx;
31 dot_xy += tx*ty;
32 }
33
34 /* ... */
35
36 *xx = dot_xx;
37 *xy = dot_xy;
38 }
Listing 3: Example of GPU port for two stream kernels: vector multiply-and-add and dot product .
…
Host
… … CUDA blocks
OpenMP team
Allocate data
in the the
device.
Host
Release data
in the the
device.
OpenMP
runtime library
OpenMP
runtime library

14
• OpenMP 4.5 – Code Saturne 5.0+ – AMG NVPROF timeline
AMG
cycle
AMG
coarse
grid
detail
Allocations of
small
variables
High kernel
launch latency
Back-to-back
kernels

15
CUDA-based tuning
• Avoid expensive GPU memory allocation/deallocation:
– Allocate a memory chunk once and reuse it
• Use pinned memory for data copied frequently to the
GPU
– Avoid pageable-pinned memory copies by the CUDA
implementation
• Explore asynchronous execution of CUDA API calls
– Start copying data to/from the device while the host is preparing
the next set of data or the next kernel
• Use CUDA constant memory to copy arguments for
multiple kernels at once.
– The latency of copying tens of KB to the GPU is similar to
copy 1B
– Dual-buffering enable copies to happen asynchronously
• Produce specialized kernels instead of relying on runtime
checks.
– CUDA is a C++ extension and therefore kernels and device
functions can be templated.
– Leverage compile-time optimizations for the relevant
sequences of kernels.
– NVCC toolchain does very aggressive inlining.
– Lower register pressure = more occupancy.
IBM Conﬁdential GPU acceleration in Cod
1 template < KernelKinds Kind >
2 __device__ int any_kernel ( KernelArgsBase &Arg , unsigned n_rows_per_block ) {
3 switch(Kind) {
4 /* ... */
5 // Dot product:
6 //
7 case DP_xx:
8 dot_product <Kind >(
9 /* version */ Arg.getArg <cs_lnum_t >(0),
10 /* n_rows */ Arg.getArg <cs_lnum_t >(1),
11 /* x */ Arg.getArg <cs_real_t * >(2),
12 /* y */ nullptr ,
13 /* z */ nullptr ,
14 /* res */ Arg.getArg <cs_real_t * >(3),
15 /* n_rows_per_block */ n_rows_per_block );
16 break;
17 /* ... */
18 }
19 __syncthreads ();
20 return 0;
21 }
22
23 template < KernelKinds ... Kinds >
24 __global__ void any_kernels (void) {
25
26 auto *KA = reinterpret_cast < KernelArgsSeries *>(& KernelArgsSeriesGPU [0]);
27 const unsigned n_rows_per_block = KA -> RowsPerBlock ;
28 unsigned idx = 0;
29
30 int dummy [] = { any_kernel <Kinds >(KA ->Args[idx ++], n_rows_per_block )... };
31 (void) dummy;
32 }
Listing 10: Device entry-point function for kernel execution.

16
CUDA-based tuning
• CUDA – Code Saturne 5.0+ – Results for a single rank
– IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid
57.21
43.83
34.77
30.28
49.86
37.37
29.67
25.63
11.87 10.83 9.55 9.32
4.41 4.34 4.40 4.63
0
10
20
30
40
50
60
70
1 2 4 8
Time (seconds)
OpenMP threads
Wall time CPU Solver time CPU Wall time CPU+GPU Solver time CPU+GPU
4.82
4.05
3.64
3.25
11.29
8.60
6.74
5.53
0
2
4
6
8
10
12
1 2 4 8
GPU speedup (1x)
OpenMP threads
Wall time Solvers time
Execution time Speed up

17
CUDA-based tuning
• CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank
Gauss-Seidel
AMG
fine grid

18
CUDA-based tuning
• CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank (cont.)
AMG
coarse grid

19
MPI and GPU acceleration
• Different processes (MPI ranks) will use different CUDA contexts.
• CUDA implementation serializes CUDA contexts by default.
• NVIDIA Multi-Process Service (MPS) provides context switching capabilities so that multiple processes can use the
same GPU.
MPS server
instance
GPU driver
Define
Visible GPU
Start MPS
server
Execute
application
Terminate
MPS server
Define
Visible GPU
Execute
application
Define
Visible GPU
Execute
application
Rank 0
Rank 1
Rank 2

20
• CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU)
– IBM Minsky server - Lid driven cavity flow – 111M-cell grid
Gauss-Seidel
Hiding data
movement
latencies

21
• CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU – cont.)
– IBM Minsky server - Lid driven cavity flow – 111M-cell grid
AMG
coarse grid

22
• CUDA – Code Saturne 5.0+ – Results for multiple ranks (5 ranks per GPU – cont.)
– IBM Minsky server - Lid driven cavity flow – 111M-cell grid – CPU+GPU efficiency 65% @32 nodes
2.39 2.42
2.32
2.22
2.08
2.00
2.53
2.57
2.45
2.37
2.20
2.10
1.5
1.7
1.9
2.1
2.3
2.5
2.7
1 2 4 8 16 32
Speedup over CPU-only (1x)
Nodes
Wall time Solvers time
717.6
369.9
187.4
100.2
54.9
28.9
693.9
358.0
181.5
97.2
53.4
28.1
300.4
153.1
80.8
45.2
26.4
14.4
274.7
139.6
74.0
41.1
24.3
13.410.0
100.0
1000.0
1 2 4 8 16 32
Time (seconds)
Nodes
CPU wall time CPU solvers time CPU+GPU wall time CPU+GPU solvers time

23
POWER8 to POWER9
• A code performing well on POWER8 + P100 GPUs should perform well on POWER9 + V100 GPUs
– No major code refactoring needed.
– More powerful CPUs, GPUs and interconnect.
• Some differences to consider:
– Core vs Pairs of Cores
• POWER9 L3 cache and store queue is shared for each pair of cores
• SMT4 per core or SMT8 per pair-of-cores
– V100 (Volta) drops lock-step execution per warp-threads
• One program-counter per thread
• If code assumes lock-step execution explicit barriers have to be inserted
• No guarantee threads will converge after divergence within a warp
• One has to leverage cooperative groups and thread activity masks
NVLINKTM
NVLINKTM
NVLINKTM
ORNL Summit Socket
(2 sockets per node)
for (cs_lnum_t ii = StartRow; ii < EndRow; ii += bdimy) {
// Depending on the number of rows - warps may diverge here
unsigned AM = __activemask();
…
for (cs_lnum_t kk = 1 ; kk < bdimx ; kk *= 2)
sii += __shfl_down_sync(AM, sii,kk, bdimx);
…
}

24
POWER8 to POWER9
• CUDA – Code Saturne 5.0+ – Results for multiple ranks (3 ranks per GPU) – 6 GPUs per node
– IBM Power 9 and NVLINK 2.0 (Summit) - Lid driven cavity flow – 889M-cell grid – CPU+GPU efficiency 76% @512 nodes
2.31
2.90
2.34
1.5
1.7
1.9
2.1
2.3
2.5
2.7
2.9
3.1
64 256 512
Speedup over CPU-only (1x)
Nodes
Wall time
74.73
21.04
11.16
32.3
7.25
4.76
1.0
10.0
100.0
64 256 512
Time (seconds)
Nodes
CPU wall time CPU+GPU wall time
POWER9 vs POWER8: Better efficiency when scaling to 16x more nodes for 8x larger problem

25
CFD and AI
• Cognitive Enhanced Design
– Design / prototype new pieces of equipment is
expensive (time and finance)
• Parameter sweeps need several expensive simulations
• Want to make decisions faster
• Make decisions on more complex problems
– Use cognitive techniques (e.g. Bayesian neural
networks) to generate a model based on a
parameterized space to relate design parameters to
performance. Use this in Bayesian optimization to
improve design
• Converge to optimal parameters more quickly
– Example: airfoil optimization: Lift/Drag maximization
• Adaptive Expected Improvement (EI) converges faster and
with less variance.
Work package ML1 Cognitive Enhanced Design
§ Problem: Design / prototyping of new pieces
of equipment can be expensive (time and
finance). Want to do more work in silico, and
also use an ‘intelligent’ design process
§ Solution: Use cognitive techniques (e.g.
Bayesian neural networks) to generate a
model based on a parameterized space to relate
design parameters to performance. Use this in
Bayesian optimization to improve design.
a

26
CFD and AI
• Enhanced 3D-Feature Detection
– Typical bottleneck of the design process is the
simulation-led workflow analysis
• Extract feature like:
– flow
– separation
– swirl
– layering
– Extend AI techniques to automatically extract features
in 3D
• Remove analysis bottlenecks
• Semantic querying of simulation data
• Contextual event classification
• Computational steering for rare-event simulation
– Example: Racing car vortex detection
• AI-enabled feature detection
§ Problem: One typical bottleneck in the
simulation-led workflow, is the analysis of
the output produced by the simulation itself
– especially the identification of features
(e.g. for flow; separation, swirl, layering).
§ Solution: Extend deep-feature detection to
3-dimensional problems to remove this
bottleneck from the design workflow
§ Additional extensions planned for the
semantic querying of simulation data, contextual
event classification, and computational steering for
rare-event simulation
Workpackage ML3 Enhanced 3D-Feature Detection

27
CFD and AI
• Enhanced 3D-Feature Detection
– Typical bottleneck of the design process is the
simulation-led workflow analysis
• Extract feature like:
– flow
– separation
– swirl
– layering
– Extend AI techniques to automatically extract features
in 3D
• Remove analysis bottlenecks
• Semantic querying of simulation data
• Contextual event classification
• Computational steering for rare-event simulation
– Example: Racing car vortex detection
• AI-enabled feature detection

28
Questions?
samuel.antao@ibm.com

CFD on Power

More Related Content

What's hot (20)

Similar to CFD on Power (20)

More from Ganesan Narayanasamy (20)

Recently uploaded (20)

CFD on Power