© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
1
Task-based GPU acceleration in Computational
Fluid Dynamics with OpenMP 4.5 and CUDA in
OpenPOWER platforms.
OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain
June 2018
Samuel Antao
IBM Research, Daresbury, UK
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
2
IBM Research @ Daresbury, UK
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
3
IBM Research @ Daresbury, UK – STFC Partnership Mission
• 2015: £313 million investment over next 5 years
• Agreement for IBM Collaborative Research and Development (R&D) that established IBM Research presence
in the UK
• Product and Services Agreement with IBM UK and Ireland
• Access to the latest data-centric and cognitive computing technologies, including IBMs world-class Watson
cognitive computing platform
• Joint commercialization of intellectual property assets produced in the partnership
Help the UK industries and institutions bringing cutting-edge computational science,
engineering and applicable technologies, such as data-centric cognitive computing, to
boost growth and development of the UK economy
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
4
IBM Research @ Daresbury, UK – People
7
Over 26 computational
scientists and engineers
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
5
IBM Research @ Daresbury, UK – Research areas
• Case studies:
– Smart Crop Protection -
Precision Agriculture
• Data science + Life sciences
– Improving disease diagnostics
and personalised treatments
• Life sciences + Machine learning
– Cognitive treatment plant
• Engineering + Machine learning
– Parameterisation of engineering
models
• Engineering + Machine learning
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
6
Task-based GPU acceleration in Computational
Fluid Dynamics with OpenMP 4.5 and CUDA in
OpenPOWER platforms.
OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain
June 2018
Samuel Antao
IBM Research, Daresbury, UK
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
7
CFD and Algebraic-Multigrid
• Solve set of partial-differential equations over several
time steps
– Discretization:
• Unstructured vs Structured
– Equations:
• Velocity
• Pressure
• Turbulence
• Iterative solvers
– Jacobi
– Gauss-Seidel
– Conjugate Gradient
• Multigrid approaches
– Solve the problem at different resolutions
• Coarse and fine grids/meshes
– Less Iterations for fine grids
• Algebraic multigrid (AMG) – encode mesh
information in algebraic format
– Sparse matrices.
source: http://web.utk.edu/~wfeng1/research.html
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
8
CFD and Algebraic-Multigrid
source: http://web.utk.edu/~wfeng1/research.html
NVLINKTM
NVLINKTM
NVLINKTM
NVLINKTM
InfiniBandTM
MPI rank
• Grid partitioned by
MPI ranks
• Ranks distributed by
nodes
• More than one rank
executing in one
node
• Challenges:
– Different grids have
different compute
needs
– Access strides vary,
unstructured data
accesses.
– CPU-GPU data
movements
– Regular
communication
between ranks
• Halo elements
• Residuals
• Synchronizations
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
9
CFD and Algebraic-Multigrid – Code Saturne
• Open-source – developed and maintained by EDF
• 350K lines of code:
– 50% C – 37% Fortran – 13% Python
• Rich ecosystem to configure/parameterise simulations, generate meshes
• History of good scalability
Cores Time in
Solver
Efficiency
262,144 789.79 s -
524,288 403.18 s 97%
MPI Tasks Time in
Solver
Efficiency
524,288 70.114 s -
1,048,576 52.574 s 66%
1,572,864 45.731 s 76%
105B Cell Mesh (MIRA, BGQ)
13B Cell Mesh (MIRA, BGQ)
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
10
CFD and Algebraic-Multigrid – Execution time distribution
• Many components (kernels)
contribute to total execution
time
• There are data dependencies
between consecutive kernels
• There are opportunities to
keep data in the device
between kernels
• Some kernels may have
lower compute intensity, it
could still be worthwhile
computing them in the GPU if
the data is already there
Gauss-Seidel	solver
(Velocity)
Other
Matrix-vector	mult.	
MSR
Matrix-vector	mult.	
CSR
Dot	products
Multigrid	setup
Compute	coarse	cells	
from	fine	cells
Other	AMG-related
Pressure
(AMG)
Single	thread	profiling	- Code	Saturne	5.0+
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
11
Directive-based programming models
• Porting existing code to accelerators is time consuming…
• The sooner we have code running in the GPU the sooner you can start …
– … learning where overheads are
– … identifying what data patterns are being used
– … spotting kernels performing poorly
– … making decisions on what strategies can be used to improve performance
• Directive-based programming models can get you started much quicker
– Don’t need to bother about device memory allocation and data pointers
– Implementation defaults already exploiting device features
– Easily create data environments where data resides in the GPU
– Improve your code portability
• Clang C/C++ and IBM XL C/C++/Fortran compiler provide OpenMP 4.5 support
• PGI C/C++/Fortran compiler provide OpenACC support
• Can be complemented with existing GPU accelerated libraries
– cuSparse
– AMGx
XL
clang
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
12
Directive-based programming models
• OpenMP 4.5 data environments – Code Saturne 5.0+ snippet – Conjugate Gradient
IBM Confidential GPU acceleration in Code Saturne
1 static cs_sles_convergence_state_t _conjugate_gradient (/* ... */)
2 {
3 # pragma omp target data if (n_rows > GPU_THRESHOLD ) 
4 /* Move result vector to device and copied it back at the ned of the scope */ 
5 map(tofrom:vx[: vx_size ]) 
6 /* Move right -hand side vector to the device */ 
7 map(to:rhs [: n_rows ]) 
8 /* Allocate all auxiliary vectors in the device */ 
9 map(alloc: _aux_vectors [: tmp_size ])
10 {
11
12 /* Solver code */
13
14 }
15 }
Listing 2: OpenMP 4.5 data environment for a level of the AMG solver.
during the computation of a level so it can be copied to the device at the beginning of the level. The result
vector can also be kept in the device for a significant part of the execution, and only has to be copied to
the host during halo exchange. OpenMP 4.5 makes managing the data according to the aforementioned
observations almost trivial: a single directive su ces to set the scope - see Listing 2. Each time halos
All arrays reside in the device in this scope!
The programming model manages host/device pointers mapping for you!
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
13
Directive-based programming models
• OpenMP 4.5 target regions – Code Saturne 5.0+ snippet – Dot products
6 vx[ii] += (alpha * dk[ii]);
7 rk[ii] += (alpha * zk[ii]);
8 }
9
10 /* ... */
11 }
12
13 /* ... */
14
15 static void _cs_dot_xx_xy_superblock (cs_lnum_t n,
16 const cs_real_t *restrict x,
17 const cs_real_t *restrict y,
18 double *xx ,
19 double *xy)
20 {
21 double dot_xx = 0.0, dot_xy = 0.0;
22
23 # pragma omp target teams distribute parallel for reduction (+: dot_xx , dot_xy) 
24 if ( n > GPU_THRESHOLD ) 
25 map(to:x[:n],y[:n]) 
26 map(tofrom:dot_xx , dot_xy)
27 for (cs_lnum_t i = 0; i < n; ++i) {
28 const double tx = x[i];
29 const double ty = y[i];
30 dot_xx += tx*tx;
31 dot_xy += tx*ty;
32 }
33
34 /* ... */
35
36 *xx = dot_xx;
37 *xy = dot_xy;
38 }
Listing 3: Example of GPU port for two stream kernels: vector multiply-and-add and dot product .
…
Host
… … CUDA blocks
OpenMP team
Allocate data
in the the
device.
Host
Release data
in the the
device.
OpenMP
runtime library
OpenMP
runtime library
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
14
Directive-based programming models
• OpenMP 4.5 – Code Saturne 5.0+ – AMG NVPROF timeline
AMG
cycle
AMG
coarse
grid
detail
Allocations of
small
variables
High kernel
launch latency
Back-to-back
kernels
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
15
CUDA-based tuning
• Avoid expensive GPU memory allocation/deallocation:
– Allocate a memory chunk once and reuse it
• Use pinned memory for data copied frequently to the
GPU
– Avoid pageable-pinned memory copies by the CUDA
implementation
• Explore asynchronous execution of CUDA API calls
– Start copying data to/from the device while the host is preparing
the next set of data or the next kernel
• Use CUDA constant memory to copy arguments for
multiple kernels at once.
– The latency of copying tens of KB to the GPU is similar to
copy 1B
– Dual-buffering enable copies to happen asynchronously
• Produce specialized kernels instead of relying on runtime
checks.
– CUDA is a C++ extension and therefore kernels and device
functions can be templated.
– Leverage compile-time optimizations for the relevant
sequences of kernels.
– NVCC toolchain does very aggressive inlining.
– Lower register pressure = more occupancy.
IBM Confidential GPU acceleration in Cod
1 template < KernelKinds Kind >
2 __device__ int any_kernel ( KernelArgsBase &Arg , unsigned n_rows_per_block ) {
3 switch(Kind) {
4 /* ... */
5 // Dot product:
6 //
7 case DP_xx:
8 dot_product <Kind >(
9 /* version */ Arg.getArg <cs_lnum_t >(0),
10 /* n_rows */ Arg.getArg <cs_lnum_t >(1),
11 /* x */ Arg.getArg <cs_real_t * >(2),
12 /* y */ nullptr ,
13 /* z */ nullptr ,
14 /* res */ Arg.getArg <cs_real_t * >(3),
15 /* n_rows_per_block */ n_rows_per_block );
16 break;
17 /* ... */
18 }
19 __syncthreads ();
20 return 0;
21 }
22
23 template < KernelKinds ... Kinds >
24 __global__ void any_kernels (void) {
25
26 auto *KA = reinterpret_cast < KernelArgsSeries *>(& KernelArgsSeriesGPU [0]);
27 const unsigned n_rows_per_block = KA -> RowsPerBlock ;
28 unsigned idx = 0;
29
30 int dummy [] = { any_kernel <Kinds >(KA ->Args[idx ++], n_rows_per_block )... };
31 (void) dummy;
32 }
Listing 10: Device entry-point function for kernel execution.
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
16
CUDA-based tuning
• CUDA – Code Saturne 5.0+ – Results for a single rank
– IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid
57.21
43.83
34.77
30.28
49.86
37.37
29.67
25.63
11.87 10.83 9.55 9.32
4.41 4.34 4.40 4.63
0
10
20
30
40
50
60
70
1 2 4 8
Time	(seconds)
OpenMP	threads
Wall	time	CPU Solver	time	CPU Wall	time	CPU+GPU Solver	time	CPU+GPU
4.82
4.05
3.64
3.25
11.29
8.60
6.74
5.53
0
2
4
6
8
10
12
1 2 4 8
GPU	speedup	(1x)
OpenMP	threads
Wall	time Solvers	time
Execution time Speed up
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
17
CUDA-based tuning
• CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank
– IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid
Gauss-Seidel
AMG
fine grid
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
18
CUDA-based tuning
• CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank (cont.)
– IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid
AMG
coarse grid
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
19
MPI and GPU acceleration
• Different processes (MPI ranks) will use different CUDA contexts.
• CUDA implementation serializes CUDA contexts by default.
• NVIDIA Multi-Process Service (MPS) provides context switching capabilities so that multiple processes can use the
same GPU.
MPS server
instance
GPU driver
Define
Visible GPU
Start MPS
server
Execute
application
Terminate
MPS server
Define
Visible GPU
Execute
application
Define
Visible GPU
Execute
application
Rank 0
Rank 1
Rank 2
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
20
MPI and GPU acceleration
• CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU)
– IBM Minsky server - Lid driven cavity flow – 111M-cell grid
Gauss-Seidel
Hiding data
movement
latencies
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
21
MPI and GPU acceleration
• CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU – cont.)
– IBM Minsky server - Lid driven cavity flow – 111M-cell grid
AMG
coarse grid
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
22
MPI and GPU acceleration
• CUDA – Code Saturne 5.0+ – Results for multiple ranks (5 ranks per GPU – cont.)
– IBM Minsky server - Lid driven cavity flow – 111M-cell grid – CPU+GPU efficiency 65% @32 nodes
Execution time Speed up
2.39 2.42
2.32
2.22
2.08
2.00
2.53
2.57
2.45
2.37
2.20
2.10
1.5
1.7
1.9
2.1
2.3
2.5
2.7
1 2 4 8 16 32
Speedup	over	CPU-only	(1x)
Nodes
Wall	time Solvers	time
717.6
369.9
187.4
100.2
54.9
28.9
693.9
358.0
181.5
97.2
53.4
28.1
300.4
153.1
80.8
45.2
26.4
14.4
274.7
139.6
74.0
41.1
24.3
13.410.0
100.0
1000.0
1 2 4 8 16 32
Time	(seconds)
Nodes
CPU	wall	time CPU	solvers	time CPU+GPU	wall	time CPU+GPU	solvers	time
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
23
POWER8 to POWER9
• A code performing well on POWER8 + P100 GPUs should perform well on POWER9 + V100 GPUs
– No major code refactoring needed.
– More powerful CPUs, GPUs and interconnect.
• Some differences to consider:
– Core vs Pairs of Cores
• POWER9 L3 cache and store queue is shared for each pair of cores
• SMT4 per core or SMT8 per pair-of-cores
– V100 (Volta) drops lock-step execution per warp-threads
• One program-counter per thread
• If code assumes lock-step execution explicit barriers have to be inserted
• No guarantee threads will converge after divergence within a warp
• One has to leverage cooperative groups and thread activity masks
NVLINKTM
NVLINKTM
NVLINKTM
ORNL Summit Socket
(2 sockets per node)
for (cs_lnum_t ii = StartRow; ii < EndRow; ii += bdimy) {
// Depending on the number of rows - warps may diverge here
unsigned AM = __activemask();
…
for (cs_lnum_t kk = 1 ; kk < bdimx ; kk *= 2)
sii += __shfl_down_sync(AM, sii,kk, bdimx);
…
}
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
24
POWER8 to POWER9
• CUDA – Code Saturne 5.0+ – Results for multiple ranks (3 ranks per GPU) – 6 GPUs per node
– IBM Power 9 and NVLINK 2.0 (Summit) - Lid driven cavity flow – 889M-cell grid – CPU+GPU efficiency 76% @512 nodes
2.31
2.90
2.34
1.5
1.7
1.9
2.1
2.3
2.5
2.7
2.9
3.1
64 256 512
Speedup	over	CPU-only	(1x)
Nodes
Wall	time
74.73
21.04
11.16
32.3
7.25
4.76
1.0
10.0
100.0
64 256 512
Time	(seconds)
Nodes
CPU	wall	time CPU+GPU	wall	time
Execution time Speed up
POWER9 vs POWER8: Better efficiency when scaling to 16x more nodes for 8x larger problem
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
25
CFD and AI
• Cognitive Enhanced Design
– Design / prototype new pieces of equipment is
expensive (time and finance)
• Parameter sweeps need several expensive simulations
• Want to make decisions faster
• Make decisions on more complex problems
– Use cognitive techniques (e.g. Bayesian neural
networks) to generate a model based on a
parameterized space to relate design parameters to
performance. Use this in Bayesian optimization to
improve design
• Converge to optimal parameters more quickly
– Example: airfoil optimization: Lift/Drag maximization
• Adaptive Expected Improvement (EI) converges faster and
with less variance.
Work package ML1 Cognitive Enhanced Design
§ Problem: Design / prototyping of new pieces
of equipment can be expensive (time and
finance). Want to do more work in silico, and
also use an ‘intelligent’ design process
§ Solution: Use cognitive techniques (e.g.
Bayesian neural networks) to generate a
model based on a parameterized space to relate
design parameters to performance. Use this in
Bayesian optimization to improve design.
a
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
26
CFD and AI
• Enhanced 3D-Feature Detection
– Typical bottleneck of the design process is the
simulation-led workflow analysis
• Extract feature like:
– flow
– separation
– swirl
– layering
– Extend AI techniques to automatically extract features
in 3D
• Remove analysis bottlenecks
• Semantic querying of simulation data
• Contextual event classification
• Computational steering for rare-event simulation
– Example: Racing car vortex detection
• AI-enabled feature detection
§ Problem: One typical bottleneck in the
simulation-led workflow, is the analysis of
the output produced by the simulation itself
– especially the identification of features
(e.g. for flow; separation, swirl, layering).
§ Solution: Extend deep-feature detection to
3-dimensional problems to remove this
bottleneck from the design workflow
§ Additional extensions planned for the
semantic querying of simulation data, contextual
event classification, and computational steering for
rare-event simulation
Workpackage ML3 Enhanced 3D-Feature Detection
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
27
CFD and AI
• Enhanced 3D-Feature Detection
– Typical bottleneck of the design process is the
simulation-led workflow analysis
• Extract feature like:
– flow
– separation
– swirl
– layering
– Extend AI techniques to automatically extract features
in 3D
• Remove analysis bottlenecks
• Semantic querying of simulation data
• Contextual event classification
• Computational steering for rare-event simulation
– Example: Racing car vortex detection
• AI-enabled feature detection
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
28
Questions?
samuel.antao@ibm.com

More Related Content

PDF
BSC LMS DDL
PPTX
2018 bsc power9 and power ai
PDF
SNAP MACHINE LEARNING
PPTX
AI OpenPOWER Academia Discussion Group
PPT
OpenPOWER Webinar
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
Covid-19 Response Capability with Power Systems
PDF
MIT's experience on OpenPOWER/POWER 9 platform
BSC LMS DDL
2018 bsc power9 and power ai
SNAP MACHINE LEARNING
AI OpenPOWER Academia Discussion Group
OpenPOWER Webinar
TAU E4S ON OpenPOWER /POWER9 platform
Covid-19 Response Capability with Power Systems
MIT's experience on OpenPOWER/POWER 9 platform

What's hot (20)

PDF
OpenPOWER/POWER9 AI webinar
PDF
OpenPOWER/POWER9 Webinar from MIT and IBM
PPTX
WML OpenPOWER presentation
PDF
Ac922 cdac webinar
PDF
IBM HPC Transformation with AI
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PDF
Deeplearningusingcloudpakfordata
PDF
Summit workshop thompto
PDF
IBM BOA for POWER
PPTX
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
PDF
JMI Techtalk: 한재근 - How to use GPU for developing AI
PPTX
PowerAI Deep dive
PDF
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
PDF
OpenPOWER Latest Updates
PPTX
Large Model support and Distribute deep learning
PPTX
A Primer on FPGAs - Field Programmable Gate Arrays
PDF
Transparent Hardware Acceleration for Deep Learning
PDF
AMD It's Time to ROC
PDF
Heterogeneous Computing : The Future of Systems
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 Webinar from MIT and IBM
WML OpenPOWER presentation
Ac922 cdac webinar
IBM HPC Transformation with AI
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Deeplearningusingcloudpakfordata
Summit workshop thompto
IBM BOA for POWER
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
JMI Techtalk: 한재근 - How to use GPU for developing AI
PowerAI Deep dive
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
OpenPOWER Latest Updates
Large Model support and Distribute deep learning
A Primer on FPGAs - Field Programmable Gate Arrays
Transparent Hardware Acceleration for Deep Learning
AMD It's Time to ROC
Heterogeneous Computing : The Future of Systems
Ad

Similar to CFD on Power (20)

PDF
PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
PPTX
OpenACC Monthly Highlights: October2020
PDF
Utilizing AMD GPUs: Tuning, programming models, and roadmap
PDF
GTC 2022 Keynote
PDF
Using GPUs to Handle Big Data with Java
PDF
Newbie’s guide to_the_gpgpu_universe
PDF
LCU13: GPGPU on ARM Experience Report
PPTX
byteLAKE's Alveo FPGA Solutions
PDF
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
PDF
Using GPUs to handle Big Data with Java by Adam Roberts.
PPTX
OpenACC Monthly Highlights: June 2021
PDF
byteLAKE's expertise across NVIDIA architectures and configurations
PDF
Cuda Without a Phd - A practical guick start
PDF
Raul sena - Apresentação Analiticsemtudo - Scientific Applications using GPU
PDF
Accelerating S3D A GPGPU Case Study
PDF
HPC Infrastructure To Solve The CFD Grand Challenge
PDF
Advances in GPU Computing
PPTX
OpenACC Monthly Highlights: January 2021
Hardware & Software Platforms for HPC, AI and ML
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
OpenACC Monthly Highlights: October2020
Utilizing AMD GPUs: Tuning, programming models, and roadmap
GTC 2022 Keynote
Using GPUs to Handle Big Data with Java
Newbie’s guide to_the_gpgpu_universe
LCU13: GPGPU on ARM Experience Report
byteLAKE's Alveo FPGA Solutions
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
Using GPUs to handle Big Data with Java by Adam Roberts.
OpenACC Monthly Highlights: June 2021
byteLAKE's expertise across NVIDIA architectures and configurations
Cuda Without a Phd - A practical guick start
Raul sena - Apresentação Analiticsemtudo - Scientific Applications using GPU
Accelerating S3D A GPGPU Case Study
HPC Infrastructure To Solve The CFD Grand Challenge
Advances in GPU Computing
OpenACC Monthly Highlights: January 2021
Ad

More from Ganesan Narayanasamy (20)

PDF
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
PDF
Chip Design Curriculum development Residency program
PDF
Basics of Digital Design and Verilog
PDF
180 nm Tape out experience using Open POWER ISA
PDF
Workload Transformation and Innovations in POWER Architecture
PDF
OpenPOWER Workshop at IIT Roorkee
PDF
Deep Learning Use Cases using OpenPOWER systems
PDF
OpenPOWER System Marconi100
PDF
POWER10 innovations for HPC
PDF
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
PDF
AI in healthcare - Use Cases
PDF
AI in Health Care using IBM Systems/OpenPOWER systems
PDF
AI in Healh Care using IBM POWER systems
PDF
Poster from NUS
PDF
SAP HANA on POWER9 systems
PPTX
Graphical Structure Learning accelerated with POWER9
PDF
AI in the enterprise
PDF
Robustness in deep learning
PDF
Perspectives of Frond end Design
PDF
A2O Core implementation on FPGA
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Chip Design Curriculum development Residency program
Basics of Digital Design and Verilog
180 nm Tape out experience using Open POWER ISA
Workload Transformation and Innovations in POWER Architecture
OpenPOWER Workshop at IIT Roorkee
Deep Learning Use Cases using OpenPOWER systems
OpenPOWER System Marconi100
POWER10 innovations for HPC
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare - Use Cases
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Healh Care using IBM POWER systems
Poster from NUS
SAP HANA on POWER9 systems
Graphical Structure Learning accelerated with POWER9
AI in the enterprise
Robustness in deep learning
Perspectives of Frond end Design
A2O Core implementation on FPGA

Recently uploaded (20)

PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PPTX
Build Your First AI Agent with UiPath.pptx
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
DOCX
search engine optimization ppt fir known well about this
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Build Your First AI Agent with UiPath.pptx
sustainability-14-14877-v2.pddhzftheheeeee
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
NewMind AI Weekly Chronicles – August ’25 Week IV
sbt 2.0: go big (Scala Days 2025 edition)
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
giants, standing on the shoulders of - by Daniel Stenberg
Enhancing plagiarism detection using data pre-processing and machine learning...
Module 1 Introduction to Web Programming .pptx
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Basics of Cloud Computing - Cloud Ecosystem
Early detection and classification of bone marrow changes in lumbar vertebrae...
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
search engine optimization ppt fir known well about this
Rapid Prototyping: A lecture on prototyping techniques for interface design

CFD on Power

  • 1. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 1 Task-based GPU acceleration in Computational Fluid Dynamics with OpenMP 4.5 and CUDA in OpenPOWER platforms. OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain June 2018 Samuel Antao IBM Research, Daresbury, UK
  • 2. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 2 IBM Research @ Daresbury, UK
  • 3. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 3 IBM Research @ Daresbury, UK – STFC Partnership Mission • 2015: £313 million investment over next 5 years • Agreement for IBM Collaborative Research and Development (R&D) that established IBM Research presence in the UK • Product and Services Agreement with IBM UK and Ireland • Access to the latest data-centric and cognitive computing technologies, including IBMs world-class Watson cognitive computing platform • Joint commercialization of intellectual property assets produced in the partnership Help the UK industries and institutions bringing cutting-edge computational science, engineering and applicable technologies, such as data-centric cognitive computing, to boost growth and development of the UK economy
  • 4. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 4 IBM Research @ Daresbury, UK – People 7 Over 26 computational scientists and engineers
  • 5. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 5 IBM Research @ Daresbury, UK – Research areas • Case studies: – Smart Crop Protection - Precision Agriculture • Data science + Life sciences – Improving disease diagnostics and personalised treatments • Life sciences + Machine learning – Cognitive treatment plant • Engineering + Machine learning – Parameterisation of engineering models • Engineering + Machine learning
  • 6. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 6 Task-based GPU acceleration in Computational Fluid Dynamics with OpenMP 4.5 and CUDA in OpenPOWER platforms. OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain June 2018 Samuel Antao IBM Research, Daresbury, UK
  • 7. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 7 CFD and Algebraic-Multigrid • Solve set of partial-differential equations over several time steps – Discretization: • Unstructured vs Structured – Equations: • Velocity • Pressure • Turbulence • Iterative solvers – Jacobi – Gauss-Seidel – Conjugate Gradient • Multigrid approaches – Solve the problem at different resolutions • Coarse and fine grids/meshes – Less Iterations for fine grids • Algebraic multigrid (AMG) – encode mesh information in algebraic format – Sparse matrices. source: http://web.utk.edu/~wfeng1/research.html
  • 8. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 8 CFD and Algebraic-Multigrid source: http://web.utk.edu/~wfeng1/research.html NVLINKTM NVLINKTM NVLINKTM NVLINKTM InfiniBandTM MPI rank • Grid partitioned by MPI ranks • Ranks distributed by nodes • More than one rank executing in one node • Challenges: – Different grids have different compute needs – Access strides vary, unstructured data accesses. – CPU-GPU data movements – Regular communication between ranks • Halo elements • Residuals • Synchronizations
  • 9. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 9 CFD and Algebraic-Multigrid – Code Saturne • Open-source – developed and maintained by EDF • 350K lines of code: – 50% C – 37% Fortran – 13% Python • Rich ecosystem to configure/parameterise simulations, generate meshes • History of good scalability Cores Time in Solver Efficiency 262,144 789.79 s - 524,288 403.18 s 97% MPI Tasks Time in Solver Efficiency 524,288 70.114 s - 1,048,576 52.574 s 66% 1,572,864 45.731 s 76% 105B Cell Mesh (MIRA, BGQ) 13B Cell Mesh (MIRA, BGQ)
  • 10. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 10 CFD and Algebraic-Multigrid – Execution time distribution • Many components (kernels) contribute to total execution time • There are data dependencies between consecutive kernels • There are opportunities to keep data in the device between kernels • Some kernels may have lower compute intensity, it could still be worthwhile computing them in the GPU if the data is already there Gauss-Seidel solver (Velocity) Other Matrix-vector mult. MSR Matrix-vector mult. CSR Dot products Multigrid setup Compute coarse cells from fine cells Other AMG-related Pressure (AMG) Single thread profiling - Code Saturne 5.0+
  • 11. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 11 Directive-based programming models • Porting existing code to accelerators is time consuming… • The sooner we have code running in the GPU the sooner you can start … – … learning where overheads are – … identifying what data patterns are being used – … spotting kernels performing poorly – … making decisions on what strategies can be used to improve performance • Directive-based programming models can get you started much quicker – Don’t need to bother about device memory allocation and data pointers – Implementation defaults already exploiting device features – Easily create data environments where data resides in the GPU – Improve your code portability • Clang C/C++ and IBM XL C/C++/Fortran compiler provide OpenMP 4.5 support • PGI C/C++/Fortran compiler provide OpenACC support • Can be complemented with existing GPU accelerated libraries – cuSparse – AMGx XL clang
  • 12. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 12 Directive-based programming models • OpenMP 4.5 data environments – Code Saturne 5.0+ snippet – Conjugate Gradient IBM Confidential GPU acceleration in Code Saturne 1 static cs_sles_convergence_state_t _conjugate_gradient (/* ... */) 2 { 3 # pragma omp target data if (n_rows > GPU_THRESHOLD ) 4 /* Move result vector to device and copied it back at the ned of the scope */ 5 map(tofrom:vx[: vx_size ]) 6 /* Move right -hand side vector to the device */ 7 map(to:rhs [: n_rows ]) 8 /* Allocate all auxiliary vectors in the device */ 9 map(alloc: _aux_vectors [: tmp_size ]) 10 { 11 12 /* Solver code */ 13 14 } 15 } Listing 2: OpenMP 4.5 data environment for a level of the AMG solver. during the computation of a level so it can be copied to the device at the beginning of the level. The result vector can also be kept in the device for a significant part of the execution, and only has to be copied to the host during halo exchange. OpenMP 4.5 makes managing the data according to the aforementioned observations almost trivial: a single directive su ces to set the scope - see Listing 2. Each time halos All arrays reside in the device in this scope! The programming model manages host/device pointers mapping for you!
  • 13. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 13 Directive-based programming models • OpenMP 4.5 target regions – Code Saturne 5.0+ snippet – Dot products 6 vx[ii] += (alpha * dk[ii]); 7 rk[ii] += (alpha * zk[ii]); 8 } 9 10 /* ... */ 11 } 12 13 /* ... */ 14 15 static void _cs_dot_xx_xy_superblock (cs_lnum_t n, 16 const cs_real_t *restrict x, 17 const cs_real_t *restrict y, 18 double *xx , 19 double *xy) 20 { 21 double dot_xx = 0.0, dot_xy = 0.0; 22 23 # pragma omp target teams distribute parallel for reduction (+: dot_xx , dot_xy) 24 if ( n > GPU_THRESHOLD ) 25 map(to:x[:n],y[:n]) 26 map(tofrom:dot_xx , dot_xy) 27 for (cs_lnum_t i = 0; i < n; ++i) { 28 const double tx = x[i]; 29 const double ty = y[i]; 30 dot_xx += tx*tx; 31 dot_xy += tx*ty; 32 } 33 34 /* ... */ 35 36 *xx = dot_xx; 37 *xy = dot_xy; 38 } Listing 3: Example of GPU port for two stream kernels: vector multiply-and-add and dot product . … Host … … CUDA blocks OpenMP team Allocate data in the the device. Host Release data in the the device. OpenMP runtime library OpenMP runtime library
  • 14. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 14 Directive-based programming models • OpenMP 4.5 – Code Saturne 5.0+ – AMG NVPROF timeline AMG cycle AMG coarse grid detail Allocations of small variables High kernel launch latency Back-to-back kernels
  • 15. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 15 CUDA-based tuning • Avoid expensive GPU memory allocation/deallocation: – Allocate a memory chunk once and reuse it • Use pinned memory for data copied frequently to the GPU – Avoid pageable-pinned memory copies by the CUDA implementation • Explore asynchronous execution of CUDA API calls – Start copying data to/from the device while the host is preparing the next set of data or the next kernel • Use CUDA constant memory to copy arguments for multiple kernels at once. – The latency of copying tens of KB to the GPU is similar to copy 1B – Dual-buffering enable copies to happen asynchronously • Produce specialized kernels instead of relying on runtime checks. – CUDA is a C++ extension and therefore kernels and device functions can be templated. – Leverage compile-time optimizations for the relevant sequences of kernels. – NVCC toolchain does very aggressive inlining. – Lower register pressure = more occupancy. IBM Confidential GPU acceleration in Cod 1 template < KernelKinds Kind > 2 __device__ int any_kernel ( KernelArgsBase &Arg , unsigned n_rows_per_block ) { 3 switch(Kind) { 4 /* ... */ 5 // Dot product: 6 // 7 case DP_xx: 8 dot_product <Kind >( 9 /* version */ Arg.getArg <cs_lnum_t >(0), 10 /* n_rows */ Arg.getArg <cs_lnum_t >(1), 11 /* x */ Arg.getArg <cs_real_t * >(2), 12 /* y */ nullptr , 13 /* z */ nullptr , 14 /* res */ Arg.getArg <cs_real_t * >(3), 15 /* n_rows_per_block */ n_rows_per_block ); 16 break; 17 /* ... */ 18 } 19 __syncthreads (); 20 return 0; 21 } 22 23 template < KernelKinds ... Kinds > 24 __global__ void any_kernels (void) { 25 26 auto *KA = reinterpret_cast < KernelArgsSeries *>(& KernelArgsSeriesGPU [0]); 27 const unsigned n_rows_per_block = KA -> RowsPerBlock ; 28 unsigned idx = 0; 29 30 int dummy [] = { any_kernel <Kinds >(KA ->Args[idx ++], n_rows_per_block )... }; 31 (void) dummy; 32 } Listing 10: Device entry-point function for kernel execution.
  • 16. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 16 CUDA-based tuning • CUDA – Code Saturne 5.0+ – Results for a single rank – IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid 57.21 43.83 34.77 30.28 49.86 37.37 29.67 25.63 11.87 10.83 9.55 9.32 4.41 4.34 4.40 4.63 0 10 20 30 40 50 60 70 1 2 4 8 Time (seconds) OpenMP threads Wall time CPU Solver time CPU Wall time CPU+GPU Solver time CPU+GPU 4.82 4.05 3.64 3.25 11.29 8.60 6.74 5.53 0 2 4 6 8 10 12 1 2 4 8 GPU speedup (1x) OpenMP threads Wall time Solvers time Execution time Speed up
  • 17. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 17 CUDA-based tuning • CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank – IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid Gauss-Seidel AMG fine grid
  • 18. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 18 CUDA-based tuning • CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank (cont.) – IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid AMG coarse grid
  • 19. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 19 MPI and GPU acceleration • Different processes (MPI ranks) will use different CUDA contexts. • CUDA implementation serializes CUDA contexts by default. • NVIDIA Multi-Process Service (MPS) provides context switching capabilities so that multiple processes can use the same GPU. MPS server instance GPU driver Define Visible GPU Start MPS server Execute application Terminate MPS server Define Visible GPU Execute application Define Visible GPU Execute application Rank 0 Rank 1 Rank 2
  • 20. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 20 MPI and GPU acceleration • CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU) – IBM Minsky server - Lid driven cavity flow – 111M-cell grid Gauss-Seidel Hiding data movement latencies
  • 21. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 21 MPI and GPU acceleration • CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU – cont.) – IBM Minsky server - Lid driven cavity flow – 111M-cell grid AMG coarse grid
  • 22. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 22 MPI and GPU acceleration • CUDA – Code Saturne 5.0+ – Results for multiple ranks (5 ranks per GPU – cont.) – IBM Minsky server - Lid driven cavity flow – 111M-cell grid – CPU+GPU efficiency 65% @32 nodes Execution time Speed up 2.39 2.42 2.32 2.22 2.08 2.00 2.53 2.57 2.45 2.37 2.20 2.10 1.5 1.7 1.9 2.1 2.3 2.5 2.7 1 2 4 8 16 32 Speedup over CPU-only (1x) Nodes Wall time Solvers time 717.6 369.9 187.4 100.2 54.9 28.9 693.9 358.0 181.5 97.2 53.4 28.1 300.4 153.1 80.8 45.2 26.4 14.4 274.7 139.6 74.0 41.1 24.3 13.410.0 100.0 1000.0 1 2 4 8 16 32 Time (seconds) Nodes CPU wall time CPU solvers time CPU+GPU wall time CPU+GPU solvers time
  • 23. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 23 POWER8 to POWER9 • A code performing well on POWER8 + P100 GPUs should perform well on POWER9 + V100 GPUs – No major code refactoring needed. – More powerful CPUs, GPUs and interconnect. • Some differences to consider: – Core vs Pairs of Cores • POWER9 L3 cache and store queue is shared for each pair of cores • SMT4 per core or SMT8 per pair-of-cores – V100 (Volta) drops lock-step execution per warp-threads • One program-counter per thread • If code assumes lock-step execution explicit barriers have to be inserted • No guarantee threads will converge after divergence within a warp • One has to leverage cooperative groups and thread activity masks NVLINKTM NVLINKTM NVLINKTM ORNL Summit Socket (2 sockets per node) for (cs_lnum_t ii = StartRow; ii < EndRow; ii += bdimy) { // Depending on the number of rows - warps may diverge here unsigned AM = __activemask(); … for (cs_lnum_t kk = 1 ; kk < bdimx ; kk *= 2) sii += __shfl_down_sync(AM, sii,kk, bdimx); … }
  • 24. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 24 POWER8 to POWER9 • CUDA – Code Saturne 5.0+ – Results for multiple ranks (3 ranks per GPU) – 6 GPUs per node – IBM Power 9 and NVLINK 2.0 (Summit) - Lid driven cavity flow – 889M-cell grid – CPU+GPU efficiency 76% @512 nodes 2.31 2.90 2.34 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 64 256 512 Speedup over CPU-only (1x) Nodes Wall time 74.73 21.04 11.16 32.3 7.25 4.76 1.0 10.0 100.0 64 256 512 Time (seconds) Nodes CPU wall time CPU+GPU wall time Execution time Speed up POWER9 vs POWER8: Better efficiency when scaling to 16x more nodes for 8x larger problem
  • 25. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 25 CFD and AI • Cognitive Enhanced Design – Design / prototype new pieces of equipment is expensive (time and finance) • Parameter sweeps need several expensive simulations • Want to make decisions faster • Make decisions on more complex problems – Use cognitive techniques (e.g. Bayesian neural networks) to generate a model based on a parameterized space to relate design parameters to performance. Use this in Bayesian optimization to improve design • Converge to optimal parameters more quickly – Example: airfoil optimization: Lift/Drag maximization • Adaptive Expected Improvement (EI) converges faster and with less variance. Work package ML1 Cognitive Enhanced Design § Problem: Design / prototyping of new pieces of equipment can be expensive (time and finance). Want to do more work in silico, and also use an ‘intelligent’ design process § Solution: Use cognitive techniques (e.g. Bayesian neural networks) to generate a model based on a parameterized space to relate design parameters to performance. Use this in Bayesian optimization to improve design. a
  • 26. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 26 CFD and AI • Enhanced 3D-Feature Detection – Typical bottleneck of the design process is the simulation-led workflow analysis • Extract feature like: – flow – separation – swirl – layering – Extend AI techniques to automatically extract features in 3D • Remove analysis bottlenecks • Semantic querying of simulation data • Contextual event classification • Computational steering for rare-event simulation – Example: Racing car vortex detection • AI-enabled feature detection § Problem: One typical bottleneck in the simulation-led workflow, is the analysis of the output produced by the simulation itself – especially the identification of features (e.g. for flow; separation, swirl, layering). § Solution: Extend deep-feature detection to 3-dimensional problems to remove this bottleneck from the design workflow § Additional extensions planned for the semantic querying of simulation data, contextual event classification, and computational steering for rare-event simulation Workpackage ML3 Enhanced 3D-Feature Detection
  • 27. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 27 CFD and AI • Enhanced 3D-Feature Detection – Typical bottleneck of the design process is the simulation-led workflow analysis • Extract feature like: – flow – separation – swirl – layering – Extend AI techniques to automatically extract features in 3D • Remove analysis bottlenecks • Semantic querying of simulation data • Contextual event classification • Computational steering for rare-event simulation – Example: Racing car vortex detection • AI-enabled feature detection
  • 28. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 28 Questions? [email protected]