SlideShare a Scribd company logo
Krzysztof Rojek, CTO, byteLAKE
krojek@byteLAKE.com
6TH INTERNATIONAL EULAG USERS WORKSHOP
MAY 29, 2018, 14:00-14:45
 Area: Adaptation of a real-life scientific codes to the most advanced
computing architectures.
 Challenge: Device architectures are constantly changing. Current
architectures are very various. Our codes need to be very portable
and flexible.
 Goal: Take HPC to the "Industry 4.0" by implementing smart techniques
to optimize the codes in terms of performance and energy
consumption.
2
 Piz Daint (ranked 3-rd at top 500):
 GPU: NVIDIA Tesla P100 - PASCAL
 1xGPU per node
 Single GPU design
 5320 nodes (up to 36 used in this work)
 Calculation speed: float is 2x faster
than double
 MICLAB:
 GPU: NVIDIA Tesla K80 - KEPLER
 2xGPUs per node
 Dual GPU design
 2 nodes (remaining nodes with Intel
Xeon Phi)
 Calculation speed: float is 3x faster
than double
3
 Size of data transfer between nodes: 2x less using float than double
 No access to sudo user – it makes a problem when your code is based on DVFS
Expectation: Mixed precision arithmetic allows us to reduce the energy consumption
and execution time; It can be used in the real HPC platforms (without special access)
 Stencil-based algorithm for numerical simulation of
geophysical fluids flows on micro-to-planetary scales:
 7 stencils (compressed into 4 kernels) – each
depends on one or more others (343 flops-per-el.)
 Iterative algorithm – a single iteration represents one
time step
 11 matrices:
 x, xP – scalar quantity (i.e. temperature); input/output
matrices between time steps
 v1, v2, v3, v1P, v2P, v3P – velocity vectors in i, j, and k
directions
 h – density matrix
 cp, cn – temporary, intermediate matrices
4
xP
x
xPKernel 3
Kernel 2
Kernel 1
Kernel 0 xP
v1P v2P
cp
xP
cn
v3P
Outputs:
(x,v1,v2,v3,h)
(v1,v2,v3,h,xP)
(x,h,xP,v1P,v2P,v3P)
(x,h,v1P,v2P,v3P,cp,cn)
Inputs:
 Idea: Provide a highly parametrized code in order to easily map the algorithm onto GPU
 Mapping: Select the right values of given code parameters (configuration) in terms of
desired criterion (Energy consumption)
 How to: We build the search space of possible configurations and prune it using our
machine learning module (MLM)
 MLM: It is still the ongoing task. Here we propose to apply the modified random forest
algorithm.
5
 We can use different number of:
 Streams count (SPG)
 Nodes count (NDS)
 With different topologies:
 Topology of streams (TGP)
 Topology of nodes (TDS)
6
1x1 1x2
1x1
2x1
1x1
1x1
2x1
1x2
2x2
1x1 1x2 1x3 1x4
1x1
2x1
3x1
4x1
Stream/Node: 1
Topology: 1
Streams/Nodes: 2
Topologies: 2
Streams/Nodes: 4
Topologies: 3
7
Synchronize domain
Synchronize domain
Synchronize domain
Synchronize domain
Synchronize domain
Synchronize domain
Synchronize domain
Synchronize domain
Kernel 0
Kernel 1
Kernel 2
Kernel 3
Data transfer
Kernel 0
Kernel 1
Kernel 2
Kernel 3
Data transfer
vs.
 Each stream
works on
distributed
subdomains
 Computations
are
independent
within a single
time step
 Halo exchange
is required after
each time step
 Each stream
share the same
subdomain
 Computations
depends on the
neighboring
streams
 Halo exchange
is not required
within a single
node
 By selecting the right strategy of
halo exchanging we can focus
on more parallelism or less
operations
 We can use a different strategy
within node and between nodes
8
Halo
exchange
cudaMemcpy
Buffers
GPU direct No GPU direct
No-buffers
Kernels
Single kernel Two kernels
 We takes into consideration also some basic parameters:
 CUDA block sizes for each of 4 kernels
 CUDA blocks are of size X times Y, where
 X*Y mod 32 = 0
 X>=Y
 X mod 16 =0
 X*Y <= 1024
 X<=M and Y<=N, where NxMxL is a size of the grid
 Data alignment and padding within a range from 1 to 4096 B
 Align in: {1, 2, 4, 8, …, 4096}
9
 Assumption: We believe that we can find a really good configuration by testing about 5000
configurations from the search space (more that this is too expensive)
 We consider two possible approaches:
 Positive: Find good solutions and eliminate groups that seem to be worse than ours
 Risk: When we find a branch with a good solution we can eliminate other branches (also quite good) that should
be worse. In fact we can eliminate a branch containing the best solution.
 Negative: Find bad solutions and eliminate them
 Risk: When we find branches with bad solutions we can eliminate them although the worst one can be still in (the
best one also is there).
 Fact: We test random branches (we may not select the best or the worst one); we are searching
for the suboptimal solution.
10
 Precision:
 DOUBLE
 Diameter:
 28.0
 L2 norm:
 0.0746
 Diffusion error:
 1.7503
 Phase error:
 0.7576
11
Halo exchange
(xP or x)
 Precision: DOUBLE
 Diameter: 28.0
 L2 norm: 0.0746
 Diff. err.: 1.7503
 Phase err.: 0.7576
 Precision: FLOAT
 Diameter: 28.0
 L2 norm: 0.1301
 Diff. err.: 2.2439
 Phase err.: 7.5919
12
FloatDouble
 Goal: Reduce the energy consumption
 Condition: Keep the accuracy at a high level (1% loss is
acceptable)
 Assumptions:
 The proposed method is intended to iterative algorithms
 Dynamic approach, self adaptable to a particular
simulation
 Self adaptation is done based on the short training stage
(the first 11 time steps)
13
1. Change
the i-th
matrix from
DP to SP
2. Execute a
single time
step
3. Measure
Energy and
Accuracy
4. Restore
the i-th
matrix to DP
Training stage:
Traditional approach based on static selection of precision arithmetic
is less flexible and may be too restrictive for some simulations
14
0
10
20
30
40
50
60
70
80
90
100
0/4 1/4 2/4 3/4 4/4
ENERGYCONSUMPTION[%]
PART OF A SINGLE SOLID BODY ROTATION
Change the
precision of
the i-th matrix
from DP to SP
0
10
20
30
40
50
60
70
80
90
100
0/4 1/4 2/4 3/4 4/4
ACCURACYLOSS[%]
PART OF A SINGLE SOLID BODY ROTATION
DOUBLE
Change the
precision of
the i-th matrix
from DP to SP
Should be
minimizedShould be
maximized
ΔE
ΔA
15
 Assumptions:
 ΔE – should be maximized
 ΔA – should be minimized
 Conclusion:
 R=ΔE/ΔA – the higher, the better
 Method:
 We estimate the R ratio for each matrix and set
matrices with the highest R from double to float
 This step is repeated until the accuracy loss is
lower than 1%
Set the
matrices with
the highest R
to float until
the accuracy
loss<1%
Sort them
decreasing
by R
Calculate
R=ΔE/ΔA
for each
matrix
The higher,
the better
 Precision: DOUBLE
 Diameter: 28.0
 L2 norm: 0.0746
 Diff. err.: 1.7503
 Phase err.: 0.7576
 Precision: MIXED
 Diameter: 28.0
 L2 norm: 0.0749
 Diff. err.: 1.7504
 Phase err.: 0.7576
16
Float group: x, xP, v3, h, v1P,v3P
Double group: v1, v2, v2P, cp, cn
Double
17
Double Mixed The proposed
method was also
validated for the
other tests
 The difference
between L2 norms
for double and
mixed precision is
0.00001
 The phase is
44.2135 for both
cases
 Test: 512x512x512 – 3909 time steps
18
Double 1 335.48 44
Mixed 1 255.02 1.32 35 19.79
Double 32 27.52 71
Mixed 24 21.65 1.27 48 32.63
Conclusion: Energy consumption is reduced by 33%
Best
performance
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 6 11 16 21 26 31 36
Gflop/s
# of nodes
Double Mixed
 Test: 512x512x512 – 3909 time steps
19
Double 1/2 533.65 80
Mixed 1/2 352.18 1.51 53 34.00
Double 4/8 144.83 87
Mixed 4/8 96.04 1.51 57 33.66
Conclusion: Energy consumption is reduced by 33%
Best
performance
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 3 4
Gflop/s
# of GPUs – 2 GPUs per node
Double Mixed
 The developed implementation of MPDATA is very flexible and portable
 The proposed method allows us to automate the code adaptation even for a very large
number of possible configurations
 Mixed precision arithmetic allows us to reduce the energy consumption and execution
time
 It can be used in the real HPC platforms without special access to the machine
 It has an effect on the computation speed, data transfer, and scalability of the
application
 The proposed method allows us to reduce the energy consumption by 33% without loss
in accuracy
 It has also improved the performance by the factor of 1.27 for Piz Daint and 1.51 for
MICLAB in relation to double precision arithmetic
20
We build Artificial Intelligence
software and integrate that into
products.
We port and optimize algorithms for
parallel, CPU+GPU architectures.
We design and optimize algorithms
for HPC supercomputers.
byteLAKE We are specialists in:
www.byteLAKE.com
Machine Learning
Deep Learning
Computer Vision
High Performance Computing
Heterogeneous Computing
Edge Computing
Our mission:
help industries transform for the era of Artificial Intelligence.
We combine business and academia. Our team consists of experts
schooled by Fortune 500 corporations as well as PhD researchers.
22
Areas of Our Expertise
Computer Vision
Deep Learning
Machine Learning
Learning Optimization
AI for Edge Devices
HPC
Expertise Consultancy
Proof of Concept

More Related Content

What's hot (20)

PPT
Chapter 3 pc
Hanif Durad
 
PDF
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Pooyan Jamshidi
 
PDF
Dg34662666
IJERA Editor
 
PDF
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Pooyan Jamshidi
 
PDF
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
PDF
Transfer Learning for Improving Model Predictions in Robotic Systems
Pooyan Jamshidi
 
PPTX
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
 
PDF
deep CNN vs conventional ML
Chao Han [email protected]
 
PPTX
Optimal Chain Matrix Multiplication Big Data Perspective
পল্লব রায়
 
PPTX
Incremental collaborative filtering via evolutionary co clustering
Allen Wu
 
PPTX
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
MLconf
 
PPTX
Beyond data and model parallelism for deep neural networks
JunKudo2
 
PDF
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
MLconf
 
PDF
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
taeseon ryu
 
PDF
Training and Inference for Deep Gaussian Processes
Keyon Vafa
 
PDF
MobileNet - PR044
Jinwon Lee
 
PDF
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
PPTX
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Seongwon Hwang
 
Chapter 3 pc
Hanif Durad
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Pooyan Jamshidi
 
Dg34662666
IJERA Editor
 
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Pooyan Jamshidi
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Transfer Learning for Improving Model Predictions in Robotic Systems
Pooyan Jamshidi
 
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
 
deep CNN vs conventional ML
Chao Han [email protected]
 
Optimal Chain Matrix Multiplication Big Data Perspective
পল্লব রায়
 
Incremental collaborative filtering via evolutionary co clustering
Allen Wu
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
MLconf
 
Beyond data and model parallelism for deep neural networks
JunKudo2
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
MLconf
 
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
taeseon ryu
 
Training and Inference for Deep Gaussian Processes
Keyon Vafa
 
MobileNet - PR044
Jinwon Lee
 
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Seongwon Hwang
 

Similar to AI optimizing HPC simulations (presentation from 6th EULAG Workshop) (20)

PDF
JJ_Thesis
Juan Jerez
 
PDF
xlelke00
Oliv Lelkes
 
PDF
Write Python for Speed
Yung-Yu Chen
 
PPTX
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
NECST Lab @ Politecnico di Milano
 
PDF
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
Kundjanasith Thonglek
 
PDF
Architectureaware Optimization Strategies In Realtime Image Processing Ballaa...
tmwdxkj743
 
PDF
Solving large sparse linear systems on the GPU
Bruno Levy
 
PDF
HPC Essentials 0
William Brouwer
 
PDF
Tutorial-on-DNN-07-Co-design-Precision.pdf
Duy-Hieu Bui
 
PDF
Exascale Computing for Autonomous Driving
Levent Gürel
 
PDF
Connected Components Labeling
Hemanth Kumar Mantri
 
PDF
Ivo Pavlik - thesis (print version)
Ivo Pavlik
 
PPTX
System mldl meetup
Ganesan Narayanasamy
 
PDF
Machine Learning Project - Neural Network
HamdaAnees
 
PDF
GPU HistoPyramid Based Fluid Simulation and Rendering
João Vicente P. Reis Fo.
 
PDF
Using Raspberry Pi GPU for DNN
notogawa
 
PDF
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE
 
PDF
Moving Toward Deep Learning Algorithms on HPCC Systems
HPCC Systems
 
PPTX
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
PDF
add_2_diplom_main
Alexander Litvinenko
 
JJ_Thesis
Juan Jerez
 
xlelke00
Oliv Lelkes
 
Write Python for Speed
Yung-Yu Chen
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
NECST Lab @ Politecnico di Milano
 
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
Kundjanasith Thonglek
 
Architectureaware Optimization Strategies In Realtime Image Processing Ballaa...
tmwdxkj743
 
Solving large sparse linear systems on the GPU
Bruno Levy
 
HPC Essentials 0
William Brouwer
 
Tutorial-on-DNN-07-Co-design-Precision.pdf
Duy-Hieu Bui
 
Exascale Computing for Autonomous Driving
Levent Gürel
 
Connected Components Labeling
Hemanth Kumar Mantri
 
Ivo Pavlik - thesis (print version)
Ivo Pavlik
 
System mldl meetup
Ganesan Narayanasamy
 
Machine Learning Project - Neural Network
HamdaAnees
 
GPU HistoPyramid Based Fluid Simulation and Rendering
João Vicente P. Reis Fo.
 
Using Raspberry Pi GPU for DNN
notogawa
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE
 
Moving Toward Deep Learning Algorithms on HPCC Systems
HPCC Systems
 
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
add_2_diplom_main
Alexander Litvinenko
 
Ad

More from byteLAKE (20)

PDF
Agent AI (LLM) dla Grupy Prawniczej (WGPR, byteLAKE)
byteLAKE
 
PDF
AI Innovation: Digital Automation with Cognitive Services
byteLAKE
 
PDF
byteLAKE's AI Products (use cases) (short)
byteLAKE
 
PDF
byteLAKE's AI Products (use cases) - presentation
byteLAKE
 
PDF
byteLAKE's AI Products for Industries (2024-02)
byteLAKE
 
PDF
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE
 
PDF
AI Solutions for Industries | Quality Inspection | Data Insights | Predictive...
byteLAKE
 
PDF
Self-Checkout for Restaurants / AI Restaurants (2024-02)
byteLAKE
 
PDF
Self-Checkout (AI for Restautants) - case study by byteLAKE's partner: Simpra
byteLAKE
 
PDF
byteLAKE: Sztuczna Inteligencja dla Przemysłu i Usług
byteLAKE
 
PDF
Przegląd zastosowań sztucznej inteligencji (2024-01)
byteLAKE
 
PDF
Przegląd zastosowań Sztucznej inteligencjI
byteLAKE
 
PDF
AI Solutions for Industries
byteLAKE
 
PDF
AI-accelerated CFD (Computational Fluid Dynamics)
byteLAKE
 
PDF
Advanced Quality Inspection and Data Insights (Artificial Intelligence)
byteLAKE
 
PDF
AI Solutions for Industries (short)
byteLAKE
 
PDF
Self-Checkout (AI for Restautants)
byteLAKE
 
PDF
Applying Industrial AI Models to Product Quality Inspection
byteLAKE
 
PDF
byteLAKE and Intel Partnership
byteLAKE
 
PDF
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...
byteLAKE
 
Agent AI (LLM) dla Grupy Prawniczej (WGPR, byteLAKE)
byteLAKE
 
AI Innovation: Digital Automation with Cognitive Services
byteLAKE
 
byteLAKE's AI Products (use cases) (short)
byteLAKE
 
byteLAKE's AI Products (use cases) - presentation
byteLAKE
 
byteLAKE's AI Products for Industries (2024-02)
byteLAKE
 
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE
 
AI Solutions for Industries | Quality Inspection | Data Insights | Predictive...
byteLAKE
 
Self-Checkout for Restaurants / AI Restaurants (2024-02)
byteLAKE
 
Self-Checkout (AI for Restautants) - case study by byteLAKE's partner: Simpra
byteLAKE
 
byteLAKE: Sztuczna Inteligencja dla Przemysłu i Usług
byteLAKE
 
Przegląd zastosowań sztucznej inteligencji (2024-01)
byteLAKE
 
Przegląd zastosowań Sztucznej inteligencjI
byteLAKE
 
AI Solutions for Industries
byteLAKE
 
AI-accelerated CFD (Computational Fluid Dynamics)
byteLAKE
 
Advanced Quality Inspection and Data Insights (Artificial Intelligence)
byteLAKE
 
AI Solutions for Industries (short)
byteLAKE
 
Self-Checkout (AI for Restautants)
byteLAKE
 
Applying Industrial AI Models to Product Quality Inspection
byteLAKE
 
byteLAKE and Intel Partnership
byteLAKE
 
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...
byteLAKE
 
Ad

Recently uploaded (20)

PPTX
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
PPTX
Presentation on Foundation Design for Civil Engineers.pptx
KamalKhan563106
 
PDF
Additional Information in midterm CPE024 (1).pdf
abolisojoy
 
PDF
A presentation on the Urban Heat Island Effect
studyfor7hrs
 
PDF
Water Design_Manual_2005. KENYA FOR WASTER SUPPLY AND SEWERAGE
DancanNgutuku
 
PPTX
REINFORCEMENT AS CONSTRUCTION MATERIALS.pptx
mohaiminulhaquesami
 
PPTX
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
PDF
monopile foundation seminar topic for civil engineering students
Ahina5
 
PPTX
Electron Beam Machining for Production Process
Rajshahi University of Engineering & Technology(RUET), Bangladesh
 
PDF
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
PPTX
Pharmaceuticals and fine chemicals.pptxx
jaypa242004
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PDF
Book.pdf01_Intro.ppt algorithm for preperation stu used
archu26
 
PDF
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
PPTX
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
PPTX
Innowell Capability B0425 - Commercial Buildings.pptx
regobertroza
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PDF
UNIT-4-FEEDBACK AMPLIFIERS AND OSCILLATORS (1).pdf
Sridhar191373
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
Presentation on Foundation Design for Civil Engineers.pptx
KamalKhan563106
 
Additional Information in midterm CPE024 (1).pdf
abolisojoy
 
A presentation on the Urban Heat Island Effect
studyfor7hrs
 
Water Design_Manual_2005. KENYA FOR WASTER SUPPLY AND SEWERAGE
DancanNgutuku
 
REINFORCEMENT AS CONSTRUCTION MATERIALS.pptx
mohaiminulhaquesami
 
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
monopile foundation seminar topic for civil engineering students
Ahina5
 
Electron Beam Machining for Production Process
Rajshahi University of Engineering & Technology(RUET), Bangladesh
 
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
Pharmaceuticals and fine chemicals.pptxx
jaypa242004
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Book.pdf01_Intro.ppt algorithm for preperation stu used
archu26
 
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
Innowell Capability B0425 - Commercial Buildings.pptx
regobertroza
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
UNIT-4-FEEDBACK AMPLIFIERS AND OSCILLATORS (1).pdf
Sridhar191373
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 

AI optimizing HPC simulations (presentation from 6th EULAG Workshop)

  • 1. Krzysztof Rojek, CTO, byteLAKE [email protected] 6TH INTERNATIONAL EULAG USERS WORKSHOP MAY 29, 2018, 14:00-14:45
  • 2.  Area: Adaptation of a real-life scientific codes to the most advanced computing architectures.  Challenge: Device architectures are constantly changing. Current architectures are very various. Our codes need to be very portable and flexible.  Goal: Take HPC to the "Industry 4.0" by implementing smart techniques to optimize the codes in terms of performance and energy consumption. 2
  • 3.  Piz Daint (ranked 3-rd at top 500):  GPU: NVIDIA Tesla P100 - PASCAL  1xGPU per node  Single GPU design  5320 nodes (up to 36 used in this work)  Calculation speed: float is 2x faster than double  MICLAB:  GPU: NVIDIA Tesla K80 - KEPLER  2xGPUs per node  Dual GPU design  2 nodes (remaining nodes with Intel Xeon Phi)  Calculation speed: float is 3x faster than double 3  Size of data transfer between nodes: 2x less using float than double  No access to sudo user – it makes a problem when your code is based on DVFS Expectation: Mixed precision arithmetic allows us to reduce the energy consumption and execution time; It can be used in the real HPC platforms (without special access)
  • 4.  Stencil-based algorithm for numerical simulation of geophysical fluids flows on micro-to-planetary scales:  7 stencils (compressed into 4 kernels) – each depends on one or more others (343 flops-per-el.)  Iterative algorithm – a single iteration represents one time step  11 matrices:  x, xP – scalar quantity (i.e. temperature); input/output matrices between time steps  v1, v2, v3, v1P, v2P, v3P – velocity vectors in i, j, and k directions  h – density matrix  cp, cn – temporary, intermediate matrices 4 xP x xPKernel 3 Kernel 2 Kernel 1 Kernel 0 xP v1P v2P cp xP cn v3P Outputs: (x,v1,v2,v3,h) (v1,v2,v3,h,xP) (x,h,xP,v1P,v2P,v3P) (x,h,v1P,v2P,v3P,cp,cn) Inputs:
  • 5.  Idea: Provide a highly parametrized code in order to easily map the algorithm onto GPU  Mapping: Select the right values of given code parameters (configuration) in terms of desired criterion (Energy consumption)  How to: We build the search space of possible configurations and prune it using our machine learning module (MLM)  MLM: It is still the ongoing task. Here we propose to apply the modified random forest algorithm. 5
  • 6.  We can use different number of:  Streams count (SPG)  Nodes count (NDS)  With different topologies:  Topology of streams (TGP)  Topology of nodes (TDS) 6 1x1 1x2 1x1 2x1 1x1 1x1 2x1 1x2 2x2 1x1 1x2 1x3 1x4 1x1 2x1 3x1 4x1 Stream/Node: 1 Topology: 1 Streams/Nodes: 2 Topologies: 2 Streams/Nodes: 4 Topologies: 3
  • 7. 7 Synchronize domain Synchronize domain Synchronize domain Synchronize domain Synchronize domain Synchronize domain Synchronize domain Synchronize domain Kernel 0 Kernel 1 Kernel 2 Kernel 3 Data transfer Kernel 0 Kernel 1 Kernel 2 Kernel 3 Data transfer vs.  Each stream works on distributed subdomains  Computations are independent within a single time step  Halo exchange is required after each time step  Each stream share the same subdomain  Computations depends on the neighboring streams  Halo exchange is not required within a single node
  • 8.  By selecting the right strategy of halo exchanging we can focus on more parallelism or less operations  We can use a different strategy within node and between nodes 8 Halo exchange cudaMemcpy Buffers GPU direct No GPU direct No-buffers Kernels Single kernel Two kernels
  • 9.  We takes into consideration also some basic parameters:  CUDA block sizes for each of 4 kernels  CUDA blocks are of size X times Y, where  X*Y mod 32 = 0  X>=Y  X mod 16 =0  X*Y <= 1024  X<=M and Y<=N, where NxMxL is a size of the grid  Data alignment and padding within a range from 1 to 4096 B  Align in: {1, 2, 4, 8, …, 4096} 9
  • 10.  Assumption: We believe that we can find a really good configuration by testing about 5000 configurations from the search space (more that this is too expensive)  We consider two possible approaches:  Positive: Find good solutions and eliminate groups that seem to be worse than ours  Risk: When we find a branch with a good solution we can eliminate other branches (also quite good) that should be worse. In fact we can eliminate a branch containing the best solution.  Negative: Find bad solutions and eliminate them  Risk: When we find branches with bad solutions we can eliminate them although the worst one can be still in (the best one also is there).  Fact: We test random branches (we may not select the best or the worst one); we are searching for the suboptimal solution. 10
  • 11.  Precision:  DOUBLE  Diameter:  28.0  L2 norm:  0.0746  Diffusion error:  1.7503  Phase error:  0.7576 11 Halo exchange (xP or x)
  • 12.  Precision: DOUBLE  Diameter: 28.0  L2 norm: 0.0746  Diff. err.: 1.7503  Phase err.: 0.7576  Precision: FLOAT  Diameter: 28.0  L2 norm: 0.1301  Diff. err.: 2.2439  Phase err.: 7.5919 12 FloatDouble
  • 13.  Goal: Reduce the energy consumption  Condition: Keep the accuracy at a high level (1% loss is acceptable)  Assumptions:  The proposed method is intended to iterative algorithms  Dynamic approach, self adaptable to a particular simulation  Self adaptation is done based on the short training stage (the first 11 time steps) 13 1. Change the i-th matrix from DP to SP 2. Execute a single time step 3. Measure Energy and Accuracy 4. Restore the i-th matrix to DP Training stage: Traditional approach based on static selection of precision arithmetic is less flexible and may be too restrictive for some simulations
  • 14. 14 0 10 20 30 40 50 60 70 80 90 100 0/4 1/4 2/4 3/4 4/4 ENERGYCONSUMPTION[%] PART OF A SINGLE SOLID BODY ROTATION Change the precision of the i-th matrix from DP to SP 0 10 20 30 40 50 60 70 80 90 100 0/4 1/4 2/4 3/4 4/4 ACCURACYLOSS[%] PART OF A SINGLE SOLID BODY ROTATION DOUBLE Change the precision of the i-th matrix from DP to SP Should be minimizedShould be maximized ΔE ΔA
  • 15. 15  Assumptions:  ΔE – should be maximized  ΔA – should be minimized  Conclusion:  R=ΔE/ΔA – the higher, the better  Method:  We estimate the R ratio for each matrix and set matrices with the highest R from double to float  This step is repeated until the accuracy loss is lower than 1% Set the matrices with the highest R to float until the accuracy loss<1% Sort them decreasing by R Calculate R=ΔE/ΔA for each matrix The higher, the better
  • 16.  Precision: DOUBLE  Diameter: 28.0  L2 norm: 0.0746  Diff. err.: 1.7503  Phase err.: 0.7576  Precision: MIXED  Diameter: 28.0  L2 norm: 0.0749  Diff. err.: 1.7504  Phase err.: 0.7576 16 Float group: x, xP, v3, h, v1P,v3P Double group: v1, v2, v2P, cp, cn Double
  • 17. 17 Double Mixed The proposed method was also validated for the other tests  The difference between L2 norms for double and mixed precision is 0.00001  The phase is 44.2135 for both cases
  • 18.  Test: 512x512x512 – 3909 time steps 18 Double 1 335.48 44 Mixed 1 255.02 1.32 35 19.79 Double 32 27.52 71 Mixed 24 21.65 1.27 48 32.63 Conclusion: Energy consumption is reduced by 33% Best performance 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 6 11 16 21 26 31 36 Gflop/s # of nodes Double Mixed
  • 19.  Test: 512x512x512 – 3909 time steps 19 Double 1/2 533.65 80 Mixed 1/2 352.18 1.51 53 34.00 Double 4/8 144.83 87 Mixed 4/8 96.04 1.51 57 33.66 Conclusion: Energy consumption is reduced by 33% Best performance 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1 2 3 4 Gflop/s # of GPUs – 2 GPUs per node Double Mixed
  • 20.  The developed implementation of MPDATA is very flexible and portable  The proposed method allows us to automate the code adaptation even for a very large number of possible configurations  Mixed precision arithmetic allows us to reduce the energy consumption and execution time  It can be used in the real HPC platforms without special access to the machine  It has an effect on the computation speed, data transfer, and scalability of the application  The proposed method allows us to reduce the energy consumption by 33% without loss in accuracy  It has also improved the performance by the factor of 1.27 for Piz Daint and 1.51 for MICLAB in relation to double precision arithmetic 20
  • 21. We build Artificial Intelligence software and integrate that into products. We port and optimize algorithms for parallel, CPU+GPU architectures. We design and optimize algorithms for HPC supercomputers. byteLAKE We are specialists in: www.byteLAKE.com Machine Learning Deep Learning Computer Vision High Performance Computing Heterogeneous Computing Edge Computing Our mission: help industries transform for the era of Artificial Intelligence. We combine business and academia. Our team consists of experts schooled by Fortune 500 corporations as well as PhD researchers.
  • 22. 22 Areas of Our Expertise Computer Vision Deep Learning Machine Learning Learning Optimization AI for Edge Devices HPC Expertise Consultancy Proof of Concept