SlideShare a Scribd company logo
www.bsc.es
ECMWF, Reading, UK, 31 October 2014
George S. Markomanolis, Jesus Labarta, Oriol Jorba
Optimizing an Earth Science
Atmospheric Application with the
OmpSs Programming Model
Workshop on High performance computing in meteorology
2
Outline
Introduction to NMMB/BSC-CTM
Performance overview of NMMB/BSC-CTM
Experiences with OmpSs
Future work
Severo-Ochoa Earth Sciences Application
Development of a Unified Meteorology/Air Quality/Climate model
• Towards a global high-resolution system for global to local
assessments
International collaborations:
Meteorology Climate
Global aerosols Air Quality
Uni. of California
Irvine
Goddard Institute Space Studies
National Centers for
Environmental Predictions
Extending NMMB/BSC-CTM
from coarse regional scales to
global high-resolution
configurations
Coupling with a Data
Assimilation System for
Aerosols
Where do we solve the primitive equations? Grid
discretization
High performance computing resources:
If we plan to solve small scale features
we need higher resolution in the mesh
and so more HPC resources are required.
We need to be able to run this models in Multi-core
architectures.
Model domain is decomposed in patches
Patch: portion of the model domain allocated to a
distributed/shared memory node.
5
ParallelizingAtmospheric Models
MPI Communication with
neighbours
Patch
6
NMMB/BSC-CTM
NMMB/BSC-CTM is used operationally for the dust forecast
center in Barcelona
NMMB is the operational model of NCEP
The general purpose is to improve its scalability and the
simulation resolution
7
MarenoStrum III
3,056 compute nodes
2x Intel SandyBridge-EP E5-2670 2.6 GHz
32 GB memory per node
Infiniband FDR10
OpenMPI 1.8.1
ifort 13.0.1
Performance Overview of NMMB/BSC-CTM Model
Execution diagram – Focus on the Model
Paraver
One hour simulation of NMMB/BSC-CTM, global, 24km, 64 layers
meteo: 9 tracers
meteo + aerosols:
9 + 16 tracers
meteo + aerosols +
gases: 9 + 16 + 53
Dynamic load balancing
Different simulation dates cause different load balance issue (useful
functions, global 24km meteo configuration)
20/12/2005
20/07/2005
A "different" view point
LB Ser Trf Eff
0.83 0.97 0.80
0.87 0.9 0.78
0.88 0.97 0.84 0.73
0.88 0.96 0.75 0.61
eff.= LB * Ser * Trf
13
One hour simulation
One hour simulation, chemistry configuration for global model,
24 km resolution
14
One hour simulation
Useful functions
15
Zoom on EBI solver and next calls
EBI solver (run_ebi) and the calls till the next call to the solver
16
Zoom between EBI solvers
The useful functions call between two EBI solvers
The first two dark blue areas are horizontal diffusion calls and
the light dark is advection chemistry.
17
Horizontal diffusion
We zoom on horizontal diffusion and the calls that follow
Horizontal diffusion (blue colour) has load imbalance
Experiences with OmpSs
19
Objectives
Trying to apply OmpSs on a real application
Applying incremental methodology
Identify opportunities
Exploring difficulties
20
OmpSs introduction
Parallel Programming Model
- Build on existing standard: OpenMP
- Directive based to keep a serial version
- Targeting: SMP, clusters, and accelerator devices
- Developed in Barcelona Supercomputing Center (BSC)
Mercurium source-to-source compiler
Nanos++ runtime system
https://pm.bsc.es/ompss
21
Studying cases
Taskify a computation routine and investigate potential improvements and
overlap between computations of different granularities
Overlap communication with packing and unpacking costs
Overlap independent coarse grain operations composed of communication
and computation phases
22
Horizontal diffusion + communication
The horizontal diffusion has some load imbalance(blue code)
There is some computation about packing/unpacking data for
the communication buffers (red area)
Gather (green colour) and scatter for the FFTs
23
Horizontal diffusion skeleton code
The hdiff subroutine has the following loops and dependencies
24
Parallelizing loops
Part of hdiff with 2 threads
Parallelizing the most important loops
We have a speedup of 1.3 by using
worksharing
25
Comparison
The execution of hdiff subroutine with 1 thread takes 120 ms
The execution of hdiff subroutine with 2 threads takes 56 ms,
the speedup is 2.14
26
Issues related to communication
We study the exch4 subroutine (red colour)
The useful function of exch4 has some computation
The communication creates a pattern and the duration of the
MPI_Wait calls can vary
27
Issues related to communication
Big load imbalance because message order
There is also some computation
28
Taskify subroutine exch4
We observe the MPI_Wait calls in the first thread
In the same moment the second thread does the necessary
computation and overlaps the communication
29
Taskify subroutine exch4
The total execution of exch4 subrouting with 1 thread
The total execution of exch4 subroutine with 2 threads
With 2 threads the speedup is 1.76 (more improvements have
been identified)
30
Advection chemistry and FFT
Advection chemistry (blue color) and 54 calls to
gather/FFT/scatter till monotonization chemistry (brown color)
Initial study to test the improvements of the execution with the
OmpSs programming model
31
Study case: gather/FFT/scatter
Workflow, two iterations, using two threads, declaring
dependencies
32
Study case: gather/FFT/scatter
Paraver view, two iterations, four tracers totally
Thread 1:
Iteration 1: FFT_1, scatter_1, scatter_2
Iteration 2: gather_4, FFT_4, scatter_4
Thread 2:
Iteration 1: gather_1, gather_2, fft_2
Iteration 2: gather_3, FFT_3, scatter_3
33
Study case: gather/FFT/scatter - Performance
Comparing the execution time of 54 calls to gather/FFT/scatter
with one and two threads
The speedup with two threads is 1.56 and we have identified
potential improvements
34
MPI Bandwidth
MPI bandwidth for gather/scatter
MPI bandwidth over 1GB/s for gather/scatter
35
Combination of advection chemistry and FFT
Advection chemistry with worksharing (not for all the loops),
FFTs for one thread
Similar but with two threads
The speedup for the advection chemistry routine is 1.6 and
overall is 1.58
36
Comparison between MPI and MPI+OmpSs
Pure MPI, 128 computation processes and 4 I/O
MPI + OmpSs: 64 MPI processes + 64 threads + 4 I/O
The load imbalance for the FFT with pure MPI is 28% while
with MPI+OmpSs is 54%
37
Incremental methodology with OmpSs
Taskify the loops
Start with 1 thread, use if(0) for serializing tasks
Test that dependencies are correct (usually trial and error)
Imagine an application crashing after adding 20+ new pragmas
(true story)
Do not parallelize loops that do not contain significant
computation
38
Conclusions
The incremental methodology is important for less overhead in
the application
OmpSs can be applied on a real application but is not
straightforward
It can achieve pretty good speedup, depending on the case
Overlapping communication with computation is a really
interesting topic
We are still in the beginning but OmpSs seems promising
39
Future improvements
Investigate the usage of multithreaded MPI
One of the main functions of the application is the EBI solver
(run_ebi). There is a problem with global variables that make the
function not reentrant. Refactoring of the code is needed.
Porting more code to OmpSs and investigate MPI calls as tasks
Some computation is independent to the model's layers or to tracers.
OpenCL kernels are going to be developed to test the performance on
accelerators
Testing versioning scheduler
The dynamic load balancing library should be studied further
(http://pm.bsc.es/dlb)
Apply OmpSs for a data assimilation simulation
www.bsc.es
Thank you!
For further information please contact
georgios.markomanolis@bsc.es
"Work funded by the SEV-2011-00067 grant of the Severo
Ochoa Program, awarded by the Spanish Government."
Acknowledgements:
Rosa M. Badia, Judit Gimenez, Roger Ferrer Ibáñez, Julian Morillo,
Victor López, Xavier Teruel, Harald Servat, BSC support
40

More Related Content

What's hot (19)

PDF
MATLAB and Simulink for Communications System Design (Design Conference 2013)
Analog Devices, Inc.
 
PDF
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim
 
PDF
Area, Delay and Power Comparison of Adder Topologies
VLSICS Design
 
PDF
A Study on Image Reconfiguration Algorithm of Compressed Sensing
TELKOMNIKA JOURNAL
 
PDF
60
srimoorthi
 
PPTX
ECE 565 FInal Project
Lakshmi Yasaswi Kamireddy
 
PPTX
MATLAB LTE Toolbox Projects Research Help
Matlab Simulation
 
PDF
Pretzel: optimized Machine Learning framework for low-latency and high throug...
NECST Lab @ Politecnico di Milano
 
PDF
Projected Barzilai-Borwein Methods Applied to Distributed Compressive Spectru...
Polytechnique Montreal
 
PDF
CARI2020: A CGM-Based Parallel Algorithm Using the Four-Russians Speedup for ...
Mokhtar SELLAMI
 
PDF
Approximation of Dynamic Convolution Exploiting Principal Component Analysis:...
a3labdsp
 
PDF
International Journal of Computational Engineering Research(IJCER)
ijceronline
 
PPT
Chap4 slides
BaliThorat1
 
PPTX
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
PDF
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
NECST Lab @ Politecnico di Milano
 
PDF
Remote Sensing IEEE 2015 Projects
Vijay Karan
 
PDF
aMCfast: Automation of Fast NLO Computations for PDF fits
juanrojochacon
 
PDF
Low Power High-Performance Computing on the BeagleBoard Platform
a3labdsp
 
MATLAB and Simulink for Communications System Design (Design Conference 2013)
Analog Devices, Inc.
 
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim
 
Area, Delay and Power Comparison of Adder Topologies
VLSICS Design
 
A Study on Image Reconfiguration Algorithm of Compressed Sensing
TELKOMNIKA JOURNAL
 
ECE 565 FInal Project
Lakshmi Yasaswi Kamireddy
 
MATLAB LTE Toolbox Projects Research Help
Matlab Simulation
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
NECST Lab @ Politecnico di Milano
 
Projected Barzilai-Borwein Methods Applied to Distributed Compressive Spectru...
Polytechnique Montreal
 
CARI2020: A CGM-Based Parallel Algorithm Using the Four-Russians Speedup for ...
Mokhtar SELLAMI
 
Approximation of Dynamic Convolution Exploiting Principal Component Analysis:...
a3labdsp
 
International Journal of Computational Engineering Research(IJCER)
ijceronline
 
Chap4 slides
BaliThorat1
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
NECST Lab @ Politecnico di Milano
 
Remote Sensing IEEE 2015 Projects
Vijay Karan
 
aMCfast: Automation of Fast NLO Computations for PDF fits
juanrojochacon
 
Low Power High-Performance Computing on the BeagleBoard Platform
a3labdsp
 

Viewers also liked (16)

PPTX
Genesis 1[1]
estefaniaduranp
 
PPTX
[23] Ronald Van Den Hoff Trendsessie
Trickline
 
PDF
futbol .pasion mundial
muerte15
 
PDF
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
Jeff Larkin
 
PDF
1 the tikkunei zohar by ra...n selected a way and..
SEIKI2
 
DOC
Orly Landingin's Resume
Orlybong Landingin
 
PPT
SISTEMA REPRODUCTOR
Lily Cárdenas Yacambú
 
PDF
Alessandra Benvenuti, Open Data e beni culturali: un’opportunità per lo svilu...
Patrimonio culturale FVG
 
PPT
JEE5 New Features
Haitham Raik
 
PDF
Presentation soft launch
Wiebe Marichael
 
PPTX
Analytical Models of Single Bubbles and Foams
Robert Murtagh
 
PPTX
Abipal organiations study at ipa pvt ltd
Libu Thomas
 
PPTX
Gametogenesis LILY
Lily Cárdenas Yacambú
 
PDF
Brochure Insem
Mario Marzolo
 
PDF
Policy Development Process Infographic English
ICANN
 
Genesis 1[1]
estefaniaduranp
 
[23] Ronald Van Den Hoff Trendsessie
Trickline
 
futbol .pasion mundial
muerte15
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
Jeff Larkin
 
1 the tikkunei zohar by ra...n selected a way and..
SEIKI2
 
Orly Landingin's Resume
Orlybong Landingin
 
SISTEMA REPRODUCTOR
Lily Cárdenas Yacambú
 
Alessandra Benvenuti, Open Data e beni culturali: un’opportunità per lo svilu...
Patrimonio culturale FVG
 
JEE5 New Features
Haitham Raik
 
Presentation soft launch
Wiebe Marichael
 
Analytical Models of Single Bubbles and Foams
Robert Murtagh
 
Abipal organiations study at ipa pvt ltd
Libu Thomas
 
Gametogenesis LILY
Lily Cárdenas Yacambú
 
Brochure Insem
Mario Marzolo
 
Policy Development Process Infographic English
ICANN
 
Ad

Similar to Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model (20)

PDF
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
ijtsrd
 
PPT
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
PDF
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
IJCSEIT Journal
 
PDF
Nasa HPC in the Cloud
Adianto Wibisono
 
PDF
Model checking
Richard Ashworth
 
PDF
Rapport_Cemracs2012
Jussara F.M.
 
PDF
Application Profiling at the HPCAC High Performance Center
inside-BigData.com
 
PDF
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
byteLAKE
 
PPTX
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
inside-BigData.com
 
PDF
Solution manual for modern processor design by john paul shen and mikko h. li...
neeraj7svp
 
PPT
Nug2004 yhe
Yassine Rafrafi
 
PPT
AHF_IDETC_2011_Jie
MDO_Lab
 
PPTX
Parallelization using open mp
ranjit banshpal
 
PDF
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Frederic Desprez
 
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
PDF
Hardback solution to accelerate multimedia computation through mgp in cmp
eSAT Publishing House
 
PDF
PraveenBOUT++
Praveen Narayanan
 
PDF
2012 05-10 kaiser
SCEE Team
 
PPTX
HPC with Clouds and Cloud Technologies
Inderjeet Singh
 
PPTX
OpenACC Monthly Highlights: September 2021
OpenACC
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
ijtsrd
 
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
IJCSEIT Journal
 
Nasa HPC in the Cloud
Adianto Wibisono
 
Model checking
Richard Ashworth
 
Rapport_Cemracs2012
Jussara F.M.
 
Application Profiling at the HPCAC High Performance Center
inside-BigData.com
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
byteLAKE
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
inside-BigData.com
 
Solution manual for modern processor design by john paul shen and mikko h. li...
neeraj7svp
 
Nug2004 yhe
Yassine Rafrafi
 
AHF_IDETC_2011_Jie
MDO_Lab
 
Parallelization using open mp
ranjit banshpal
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Frederic Desprez
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Hardback solution to accelerate multimedia computation through mgp in cmp
eSAT Publishing House
 
PraveenBOUT++
Praveen Narayanan
 
2012 05-10 kaiser
SCEE Team
 
HPC with Clouds and Cloud Technologies
Inderjeet Singh
 
OpenACC Monthly Highlights: September 2021
OpenACC
 
Ad

More from George Markomanolis (18)

PDF
Evaluating GPU programming Models for the LUMI Supercomputer
George Markomanolis
 
PDF
Utilizing AMD GPUs: Tuning, programming models, and roadmap
George Markomanolis
 
PDF
Exploring the Programming Models for the LUMI Supercomputer
George Markomanolis
 
PDF
Getting started with AMD GPUs
George Markomanolis
 
PDF
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
George Markomanolis
 
PDF
Introduction to Extrae/Paraver, part I
George Markomanolis
 
PDF
Performance Analysis with Scalasca, part II
George Markomanolis
 
PDF
Performance Analysis with Scalasca on Summit Supercomputer part I
George Markomanolis
 
PDF
Performance Analysis with TAU on Summit Supercomputer, part II
George Markomanolis
 
PDF
How to use TAU for Performance Analysis on Summit Supercomputer
George Markomanolis
 
PDF
Introducing IO-500 benchmark
George Markomanolis
 
PDF
Experience using the IO-500
George Markomanolis
 
PDF
Harshad - Handle Darshan Data
George Markomanolis
 
PDF
Lustre Best Practices
George Markomanolis
 
PDF
Burst Buffer: From Alpha to Omega
George Markomanolis
 
PDF
markomanolis_phd_defense
George Markomanolis
 
PDF
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
George Markomanolis
 
PDF
Introduction to Performance Analysis tools on Shaheen II
George Markomanolis
 
Evaluating GPU programming Models for the LUMI Supercomputer
George Markomanolis
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
George Markomanolis
 
Exploring the Programming Models for the LUMI Supercomputer
George Markomanolis
 
Getting started with AMD GPUs
George Markomanolis
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
George Markomanolis
 
Introduction to Extrae/Paraver, part I
George Markomanolis
 
Performance Analysis with Scalasca, part II
George Markomanolis
 
Performance Analysis with Scalasca on Summit Supercomputer part I
George Markomanolis
 
Performance Analysis with TAU on Summit Supercomputer, part II
George Markomanolis
 
How to use TAU for Performance Analysis on Summit Supercomputer
George Markomanolis
 
Introducing IO-500 benchmark
George Markomanolis
 
Experience using the IO-500
George Markomanolis
 
Harshad - Handle Darshan Data
George Markomanolis
 
Lustre Best Practices
George Markomanolis
 
Burst Buffer: From Alpha to Omega
George Markomanolis
 
markomanolis_phd_defense
George Markomanolis
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
George Markomanolis
 
Introduction to Performance Analysis tools on Shaheen II
George Markomanolis
 

Recently uploaded (20)

PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Français Patch Tuesday - Juillet
Ivanti
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Français Patch Tuesday - Juillet
Ivanti
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
July Patch Tuesday
Ivanti
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 

Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model

  • 1. www.bsc.es ECMWF, Reading, UK, 31 October 2014 George S. Markomanolis, Jesus Labarta, Oriol Jorba Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model Workshop on High performance computing in meteorology
  • 2. 2 Outline Introduction to NMMB/BSC-CTM Performance overview of NMMB/BSC-CTM Experiences with OmpSs Future work
  • 3. Severo-Ochoa Earth Sciences Application Development of a Unified Meteorology/Air Quality/Climate model • Towards a global high-resolution system for global to local assessments International collaborations: Meteorology Climate Global aerosols Air Quality Uni. of California Irvine Goddard Institute Space Studies National Centers for Environmental Predictions Extending NMMB/BSC-CTM from coarse regional scales to global high-resolution configurations Coupling with a Data Assimilation System for Aerosols
  • 4. Where do we solve the primitive equations? Grid discretization High performance computing resources: If we plan to solve small scale features we need higher resolution in the mesh and so more HPC resources are required.
  • 5. We need to be able to run this models in Multi-core architectures. Model domain is decomposed in patches Patch: portion of the model domain allocated to a distributed/shared memory node. 5 ParallelizingAtmospheric Models MPI Communication with neighbours Patch
  • 6. 6 NMMB/BSC-CTM NMMB/BSC-CTM is used operationally for the dust forecast center in Barcelona NMMB is the operational model of NCEP The general purpose is to improve its scalability and the simulation resolution
  • 7. 7 MarenoStrum III 3,056 compute nodes 2x Intel SandyBridge-EP E5-2670 2.6 GHz 32 GB memory per node Infiniband FDR10 OpenMPI 1.8.1 ifort 13.0.1
  • 8. Performance Overview of NMMB/BSC-CTM Model
  • 9. Execution diagram – Focus on the Model
  • 10. Paraver One hour simulation of NMMB/BSC-CTM, global, 24km, 64 layers meteo: 9 tracers meteo + aerosols: 9 + 16 tracers meteo + aerosols + gases: 9 + 16 + 53
  • 11. Dynamic load balancing Different simulation dates cause different load balance issue (useful functions, global 24km meteo configuration) 20/12/2005 20/07/2005
  • 12. A "different" view point LB Ser Trf Eff 0.83 0.97 0.80 0.87 0.9 0.78 0.88 0.97 0.84 0.73 0.88 0.96 0.75 0.61 eff.= LB * Ser * Trf
  • 13. 13 One hour simulation One hour simulation, chemistry configuration for global model, 24 km resolution
  • 15. 15 Zoom on EBI solver and next calls EBI solver (run_ebi) and the calls till the next call to the solver
  • 16. 16 Zoom between EBI solvers The useful functions call between two EBI solvers The first two dark blue areas are horizontal diffusion calls and the light dark is advection chemistry.
  • 17. 17 Horizontal diffusion We zoom on horizontal diffusion and the calls that follow Horizontal diffusion (blue colour) has load imbalance
  • 19. 19 Objectives Trying to apply OmpSs on a real application Applying incremental methodology Identify opportunities Exploring difficulties
  • 20. 20 OmpSs introduction Parallel Programming Model - Build on existing standard: OpenMP - Directive based to keep a serial version - Targeting: SMP, clusters, and accelerator devices - Developed in Barcelona Supercomputing Center (BSC) Mercurium source-to-source compiler Nanos++ runtime system https://pm.bsc.es/ompss
  • 21. 21 Studying cases Taskify a computation routine and investigate potential improvements and overlap between computations of different granularities Overlap communication with packing and unpacking costs Overlap independent coarse grain operations composed of communication and computation phases
  • 22. 22 Horizontal diffusion + communication The horizontal diffusion has some load imbalance(blue code) There is some computation about packing/unpacking data for the communication buffers (red area) Gather (green colour) and scatter for the FFTs
  • 23. 23 Horizontal diffusion skeleton code The hdiff subroutine has the following loops and dependencies
  • 24. 24 Parallelizing loops Part of hdiff with 2 threads Parallelizing the most important loops We have a speedup of 1.3 by using worksharing
  • 25. 25 Comparison The execution of hdiff subroutine with 1 thread takes 120 ms The execution of hdiff subroutine with 2 threads takes 56 ms, the speedup is 2.14
  • 26. 26 Issues related to communication We study the exch4 subroutine (red colour) The useful function of exch4 has some computation The communication creates a pattern and the duration of the MPI_Wait calls can vary
  • 27. 27 Issues related to communication Big load imbalance because message order There is also some computation
  • 28. 28 Taskify subroutine exch4 We observe the MPI_Wait calls in the first thread In the same moment the second thread does the necessary computation and overlaps the communication
  • 29. 29 Taskify subroutine exch4 The total execution of exch4 subrouting with 1 thread The total execution of exch4 subroutine with 2 threads With 2 threads the speedup is 1.76 (more improvements have been identified)
  • 30. 30 Advection chemistry and FFT Advection chemistry (blue color) and 54 calls to gather/FFT/scatter till monotonization chemistry (brown color) Initial study to test the improvements of the execution with the OmpSs programming model
  • 31. 31 Study case: gather/FFT/scatter Workflow, two iterations, using two threads, declaring dependencies
  • 32. 32 Study case: gather/FFT/scatter Paraver view, two iterations, four tracers totally Thread 1: Iteration 1: FFT_1, scatter_1, scatter_2 Iteration 2: gather_4, FFT_4, scatter_4 Thread 2: Iteration 1: gather_1, gather_2, fft_2 Iteration 2: gather_3, FFT_3, scatter_3
  • 33. 33 Study case: gather/FFT/scatter - Performance Comparing the execution time of 54 calls to gather/FFT/scatter with one and two threads The speedup with two threads is 1.56 and we have identified potential improvements
  • 34. 34 MPI Bandwidth MPI bandwidth for gather/scatter MPI bandwidth over 1GB/s for gather/scatter
  • 35. 35 Combination of advection chemistry and FFT Advection chemistry with worksharing (not for all the loops), FFTs for one thread Similar but with two threads The speedup for the advection chemistry routine is 1.6 and overall is 1.58
  • 36. 36 Comparison between MPI and MPI+OmpSs Pure MPI, 128 computation processes and 4 I/O MPI + OmpSs: 64 MPI processes + 64 threads + 4 I/O The load imbalance for the FFT with pure MPI is 28% while with MPI+OmpSs is 54%
  • 37. 37 Incremental methodology with OmpSs Taskify the loops Start with 1 thread, use if(0) for serializing tasks Test that dependencies are correct (usually trial and error) Imagine an application crashing after adding 20+ new pragmas (true story) Do not parallelize loops that do not contain significant computation
  • 38. 38 Conclusions The incremental methodology is important for less overhead in the application OmpSs can be applied on a real application but is not straightforward It can achieve pretty good speedup, depending on the case Overlapping communication with computation is a really interesting topic We are still in the beginning but OmpSs seems promising
  • 39. 39 Future improvements Investigate the usage of multithreaded MPI One of the main functions of the application is the EBI solver (run_ebi). There is a problem with global variables that make the function not reentrant. Refactoring of the code is needed. Porting more code to OmpSs and investigate MPI calls as tasks Some computation is independent to the model's layers or to tracers. OpenCL kernels are going to be developed to test the performance on accelerators Testing versioning scheduler The dynamic load balancing library should be studied further (http://pm.bsc.es/dlb) Apply OmpSs for a data assimilation simulation
  • 40. www.bsc.es Thank you! For further information please contact [email protected] "Work funded by the SEV-2011-00067 grant of the Severo Ochoa Program, awarded by the Spanish Government." Acknowledgements: Rosa M. Badia, Judit Gimenez, Roger Ferrer Ibáñez, Julian Morillo, Victor López, Xavier Teruel, Harald Servat, BSC support 40