SlideShare a Scribd company logo
Parallel and Distributed
  Computing on Low
    Latecy Clusters

                Vittorio Giovara
    M. S. Electrical Engineering and Computer Science
              University of Illinois at Chicago
                          May 2009
Contents
•   Motivation            •   Application

•   Strategy                    •   Compiler Optimizations


•   Technologies                •   OpenMP and MPI over
                                    Infinband

      •   OpenMP
                          •   Results
      •   MPI
                          •   Conclusions
      •   Infinband
Motivation
Motivation

• Scaling trend has to stop for CMOS
  technology:
    ✓ Direct-tunneling limit in SiO2 ~3 nm
    ✓ Distance between Si atoms ~0.3 nm
    ✓ Variabilty


• Foundamental reason: rising fab cost
Motivation

• Easy to build multiple core processor
• Requires human action to modify and adapt
  concurrent software
• New classification for computer
  architectures
Classification
SISD                                                        SIMD
                     data pool                                                   data pool




                                                              instruction pool
  instruction pool




                       CPU                                                       CPU CPU



                        MISD                                                        MIMD
                                                data pool                                                  data pool



                                                                                        instruction pool
                             instruction pool




                                                  CPU                                                      CPU CPU

                                                  CPU                                                      CPU CPU
easier to parallelize




             abstraction level

algorithm
              loop level
                           process management
Levels
 recursion
  memory
management
  profiling
                                 data dependency
                                branching overhead
                                   control flow
       algorithm
                   loop level
                                process management

                                      SMP Multiprogramming
                                    Multithreading and Scheduling
Backfire

• Difficutly to fully exploit the parallelism
  offered
• Automatic tools required to adapt software
  to parallelism
• Compiler support for manual or semi-
  automatic enhancement
Applications
• OpenMP and MPI are two popular tools
  used to simplify the parallelizing process of
  both new and old software
• Mathematics and Physics
• Computer Science
• Biomedics
Specific Problem and
     Background
• Sally3D is a micromagnetism program suit
  for field analysis and modeling developed at
  Politecnico di Torino (Department of
  Electrical Engineering)
• Computationally intensive (even days of
  CPU); speedup required
• Previous works still not fully encompassing
  the problem (no Infiniband or OpenMP
  +MPI solutions)
Strategy
Strategy
• Install a Linux Kernel with ad-hoc
  configuration for scientific computation
• Compile a OpenMP enable GCC
  (supported from 4.3.1 onwards)
• Add the Infiniband link among clusters with
  proper drivers in kernel and user space
• Select a MPI implementation library
Strategy
• Verify Infiniband network through some
  MPI test examples
• Install the target software
• Proceed to include OpenMP and MPI
  directives in the code
• Run test cases
OpenMP

• standard
• supported by most of modern compilers
• requires little knowledge of the software
• very simple construction methods
OpenMP - example
OpenMP - example
Parallel Task 1                     Parallel Task 3




                  Parallel Task 2                     Parallel Task 4
Parallel Task 1         Parallel Task 2


Thread A
                                                              Parallel Task 4




            Thread B
                         Parallel Task 3




                                                       Join
  Master Thread
OpenMP Sceduler

• Which scheduler available for hardware?
   - Static
   - Dynamic
   - Guided
OpenMP Scheduler
                                                    OpenMP Static Scheduler Chart
               80000


               70000


               60000


               50000
microseconds




               40000


               30000


               20000


               10000


                  0
                       1   2   3     4   5      6       7      8        9       10       11     12    13    14   15   16
                                                            number of threads




                               chunk 1   chunk 10      chunk 100            chunk 1000        chunk 10000
OpenMP Scheduler
                                                         OpenMP Dynamic Scheduler Chart
               117000


               102375


               87750


               73125
microseconds




               58500


               43875


               29250


               14625


                   0
                        1   2    3        4    5         6    7      8       9        10   11     12    13    14   15   16
                                                                  number of threads



                                chunk 1       chunk 10       chunk 100        chunk 1000        chunk 10000
OpenMP Scheduler
                                                        OpenMP Guided Scheduler Chart
               80000


               70000


               60000


               50000
microseconds




               40000


               30000


               20000


               10000


                  0
                       1   2   3        4    5     6        7      8        9       10   11      12     13   14   15   16
                                                                number of threads




                                   chunk 1   chunk 10       chunk 100       chunk 1000        chunk 10000
OpenMP Scheduler
OpenMP Scheduler




   static scheduler   dynamic scheduler   guided scheduler
MPI
• standard
• widely used in cluster environment
• many transport link supported
• different implementations available
      - OpenMPI
      - MVAPICH
Infiniband

• standard
• widely used in cluster environment
• very low latency for small packets
• up to 16 Gb/s transfer speed
MPI over Infiniband
10000000,0 µs




 1000000,0 µs




  100000,0 µs




   10000,0 µs




    1000,0 µs




     100,0 µs




      10,0 µs




       1,0 µs
                kB
                     kB
                          kB
                               kB

                                    kB
                                               kB

                                          12 B
                                          25 B
                                          51 B
                                               kB

                                                 B
                                                 B
                                                 B
                                                 B

                                          32 B
                                          64 B
                                         12 B
                                         25 B
                                         51 B
                                                 B
                                                 B
                                                 B
                                                 B
                                                 B

                                                 B
                                               k

                                               k
                                               k


                                              M
                                              M
                                              M
                                              M

                                              M
                                              M
                                              M

                                              M
                                              M
                                              M
                                               G
                                               G
                                               G
                                               G

                                               G
                1
                    2
                        4
                            8
                                16
                                     32
                                           64

                                             8
                                             6
                                             2




                                            1
                                            2
                                            4
                                            8
                                           16
                                            1
                                            2
                                            4
                                            8
                                          16




                                            8
                                            6
                                            2
                                          OpenMPI   Mvapich2
MPI over Infiniband
10000000,00 µs




 1000000,00 µs




  100000,00 µs




   10000,00 µs




    1000,00 µs




     100,00 µs




      10,00 µs




       1,00 µs
                 kB


                      kB


                           kB


                                kB


                                        kB


                                               kB


                                                     kB


                                                            kB


                                                                   kB


                                                                          kB


                                                                                   B


                                                                                           B


                                                                                                   B


                                                                                                           B
                                                                               M


                                                                                       M


                                                                                               M


                                                                                                       M
                 1


                      2


                           4


                                8


                                      16


                                              32


                                                    64



                                                             8


                                                                    6


                                                                           2


                                                                               1


                                                                                       2


                                                                                               4


                                                                                                       8
                                                          12


                                                                 25


                                                                        51




                                    OpenMPI                  Mvapich2
Optimizations

• Active at compile time
• Available only after porting the software to
  standard FORTRAN
• Consistent documentation available
• Unexpected positive results
Optimizations

•-march = native
•-O3
•-ffast-math
•-Wl,-O1
Target Software
Target Software

• Sally3D
• micromagnetic equation solver
• written in FORTRAN with some C libraries
• program uses linear formulation of
  mathematical models
Implementation Scheme
          sequential loop                    parallel loop

  standard
programming
    model

                                              OpenMP Threads
      distributed loop


     OpenMP Threads         OpenMP Threads
         Host 1               Host 2
                      MPI
Implementation
         Scheme
• Data Structure: not embarrassingly parallel
• Three dimensional matrix
• Several temporary arrays – synchronization
  obiects required
  ➡ send() and recv() mechanism
  ➡ critical regions using OpenMP directives
  ➡ functions merging
  ➡ matrix conversion
Results
Results
  OMP    MPI    OPT   seconds
   *      *      *       133
   *      *      -       400
   *      -      *       186
   *      -      -       487
   -      *      *       200
   -      *      -       792
   -      -      *       246
   -      -      -      1062




Total Speed Increase: 87.52%
Actual Results
                  OMP       MPI      seconds
                   *         *          59
                   *         -         129
                   -         *         174
                   -         -         249




  Function Name   Normal    OpenMP      MPI     OpenMP+MPI
calc_intmudua      24.5 s    4.7 s     14.4 s      2.8 s
calc_hdmg_tet      16.9 s    3.0 s     10.8 s      1.7 s
calc_mudua         12.1 s    1.9 s      7.0 s      1.1 s
campo_effettivo    17.7 s    4.5 s      9.9 s      2.3 s
Actual Results

• OpenMP – 6-8x
• MPI – 2x
• OpenMP + MPI – 14 - 16x

     Total Raw Speed Increment: 76%
Conclusions
Conclusions and
        Future Works
• Computational time has been significantly
    decreased
• Speedup is consistent with expected results
• Submitted to COMPUMAG ‘09
•   Continue inserting OpenMP and MPI directives
•   Perform algorithm optimizations
•   Increase cluster size

More Related Content

What's hot (15)

PPTX
ISBI MPI Tutorial
Daniel Blezek
 
PPT
2011.jtr.pbasanta.
Universidad Carlos III de Madrid
 
PDF
Lightweight DNN Processor Design (based on NVDLA)
Shien-Chun Luo
 
PDF
Cots moves to multicore: AMD
Konrad Witte
 
PDF
Dimemas and Multi-Level Cache Simulations
Mário Almeida
 
PDF
Presentation of the open source CFD code Code_Saturne
Renuda SARL
 
PDF
Introduction to Chainer 11 may,2018
Preferred Networks
 
PDF
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Shinya Takamaeda-Y
 
PDF
Gfarm Fs Tatebe Tip2004
xlight
 
PDF
Thesis Report - Gaurav Raina MSc ES - v2
Gaurav Raina
 
PPT
Data flow super computing valentina balas
Valentina Emilia Balas
 
PDF
HPCMPUG2011 cray tutorial
Jeff Larkin
 
PDF
Lect.10.arm soc.4 neon
sean chen
 
PPTX
stream processing engine
tiana528
 
ISBI MPI Tutorial
Daniel Blezek
 
Lightweight DNN Processor Design (based on NVDLA)
Shien-Chun Luo
 
Cots moves to multicore: AMD
Konrad Witte
 
Dimemas and Multi-Level Cache Simulations
Mário Almeida
 
Presentation of the open source CFD code Code_Saturne
Renuda SARL
 
Introduction to Chainer 11 may,2018
Preferred Networks
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Shinya Takamaeda-Y
 
Gfarm Fs Tatebe Tip2004
xlight
 
Thesis Report - Gaurav Raina MSc ES - v2
Gaurav Raina
 
Data flow super computing valentina balas
Valentina Emilia Balas
 
HPCMPUG2011 cray tutorial
Jeff Larkin
 
Lect.10.arm soc.4 neon
sean chen
 
stream processing engine
tiana528
 

Viewers also liked (15)

PPTX
Cloud Computing
MANVENDRA PRIYADARSHI
 
PPT
Google Cluster Innards
Martin Dvorak
 
PPT
Grid
FajarZain
 
PDF
Evaluation of Virtual Clusters Performance on a Cloud Computing Infrastructure
EuroCloud
 
PPT
Clusters (Distributed computing)
Sri Prasanna
 
PPTX
Chapter16 new
vmummaneni
 
PDF
Clusters, Grids & Clouds for Engineering Design, Simulation, and Collaboration
Wolfgang Gentzsch
 
PDF
SLE12 SP2 : High Availability et Geo Cluster
SUSE
 
PPT
Grid computing ppt 2003(done)
TASNEEM88
 
PPTX
Cluster computing
Venkat Sai Sharath Mudhigonda
 
PPT
Chap8 basic cluster_analysis
guru_prasadg
 
PDF
Grid computing notes
Syed Mustafa
 
PPTX
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Krishna Petrochemicals
 
PPTX
Cluster analysis
Jewel Refran
 
PPTX
Distributed Computing
Prashant Tiwari
 
Cloud Computing
MANVENDRA PRIYADARSHI
 
Google Cluster Innards
Martin Dvorak
 
Grid
FajarZain
 
Evaluation of Virtual Clusters Performance on a Cloud Computing Infrastructure
EuroCloud
 
Clusters (Distributed computing)
Sri Prasanna
 
Chapter16 new
vmummaneni
 
Clusters, Grids & Clouds for Engineering Design, Simulation, and Collaboration
Wolfgang Gentzsch
 
SLE12 SP2 : High Availability et Geo Cluster
SUSE
 
Grid computing ppt 2003(done)
TASNEEM88
 
Chap8 basic cluster_analysis
guru_prasadg
 
Grid computing notes
Syed Mustafa
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Krishna Petrochemicals
 
Cluster analysis
Jewel Refran
 
Distributed Computing
Prashant Tiwari
 
Ad

Similar to Parallel and Distributed Computing on Low Latency Clusters (20)

PDF
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Xavier Llorà
 
PDF
Evaluating Data Freshness in Large Scale Replicated Databases
Miguel Araújo
 
PDF
Acceleration for big data, hadoop and memcached it168文库
Accenture
 
PDF
Acceleration for big data, hadoop and memcached it168文库
Accenture
 
PDF
Performance Oriented Design
Rodrigo Campos
 
PDF
The 5 Stages of Scale
xcbsmith
 
PDF
A Dataflow Processing Chip for Training Deep Neural Networks
inside-BigData.com
 
PPTX
Crash course on data streaming (with examples using Apache Flink)
Vincenzo Gulisano
 
PDF
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
PDF
Betting On Data Grids
gojkoadzic
 
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
PDF
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
OPNFV
 
PDF
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
PDF
Kersten eder q4_2008_bristol_2
Obsidian Software
 
PDF
Dasia 2022
klepsydratechnologie
 
PDF
PerfUG 3 - perfs système
Ludovic Piot
 
PDF
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Gaurav Raina
 
PPTX
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
PDF
Performance Analysis of Lattice QCD with APGAS Programming Model
Koichi Shirahata
 
PPTX
OpenStack and OpenFlow Demos
Brent Salisbury
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Xavier Llorà
 
Evaluating Data Freshness in Large Scale Replicated Databases
Miguel Araújo
 
Acceleration for big data, hadoop and memcached it168文库
Accenture
 
Acceleration for big data, hadoop and memcached it168文库
Accenture
 
Performance Oriented Design
Rodrigo Campos
 
The 5 Stages of Scale
xcbsmith
 
A Dataflow Processing Chip for Training Deep Neural Networks
inside-BigData.com
 
Crash course on data streaming (with examples using Apache Flink)
Vincenzo Gulisano
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
Betting On Data Grids
gojkoadzic
 
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
OPNFV
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
Kersten eder q4_2008_bristol_2
Obsidian Software
 
PerfUG 3 - perfs système
Ludovic Piot
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Gaurav Raina
 
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Koichi Shirahata
 
OpenStack and OpenFlow Demos
Brent Salisbury
 
Ad

More from Vittorio Giovara (13)

PPTX
Color me intrigued: A jaunt through color technology in video
Vittorio Giovara
 
PDF
An overview on 10 bit video: UHDTV, HDR, and coding efficiency
Vittorio Giovara
 
PDF
Introduction to video reverse engineering
Vittorio Giovara
 
PDF
Il Caso Ryanair
Vittorio Giovara
 
PDF
I Mercati Geografici
Vittorio Giovara
 
PDF
Block Cipher Modes of Operation And Cmac For Authentication
Vittorio Giovara
 
PDF
Crittografia Quantistica
Vittorio Giovara
 
PDF
Fuzzing Techniques for Software Vulnerability Discovery
Vittorio Giovara
 
PDF
Software Requirements for Safety-related Systems
Vittorio Giovara
 
PDF
Microprocessor-based Systems 48/32bit Division Algorithm
Vittorio Giovara
 
PDF
Misra C Software Development Standard
Vittorio Giovara
 
PDF
OpenSSL User Manual and Data Format
Vittorio Giovara
 
PDF
Authenticated Encryption Gcm Ccm
Vittorio Giovara
 
Color me intrigued: A jaunt through color technology in video
Vittorio Giovara
 
An overview on 10 bit video: UHDTV, HDR, and coding efficiency
Vittorio Giovara
 
Introduction to video reverse engineering
Vittorio Giovara
 
Il Caso Ryanair
Vittorio Giovara
 
I Mercati Geografici
Vittorio Giovara
 
Block Cipher Modes of Operation And Cmac For Authentication
Vittorio Giovara
 
Crittografia Quantistica
Vittorio Giovara
 
Fuzzing Techniques for Software Vulnerability Discovery
Vittorio Giovara
 
Software Requirements for Safety-related Systems
Vittorio Giovara
 
Microprocessor-based Systems 48/32bit Division Algorithm
Vittorio Giovara
 
Misra C Software Development Standard
Vittorio Giovara
 
OpenSSL User Manual and Data Format
Vittorio Giovara
 
Authenticated Encryption Gcm Ccm
Vittorio Giovara
 

Recently uploaded (20)

PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 

Parallel and Distributed Computing on Low Latency Clusters

  • 1. Parallel and Distributed Computing on Low Latecy Clusters Vittorio Giovara M. S. Electrical Engineering and Computer Science University of Illinois at Chicago May 2009
  • 2. Contents • Motivation • Application • Strategy • Compiler Optimizations • Technologies • OpenMP and MPI over Infinband • OpenMP • Results • MPI • Conclusions • Infinband
  • 4. Motivation • Scaling trend has to stop for CMOS technology: ✓ Direct-tunneling limit in SiO2 ~3 nm ✓ Distance between Si atoms ~0.3 nm ✓ Variabilty • Foundamental reason: rising fab cost
  • 5. Motivation • Easy to build multiple core processor • Requires human action to modify and adapt concurrent software • New classification for computer architectures
  • 6. Classification SISD SIMD data pool data pool instruction pool instruction pool CPU CPU CPU MISD MIMD data pool data pool instruction pool instruction pool CPU CPU CPU CPU CPU CPU
  • 7. easier to parallelize abstraction level algorithm loop level process management
  • 8. Levels recursion memory management profiling data dependency branching overhead control flow algorithm loop level process management SMP Multiprogramming Multithreading and Scheduling
  • 9. Backfire • Difficutly to fully exploit the parallelism offered • Automatic tools required to adapt software to parallelism • Compiler support for manual or semi- automatic enhancement
  • 10. Applications • OpenMP and MPI are two popular tools used to simplify the parallelizing process of both new and old software • Mathematics and Physics • Computer Science • Biomedics
  • 11. Specific Problem and Background • Sally3D is a micromagnetism program suit for field analysis and modeling developed at Politecnico di Torino (Department of Electrical Engineering) • Computationally intensive (even days of CPU); speedup required • Previous works still not fully encompassing the problem (no Infiniband or OpenMP +MPI solutions)
  • 13. Strategy • Install a Linux Kernel with ad-hoc configuration for scientific computation • Compile a OpenMP enable GCC (supported from 4.3.1 onwards) • Add the Infiniband link among clusters with proper drivers in kernel and user space • Select a MPI implementation library
  • 14. Strategy • Verify Infiniband network through some MPI test examples • Install the target software • Proceed to include OpenMP and MPI directives in the code • Run test cases
  • 15. OpenMP • standard • supported by most of modern compilers • requires little knowledge of the software • very simple construction methods
  • 17. OpenMP - example Parallel Task 1 Parallel Task 3 Parallel Task 2 Parallel Task 4
  • 18. Parallel Task 1 Parallel Task 2 Thread A Parallel Task 4 Thread B Parallel Task 3 Join Master Thread
  • 19. OpenMP Sceduler • Which scheduler available for hardware? - Static - Dynamic - Guided
  • 20. OpenMP Scheduler OpenMP Static Scheduler Chart 80000 70000 60000 50000 microseconds 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
  • 21. OpenMP Scheduler OpenMP Dynamic Scheduler Chart 117000 102375 87750 73125 microseconds 58500 43875 29250 14625 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
  • 22. OpenMP Scheduler OpenMP Guided Scheduler Chart 80000 70000 60000 50000 microseconds 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
  • 24. OpenMP Scheduler static scheduler dynamic scheduler guided scheduler
  • 25. MPI • standard • widely used in cluster environment • many transport link supported • different implementations available - OpenMPI - MVAPICH
  • 26. Infiniband • standard • widely used in cluster environment • very low latency for small packets • up to 16 Gb/s transfer speed
  • 27. MPI over Infiniband 10000000,0 µs 1000000,0 µs 100000,0 µs 10000,0 µs 1000,0 µs 100,0 µs 10,0 µs 1,0 µs kB kB kB kB kB kB 12 B 25 B 51 B kB B B B B 32 B 64 B 12 B 25 B 51 B B B B B B B k k k M M M M M M M M M M G G G G G 1 2 4 8 16 32 64 8 6 2 1 2 4 8 16 1 2 4 8 16 8 6 2 OpenMPI Mvapich2
  • 28. MPI over Infiniband 10000000,00 µs 1000000,00 µs 100000,00 µs 10000,00 µs 1000,00 µs 100,00 µs 10,00 µs 1,00 µs kB kB kB kB kB kB kB kB kB kB B B B B M M M M 1 2 4 8 16 32 64 8 6 2 1 2 4 8 12 25 51 OpenMPI Mvapich2
  • 29. Optimizations • Active at compile time • Available only after porting the software to standard FORTRAN • Consistent documentation available • Unexpected positive results
  • 32. Target Software • Sally3D • micromagnetic equation solver • written in FORTRAN with some C libraries • program uses linear formulation of mathematical models
  • 33. Implementation Scheme sequential loop parallel loop standard programming model OpenMP Threads distributed loop OpenMP Threads OpenMP Threads Host 1 Host 2 MPI
  • 34. Implementation Scheme • Data Structure: not embarrassingly parallel • Three dimensional matrix • Several temporary arrays – synchronization obiects required ➡ send() and recv() mechanism ➡ critical regions using OpenMP directives ➡ functions merging ➡ matrix conversion
  • 36. Results OMP MPI OPT seconds * * * 133 * * - 400 * - * 186 * - - 487 - * * 200 - * - 792 - - * 246 - - - 1062 Total Speed Increase: 87.52%
  • 37. Actual Results OMP MPI seconds * * 59 * - 129 - * 174 - - 249 Function Name Normal OpenMP MPI OpenMP+MPI calc_intmudua 24.5 s 4.7 s 14.4 s 2.8 s calc_hdmg_tet 16.9 s 3.0 s 10.8 s 1.7 s calc_mudua 12.1 s 1.9 s 7.0 s 1.1 s campo_effettivo 17.7 s 4.5 s 9.9 s 2.3 s
  • 38. Actual Results • OpenMP – 6-8x • MPI – 2x • OpenMP + MPI – 14 - 16x Total Raw Speed Increment: 76%
  • 40. Conclusions and Future Works • Computational time has been significantly decreased • Speedup is consistent with expected results • Submitted to COMPUMAG ‘09 • Continue inserting OpenMP and MPI directives • Perform algorithm optimizations • Increase cluster size