SlideShare a Scribd company logo
Presenter :  Nageeb Yahya Alsurmi GS21565 Lecturer : Assoc. Prof. Dr Mohamed Othman Test Suite for Evaluating Performance of MPI Implementations That Support MPI_THREAD_MULTIPLE By: Rajeev Thakur and William Gropp Argonne National Laboratory, USA
Introduction Literature Review Problem Statement Problem Objective Methodology Test Suite Experimental Result Conclusion References Outline
With thread-safe MPI implementations becoming increasingly common. an MPI process is a process that may be multithreaded . Each thread can issue MPI calls. threads are not separately addressable:  a rank in a send or receive call identifies  a process , not a thread. A message sent to a process  can be received by any thread in this process. The user can make sure that  two threads  in the same process will not issue  conflicting communication calls  by using distinct communicators at each thread.
The two main requirements for a thread-compliant implementation: 1-  All MPI calls are  thread-safe. 2-  Blocking MPI calls will block the calling thread only, allowing another thread to  execute, if available.
The MPI benchmarks from Ohio State University only contain a multithreaded  latency test. The latency test is a  ping-pong test with one thread on the sender side and two (or more) threads on the receiver side. There are a number of MPI benchmarks exist, such as SKaMPI and Intel MPI  Benchmarks , but they do not measure the performance of multithreaded MPI programs.
With thread-safe MPI implementations becoming increasingly common, users are able to write multithreaded MPI programs that make MPI calls concurrently from multiple threads. Developing a thread-safe MPI implementation is a fairly complex task. Users, therefore, need a way to measure the outcome and determine how efficiently an implementation can support multiple threads.
The authors proposed a test suite that can shed light on the performance of an MPI implementation in the multithreaded case.
To understand the test suite you have first to understand the thread-safety specification in MPI. MPI defines four “levels” of thread safety: 1-MPI_THREAD_SINGLE Each process has a single thread of execution . 2.  MPI_THREAD_FUNNELED A process may be multithreaded, but only the Main thread that initialized MPI may make MPI calls. T P1 T T m T P1 T T m T P2 P2 T MPI Call MPI Call MPI Call MPI Call
3. MPI THREAD SERIALIZED A process may be multithreaded, but only one thread at a time may make MPI calls. 4. MPI THREAD MULTIPLE A process may be multithreaded, and multiple threads may simultaneously call MPI functions (with some restrictions mentioned below). T T P1 T T P1 T 1 2 3 MPI Call MPI Call MPI Call T MPI Call MPI Call MPI Call
if your code does not access the same memory location from multiple threads without protection, it is most likely thread-safe. This is fairly minimal thread safety since you must ensure that your programs logic is thread safe, that is if your application is multithreaded . In this context thread safety means that execution of multiple threads does not in itself corrupt the state of your objects .
Deadlock  occurs when a process holds a lock and then attempts to acquire a second lock. If the second lock is already held by another process, the first process will be blocked. If the second process then attempts to acquire the lock held by the first process, the system has "deadlocked": no progress will ever be made  They cause blocking, which means some threads/processes have to wait until a lock (or a whole set of locks) is released Process 0 Process 1 Thread 0 Thread 1 Thread 1 Thread 0 MPI_Recv(src1) MPI_Send(dest1) MPI_Recv(src0) MPI_Send(dest0) Buffer full Wait for thread 1 to complete the send operation  to start reading from the buffer The buffer is full but still a data are sending so  thread 1  wait for  thread 0  to empty (read) the buffer
There are many MPI implementations but in this paper , just used four implementations: MPICH2      it’s a library and portable  It’s a library ( not compiler ), It can achieve parallelism using networked machines or using multitasking on a single machine.  portable implementation of  MPI , a standard for message-passing .  can be used for communication between processors. OPEN MPI merger between three well-known MPI implementations (FT-MPI, LA-MPI,  LAM/MPI ). (MPI)  SUN MPI     run on SUN machines It is Sun Microsystems' implementation of MPI   IBM’s MPI     runs on IBM SP systems and AIX workstation clusters.
The test suit has carried on multiple MPI implementation with different platforms. Linux Cluster (AMD Opetron two DualCore) MPICH2  V 1.05 OpenMPI V1.2.1 SUN Fire SMP E2900 UtraSparc has 8 DualCore (SUN cluster) SUN MPI. IBM p566+ SMP has 8 Power4+ CPU IBM MPI
The test has three categorization: 1-Cost of thread safety test 1-1 MPI THREAD MULTIPLE overhead 2-Concurrent progress test 2-1 Concurrent bandwidth 2-2 Concurrent latency 2-3 Concurrent short-long messages 3-Computation/ communication tests 3-1 Computation/ communication overlap   3-2 Concurrent collective operation 3-3 Concurrent collective and computation
MPI THREAD MULTIPLE  Overhead  test Ping pong Latency ( command : mpiexec –n 2 latency ) Command (muti-thread) : mpiexec –n 2 latency_th 4 Single thread Multiple thread Ping Pong Ping Pong The difference = Overhead MPI_Init(&argc,&argv) MPI_Init_thread(MPI_THREAD_MULTIPLE);
MPI THREAD MULTIPLE  Overhead  Results: Linux Cluster MPICH2 & OpenMPI    overhead average <= o.5 us IBM cluster IBM MPI    Overhead avearage < 0.25 us SUN Cluster SUN MPI    Overhead avearage > 3 us
2-1- concurrent bandwidth  (cumulative bandwidth) Test on Large Messages Process ( 4 processes at each node) Threads ( 2 processes each one has 2 threads) P1 P2 P3 P4 P4 P2 P3 P2 P1 P2 P3 P4 P1 P1 P1 T1 T2 T1 T2 T1 T2 T1 T1 T2 T2 T1 T1 T2 T2 T1 T1 T2 + + Large message Large message cumulative bandwidth
Why this test? how much thread locks affect the cumulative bandwidth. Linux Cluster (AMD Opetron two dual-core) MPICH2     no measurable difference in bandwidth between threads and processes. OpenMPI    there is a decline in bandwidth with threads. IBM MPI & SUN MPI   there is a substantial decline (more than 50% in some cases) in the bandwidth when threads were used.
 
This is similar to the concurrent bandwidth test except that it measures the time for individual short messages. P1 P2 P3 P4 P4 P2 P3 P2 P1 P2 P3 P4 P1 P1 P1 T1 T2 T1 T2 T1 T2 T1 T1 T2 T2 T1 T1 T2 T2 T1 T1 T2 Short message series Short message series Process Mutti threading
overhead in latency when using concurrent threads instead of processes Linux cluster MPICH2    overhead is about  20  μ s. Open MPI    overhead is about 30  μ s. IBM MPI & SUN MPI  the latency with threads is about 10 times the latency with processes. But still the IBM & SUN has the low latency compared with MPICH & Open MPI.
 
This test is a blend of the concurrent bandwidth and concurrent latency  tests This test tests the fairness of thread scheduling and locking P1 P2 P0 P1 P2 P3 P1 P1 P0 T1 T2 T1 T2 T1 T2 T1 T2 T2 T1 T1 T2 Short message series Short message series Long message P2 Long message Process Multi Threads
This result demonstrates that, in the threaded. case, locks are fairly held and released and that the thread blocked in the long message send does not block the other thread.
Test1 (non threading mode )- has an iterative loop in which a process communicates with its four nearest neighbors by posting nonblocking sends and receives, followed by a computation phase, followed by an MPI_ Waitall for the communication to complete. T est2  ( threading mode ). - is similar except that, before the iterative loop, each process spawns a thread that blocks on an MPI_Recv. This technique effectively simulates asynchronous progress by the MPI implementation. If  total time  (  threading mode ) <  total time  ( non threading )    there is no overlap.
 
compares the performance of concurrent calls to a collective function (MPI Allreduce) issued from  multiple threads  to that when issued from multiple processes. T1 T2 T1 T2 P1 P1 P1 T1 T2 Multi Threads
3-2 Concurrent Collectives test   2/3 For processes P1 P1 P1 Process
results on the Linux cluster. MPICH2 has relatively small overhead for the threaded version, compared with Open MPI.
evaluates the ability to use a thread to hide the latency of a collective operation. The same test as last test but each node has p cores, specify a p+1 as the number of threads. Thread  p  does an MPI_Allreduce with its corresponding threads on other nodes. Then  compared with the case with no allreduce thread (the higher the better).
the results on the Linux cluster. MPICH2 demonstrates a better ability than Open MPI to hide the latency of the allreduce.
MPI implementations supporting MPI THREAD MULTIPLE become increasingly available. The Authors have developed such a test suite and show its performance on multiple platforms and implementations The results indicate Good performance  with  MPICH2  and  Open MPI on Linux clusters. Poor performance  with  IBM  and  Sun MPI  on IBM and Sun SMP systems The Authors plan to add more tests to the suite, such as to measure the overlap of computation/communication with the MPI-2 file I/O and connect-accept features.
1. Francisco Garc´ıa, Alejandro Calder´on, and Jes´us Carretero. MiMPI: A multithreadsafe implementation of MPI. In  Recent Advances in Parallel Virtual Machine and Message Passing Interface, 6th European PVM/MPI Users’ Group Meeting , pages 207–214. Lecture Notes in Computer Science 1697, Springer, September 1999. 2. William Gropp and Rajeev Thakur. Issues in developing a thread-safe MPI implementation. In  Recent Advances in Parallel Virtual Machine and Message Passing Interface, 13th European PVM/MPI Users’ Group Meeting , pages 12–21. Lecture Notes in Computer Science 4192, Springer, September 2006. 3. Intel MPI benchmarks. http://www.intel.com. 4. OSU MPI benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks. 5. Boris V. Protopopov and Anthony Skjellum. A multithreaded message passing interface (MPI) architecture: Performance and program issues.  Journal of Parallel and Distributed Computing , 61(4):449–466, April 2001. 6. Ralf Reussner, Peter Sanders, and Jesper Larsson Tr¨aff. SKaMPI: A comprehensive benchmark for public benchmarking of MPI.  Scientific Programming , 10(1):55–65, January 2002.
Any Questions @ MPI Multiple threading Ada Soalan !!!!

More Related Content

What's hot (20)

PDF
MPI Tutorial
Dhanashree Prasad
 
PDF
Parallel programming using MPI
Ajit Nayak
 
PPTX
Intro to MPI
jbp4444
 
PPT
Open MPI
Anshul Sharma
 
PPT
Chapter 6 pc
Hanif Durad
 
PPT
What is [Open] MPI?
Jeff Squyres
 
PPT
Message passing interface
Md. Mahedi Mahfuj
 
PDF
MPI Presentation
Tayfun Sen
 
PDF
Mpi
Bertha Vega
 
PPT
Mpi Java
David Freitas
 
PPTX
Rgk cluster computing project
OstopD
 
PPTX
Java - Processing input and output
Riccardo Cardin
 
PPTX
Introdcution to Openfoam--working with free software
pengding2
 
PPTX
Multi-threaded Programming in JAVA
Vikram Kalyani
 
PDF
Semantics
Dr. Cupid Lucid
 
PPT
Chap2 2 1
Hemo Chella
 
PPT
Java And Multithreading
Shraddha
 
PDF
Suse Studio: "How to create a live openSUSE image with OpenFOAM® and CFD tools"
Baltasar Ortega
 
PPT
Collective Communications in MPI
Hanif Durad
 
MPI Tutorial
Dhanashree Prasad
 
Parallel programming using MPI
Ajit Nayak
 
Intro to MPI
jbp4444
 
Open MPI
Anshul Sharma
 
Chapter 6 pc
Hanif Durad
 
What is [Open] MPI?
Jeff Squyres
 
Message passing interface
Md. Mahedi Mahfuj
 
MPI Presentation
Tayfun Sen
 
Mpi Java
David Freitas
 
Rgk cluster computing project
OstopD
 
Java - Processing input and output
Riccardo Cardin
 
Introdcution to Openfoam--working with free software
pengding2
 
Multi-threaded Programming in JAVA
Vikram Kalyani
 
Semantics
Dr. Cupid Lucid
 
Chap2 2 1
Hemo Chella
 
Java And Multithreading
Shraddha
 
Suse Studio: "How to create a live openSUSE image with OpenFOAM® and CFD tools"
Baltasar Ortega
 
Collective Communications in MPI
Hanif Durad
 

Similar to Mpi Test Suite Multi Threaded (20)

PPT
Multicore
Birgit Plötzeneder
 
PDF
MPI in TNT for parallel processing
Martín Morales
 
PDF
2023comp90024_workshop.pdf
LevLafayette1
 
PPT
Tutorial on Parallel Computing and Message Passing Model - C2
Marcirio Chaves
 
PDF
High Performance Computing using MPI
Ankit Mahato
 
PPT
slides8 SharedMemory.ppt
aminnezarat
 
PPT
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
PPTX
Operating Systems
Harshith Meela
 
PDF
Building A Linux Cluster Using Raspberry PI #2!
A Jorge Garcia
 
PPT
TCP Performance analysis Wireless Multihop Networks
Abhishek Kona
 
PPT
Suyash Thesis Presentation
Tanvee Katyal
 
PPTX
Programming using MPI and OpenMP
Divya Tiwari
 
DOCX
Rajesh - CV
Rajesh Muddana
 
DOC
Pipeline Mechanism
Ashik Iqbal
 
PDF
Burst Buffer: From Alpha to Omega
George Markomanolis
 
PPT
Operating System Chapter 4 Multithreaded programming
guesta40f80
 
PDF
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
PDF
Advanced Scalable Decomposition Method with MPICH Environment for HPC
IJSRD
 
ODP
Concurrent Programming with Ruby and Tuple Spaces
luccastera
 
MPI in TNT for parallel processing
Martín Morales
 
2023comp90024_workshop.pdf
LevLafayette1
 
Tutorial on Parallel Computing and Message Passing Model - C2
Marcirio Chaves
 
High Performance Computing using MPI
Ankit Mahato
 
slides8 SharedMemory.ppt
aminnezarat
 
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
Operating Systems
Harshith Meela
 
Building A Linux Cluster Using Raspberry PI #2!
A Jorge Garcia
 
TCP Performance analysis Wireless Multihop Networks
Abhishek Kona
 
Suyash Thesis Presentation
Tanvee Katyal
 
Programming using MPI and OpenMP
Divya Tiwari
 
Rajesh - CV
Rajesh Muddana
 
Pipeline Mechanism
Ashik Iqbal
 
Burst Buffer: From Alpha to Omega
George Markomanolis
 
Operating System Chapter 4 Multithreaded programming
guesta40f80
 
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
Advanced Scalable Decomposition Method with MPICH Environment for HPC
IJSRD
 
Concurrent Programming with Ruby and Tuple Spaces
luccastera
 
Ad

Recently uploaded (20)

PPTX
HYDROCEPHALUS: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
PPTX
grade 5 lesson ENGLISH 5_Q1_PPT_WEEK3.pptx
SireQuinn
 
PDF
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
PDF
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
PDF
SSHS-2025-PKLP_Quarter-1-Dr.-Kerby-Alvarez.pdf
AishahSangcopan1
 
PDF
People & Earth's Ecosystem -Lesson 2: People & Population
marvinnbustamante1
 
PPTX
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PPTX
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
PPT
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPTX
SPINA BIFIDA: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
PDF
LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTURE
APARNA T SHAIL KUMAR
 
PPTX
How to Set Maximum Difference Odoo 18 POS
Celine George
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PDF
CEREBRAL PALSY: NURSING MANAGEMENT .pdf
PRADEEP ABOTHU
 
PPTX
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
PPT
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
PDF
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
HYDROCEPHALUS: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
grade 5 lesson ENGLISH 5_Q1_PPT_WEEK3.pptx
SireQuinn
 
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
SSHS-2025-PKLP_Quarter-1-Dr.-Kerby-Alvarez.pdf
AishahSangcopan1
 
People & Earth's Ecosystem -Lesson 2: People & Population
marvinnbustamante1
 
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
SPINA BIFIDA: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTURE
APARNA T SHAIL KUMAR
 
How to Set Maximum Difference Odoo 18 POS
Celine George
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
CEREBRAL PALSY: NURSING MANAGEMENT .pdf
PRADEEP ABOTHU
 
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
Ad

Mpi Test Suite Multi Threaded

  • 1. Presenter : Nageeb Yahya Alsurmi GS21565 Lecturer : Assoc. Prof. Dr Mohamed Othman Test Suite for Evaluating Performance of MPI Implementations That Support MPI_THREAD_MULTIPLE By: Rajeev Thakur and William Gropp Argonne National Laboratory, USA
  • 2. Introduction Literature Review Problem Statement Problem Objective Methodology Test Suite Experimental Result Conclusion References Outline
  • 3. With thread-safe MPI implementations becoming increasingly common. an MPI process is a process that may be multithreaded . Each thread can issue MPI calls. threads are not separately addressable: a rank in a send or receive call identifies a process , not a thread. A message sent to a process can be received by any thread in this process. The user can make sure that two threads in the same process will not issue conflicting communication calls by using distinct communicators at each thread.
  • 4. The two main requirements for a thread-compliant implementation: 1- All MPI calls are thread-safe. 2- Blocking MPI calls will block the calling thread only, allowing another thread to execute, if available.
  • 5. The MPI benchmarks from Ohio State University only contain a multithreaded latency test. The latency test is a ping-pong test with one thread on the sender side and two (or more) threads on the receiver side. There are a number of MPI benchmarks exist, such as SKaMPI and Intel MPI Benchmarks , but they do not measure the performance of multithreaded MPI programs.
  • 6. With thread-safe MPI implementations becoming increasingly common, users are able to write multithreaded MPI programs that make MPI calls concurrently from multiple threads. Developing a thread-safe MPI implementation is a fairly complex task. Users, therefore, need a way to measure the outcome and determine how efficiently an implementation can support multiple threads.
  • 7. The authors proposed a test suite that can shed light on the performance of an MPI implementation in the multithreaded case.
  • 8. To understand the test suite you have first to understand the thread-safety specification in MPI. MPI defines four “levels” of thread safety: 1-MPI_THREAD_SINGLE Each process has a single thread of execution . 2. MPI_THREAD_FUNNELED A process may be multithreaded, but only the Main thread that initialized MPI may make MPI calls. T P1 T T m T P1 T T m T P2 P2 T MPI Call MPI Call MPI Call MPI Call
  • 9. 3. MPI THREAD SERIALIZED A process may be multithreaded, but only one thread at a time may make MPI calls. 4. MPI THREAD MULTIPLE A process may be multithreaded, and multiple threads may simultaneously call MPI functions (with some restrictions mentioned below). T T P1 T T P1 T 1 2 3 MPI Call MPI Call MPI Call T MPI Call MPI Call MPI Call
  • 10. if your code does not access the same memory location from multiple threads without protection, it is most likely thread-safe. This is fairly minimal thread safety since you must ensure that your programs logic is thread safe, that is if your application is multithreaded . In this context thread safety means that execution of multiple threads does not in itself corrupt the state of your objects .
  • 11. Deadlock occurs when a process holds a lock and then attempts to acquire a second lock. If the second lock is already held by another process, the first process will be blocked. If the second process then attempts to acquire the lock held by the first process, the system has &quot;deadlocked&quot;: no progress will ever be made They cause blocking, which means some threads/processes have to wait until a lock (or a whole set of locks) is released Process 0 Process 1 Thread 0 Thread 1 Thread 1 Thread 0 MPI_Recv(src1) MPI_Send(dest1) MPI_Recv(src0) MPI_Send(dest0) Buffer full Wait for thread 1 to complete the send operation to start reading from the buffer The buffer is full but still a data are sending so thread 1 wait for thread 0 to empty (read) the buffer
  • 12. There are many MPI implementations but in this paper , just used four implementations: MPICH2  it’s a library and portable It’s a library ( not compiler ), It can achieve parallelism using networked machines or using multitasking on a single machine. portable implementation of MPI , a standard for message-passing . can be used for communication between processors. OPEN MPI merger between three well-known MPI implementations (FT-MPI, LA-MPI, LAM/MPI ). (MPI) SUN MPI  run on SUN machines It is Sun Microsystems' implementation of MPI IBM’s MPI  runs on IBM SP systems and AIX workstation clusters.
  • 13. The test suit has carried on multiple MPI implementation with different platforms. Linux Cluster (AMD Opetron two DualCore) MPICH2 V 1.05 OpenMPI V1.2.1 SUN Fire SMP E2900 UtraSparc has 8 DualCore (SUN cluster) SUN MPI. IBM p566+ SMP has 8 Power4+ CPU IBM MPI
  • 14. The test has three categorization: 1-Cost of thread safety test 1-1 MPI THREAD MULTIPLE overhead 2-Concurrent progress test 2-1 Concurrent bandwidth 2-2 Concurrent latency 2-3 Concurrent short-long messages 3-Computation/ communication tests 3-1 Computation/ communication overlap 3-2 Concurrent collective operation 3-3 Concurrent collective and computation
  • 15. MPI THREAD MULTIPLE Overhead test Ping pong Latency ( command : mpiexec –n 2 latency ) Command (muti-thread) : mpiexec –n 2 latency_th 4 Single thread Multiple thread Ping Pong Ping Pong The difference = Overhead MPI_Init(&argc,&argv) MPI_Init_thread(MPI_THREAD_MULTIPLE);
  • 16. MPI THREAD MULTIPLE Overhead Results: Linux Cluster MPICH2 & OpenMPI  overhead average <= o.5 us IBM cluster IBM MPI  Overhead avearage < 0.25 us SUN Cluster SUN MPI  Overhead avearage > 3 us
  • 17. 2-1- concurrent bandwidth (cumulative bandwidth) Test on Large Messages Process ( 4 processes at each node) Threads ( 2 processes each one has 2 threads) P1 P2 P3 P4 P4 P2 P3 P2 P1 P2 P3 P4 P1 P1 P1 T1 T2 T1 T2 T1 T2 T1 T1 T2 T2 T1 T1 T2 T2 T1 T1 T2 + + Large message Large message cumulative bandwidth
  • 18. Why this test? how much thread locks affect the cumulative bandwidth. Linux Cluster (AMD Opetron two dual-core) MPICH2  no measurable difference in bandwidth between threads and processes. OpenMPI  there is a decline in bandwidth with threads. IBM MPI & SUN MPI  there is a substantial decline (more than 50% in some cases) in the bandwidth when threads were used.
  • 19.  
  • 20. This is similar to the concurrent bandwidth test except that it measures the time for individual short messages. P1 P2 P3 P4 P4 P2 P3 P2 P1 P2 P3 P4 P1 P1 P1 T1 T2 T1 T2 T1 T2 T1 T1 T2 T2 T1 T1 T2 T2 T1 T1 T2 Short message series Short message series Process Mutti threading
  • 21. overhead in latency when using concurrent threads instead of processes Linux cluster MPICH2  overhead is about 20 μ s. Open MPI  overhead is about 30 μ s. IBM MPI & SUN MPI the latency with threads is about 10 times the latency with processes. But still the IBM & SUN has the low latency compared with MPICH & Open MPI.
  • 22.  
  • 23. This test is a blend of the concurrent bandwidth and concurrent latency tests This test tests the fairness of thread scheduling and locking P1 P2 P0 P1 P2 P3 P1 P1 P0 T1 T2 T1 T2 T1 T2 T1 T2 T2 T1 T1 T2 Short message series Short message series Long message P2 Long message Process Multi Threads
  • 24. This result demonstrates that, in the threaded. case, locks are fairly held and released and that the thread blocked in the long message send does not block the other thread.
  • 25. Test1 (non threading mode )- has an iterative loop in which a process communicates with its four nearest neighbors by posting nonblocking sends and receives, followed by a computation phase, followed by an MPI_ Waitall for the communication to complete. T est2 ( threading mode ). - is similar except that, before the iterative loop, each process spawns a thread that blocks on an MPI_Recv. This technique effectively simulates asynchronous progress by the MPI implementation. If total time ( threading mode ) < total time ( non threading )  there is no overlap.
  • 26.  
  • 27. compares the performance of concurrent calls to a collective function (MPI Allreduce) issued from multiple threads to that when issued from multiple processes. T1 T2 T1 T2 P1 P1 P1 T1 T2 Multi Threads
  • 28. 3-2 Concurrent Collectives test 2/3 For processes P1 P1 P1 Process
  • 29. results on the Linux cluster. MPICH2 has relatively small overhead for the threaded version, compared with Open MPI.
  • 30. evaluates the ability to use a thread to hide the latency of a collective operation. The same test as last test but each node has p cores, specify a p+1 as the number of threads. Thread p does an MPI_Allreduce with its corresponding threads on other nodes. Then compared with the case with no allreduce thread (the higher the better).
  • 31. the results on the Linux cluster. MPICH2 demonstrates a better ability than Open MPI to hide the latency of the allreduce.
  • 32. MPI implementations supporting MPI THREAD MULTIPLE become increasingly available. The Authors have developed such a test suite and show its performance on multiple platforms and implementations The results indicate Good performance with MPICH2 and Open MPI on Linux clusters. Poor performance with IBM and Sun MPI on IBM and Sun SMP systems The Authors plan to add more tests to the suite, such as to measure the overlap of computation/communication with the MPI-2 file I/O and connect-accept features.
  • 33. 1. Francisco Garc´ıa, Alejandro Calder´on, and Jes´us Carretero. MiMPI: A multithreadsafe implementation of MPI. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 6th European PVM/MPI Users’ Group Meeting , pages 207–214. Lecture Notes in Computer Science 1697, Springer, September 1999. 2. William Gropp and Rajeev Thakur. Issues in developing a thread-safe MPI implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 13th European PVM/MPI Users’ Group Meeting , pages 12–21. Lecture Notes in Computer Science 4192, Springer, September 2006. 3. Intel MPI benchmarks. http://www.intel.com. 4. OSU MPI benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks. 5. Boris V. Protopopov and Anthony Skjellum. A multithreaded message passing interface (MPI) architecture: Performance and program issues. Journal of Parallel and Distributed Computing , 61(4):449–466, April 2001. 6. Ralf Reussner, Peter Sanders, and Jesper Larsson Tr¨aff. SKaMPI: A comprehensive benchmark for public benchmarking of MPI. Scientific Programming , 10(1):55–65, January 2002.
  • 34. Any Questions @ MPI Multiple threading Ada Soalan !!!!