Designing Parallel Programs
Introduction to High Performance Computing
Page 2
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning
Introduction to High Performance Computing
Page 3
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning
Introduction to High Performance Computing
Page 4
 Designing and developing parallel programs has
characteristically been a very manual process. The programmer
is typically responsible for both identifying and actually
implementing parallelism.
 Very often, manually developing parallel codes is a time
consuming, complex, error-prone and iterative process.
 For a number of years now, various tools have been available to
assist the programmer with converting serial programs into
parallel programs. The most common type of tool used to
automatically parallelize a serial program is a parallelizing
compiler or pre-processor.
Introduction to High Performance Computing
Page 5
 A parallelizing compiler generally works in two different ways:
– Fully Automatic
 The compiler analyzes the source code and identifies opportunities for
parallelism.
 The analysis includes identifying inhibitors to parallelism and possibly a
cost weighting on whether or not the parallelism would actually improve
performance.
 Loops (do, for) loops are the most frequent target for automatic
parallelization.
– Programmer Directed
 Using "compiler directives" or possibly compiler flags, the programmer
explicitly tells the compiler how to parallelize the code.
 May be able to be used in conjunction with some degree of automatic
parallelization also.
Introduction to High Performance Computing
Page 6
 If you are beginning with an existing serial code and have time
or budget constraints, then automatic parallelization may be the
answer. However, there are several important caveats that
apply to automatic parallelization:
– Wrong results may be produced
– Performance may actually degrade
– Much less flexible than manual parallelization
– Limited to a subset (mostly loops) of code
– May actually not parallelize code if the analysis suggests there are
inhibitors or the code is too complex
– Most automatic parallelization tools are for Fortran
 The remainder of this section applies to the manual method of
developing parallel codes.
Introduction to High Performance Computing
Page 7
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning
Introduction to High Performance Computing
Page 8
 Undoubtedly, the first step in developing parallel
software is to first understand the problem that you
wish to solve in parallel.
 If you are starting with a serial program, this
necessitates understanding the existing code also.
 Before spending time in an attempt to develop a
parallel solution for a problem, determine whether or
not the problem is one that can actually be
parallelized.
Introduction to High Performance Computing
Page 9
Example of Parallelizable Problem
Calculate the potential energy for each of several
thousand independent conformations of a molecule.
When done, find the minimum energy conformation.
 This problem is able to be solved in parallel. Each of
the molecular conformations is independently
determinable. The calculation of the minimum energy
conformation is also a parallelizable problem.
Introduction to High Performance Computing
Page 10
Example of a Non-parallelizable Problem
Calculation of the Fibonacci series
(0,1,2,3,5,8,13,21,...) by use of the formula:
F(k + 2) = F(k + 1) + F(k)
 This is a non-parallelizable problem because the
calculation of the Fibonacci sequence as shown
would entail dependent calculations rather than
independent ones. The calculation of the k + 2 value
uses those of both k + 1 and k. These three terms
cannot be calculated independently and therefore,
not in parallel.
Introduction to High Performance Computing
Page 11
Identify the program's hotspots
 Know where most of the real work is being done. The
majority of scientific and technical programs usually
accomplish most of their work in a few places.
 Profilers and performance analysis tools can help
here
 Focus on parallelizing the hotspots and ignore
those sections of the program that account for little
CPU usage.
Introduction to High Performance Computing
Page 12
Identify bottlenecks in the program
 Are there areas that are disproportionately slow, or
cause parallelizable work to halt or be deferred? For
example, I/O is usually something that slows a
program down.
 May be possible to restructure the program or use a
different algorithm to reduce or eliminate
unnecessary slow areas
Introduction to High Performance Computing
Page 13
Other considerations
 Identify inhibitors to parallelism. One common class
of inhibitor is data dependence, as demonstrated by
the Fibonacci sequence above.
 Investigate other algorithms if possible. This may be
the single most important consideration when
designing a parallel application.
Introduction to High Performance Computing
Page 14
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning
Introduction to High Performance Computing
Page 15
 One of the first steps in designing a parallel program
is to break the problem into discrete "chunks" of work
that can be distributed to multiple tasks. This is
known as decomposition or partitioning.
 There are two basic ways to partition computational
work among parallel tasks:
– domain decomposition
and
– functional decomposition
Introduction to High Performance Computing
Page 16
Domain Decomposition
 In this type of partitioning, the data associated with a
problem is decomposed. Each parallel task then
works on a portion of of the data.
Introduction to High Performance Computing
Page 17
Partitioning Data
 There are different ways to partition data
Introduction to High Performance Computing
Page 18
Functional Decomposition
 In this approach, the focus is on the computation that is to be
performed rather than on the data manipulated by the
computation. The problem is decomposed according to the work
that must be done. Each task then performs a portion of the
overall work.
 Functional decomposition lends itself well to problems that can
be split into different tasks. For example
– Ecosystem Modeling
– Signal Processing
– Climate Modeling
Introduction to High Performance Computing
Page 19
Ecosystem Modeling
 Each program calculates the population of a given
group, where each group's growth depends on that of
its neighbors. As time progresses, each process
calculates its current state, then exchanges
information with the neighbor populations. All tasks
then progress to calculate the state at the next time
step.
Introduction to High Performance Computing
Page 20
Signal Processing
 An audio signal data set is passed through four distinct
computational filters. Each filter is a separate process. The first
segment of data must pass through the first filter before
progressing to the second. When it does, the second segment
of data passes through the first filter. By the time the fourth
segment of data is in the first filter, all four tasks are busy.
Introduction to High Performance Computing
Page 21
Climate Modeling
 Each model component can be thought of as a separate task.
Arrows represent exchanges of data between components
during computation: the atmosphere model generates wind
velocity data that are used by the ocean model, the ocean
model generates sea surface temperature data that are used by
the atmosphere model, and so on.
 Combining these two types of problem decomposition is
common and natural.
Introduction to High Performance Computing
Page 22
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning
Introduction to High Performance Computing
Page 23
Who Needs Communications?
 The need for communications between tasks depends upon your
problem
 You DON'T need communications
– Some types of problems can be decomposed and executed in parallel with
virtually no need for tasks to share data. For example, imagine an image
processing operation where every pixel in a black and white image needs to
have its color reversed. The image data can easily be distributed to multiple
tasks that then act independently of each other to do their portion of the
work.
– These types of problems are often called embarrassingly parallel because
they are so straight-forward. Very little inter-task communication is required.
 You DO need communications
– Most parallel applications are not quite so simple, and do require tasks to
share data with each other. For example, a 3-D heat diffusion problem
requires a task to know the temperatures calculated by the tasks that have
neighboring data. Changes to neighboring data has a direct effect on that
task's data.
Introduction to High Performance Computing
Page 24
Factors to Consider (1)
 There are a number of important factors to consider
when designing your program's inter-task
communications
 Cost of communications
– Inter-task communication virtually always implies overhead.
– Machine cycles and resources that could be used for
computation are instead used to package and transmit data.
– Communications frequently require some type of
synchronization between tasks, which can result in tasks
spending time "waiting" instead of doing work.
– Competing communication traffic can saturate the available
network bandwidth, further aggravating performance
problems.
Introduction to High Performance Computing
Page 25
Factors to Consider (2)
 Latency vs. Bandwidth
– latency is the time it takes to send a minimal (0 byte)
message from point A to point B. Commonly expressed as
microseconds.
– bandwidth is the amount of data that can be communicated
per unit of time. Commonly expressed as megabytes/sec.
– Sending many small messages can cause latency to
dominate communication overheads. Often it is more
efficient to package small messages into a larger message,
thus increasing the effective communications bandwidth.
Introduction to High Performance Computing
Page 26
Factors to Consider (3)
 Visibility of communications
– With the Message Passing Model, communications are
explicit and generally quite visible and under the control of
the programmer.
– With the Data Parallel Model, communications often occur
transparently to the programmer, particularly on distributed
memory architectures. The programmer may not even be
able to know exactly how inter-task communications are
being accomplished.
Introduction to High Performance Computing
Page 27
Factors to Consider (4)
 Synchronous vs. asynchronous communications
– Synchronous communications require some type of "handshaking"
between tasks that are sharing data. This can be explicitly
structured in code by the programmer, or it may happen at a lower
level unknown to the programmer.
– Synchronous communications are often referred to as blocking
communications since other work must wait until the
communications have completed.
– Asynchronous communications allow tasks to transfer data
independently from one another. For example, task 1 can prepare
and send a message to task 2, and then immediately begin doing
other work. When task 2 actually receives the data doesn't matter.
– Asynchronous communications are often referred to as non-
blocking communications since other work can be done while the
communications are taking place.
– Interleaving computation with communication is the single greatest
benefit for using asynchronous communications.
Introduction to High Performance Computing
Page 28
Factors to Consider (5)
 Scope of communications
– Knowing which tasks must communicate with each other is
critical during the design stage of a parallel code. Both of the
two scopings described below can be implemented
synchronously or asynchronously.
– Point-to-point - involves two tasks with one task acting as
the sender/producer of data, and the other acting as the
receiver/consumer.
– Collective - involves data sharing between more than two
tasks, which are often specified as being members in a
common group, or collective.
Introduction to High Performance Computing
Page 29
Collective Communications
 Examples
Introduction to High Performance Computing
Page 30
Factors to Consider (6)
 Efficiency of communications
– Very often, the programmer will have a choice with regard to
factors that can affect communications performance. Only a
few are mentioned here.
– Which implementation for a given model should be used?
Using the Message Passing Model as an example, one MPI
implementation may be faster on a given hardware platform
than another.
– What type of communication operations should be used? As
mentioned previously, asynchronous communication
operations can improve overall program performance.
– Network media - some platforms may offer more than one
network for communications. Which one is best?
Introduction to High Performance Computing
Page 31
Factors to Consider (7)
 Overhead and Complexity
Introduction to High Performance Computing
Page 32
Factors to Consider (8)
 Finally, realize that this is only a partial list of
things to consider!!!
Introduction to High Performance Computing
Page 33
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning
Introduction to High Performance Computing
Page 34
Types of Synchronization
 Barrier
– Usually implies that all tasks are involved
– Each task performs its work until it reaches the barrier. It then stops, or "blocks".
– When the last task reaches the barrier, all tasks are synchronized.
– What happens from here varies. Often, a serial section of work must be done. In other
cases, the tasks are automatically released to continue their work.
 Lock / semaphore
– Can involve any number of tasks
– Typically used to serialize (protect) access to global data or a section of code. Only
one task at a time may use (own) the lock / semaphore / flag.
– The first task to acquire the lock "sets" it. This task can then safely (serially) access
the protected data or code.
– Other tasks can attempt to acquire the lock but must wait until the task that owns the
lock releases it.
– Can be blocking or non-blocking
 Synchronous communication operations
– Involves only those tasks executing a communication operation
– When a task performs a communication operation, some form of coordination is
required with the other task(s) participating in the communication. For example, before
a task can perform a send operation, it must first receive an acknowledgment from the
receiving task that it is OK to send.
– Discussed previously in the Communications section.
Introduction to High Performance Computing
Page 35
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning
Introduction to High Performance Computing
Page 36
Definitions
 A dependence exists between program statements
when the order of statement execution affects the
results of the program.
 A data dependence results from multiple use of the
same location(s) in storage by different tasks.
 Dependencies are important to parallel programming
because they are one of the primary inhibitors to
parallelism.
Introduction to High Performance Computing
Page 37
Examples (1): Loop carried data dependence
 The value of A(J-1) must be computed before the
value of A(J), therefore A(J) exhibits a data
dependency on A(J-1). Parallelism is inhibited.
 If Task 2 has A(J) and task 1 has A(J-1), computing
the correct value of A(J) necessitates:
– Distributed memory architecture - task 2 must obtain the
value of A(J-1) from task 1 after task 1 finishes its
computation
– Shared memory architecture - task 2 must read A(J-1) after
task 1 updates it
DO 500 J = MYSTART,MYEND
A(J) = A(J-1) * 2.0500
CONTINUE
Introduction to High Performance Computing
Page 38
Examples (2): Loop independent data dependence
 As with the previous example, parallelism is inhibited. The value
of Y is dependent on:
– Distributed memory architecture - if or when the value of X is
communicated between the tasks.
– Shared memory architecture - which task last stores the value of X.
 Although all data dependencies are important to identify when
designing parallel programs, loop carried dependencies are
particularly important since loops are possibly the most common
target of parallelization efforts.
task 1 task 2
------ ------
X = 2 X = 4
. .
. .
Y = X**2 Y = X**3
Introduction to High Performance Computing
Page 39
How to Handle Data Dependencies?
 Distributed memory architectures - communicate
required data at synchronization points.
 Shared memory architectures -synchronize read/write
operations between tasks.

More Related Content

PPTX
Safety induction presentation
DOCX
Classification of weeds
PPTX
Manual Handling Training in Work Places
PDF
Introduction to Weed Album
PPTX
Manual handling presentation
PDF
Staff safety hand book
PPTX
Social Media Management.pptx
PPTX
Diabetes Mellitus
Safety induction presentation
Classification of weeds
Manual Handling Training in Work Places
Introduction to Weed Album
Manual handling presentation
Staff safety hand book
Social Media Management.pptx
Diabetes Mellitus

Similar to FL Studio Crack v24.2.2.4597 License Free 2025 (20)

PDF
Zoom Meeting Crack License 100% Working [2025]
PDF
Movavi Screen Recorder Studio 2025 crack Free Download
PDF
Windows 8.1 Pro Activator Crack Version [April-2025]
PDF
Iobit Uninstaller Pro Crack Free Download
PDF
Auslogics Video Grabber 1.0.0.7 Crack Free Download
PDF
Wondershare UniConverter Download (Latest 2025)
PDF
Auslogics Video Grabber 1.0.0.12 Free Download
PDF
Apowersoft Screen Recorder Pro Free CRACK
PDF
Apowersoft Screen Recorder Pro Free CRACK Download
PDF
Auslogics Video Grabber 1.0.0.7 Crack Free
PDF
Auslogics Video Grabber 1.0.0.7 Crack Free
PDF
Wondershare Recoverit 13.5.12.11 Free Download
PDF
Auslogics Video Grabber 1.0.0.12 Free Download
PDF
Flexible PDF 2025 Crack License 100% Working
PDF
Auslogics Video Grabber 1.0.0.7 Crack Free
PDF
Aiseesoft Video Converter Ultimate 10.9.6
PDF
The XBMC Free CRACK Download for Pc 2025
PPTX
The XBMC Free CRACK Download for Pc .
PPTX
Apowersoft Screen Recorder Pro Free CRACK Download
PPTX
MiniGolf Showdown TENOKE Free Download
Zoom Meeting Crack License 100% Working [2025]
Movavi Screen Recorder Studio 2025 crack Free Download
Windows 8.1 Pro Activator Crack Version [April-2025]
Iobit Uninstaller Pro Crack Free Download
Auslogics Video Grabber 1.0.0.7 Crack Free Download
Wondershare UniConverter Download (Latest 2025)
Auslogics Video Grabber 1.0.0.12 Free Download
Apowersoft Screen Recorder Pro Free CRACK
Apowersoft Screen Recorder Pro Free CRACK Download
Auslogics Video Grabber 1.0.0.7 Crack Free
Auslogics Video Grabber 1.0.0.7 Crack Free
Wondershare Recoverit 13.5.12.11 Free Download
Auslogics Video Grabber 1.0.0.12 Free Download
Flexible PDF 2025 Crack License 100% Working
Auslogics Video Grabber 1.0.0.7 Crack Free
Aiseesoft Video Converter Ultimate 10.9.6
The XBMC Free CRACK Download for Pc 2025
The XBMC Free CRACK Download for Pc .
Apowersoft Screen Recorder Pro Free CRACK Download
MiniGolf Showdown TENOKE Free Download
Ad

Recently uploaded (20)

PPTX
Math 2 Quarter 2 Week 1 Matatag Curriculum
PDF
LATAM’s Top EdTech Innovators Transforming Learning in 2025.pdf
PDF
CHALLENGES FACED BY TEACHERS WHEN TEACHING LEARNERS WITH DEVELOPMENTAL DISABI...
PDF
Physical pharmaceutics two in b pharmacy
PDF
Health aspects of bilberry: A review on its general benefits
PDF
POM_Unit1_Notes.pdf Introduction to Management #mba #bba #bcom #bballb #class...
PPTX
Power Point PR B.Inggris 12 Ed. 2019.pptx
PPTX
principlesofmanagementsem1slides-131211060335-phpapp01 (1).ppt
PDF
Disorder of Endocrine system (1).pdfyyhyyyy
PPTX
2025 High Blood Pressure Guideline Slide Set.pptx
PPTX
Neurology of Systemic disease all systems
PDF
Kalaari-SaaS-Founder-Playbook-2024-Edition-.pdf
PPT
hsl powerpoint resource goyloveh feb 07.ppt
PPTX
Key-Features-of-the-SHS-Program-v4-Slides (3) PPT2.pptx
PPTX
ACFE CERTIFICATION TRAINING ON LAW.pptx
PPTX
pharmaceutics-1unit-1-221214121936-550b56aa.pptx
PPTX
Theoretical for class.pptxgshdhddhdhdhgd
PPTX
Diploma pharmaceutics notes..helps diploma students
PPSX
namma_kalvi_12th_botany_chapter_9_ppt.ppsx
PPTX
Thinking Routines and Learning Engagements.pptx
Math 2 Quarter 2 Week 1 Matatag Curriculum
LATAM’s Top EdTech Innovators Transforming Learning in 2025.pdf
CHALLENGES FACED BY TEACHERS WHEN TEACHING LEARNERS WITH DEVELOPMENTAL DISABI...
Physical pharmaceutics two in b pharmacy
Health aspects of bilberry: A review on its general benefits
POM_Unit1_Notes.pdf Introduction to Management #mba #bba #bcom #bballb #class...
Power Point PR B.Inggris 12 Ed. 2019.pptx
principlesofmanagementsem1slides-131211060335-phpapp01 (1).ppt
Disorder of Endocrine system (1).pdfyyhyyyy
2025 High Blood Pressure Guideline Slide Set.pptx
Neurology of Systemic disease all systems
Kalaari-SaaS-Founder-Playbook-2024-Edition-.pdf
hsl powerpoint resource goyloveh feb 07.ppt
Key-Features-of-the-SHS-Program-v4-Slides (3) PPT2.pptx
ACFE CERTIFICATION TRAINING ON LAW.pptx
pharmaceutics-1unit-1-221214121936-550b56aa.pptx
Theoretical for class.pptxgshdhddhdhdhgd
Diploma pharmaceutics notes..helps diploma students
namma_kalvi_12th_botany_chapter_9_ppt.ppsx
Thinking Routines and Learning Engagements.pptx
Ad

FL Studio Crack v24.2.2.4597 License Free 2025

  • 2. Introduction to High Performance Computing Page 2 Agenda  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Partitioning  Communications  Synchronization  Data Dependencies  Load Balancing  Granularity  I/O  Limits and Costs of Parallel Programming  Performance Analysis and Tuning
  • 3. Introduction to High Performance Computing Page 3 Agenda  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Partitioning  Communications  Synchronization  Data Dependencies  Load Balancing  Granularity  I/O  Limits and Costs of Parallel Programming  Performance Analysis and Tuning
  • 4. Introduction to High Performance Computing Page 4  Designing and developing parallel programs has characteristically been a very manual process. The programmer is typically responsible for both identifying and actually implementing parallelism.  Very often, manually developing parallel codes is a time consuming, complex, error-prone and iterative process.  For a number of years now, various tools have been available to assist the programmer with converting serial programs into parallel programs. The most common type of tool used to automatically parallelize a serial program is a parallelizing compiler or pre-processor.
  • 5. Introduction to High Performance Computing Page 5  A parallelizing compiler generally works in two different ways: – Fully Automatic  The compiler analyzes the source code and identifies opportunities for parallelism.  The analysis includes identifying inhibitors to parallelism and possibly a cost weighting on whether or not the parallelism would actually improve performance.  Loops (do, for) loops are the most frequent target for automatic parallelization. – Programmer Directed  Using "compiler directives" or possibly compiler flags, the programmer explicitly tells the compiler how to parallelize the code.  May be able to be used in conjunction with some degree of automatic parallelization also.
  • 6. Introduction to High Performance Computing Page 6  If you are beginning with an existing serial code and have time or budget constraints, then automatic parallelization may be the answer. However, there are several important caveats that apply to automatic parallelization: – Wrong results may be produced – Performance may actually degrade – Much less flexible than manual parallelization – Limited to a subset (mostly loops) of code – May actually not parallelize code if the analysis suggests there are inhibitors or the code is too complex – Most automatic parallelization tools are for Fortran  The remainder of this section applies to the manual method of developing parallel codes.
  • 7. Introduction to High Performance Computing Page 7 Agenda  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Partitioning  Communications  Synchronization  Data Dependencies  Load Balancing  Granularity  I/O  Limits and Costs of Parallel Programming  Performance Analysis and Tuning
  • 8. Introduction to High Performance Computing Page 8  Undoubtedly, the first step in developing parallel software is to first understand the problem that you wish to solve in parallel.  If you are starting with a serial program, this necessitates understanding the existing code also.  Before spending time in an attempt to develop a parallel solution for a problem, determine whether or not the problem is one that can actually be parallelized.
  • 9. Introduction to High Performance Computing Page 9 Example of Parallelizable Problem Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation.  This problem is able to be solved in parallel. Each of the molecular conformations is independently determinable. The calculation of the minimum energy conformation is also a parallelizable problem.
  • 10. Introduction to High Performance Computing Page 10 Example of a Non-parallelizable Problem Calculation of the Fibonacci series (0,1,2,3,5,8,13,21,...) by use of the formula: F(k + 2) = F(k + 1) + F(k)  This is a non-parallelizable problem because the calculation of the Fibonacci sequence as shown would entail dependent calculations rather than independent ones. The calculation of the k + 2 value uses those of both k + 1 and k. These three terms cannot be calculated independently and therefore, not in parallel.
  • 11. Introduction to High Performance Computing Page 11 Identify the program's hotspots  Know where most of the real work is being done. The majority of scientific and technical programs usually accomplish most of their work in a few places.  Profilers and performance analysis tools can help here  Focus on parallelizing the hotspots and ignore those sections of the program that account for little CPU usage.
  • 12. Introduction to High Performance Computing Page 12 Identify bottlenecks in the program  Are there areas that are disproportionately slow, or cause parallelizable work to halt or be deferred? For example, I/O is usually something that slows a program down.  May be possible to restructure the program or use a different algorithm to reduce or eliminate unnecessary slow areas
  • 13. Introduction to High Performance Computing Page 13 Other considerations  Identify inhibitors to parallelism. One common class of inhibitor is data dependence, as demonstrated by the Fibonacci sequence above.  Investigate other algorithms if possible. This may be the single most important consideration when designing a parallel application.
  • 14. Introduction to High Performance Computing Page 14 Agenda  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Partitioning  Communications  Synchronization  Data Dependencies  Load Balancing  Granularity  I/O  Limits and Costs of Parallel Programming  Performance Analysis and Tuning
  • 15. Introduction to High Performance Computing Page 15  One of the first steps in designing a parallel program is to break the problem into discrete "chunks" of work that can be distributed to multiple tasks. This is known as decomposition or partitioning.  There are two basic ways to partition computational work among parallel tasks: – domain decomposition and – functional decomposition
  • 16. Introduction to High Performance Computing Page 16 Domain Decomposition  In this type of partitioning, the data associated with a problem is decomposed. Each parallel task then works on a portion of of the data.
  • 17. Introduction to High Performance Computing Page 17 Partitioning Data  There are different ways to partition data
  • 18. Introduction to High Performance Computing Page 18 Functional Decomposition  In this approach, the focus is on the computation that is to be performed rather than on the data manipulated by the computation. The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work.  Functional decomposition lends itself well to problems that can be split into different tasks. For example – Ecosystem Modeling – Signal Processing – Climate Modeling
  • 19. Introduction to High Performance Computing Page 19 Ecosystem Modeling  Each program calculates the population of a given group, where each group's growth depends on that of its neighbors. As time progresses, each process calculates its current state, then exchanges information with the neighbor populations. All tasks then progress to calculate the state at the next time step.
  • 20. Introduction to High Performance Computing Page 20 Signal Processing  An audio signal data set is passed through four distinct computational filters. Each filter is a separate process. The first segment of data must pass through the first filter before progressing to the second. When it does, the second segment of data passes through the first filter. By the time the fourth segment of data is in the first filter, all four tasks are busy.
  • 21. Introduction to High Performance Computing Page 21 Climate Modeling  Each model component can be thought of as a separate task. Arrows represent exchanges of data between components during computation: the atmosphere model generates wind velocity data that are used by the ocean model, the ocean model generates sea surface temperature data that are used by the atmosphere model, and so on.  Combining these two types of problem decomposition is common and natural.
  • 22. Introduction to High Performance Computing Page 22 Agenda  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Partitioning  Communications  Synchronization  Data Dependencies  Load Balancing  Granularity  I/O  Limits and Costs of Parallel Programming  Performance Analysis and Tuning
  • 23. Introduction to High Performance Computing Page 23 Who Needs Communications?  The need for communications between tasks depends upon your problem  You DON'T need communications – Some types of problems can be decomposed and executed in parallel with virtually no need for tasks to share data. For example, imagine an image processing operation where every pixel in a black and white image needs to have its color reversed. The image data can easily be distributed to multiple tasks that then act independently of each other to do their portion of the work. – These types of problems are often called embarrassingly parallel because they are so straight-forward. Very little inter-task communication is required.  You DO need communications – Most parallel applications are not quite so simple, and do require tasks to share data with each other. For example, a 3-D heat diffusion problem requires a task to know the temperatures calculated by the tasks that have neighboring data. Changes to neighboring data has a direct effect on that task's data.
  • 24. Introduction to High Performance Computing Page 24 Factors to Consider (1)  There are a number of important factors to consider when designing your program's inter-task communications  Cost of communications – Inter-task communication virtually always implies overhead. – Machine cycles and resources that could be used for computation are instead used to package and transmit data. – Communications frequently require some type of synchronization between tasks, which can result in tasks spending time "waiting" instead of doing work. – Competing communication traffic can saturate the available network bandwidth, further aggravating performance problems.
  • 25. Introduction to High Performance Computing Page 25 Factors to Consider (2)  Latency vs. Bandwidth – latency is the time it takes to send a minimal (0 byte) message from point A to point B. Commonly expressed as microseconds. – bandwidth is the amount of data that can be communicated per unit of time. Commonly expressed as megabytes/sec. – Sending many small messages can cause latency to dominate communication overheads. Often it is more efficient to package small messages into a larger message, thus increasing the effective communications bandwidth.
  • 26. Introduction to High Performance Computing Page 26 Factors to Consider (3)  Visibility of communications – With the Message Passing Model, communications are explicit and generally quite visible and under the control of the programmer. – With the Data Parallel Model, communications often occur transparently to the programmer, particularly on distributed memory architectures. The programmer may not even be able to know exactly how inter-task communications are being accomplished.
  • 27. Introduction to High Performance Computing Page 27 Factors to Consider (4)  Synchronous vs. asynchronous communications – Synchronous communications require some type of "handshaking" between tasks that are sharing data. This can be explicitly structured in code by the programmer, or it may happen at a lower level unknown to the programmer. – Synchronous communications are often referred to as blocking communications since other work must wait until the communications have completed. – Asynchronous communications allow tasks to transfer data independently from one another. For example, task 1 can prepare and send a message to task 2, and then immediately begin doing other work. When task 2 actually receives the data doesn't matter. – Asynchronous communications are often referred to as non- blocking communications since other work can be done while the communications are taking place. – Interleaving computation with communication is the single greatest benefit for using asynchronous communications.
  • 28. Introduction to High Performance Computing Page 28 Factors to Consider (5)  Scope of communications – Knowing which tasks must communicate with each other is critical during the design stage of a parallel code. Both of the two scopings described below can be implemented synchronously or asynchronously. – Point-to-point - involves two tasks with one task acting as the sender/producer of data, and the other acting as the receiver/consumer. – Collective - involves data sharing between more than two tasks, which are often specified as being members in a common group, or collective.
  • 29. Introduction to High Performance Computing Page 29 Collective Communications  Examples
  • 30. Introduction to High Performance Computing Page 30 Factors to Consider (6)  Efficiency of communications – Very often, the programmer will have a choice with regard to factors that can affect communications performance. Only a few are mentioned here. – Which implementation for a given model should be used? Using the Message Passing Model as an example, one MPI implementation may be faster on a given hardware platform than another. – What type of communication operations should be used? As mentioned previously, asynchronous communication operations can improve overall program performance. – Network media - some platforms may offer more than one network for communications. Which one is best?
  • 31. Introduction to High Performance Computing Page 31 Factors to Consider (7)  Overhead and Complexity
  • 32. Introduction to High Performance Computing Page 32 Factors to Consider (8)  Finally, realize that this is only a partial list of things to consider!!!
  • 33. Introduction to High Performance Computing Page 33 Agenda  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Partitioning  Communications  Synchronization  Data Dependencies  Load Balancing  Granularity  I/O  Limits and Costs of Parallel Programming  Performance Analysis and Tuning
  • 34. Introduction to High Performance Computing Page 34 Types of Synchronization  Barrier – Usually implies that all tasks are involved – Each task performs its work until it reaches the barrier. It then stops, or "blocks". – When the last task reaches the barrier, all tasks are synchronized. – What happens from here varies. Often, a serial section of work must be done. In other cases, the tasks are automatically released to continue their work.  Lock / semaphore – Can involve any number of tasks – Typically used to serialize (protect) access to global data or a section of code. Only one task at a time may use (own) the lock / semaphore / flag. – The first task to acquire the lock "sets" it. This task can then safely (serially) access the protected data or code. – Other tasks can attempt to acquire the lock but must wait until the task that owns the lock releases it. – Can be blocking or non-blocking  Synchronous communication operations – Involves only those tasks executing a communication operation – When a task performs a communication operation, some form of coordination is required with the other task(s) participating in the communication. For example, before a task can perform a send operation, it must first receive an acknowledgment from the receiving task that it is OK to send. – Discussed previously in the Communications section.
  • 35. Introduction to High Performance Computing Page 35 Agenda  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Partitioning  Communications  Synchronization  Data Dependencies  Load Balancing  Granularity  I/O  Limits and Costs of Parallel Programming  Performance Analysis and Tuning
  • 36. Introduction to High Performance Computing Page 36 Definitions  A dependence exists between program statements when the order of statement execution affects the results of the program.  A data dependence results from multiple use of the same location(s) in storage by different tasks.  Dependencies are important to parallel programming because they are one of the primary inhibitors to parallelism.
  • 37. Introduction to High Performance Computing Page 37 Examples (1): Loop carried data dependence  The value of A(J-1) must be computed before the value of A(J), therefore A(J) exhibits a data dependency on A(J-1). Parallelism is inhibited.  If Task 2 has A(J) and task 1 has A(J-1), computing the correct value of A(J) necessitates: – Distributed memory architecture - task 2 must obtain the value of A(J-1) from task 1 after task 1 finishes its computation – Shared memory architecture - task 2 must read A(J-1) after task 1 updates it DO 500 J = MYSTART,MYEND A(J) = A(J-1) * 2.0500 CONTINUE
  • 38. Introduction to High Performance Computing Page 38 Examples (2): Loop independent data dependence  As with the previous example, parallelism is inhibited. The value of Y is dependent on: – Distributed memory architecture - if or when the value of X is communicated between the tasks. – Shared memory architecture - which task last stores the value of X.  Although all data dependencies are important to identify when designing parallel programs, loop carried dependencies are particularly important since loops are possibly the most common target of parallelization efforts. task 1 task 2 ------ ------ X = 2 X = 4 . . . . Y = X**2 Y = X**3
  • 39. Introduction to High Performance Computing Page 39 How to Handle Data Dependencies?  Distributed memory architectures - communicate required data at synchronization points.  Shared memory architectures -synchronize read/write operations between tasks.