SlideShare a Scribd company logo
Introduction to High Performance Computing
Page 1
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning
Introduction to High Performance Computing
Page 2
Definition
 Load balancing refers to the practice of distributing work among
tasks so that all tasks are kept busy all of the time. It can be
considered a minimization of task idle time.
 Load balancing is important to parallel programs for
performance reasons. For example, if all tasks are subject to a
barrier synchronization point, the slowest task will determine the
overall performance.
Introduction to High Performance Computing
Page 3
How to Achieve Load Balance? (1)
 Equally partition the work each task receives
– For array/matrix operations where each task performs
similar work, evenly distribute the data set among the tasks.
– For loop iterations where the work done in each iteration is
similar, evenly distribute the iterations across the tasks.
– If a heterogeneous mix of machines with varying
performance characteristics are being used, be sure to use
some type of performance analysis tool to detect any load
imbalances. Adjust work accordingly.
Introduction to High Performance Computing
Page 4
How to Achieve Load Balance? (2)
 Use dynamic work assignment
– Certain classes of problems result in load imbalances even if data
is evenly distributed among tasks:
 Sparse arrays - some tasks will have actual data to work on while
others have mostly "zeros".
 Adaptive grid methods - some tasks may need to refine their mesh
while others don't.
 N-body simulations - where some particles may migrate to/from their
original task domain to another task's; where the particles owned by
some tasks require more work than those owned by other tasks.
– When the amount of work each task will perform is intentionally
variable, or is unable to be predicted, it may be helpful to use a
scheduler - task pool approach. As each task finishes its work, it
queues to get a new piece of work.
– It may become necessary to design an algorithm which detects and
handles load imbalances as they occur dynamically within the code.
Introduction to High Performance Computing
Page 5
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning
Introduction to High Performance Computing
Page 6
Definitions
 Computation / Communication Ratio:
– In parallel computing, granularity is a qualitative measure of
the ratio of computation to communication.
– Periods of computation are typically separated from periods
of communication by synchronization events.
 Fine grain parallelism
 Coarse grain parallelism
Introduction to High Performance Computing
Page 7
Fine-grain Parallelism
 Relatively small amounts of computational work
are done between communication events
 Low computation to communication ratio
 Facilitates load balancing
 Implies high communication overhead and less
opportunity for performance enhancement
 If granularity is too fine it is possible that the
overhead required for communications and
synchronization between tasks takes longer
than the computation.
Introduction to High Performance Computing
Page 8
Coarse-grain Parallelism
 Relatively large amounts of
computational work are done between
communication/synchronization events
 High computation to communication
ratio
 Implies more opportunity for
performance increase
 Harder to load balance efficiently
Introduction to High Performance Computing
Page 9
Which is Best?
 The most efficient granularity is dependent on the
algorithm and the hardware environment in which it
runs.
 In most cases the overhead associated with
communications and synchronization is high relative
to execution speed so it is advantageous to have
coarse granularity.
 Fine-grain parallelism can help reduce overheads
due to load imbalance.
Introduction to High Performance Computing
Page 10
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning
Introduction to High Performance Computing
Page 11
The bad News
 I/O operations are generally regarded as inhibitors to
parallelism
 Parallel I/O systems are immature or not available for
all platforms
 In an environment where all tasks see the same
filespace, write operations will result in file overwriting
 Read operations will be affected by the fileserver's
ability to handle multiple read requests at the same
time
 I/O that must be conducted over the network (NFS,
non-local) can cause severe bottlenecks
Introduction to High Performance Computing
Page 12
The good News
 Some parallel file systems are available. For example:
– GPFS: General Parallel File System for AIX (IBM)
– Lustre: for Linux clusters (Cluster File Systems, Inc.)
– PVFS/PVFS2: Parallel Virtual File System for Linux clusters
(Clemson/Argonne/Ohio State/others)
– PanFS: Panasas ActiveScale File System for Linux clusters
(Panasas, Inc.)
– HP SFS: HP StorageWorks Scalable File Share. Lustre based
parallel file system (Global File System for Linux) product from HP
 The parallel I/O programming interface specification for MPI has
been available since 1996 as part of MPI-2. Vendor and "free"
implementations are now commonly available.
Introduction to High Performance Computing
Page 13
Some Options
 If you have access to a parallel file system, investigate using it.
If you don't, keep reading...
 Rule #1: Reduce overall I/O as much as possible
 Confine I/O to specific serial portions of the job, and then use
parallel communications to distribute data to parallel tasks. For
example, Task 1 could read an input file and then communicate
required data to other tasks. Likewise, Task 1 could perform
write operation after receiving required data from all other tasks.
 For distributed memory systems with shared filespace, perform
I/O in local, non-shared filespace. For example, each processor
may have /tmp filespace which can used. This is usually much
more efficient than performing I/O over the network to one's
home directory.
 Create unique filenames for each tasks' input/output file(s)
Introduction to High Performance Computing
Page 14
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning
Introduction to High Performance Computing
Page 15
Amdahl's Law
1
speedup = --------
1 - P
 If none of the code can be parallelized, P = 0 and the
speedup = 1 (no speedup). If all of the code is
parallelized, P = 1 and the speedup is infinite (in
theory).
 If 50% of the code can be parallelized, maximum
speedup = 2, meaning the code will run twice as fast.
Amdahl's Law states that potential
program speedup is defined by the
fraction of code (P) that can be
parallelized:
Introduction to High Performance Computing
Page 16
Amdahl's Law
 Introducing the number of processors performing the
parallel fraction of work, the relationship can be
modeled by
 where P = parallel fraction, N = number of processors
and S = serial fraction
1
speedup = ------------
P + S
---
N
Introduction to High Performance Computing
Page 17
Amdahl's Law
 It soon becomes obvious that there are limits to the
scalability of parallelism. For example, at P = .50, .90
and .99 (50%, 90% and 99% of the code is
parallelizable)
speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02
Introduction to High Performance Computing
Page 18
Amdahl's Law
 However, certain problems demonstrate increased performance
by increasing the problem size. For example:
– 2D Grid Calculations 85 seconds 85%
– Serial fraction 15 seconds 15%
 We can increase the problem size by doubling the grid
dimensions and halving the time step. This results in four times
the number of grid points and twice the number of time steps.
The timings then look like:
– 2D Grid Calculations 680 seconds 97.84%
– Serial fraction 15 seconds 2.16%
 Problems that increase the percentage of parallel time with their
size are more scalable than problems with a fixed percentage
of parallel time.
Introduction to High Performance Computing
Page 19
Complexity
 In general, parallel applications are much more complex than
corresponding serial applications, perhaps an order of
magnitude. Not only do you have multiple instruction streams
executing at the same time, but you also have data flowing
between them.
 The costs of complexity are measured in programmer time in
virtually every aspect of the software development cycle:
– Design
– Coding
– Debugging
– Tuning
– Maintenance
 Adhering to "good" software development practices is essential
when when working with parallel applications - especially if
somebody besides you will have to work with the software.
Introduction to High Performance Computing
Page 20
Portability
 Thanks to standardization in several APIs, such as MPI, POSIX
threads, HPF and OpenMP, portability issues with parallel
programs are not as serious as in years past. However...
 All of the usual portability issues associated with serial
programs apply to parallel programs. For example, if you use
vendor "enhancements" to Fortran, C or C++, portability will be
a problem.
 Even though standards exist for several APIs, implementations
will differ in a number of details, sometimes to the point of
requiring code modifications in order to effect portability.
 Operating systems can play a key role in code portability issues.
 Hardware architectures are characteristically highly variable and
can affect portability.
Introduction to High Performance Computing
Page 21
Resource Requirements
 The primary intent of parallel programming is to decrease
execution wall clock time, however in order to accomplish this,
more CPU time is required. For example, a parallel code that
runs in 1 hour on 8 processors actually uses 8 hours of CPU
time.
 The amount of memory required can be greater for parallel
codes than serial codes, due to the need to replicate data and
for overheads associated with parallel support libraries and
subsystems.
 For short running parallel programs, there can actually be a
decrease in performance compared to a similar serial
implementation. The overhead costs associated with setting up
the parallel environment, task creation, communications and
task termination can comprise a significant portion of the total
execution time for short runs.
Introduction to High Performance Computing
Page 22
Scalability
 The ability of a parallel program's performance to scale is a
result of a number of interrelated factors. Simply adding more
machines is rarely the answer.
 The algorithm may have inherent limits to scalability. At some
point, adding more resources causes performance to decrease.
Most parallel solutions demonstrate this characteristic at some
point.
 Hardware factors play a significant role in scalability. Examples:
– Memory-cpu bus bandwidth on an SMP machine
– Communications network bandwidth
– Amount of memory available on any given machine or set of
machines
– Processor clock speed
 Parallel support libraries and subsystems software can limit
scalability independent of your application.
Introduction to High Performance Computing
Page 23
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning
Introduction to High Performance Computing
Page 24
 As with debugging, monitoring and analyzing parallel
program execution is significantly more of a
challenge than for serial programs.
 A number of parallel tools for execution monitoring
and program analysis are available.
 Some are quite useful; some are cross-platform also.
 One starting point:
Performance Analysis Tools Tutorial
 Work remains to be done, particularly in the area of
scalability.
Parallel Examples
Introduction to High Performance Computing
Page 26
Array Processing
 This example demonstrates calculations on 2-dimensional array
elements, with the computation on each array element being
independent from other array elements.
 The serial program calculates one element at a time in
sequential order.
 Serial code could be of the form:
do j = 1,n
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
 The calculation of elements is independent of one another -
leads to an embarrassingly parallel situation.
 The problem should be computationally intensive.
Introduction to High Performance Computing
Page 27
Array Processing Solution 1
 Arrays elements are distributed so that each processor owns a portion of an
array (subarray).
 Independent calculation of array elements insures there is no need for
communication between tasks.
 Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1)
through the subarrays. Unit stride maximizes cache/memory usage.
 Since it is desirable to have unit stride through the subarrays, the choice of a
distribution scheme depends on the programming language. See the
Block - Cyclic Distributions Diagram for the options.
 After the array is distributed, each task executes the portion of the loop
corresponding to the data it owns. For example, with Fortran block distribution:
do j = mystart, myend
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
 Notice that only the outer loop variables are different from the serial solution.
Introduction to High Performance Computing
Page 28
Array Processing Solution 1
One possible implementation
 Implement as SPMD model.
 Master process initializes array, sends info to worker
processes and receives results.
 Worker process receives info, performs its share of
computation and sends results to master.
 Using the Fortran storage scheme, perform block
distribution of the array.
 Pseudo code solution: red highlights changes for
parallelism.
Introduction to High Performance Computing
Page 29
Array Processing Solution 1
One possible implementation
Introduction to High Performance Computing
Page 30
Array Processing Solution 2: Pool of Tasks
 The previous array solution demonstrated static load
balancing:
– Each task has a fixed amount of work to do
– May be significant idle time for faster or more lightly loaded
processors - slowest tasks determines overall performance.
 Static load balancing is not usually a major concern if
all tasks are performing the same amount of work on
identical machines.
 If you have a load balance problem (some tasks work
faster than others), you may benefit by using a "pool
of tasks" scheme.
Introduction to High Performance Computing
Page 31
Array Processing Solution 2
Pool of Tasks Scheme
 Two processes are employed
 Master Process:
– Holds pool of tasks for worker processes to do
– Sends worker a task when requested
– Collects results from workers
 Worker Process: repeatedly does the following
– Gets task from master process
– Performs computation
– Sends results to master
 Worker processes do not know before runtime which portion of
array they will handle or how many tasks they will perform.
 Dynamic load balancing occurs at run time: the faster tasks will
get more work to do.
 Pseudo code solution: red highlights changes for parallelism.
Introduction to High Performance Computing
Page 32
Array Processing Solution 2 Pool of Tasks Scheme

More Related Content

PPTX
Unit 1.2 Parallel Programming in HPC.pptx
sayalee7
 
PPTX
Wondershare Dr. Fone 13.5.5 Crack + License Key [Latest] Donwload
shilldevil5
 
PDF
FL Studio Crack v24.2.2.4597 License Free 2025
pro zkk
 
PPTX
VideoHive – Past Echoes Historical Slideshow Download crack
borikhni
 
PPTX
Wondershare Dr. Fone 13.5.5 Crack + License Key [Latest] Donwload
borikhni
 
PPTX
Smalland Survive the Wilds v1.6.2 Free Download
elonbuda
 
PPTX
Tekken 3 Download For PC (Windows 7/8/10/11) Free 2025
elonbuda
 
PPTX
Zoom Player MAX Final crack Free Download
gangpage308
 
Unit 1.2 Parallel Programming in HPC.pptx
sayalee7
 
Wondershare Dr. Fone 13.5.5 Crack + License Key [Latest] Donwload
shilldevil5
 
FL Studio Crack v24.2.2.4597 License Free 2025
pro zkk
 
VideoHive – Past Echoes Historical Slideshow Download crack
borikhni
 
Wondershare Dr. Fone 13.5.5 Crack + License Key [Latest] Donwload
borikhni
 
Smalland Survive the Wilds v1.6.2 Free Download
elonbuda
 
Tekken 3 Download For PC (Windows 7/8/10/11) Free 2025
elonbuda
 
Zoom Player MAX Final crack Free Download
gangpage308
 

Similar to Wondershare Recoverit 13.5.12.11 Free Download (8)

PDF
LDPlayer 9.1.20 Latest Crack Free Download
farooq08kp
 
PDF
LDPlayer 9.1.20 Latest Crack Free 2025 Download
ahmad09kp
 
PDF
LDPlayer 9.1.20 Latest Crack Free Download
basitayoubi20
 
PDF
CorelDraw X7 Crack Latest Version 2025 ?
j3002972
 
PDF
LDPlayer 9.1.20 Latest Crack Free 2025 Download
shanbhai07kp
 
PDF
Auslogics Video Grabber 1.0.0.7 Crack Free Download
numan04kp
 
PPTX
Wondershare Filmora 14 Crack With Activation Key [2025]
shubgill015
 
PPTX
Chimera Tool 2025 Crack Free Download [Updated]
bhalajhag
 
LDPlayer 9.1.20 Latest Crack Free Download
farooq08kp
 
LDPlayer 9.1.20 Latest Crack Free 2025 Download
ahmad09kp
 
LDPlayer 9.1.20 Latest Crack Free Download
basitayoubi20
 
CorelDraw X7 Crack Latest Version 2025 ?
j3002972
 
LDPlayer 9.1.20 Latest Crack Free 2025 Download
shanbhai07kp
 
Auslogics Video Grabber 1.0.0.7 Crack Free Download
numan04kp
 
Wondershare Filmora 14 Crack With Activation Key [2025]
shubgill015
 
Chimera Tool 2025 Crack Free Download [Updated]
bhalajhag
 
Ad

Recently uploaded (20)

PPTX
Presentation about variables and constant.pptx
safalsingh810
 
PDF
vAdobe Premiere Pro 2025 (v25.2.3.004) Crack Pre-Activated Latest
imang66g
 
PDF
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PPTX
Role Of Python In Programing Language.pptx
jaykoshti048
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PDF
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PPTX
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
DOCX
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PPTX
oapresentation.pptx
mehatdhavalrajubhai
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PDF
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Presentation about variables and constant.pptx
safalsingh810
 
vAdobe Premiere Pro 2025 (v25.2.3.004) Crack Pre-Activated Latest
imang66g
 
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
Presentation about variables and constant.pptx
kr2589474
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Role Of Python In Programing Language.pptx
jaykoshti048
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
oapresentation.pptx
mehatdhavalrajubhai
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Ad

Wondershare Recoverit 13.5.12.11 Free Download

  • 1. Introduction to High Performance Computing Page 1 Agenda  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Partitioning  Communications  Synchronization  Data Dependencies  Load Balancing  Granularity  I/O  Limits and Costs of Parallel Programming  Performance Analysis and Tuning
  • 2. Introduction to High Performance Computing Page 2 Definition  Load balancing refers to the practice of distributing work among tasks so that all tasks are kept busy all of the time. It can be considered a minimization of task idle time.  Load balancing is important to parallel programs for performance reasons. For example, if all tasks are subject to a barrier synchronization point, the slowest task will determine the overall performance.
  • 3. Introduction to High Performance Computing Page 3 How to Achieve Load Balance? (1)  Equally partition the work each task receives – For array/matrix operations where each task performs similar work, evenly distribute the data set among the tasks. – For loop iterations where the work done in each iteration is similar, evenly distribute the iterations across the tasks. – If a heterogeneous mix of machines with varying performance characteristics are being used, be sure to use some type of performance analysis tool to detect any load imbalances. Adjust work accordingly.
  • 4. Introduction to High Performance Computing Page 4 How to Achieve Load Balance? (2)  Use dynamic work assignment – Certain classes of problems result in load imbalances even if data is evenly distributed among tasks:  Sparse arrays - some tasks will have actual data to work on while others have mostly "zeros".  Adaptive grid methods - some tasks may need to refine their mesh while others don't.  N-body simulations - where some particles may migrate to/from their original task domain to another task's; where the particles owned by some tasks require more work than those owned by other tasks. – When the amount of work each task will perform is intentionally variable, or is unable to be predicted, it may be helpful to use a scheduler - task pool approach. As each task finishes its work, it queues to get a new piece of work. – It may become necessary to design an algorithm which detects and handles load imbalances as they occur dynamically within the code.
  • 5. Introduction to High Performance Computing Page 5 Agenda  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Partitioning  Communications  Synchronization  Data Dependencies  Load Balancing  Granularity  I/O  Limits and Costs of Parallel Programming  Performance Analysis and Tuning
  • 6. Introduction to High Performance Computing Page 6 Definitions  Computation / Communication Ratio: – In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. – Periods of computation are typically separated from periods of communication by synchronization events.  Fine grain parallelism  Coarse grain parallelism
  • 7. Introduction to High Performance Computing Page 7 Fine-grain Parallelism  Relatively small amounts of computational work are done between communication events  Low computation to communication ratio  Facilitates load balancing  Implies high communication overhead and less opportunity for performance enhancement  If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation.
  • 8. Introduction to High Performance Computing Page 8 Coarse-grain Parallelism  Relatively large amounts of computational work are done between communication/synchronization events  High computation to communication ratio  Implies more opportunity for performance increase  Harder to load balance efficiently
  • 9. Introduction to High Performance Computing Page 9 Which is Best?  The most efficient granularity is dependent on the algorithm and the hardware environment in which it runs.  In most cases the overhead associated with communications and synchronization is high relative to execution speed so it is advantageous to have coarse granularity.  Fine-grain parallelism can help reduce overheads due to load imbalance.
  • 10. Introduction to High Performance Computing Page 10 Agenda  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Partitioning  Communications  Synchronization  Data Dependencies  Load Balancing  Granularity  I/O  Limits and Costs of Parallel Programming  Performance Analysis and Tuning
  • 11. Introduction to High Performance Computing Page 11 The bad News  I/O operations are generally regarded as inhibitors to parallelism  Parallel I/O systems are immature or not available for all platforms  In an environment where all tasks see the same filespace, write operations will result in file overwriting  Read operations will be affected by the fileserver's ability to handle multiple read requests at the same time  I/O that must be conducted over the network (NFS, non-local) can cause severe bottlenecks
  • 12. Introduction to High Performance Computing Page 12 The good News  Some parallel file systems are available. For example: – GPFS: General Parallel File System for AIX (IBM) – Lustre: for Linux clusters (Cluster File Systems, Inc.) – PVFS/PVFS2: Parallel Virtual File System for Linux clusters (Clemson/Argonne/Ohio State/others) – PanFS: Panasas ActiveScale File System for Linux clusters (Panasas, Inc.) – HP SFS: HP StorageWorks Scalable File Share. Lustre based parallel file system (Global File System for Linux) product from HP  The parallel I/O programming interface specification for MPI has been available since 1996 as part of MPI-2. Vendor and "free" implementations are now commonly available.
  • 13. Introduction to High Performance Computing Page 13 Some Options  If you have access to a parallel file system, investigate using it. If you don't, keep reading...  Rule #1: Reduce overall I/O as much as possible  Confine I/O to specific serial portions of the job, and then use parallel communications to distribute data to parallel tasks. For example, Task 1 could read an input file and then communicate required data to other tasks. Likewise, Task 1 could perform write operation after receiving required data from all other tasks.  For distributed memory systems with shared filespace, perform I/O in local, non-shared filespace. For example, each processor may have /tmp filespace which can used. This is usually much more efficient than performing I/O over the network to one's home directory.  Create unique filenames for each tasks' input/output file(s)
  • 14. Introduction to High Performance Computing Page 14 Agenda  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Partitioning  Communications  Synchronization  Data Dependencies  Load Balancing  Granularity  I/O  Limits and Costs of Parallel Programming  Performance Analysis and Tuning
  • 15. Introduction to High Performance Computing Page 15 Amdahl's Law 1 speedup = -------- 1 - P  If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup). If all of the code is parallelized, P = 1 and the speedup is infinite (in theory).  If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast. Amdahl's Law states that potential program speedup is defined by the fraction of code (P) that can be parallelized:
  • 16. Introduction to High Performance Computing Page 16 Amdahl's Law  Introducing the number of processors performing the parallel fraction of work, the relationship can be modeled by  where P = parallel fraction, N = number of processors and S = serial fraction 1 speedup = ------------ P + S --- N
  • 17. Introduction to High Performance Computing Page 17 Amdahl's Law  It soon becomes obvious that there are limits to the scalability of parallelism. For example, at P = .50, .90 and .99 (50%, 90% and 99% of the code is parallelizable) speedup -------------------------------- N P = .50 P = .90 P = .99 ----- ------- ------- ------- 10 1.82 5.26 9.17 100 1.98 9.17 50.25 1000 1.99 9.91 90.99 10000 1.99 9.91 99.02
  • 18. Introduction to High Performance Computing Page 18 Amdahl's Law  However, certain problems demonstrate increased performance by increasing the problem size. For example: – 2D Grid Calculations 85 seconds 85% – Serial fraction 15 seconds 15%  We can increase the problem size by doubling the grid dimensions and halving the time step. This results in four times the number of grid points and twice the number of time steps. The timings then look like: – 2D Grid Calculations 680 seconds 97.84% – Serial fraction 15 seconds 2.16%  Problems that increase the percentage of parallel time with their size are more scalable than problems with a fixed percentage of parallel time.
  • 19. Introduction to High Performance Computing Page 19 Complexity  In general, parallel applications are much more complex than corresponding serial applications, perhaps an order of magnitude. Not only do you have multiple instruction streams executing at the same time, but you also have data flowing between them.  The costs of complexity are measured in programmer time in virtually every aspect of the software development cycle: – Design – Coding – Debugging – Tuning – Maintenance  Adhering to "good" software development practices is essential when when working with parallel applications - especially if somebody besides you will have to work with the software.
  • 20. Introduction to High Performance Computing Page 20 Portability  Thanks to standardization in several APIs, such as MPI, POSIX threads, HPF and OpenMP, portability issues with parallel programs are not as serious as in years past. However...  All of the usual portability issues associated with serial programs apply to parallel programs. For example, if you use vendor "enhancements" to Fortran, C or C++, portability will be a problem.  Even though standards exist for several APIs, implementations will differ in a number of details, sometimes to the point of requiring code modifications in order to effect portability.  Operating systems can play a key role in code portability issues.  Hardware architectures are characteristically highly variable and can affect portability.
  • 21. Introduction to High Performance Computing Page 21 Resource Requirements  The primary intent of parallel programming is to decrease execution wall clock time, however in order to accomplish this, more CPU time is required. For example, a parallel code that runs in 1 hour on 8 processors actually uses 8 hours of CPU time.  The amount of memory required can be greater for parallel codes than serial codes, due to the need to replicate data and for overheads associated with parallel support libraries and subsystems.  For short running parallel programs, there can actually be a decrease in performance compared to a similar serial implementation. The overhead costs associated with setting up the parallel environment, task creation, communications and task termination can comprise a significant portion of the total execution time for short runs.
  • 22. Introduction to High Performance Computing Page 22 Scalability  The ability of a parallel program's performance to scale is a result of a number of interrelated factors. Simply adding more machines is rarely the answer.  The algorithm may have inherent limits to scalability. At some point, adding more resources causes performance to decrease. Most parallel solutions demonstrate this characteristic at some point.  Hardware factors play a significant role in scalability. Examples: – Memory-cpu bus bandwidth on an SMP machine – Communications network bandwidth – Amount of memory available on any given machine or set of machines – Processor clock speed  Parallel support libraries and subsystems software can limit scalability independent of your application.
  • 23. Introduction to High Performance Computing Page 23 Agenda  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Partitioning  Communications  Synchronization  Data Dependencies  Load Balancing  Granularity  I/O  Limits and Costs of Parallel Programming  Performance Analysis and Tuning
  • 24. Introduction to High Performance Computing Page 24  As with debugging, monitoring and analyzing parallel program execution is significantly more of a challenge than for serial programs.  A number of parallel tools for execution monitoring and program analysis are available.  Some are quite useful; some are cross-platform also.  One starting point: Performance Analysis Tools Tutorial  Work remains to be done, particularly in the area of scalability.
  • 26. Introduction to High Performance Computing Page 26 Array Processing  This example demonstrates calculations on 2-dimensional array elements, with the computation on each array element being independent from other array elements.  The serial program calculates one element at a time in sequential order.  Serial code could be of the form: do j = 1,n do i = 1,n a(i,j) = fcn(i,j) end do end do  The calculation of elements is independent of one another - leads to an embarrassingly parallel situation.  The problem should be computationally intensive.
  • 27. Introduction to High Performance Computing Page 27 Array Processing Solution 1  Arrays elements are distributed so that each processor owns a portion of an array (subarray).  Independent calculation of array elements insures there is no need for communication between tasks.  Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1) through the subarrays. Unit stride maximizes cache/memory usage.  Since it is desirable to have unit stride through the subarrays, the choice of a distribution scheme depends on the programming language. See the Block - Cyclic Distributions Diagram for the options.  After the array is distributed, each task executes the portion of the loop corresponding to the data it owns. For example, with Fortran block distribution: do j = mystart, myend do i = 1,n a(i,j) = fcn(i,j) end do end do  Notice that only the outer loop variables are different from the serial solution.
  • 28. Introduction to High Performance Computing Page 28 Array Processing Solution 1 One possible implementation  Implement as SPMD model.  Master process initializes array, sends info to worker processes and receives results.  Worker process receives info, performs its share of computation and sends results to master.  Using the Fortran storage scheme, perform block distribution of the array.  Pseudo code solution: red highlights changes for parallelism.
  • 29. Introduction to High Performance Computing Page 29 Array Processing Solution 1 One possible implementation
  • 30. Introduction to High Performance Computing Page 30 Array Processing Solution 2: Pool of Tasks  The previous array solution demonstrated static load balancing: – Each task has a fixed amount of work to do – May be significant idle time for faster or more lightly loaded processors - slowest tasks determines overall performance.  Static load balancing is not usually a major concern if all tasks are performing the same amount of work on identical machines.  If you have a load balance problem (some tasks work faster than others), you may benefit by using a "pool of tasks" scheme.
  • 31. Introduction to High Performance Computing Page 31 Array Processing Solution 2 Pool of Tasks Scheme  Two processes are employed  Master Process: – Holds pool of tasks for worker processes to do – Sends worker a task when requested – Collects results from workers  Worker Process: repeatedly does the following – Gets task from master process – Performs computation – Sends results to master  Worker processes do not know before runtime which portion of array they will handle or how many tasks they will perform.  Dynamic load balancing occurs at run time: the faster tasks will get more work to do.  Pseudo code solution: red highlights changes for parallelism.
  • 32. Introduction to High Performance Computing Page 32 Array Processing Solution 2 Pool of Tasks Scheme