Wondershare Recoverit 13.5.12.11 Free Download

Introduction to High Performance Computing
Page 1
Agenda
 Automatic vs. Manual Parallelization
 Understand the Problem and the Program
 Partitioning
 Communications
 Synchronization
 Data Dependencies
 Load Balancing
 Granularity
 I/O
 Limits and Costs of Parallel Programming
 Performance Analysis and Tuning

Page 2
Definition
 Load balancing refers to the practice of distributing work among
tasks so that all tasks are kept busy all of the time. It can be
considered a minimization of task idle time.
 Load balancing is important to parallel programs for
performance reasons. For example, if all tasks are subject to a
barrier synchronization point, the slowest task will determine the
overall performance.

Page 3
How to Achieve Load Balance? (1)
 Equally partition the work each task receives
– For array/matrix operations where each task performs
similar work, evenly distribute the data set among the tasks.
– For loop iterations where the work done in each iteration is
similar, evenly distribute the iterations across the tasks.
– If a heterogeneous mix of machines with varying
performance characteristics are being used, be sure to use
some type of performance analysis tool to detect any load
imbalances. Adjust work accordingly.

Page 4
How to Achieve Load Balance? (2)
 Use dynamic work assignment
– Certain classes of problems result in load imbalances even if data
is evenly distributed among tasks:
 Sparse arrays - some tasks will have actual data to work on while
others have mostly "zeros".
 Adaptive grid methods - some tasks may need to refine their mesh
while others don't.
 N-body simulations - where some particles may migrate to/from their
original task domain to another task's; where the particles owned by
some tasks require more work than those owned by other tasks.
– When the amount of work each task will perform is intentionally
variable, or is unable to be predicted, it may be helpful to use a
scheduler - task pool approach. As each task finishes its work, it
queues to get a new piece of work.
– It may become necessary to design an algorithm which detects and
handles load imbalances as they occur dynamically within the code.

Page 5
Agenda
 Partitioning
 Communications
 Synchronization
 Load Balancing
 Granularity
 I/O

Page 6
Definitions
 Computation / Communication Ratio:
– In parallel computing, granularity is a qualitative measure of
the ratio of computation to communication.
– Periods of computation are typically separated from periods
of communication by synchronization events.
 Fine grain parallelism
 Coarse grain parallelism

Page 7
Fine-grain Parallelism
 Relatively small amounts of computational work
are done between communication events
 Low computation to communication ratio
 Facilitates load balancing
 Implies high communication overhead and less
opportunity for performance enhancement
 If granularity is too fine it is possible that the
overhead required for communications and
synchronization between tasks takes longer
than the computation.

Page 8
Coarse-grain Parallelism
 Relatively large amounts of
computational work are done between
communication/synchronization events
 High computation to communication
ratio
 Implies more opportunity for
performance increase
 Harder to load balance efficiently

Page 9
Which is Best?
 The most efficient granularity is dependent on the
algorithm and the hardware environment in which it
runs.
 In most cases the overhead associated with
communications and synchronization is high relative
to execution speed so it is advantageous to have
coarse granularity.
 Fine-grain parallelism can help reduce overheads
due to load imbalance.

Page 10
Agenda
 Partitioning
 Communications
 Synchronization
 Load Balancing
 Granularity
 I/O

Page 11
The bad News
 I/O operations are generally regarded as inhibitors to
parallelism
 Parallel I/O systems are immature or not available for
all platforms
 In an environment where all tasks see the same
filespace, write operations will result in file overwriting
 Read operations will be affected by the fileserver's
ability to handle multiple read requests at the same
time
 I/O that must be conducted over the network (NFS,
non-local) can cause severe bottlenecks

Page 12
The good News
 Some parallel file systems are available. For example:
– GPFS: General Parallel File System for AIX (IBM)
– Lustre: for Linux clusters (Cluster File Systems, Inc.)
– PVFS/PVFS2: Parallel Virtual File System for Linux clusters
(Clemson/Argonne/Ohio State/others)
– PanFS: Panasas ActiveScale File System for Linux clusters
(Panasas, Inc.)
– HP SFS: HP StorageWorks Scalable File Share. Lustre based
parallel file system (Global File System for Linux) product from HP
 The parallel I/O programming interface specification for MPI has
been available since 1996 as part of MPI-2. Vendor and "free"
implementations are now commonly available.

Page 13
Some Options
 If you have access to a parallel file system, investigate using it.
If you don't, keep reading...
 Rule #1: Reduce overall I/O as much as possible
 Confine I/O to specific serial portions of the job, and then use
parallel communications to distribute data to parallel tasks. For
example, Task 1 could read an input file and then communicate
required data to other tasks. Likewise, Task 1 could perform
write operation after receiving required data from all other tasks.
 For distributed memory systems with shared filespace, perform
I/O in local, non-shared filespace. For example, each processor
may have /tmp filespace which can used. This is usually much
more efficient than performing I/O over the network to one's
home directory.
 Create unique filenames for each tasks' input/output file(s)

Page 14
Agenda
 Partitioning
 Communications
 Synchronization
 Load Balancing
 Granularity
 I/O

Page 15
Amdahl's Law
1
speedup = --------
1 - P
 If none of the code can be parallelized, P = 0 and the
speedup = 1 (no speedup). If all of the code is
parallelized, P = 1 and the speedup is infinite (in
theory).
 If 50% of the code can be parallelized, maximum
speedup = 2, meaning the code will run twice as fast.
Amdahl's Law states that potential
program speedup is defined by the
fraction of code (P) that can be
parallelized:

Page 16
Amdahl's Law
 Introducing the number of processors performing the
parallel fraction of work, the relationship can be
modeled by
 where P = parallel fraction, N = number of processors
and S = serial fraction
1
speedup = ------------
P + S
---
N

Page 17
Amdahl's Law
 It soon becomes obvious that there are limits to the
scalability of parallelism. For example, at P = .50, .90
and .99 (50%, 90% and 99% of the code is
parallelizable)
speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02

Page 18
Amdahl's Law
 However, certain problems demonstrate increased performance
by increasing the problem size. For example:
– 2D Grid Calculations 85 seconds 85%
– Serial fraction 15 seconds 15%
 We can increase the problem size by doubling the grid
dimensions and halving the time step. This results in four times
the number of grid points and twice the number of time steps.
The timings then look like:
– 2D Grid Calculations 680 seconds 97.84%
– Serial fraction 15 seconds 2.16%
 Problems that increase the percentage of parallel time with their
size are more scalable than problems with a fixed percentage
of parallel time.

Page 19
Complexity
 In general, parallel applications are much more complex than
corresponding serial applications, perhaps an order of
magnitude. Not only do you have multiple instruction streams
executing at the same time, but you also have data flowing
between them.
 The costs of complexity are measured in programmer time in
virtually every aspect of the software development cycle:
– Design
– Coding
– Debugging
– Tuning
– Maintenance
 Adhering to "good" software development practices is essential
when when working with parallel applications - especially if
somebody besides you will have to work with the software.

Page 20
Portability
 Thanks to standardization in several APIs, such as MPI, POSIX
threads, HPF and OpenMP, portability issues with parallel
programs are not as serious as in years past. However...
 All of the usual portability issues associated with serial
programs apply to parallel programs. For example, if you use
vendor "enhancements" to Fortran, C or C++, portability will be
a problem.
 Even though standards exist for several APIs, implementations
will differ in a number of details, sometimes to the point of
requiring code modifications in order to effect portability.
 Operating systems can play a key role in code portability issues.
 Hardware architectures are characteristically highly variable and
can affect portability.

Page 21
Resource Requirements
 The primary intent of parallel programming is to decrease
execution wall clock time, however in order to accomplish this,
more CPU time is required. For example, a parallel code that
runs in 1 hour on 8 processors actually uses 8 hours of CPU
time.
 The amount of memory required can be greater for parallel
codes than serial codes, due to the need to replicate data and
for overheads associated with parallel support libraries and
subsystems.
 For short running parallel programs, there can actually be a
decrease in performance compared to a similar serial
implementation. The overhead costs associated with setting up
the parallel environment, task creation, communications and
task termination can comprise a significant portion of the total
execution time for short runs.

Page 22
Scalability
 The ability of a parallel program's performance to scale is a
result of a number of interrelated factors. Simply adding more
machines is rarely the answer.
 The algorithm may have inherent limits to scalability. At some
point, adding more resources causes performance to decrease.
Most parallel solutions demonstrate this characteristic at some
point.
 Hardware factors play a significant role in scalability. Examples:
– Memory-cpu bus bandwidth on an SMP machine
– Communications network bandwidth
– Amount of memory available on any given machine or set of
machines
– Processor clock speed
 Parallel support libraries and subsystems software can limit
scalability independent of your application.

Page 23
Agenda
 Partitioning
 Communications
 Synchronization
 Load Balancing
 Granularity
 I/O

Page 24
 As with debugging, monitoring and analyzing parallel
program execution is significantly more of a
challenge than for serial programs.
 A number of parallel tools for execution monitoring
and program analysis are available.
 Some are quite useful; some are cross-platform also.
 One starting point:
Performance Analysis Tools Tutorial
 Work remains to be done, particularly in the area of
scalability.

Page 26
Array Processing
 This example demonstrates calculations on 2-dimensional array
elements, with the computation on each array element being
independent from other array elements.
 The serial program calculates one element at a time in
sequential order.
 Serial code could be of the form:
do j = 1,n
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
 The calculation of elements is independent of one another -
leads to an embarrassingly parallel situation.
 The problem should be computationally intensive.

Page 27
Array Processing Solution 1
 Arrays elements are distributed so that each processor owns a portion of an
array (subarray).
 Independent calculation of array elements insures there is no need for
communication between tasks.
 Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1)
through the subarrays. Unit stride maximizes cache/memory usage.
 Since it is desirable to have unit stride through the subarrays, the choice of a
distribution scheme depends on the programming language. See the
Block - Cyclic Distributions Diagram for the options.
 After the array is distributed, each task executes the portion of the loop
corresponding to the data it owns. For example, with Fortran block distribution:
do j = mystart, myend
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
 Notice that only the outer loop variables are different from the serial solution.

Page 28
One possible implementation
 Implement as SPMD model.
 Master process initializes array, sends info to worker
processes and receives results.
 Worker process receives info, performs its share of
computation and sends results to master.
 Using the Fortran storage scheme, perform block
distribution of the array.
 Pseudo code solution: red highlights changes for
parallelism.

Page 29
One possible implementation

Page 30
Array Processing Solution 2: Pool of Tasks
 The previous array solution demonstrated static load
balancing:
– Each task has a fixed amount of work to do
– May be significant idle time for faster or more lightly loaded
processors - slowest tasks determines overall performance.
 Static load balancing is not usually a major concern if
all tasks are performing the same amount of work on
identical machines.
 If you have a load balance problem (some tasks work
faster than others), you may benefit by using a "pool
of tasks" scheme.

Page 31
Pool of Tasks Scheme
 Two processes are employed
 Master Process:
– Holds pool of tasks for worker processes to do
– Sends worker a task when requested
– Collects results from workers
 Worker Process: repeatedly does the following
– Gets task from master process
– Performs computation
– Sends results to master
 Worker processes do not know before runtime which portion of
array they will handle or how many tasks they will perform.
 Dynamic load balancing occurs at run time: the faster tasks will
get more work to do.
 Pseudo code solution: red highlights changes for parallelism.

Page 32
Array Processing Solution 2 Pool of Tasks Scheme

Wondershare Recoverit 13.5.12.11 Free Download

More Related Content

Similar to Wondershare Recoverit 13.5.12.11 Free Download (8)

Recently uploaded (20)

Wondershare Recoverit 13.5.12.11 Free Download