Parallel and Distributed Computing on Low Latency Clusters

Parallel and Distributed
Computing on Low
Latecy Clusters

Vittorio Giovara
M. S. Electrical Engineering and Computer Science
University of Illinois at Chicago
May 2009

Contents
• Motivation • Application

• Strategy • Compiler Optimizations

• Technologies • OpenMP and MPI over
Inﬁnband

• OpenMP
• Results
• MPI
• Conclusions
• Inﬁnband

Motivation

• Scaling trend has to stop for CMOS
technology:
✓ Direct-tunneling limit in SiO2 ~3 nm
✓ Distance between Si atoms ~0.3 nm
✓ Variabilty

• Foundamental reason: rising fab cost

Motivation

• Easy to build multiple core processor
• Requires human action to modify and adapt
concurrent software
• New classiﬁcation for computer
architectures

Classiﬁcation
SISD SIMD
data pool data pool

instruction pool
instruction pool

CPU CPU CPU

MISD MIMD
data pool data pool

instruction pool
instruction pool

CPU CPU CPU

CPU CPU CPU

easier to parallelize

abstraction level

algorithm
loop level
process management

Levels
recursion
memory
management
proﬁling
data dependency
branching overhead
control ﬂow
algorithm
loop level
process management

SMP Multiprogramming
Multithreading and Scheduling

Backﬁre

• Difﬁcutly to fully exploit the parallelism
offered
• Automatic tools required to adapt software
to parallelism
• Compiler support for manual or semi-
automatic enhancement

Applications
• OpenMP and MPI are two popular tools
used to simplify the parallelizing process of
both new and old software
• Mathematics and Physics
• Computer Science
• Biomedics

Specific Problem and
Background
• Sally3D is a micromagnetism program suit
for field analysis and modeling developed at
Politecnico di Torino (Department of
Electrical Engineering)
• Computationally intensive (even days of
CPU); speedup required
• Previous works still not fully encompassing
the problem (no Infiniband or OpenMP
+MPI solutions)

Strategy
• Install a Linux Kernel with ad-hoc
configuration for scientific computation
• Compile a OpenMP enable GCC
(supported from 4.3.1 onwards)
• Add the Infiniband link among clusters with
proper drivers in kernel and user space
• Select a MPI implementation library

Strategy
• Verify Inﬁniband network through some
MPI test examples
• Install the target software
• Proceed to include OpenMP and MPI
directives in the code
• Run test cases

OpenMP

• standard
• supported by most of modern compilers
• requires little knowledge of the software
• very simple construction methods

OpenMP - example
Parallel Task 1 Parallel Task 3



Thread A
Parallel Task 4

Thread B
Parallel Task 3

Join
Master Thread

OpenMP Sceduler

• Which scheduler available for hardware?
- Static
- Dynamic
- Guided

OpenMP Scheduler
OpenMP Static Scheduler Chart
80000

70000

60000

50000
microseconds

40000

30000

20000

10000

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of threads

chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

OpenMP Scheduler
OpenMP Dynamic Scheduler Chart
117000

102375

87750

73125
microseconds

58500

43875

29250

14625

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of threads


OpenMP Scheduler
OpenMP Guided Scheduler Chart
80000

70000

60000

50000
microseconds

40000

30000

20000

10000

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of threads


OpenMP Scheduler

static scheduler dynamic scheduler guided scheduler

MPI
• standard
• widely used in cluster environment
• many transport link supported
• different implementations available
- OpenMPI
- MVAPICH

Inﬁniband

• standard
• widely used in cluster environment
• very low latency for small packets
• up to 16 Gb/s transfer speed

MPI over Inﬁniband
10000000,0 µs

1000000,0 µs

100000,0 µs

10000,0 µs

1000,0 µs

100,0 µs

10,0 µs

1,0 µs
kB
kB
kB
kB

kB
kB

12 B
25 B
51 B
kB

B
B
B
B

32 B
64 B
12 B
25 B
51 B
B
B
B
B
B

B
k

k
k

M
M
M
M

M
M
M

M
M
M
G
G
G
G

G
1
2
4
8
16
32
64

8
6
2

1
2
4
8
16
1
2
4
8
16

8
6
2
OpenMPI Mvapich2

MPI over Inﬁniband
10000000,00 µs

1000000,00 µs

100000,00 µs

10000,00 µs

1000,00 µs

100,00 µs

10,00 µs

1,00 µs
kB

kB

kB

kB

kB

kB

kB

kB

kB

kB

B

B

B

B
M

M

M

M
1

2

4

8

16

32

64

8

6

2

1

2

4

8
12

25

51

OpenMPI Mvapich2

Optimizations

• Active at compile time
• Available only after porting the software to
standard FORTRAN
• Consistent documentation available
• Unexpected positive results

Optimizations

•-march = native
•-O3
•-ffast-math
•-Wl,-O1

Target Software

• Sally3D
• micromagnetic equation solver
• written in FORTRAN with some C libraries
• program uses linear formulation of
mathematical models

Implementation Scheme
sequential loop parallel loop

standard
programming
model

OpenMP Threads
distributed loop

OpenMP Threads OpenMP Threads
Host 1 Host 2
MPI

Implementation
Scheme
• Data Structure: not embarrassingly parallel
• Three dimensional matrix
• Several temporary arrays – synchronization
obiects required
➡ send() and recv() mechanism
➡ critical regions using OpenMP directives
➡ functions merging
➡ matrix conversion

Results
OMP MPI OPT seconds
* * * 133
* * - 400
* - * 186
* - - 487
- * * 200
- * - 792
- - * 246
- - - 1062

Total Speed Increase: 87.52%

Actual Results
OMP MPI seconds
* * 59
* - 129
- * 174
- - 249

Function Name Normal OpenMP MPI OpenMP+MPI
calc_intmudua 24.5 s 4.7 s 14.4 s 2.8 s
calc_hdmg_tet 16.9 s 3.0 s 10.8 s 1.7 s
calc_mudua 12.1 s 1.9 s 7.0 s 1.1 s
campo_effettivo 17.7 s 4.5 s 9.9 s 2.3 s

Actual Results

• OpenMP – 6-8x
• MPI – 2x
• OpenMP + MPI – 14 - 16x

Total Raw Speed Increment: 76%

Conclusions and
Future Works
• Computational time has been signiﬁcantly
decreased
• Speedup is consistent with expected results
• Submitted to COMPUMAG ‘09
• Continue inserting OpenMP and MPI directives
• Perform algorithm optimizations
• Increase cluster size

Parallel and Distributed Computing on Low Latency Clusters

More Related Content

What's hot (15)

Viewers also liked (15)

Similar to Parallel and Distributed Computing on Low Latency Clusters (20)

More from Vittorio Giovara (13)

Recently uploaded (20)

Parallel and Distributed Computing on Low Latency Clusters