SlideShare a Scribd company logo
Parallel and Distributed Computing
Chapter 2: Parallel Programming Platforms
1
Muhammad Haroon
mr.harunahmad2014@gmail.com
Cell# +92300-7327761
Department of Computer Science
Hitec University
Taxila Cantt
Pakistan
2.1a: Flynn’s Classical Taxonomy
2
One of the more widely used parallel computer
classifications, since 1966, is called Flynn’s Taxonomy
It distinguishes multiprocessor computers according to
the dimensions of Instruction and Data
 SISD: Single instruction stream, Single data stream
 SIMD: Single instruction stream, Multiple data streams
 MISD: Multiple instruction streams, Single data stream
 MIMD: Multiple instruction streams, Multiple data
streams
2.1b: SISD Machines
 A serial (non-parallel) computer
 Single instruction: Only one instruction
stream is acted on by CPU during any one
clock cycle
 Single data: Only one data stream is used
as input during any one clock cycle
 Deterministic execution
 Oldest and most prevalent form computer
 Examples: Most PCs, single CPU
workstations and mainframes
3
2.2a: SIMD Machines (I)
4
 A type of parallel computers
 Single instruction: All processor units execute the same
instruction at any give clock cycle
 Multiple data: Each processing unit can operate on a different
data element
 It typically has an instruction dispatcher, a very high-bandwidth
internal network, and a very large array of very small-capacity
instruction units
 Best suitable for specialized problems characterized by a high
degree of regularity, e.g., image processing
 Two varieties: Processor Arrays and Vector Pipelines
 Examples: Connection Machines, MasPar-1, MasPar-2;
 IBM 9000, Cray C90, Fujitsu VP, etc
2.2b: SIMD Machines (II)
5
2.2c: SIMD Machine (III)
6
2.2d: Processing Array
7
Pipelined Processing
 A pipeline, also known as a data pipeline,
 is a set of data processing elements connected in series,
 where the output of one element is the input of the next
one.
 The elements of a pipeline are often executed in parallel or
in time-sliced fashion
 A category of techniques that provide simultaneous, or
parallel, processing within the computer.
 It refers to overlapping operations by moving data or
instructions into a conceptual pipe with all stages of the
pipe processing simultaneously.
8
2.2f: Pipelined Processing
A six-step pipeline with
IEEE arithmetic hardware
Parallelization happens
behind the scene
Not true parallel
computers
9
Assembly Line
 An assembly line is a manufacturing process in which
parts are added as the semi-finished assembly moves from
workstation to workstation where the parts are added in
sequence until the final assembly is produced
10
2.2g: Assembly Line
11
Vector Processor Pipeline
 The term "vector pipeline" was used in the 1970's to
describe vector processing at a time when a single vector
instruction might (for example) compute the sum of two
vectors of floating point numbers using a single pipelined
floating point arithmetic unit
12
2.2h: Vector Processor Pipeline
13
2.3a: MISD Machines (I)
14
 A single data stream is fed into multiple
processing units
 Each processing unit operates on the data
independently via independent instruction
streams
 Very few actual machines: CMU’s C.mmp
computer (1971)
 Possible use: multiple frequency filters
operating on a single signal stream
2.3b: MISD Machines (II)
15
2.4a: MIMD Machines (I)
16
 Multiple instruction: Every processor may
execute a different instruction stream
 Multiple data: Every processor may work with
a different data stream
 Execution can be synchronous or
asynchronous, deterministic or non-
deterministic
 Examples: most current supercomputers,
grids, networked parallel computers,
multiprocessor SMP computer
2.4b: MIMD Machines (II)
17
2.4c: MIMD Machines (III)
18
T3E-Cray
19
 The Cray T3E was Cray Research's
second-generation massively parallel
supercomputer architecture, launched in
late November 1995. ... Like the previous
Cray T3D, it was a fully distributed
memory machine using a 3D torus
topology interconnection network.
2.4d: MIMD Machines (T3E-Cray)
20
2.5: Shared Memory and Message Passing
21
 Parallel computers can also be classified
according to memory access
 Shared memory computers
 Message-passing (distributed memory)
computers
 Multi-processor computers
 Multi-computers
 Clusters
 Grids
2.6a: Shared Memory Computers
22
 All processors have access to all memory as
a global address space
 Multiple processors can operate
independently, but share the same memory
resources
 Changes in a memory location effected by
one processor are visible to all other
processors
 Two classes of shared memory machines:
UMA and NUMA, (and COMA)
2.6b: Shared Memory Architecture
23
2.6c: Uniform Memory Access (UMA)
24
 Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
 Identical processors
 Equal access and access time to memory
 Sometimes called CC-UMA – Cache Coherent UMA
 Cache Coherence:
If one processor updates a location in shared
memory, all the other processors know about the
update. Cache coherence is accomplished at the
hardware level
2.6d: Symmetric Multiprocessor (SMP)
25
2.6e: UMA –with and without caches
26
2.6f: Nonuniform Memory Access
(NUMA)
27
 Often made by physically linked two or more
SMPs
 One SMP can directly access memory of
another SMP
 Not all processors have equal access time to
all memories
 Memory access across link is slower
 If cache coherence is maintained, they may
also be called CC-NUMA
2.6g: NUMA
28
2.6h: Hybrid Machine (UMA & NUMA)
29
2.6i: Advantages of Shared Memory
Machines
30
 Global address space provides a
user-friendly programming
environment to memory access
 Data sharing between tasks is
both fast and uniform due to
proximity of memory to CPUs
2.6j: Disadvantages of Shared Memory
Machines
31
 Lack of scalability between memory and
CPUs: Adding processors can geometrically
increase traffic on the shared memory-CPU
path and for cache coherence management
 Programmer’s responsibility for
synchronization constructs (correct access to
memory)
 Expensive to design shared memory
computers with increasing numbers of
processors
2.7a: Distributed Memory Computers
32
 Processors have their own local memory
 It requires a communication network to connect
inter-processor memory
 Memory addresses in one processor do not map to
another processor – no concept of global address
space across all processors
 The concept of cache coherence does not apply
 Data are exchanged explicitly through message-
passing
 Synchronization between tasks is the programmer’s
responsibility
2.7b: Distributed Memory Architecture
33
2.7c: Advantages of Distributed Memory
Machines
34
 Memory is scalable with the number of processors
 Increase the number of processors, the size of
memory increases proportionally
 Each processor can rapidly access its own memory
without interference and without the overhead
incurred with trying to maintain cache coherence
 Cost effectiveness: can use commodity, off-the-
shelf processors and networking
2.7d: Disadvantages of Distributed
Memory Machines
35
 Difficult to program: Programmer has to
handle data communication between
processors
 Nonuniform memory access (NUMA) times
 It may be difficult to map existing data
structures, based on global memory, to
distributed memory organization
2.7e: Networked Cluster of Workstatsions
(PCs)
36
2.7f: Massively Parallel Processing (MPP)
Environment
37
2.7g: Distributed Shared Memory (DSM)
Computer
38
2.7h: Hybrid and Combinations (Very
Large Parallel Computing System)
39
2.8: Architectural Design Tradeoffs
40
Shared
memory
Distributed
memory
Programmability easier harder
Scalability harder easier
2.9:Parallel Architectural Issues
41
 Control mechanism: SIMD vs MIMD
 Operation: synchronous vs asynchronous
 Memory organization: private vs shared
 Address space: local vs global
 Memory access: uniform vs nonuniform
 Granularity: power of individual processors
coarse grained system vs fine grained system
 Interconnection network topology
2.10a: Beowulf Cluster System
42
 A cluster of tightly coupled PC’s for distributed
parallel computation
 Moderate size: normally 16 to 32 PC’s
 Promise of good price/performance ratio
 Use of commodity-of-the-self (COTS) components
(PCs, Linux, MPI)
 Initiated at NASA (Center of Excellence in Space
Data and Information Sciences) in 1994 using 16
DX4 processors
 http://www.beowulf.org
2.10b: NASA 128-Processor Beowulf
Cluster
43
2.11a: Sun Enterprise 10000 Server (I)
44
 Starfire is a shared-address space machine
4 to 64 processors, each is an UltraSPARC II
running at 400 MHz with a peak rate 800 MHz
 System board has up to 4 CPUs and 4GB RAM
local memory
 Up to 64 GB memory in total
 16 such boards can be connected via a Gigaplane-
XB interconnect
2.11b: Sun Enterprise 10000 Server (II)
45
 Intra- and inter-board connectivity uses
different interconnection networks
 Intra-board connectivity is by a split
transaction system bus (Gigaplane bus)
 Global data router is a 16X16 non-blocking
crossbar for the data buses and four global
address buses
 http://www.sun.com/servers/highend/e10000/
2.11c: Sun Enterprise 10000 Server (III)
46
2.11d: Sun Enterprise 10000 Server (IV)
47
2.12a: SGI Origin Servers (I)
48
 SGI Origin 3000 Servers support up to 512
MPIS 14000 processors in a NUMA cache
coherent shared-address-space configuration
 Each processor operates at 500 MHz with a
peak rate of 1.0 gigaflops
 Modular framework with 4 CPUs and 8GB of
RAM (C-Brick)
 Interconnection by crossbar between C-
Bricks and R-Bricks (routers)
2.12b: SGI Origin Servers (II)
49
 Larger configuration are built by connecting
multiple C-Bricks via R-Bricks using 6-port or
8-port crossbar switches and metarouters
with full-duplex links operating at 1.6GB/s
 http://www.sgi.com/products/servers/origin/30
00/overview.html
2.12c: C-Brick and R-Brick
C- Brick
50
R- Brick
2.12d: SGI Origin Server
51
2.12e: SGI Origin Server Machine
52
2.13a: Cray T3E/1350
53
 Use Alpha 21164A processor with 4-way
superscalar architecture, 2 floating point instruction
per cycle
 CPU clock 675 MHz, with peak rating 1.35
Gigaflops, 512 MB local memory
 Parallel systems with 40 to 2176 processors (with
modules of 8 CPUs each)
 3D torus interconnect with a single processor per
node
 Each node contains a router and has a processor
interface and six full-duplex link (one for each
direction of the cube)
2.13b: Cray T3E Topology
54
2.13c: Cray T3E Machine
55
2.14a: UK HP Superdome Cluster
56
 4 HP superdomes
 256 total processors (64 processors per
host/node)
 Itanium-2 processor at 1.6 TF peak rate
 2 GB memory per processor
 7 Terabytes of total disk space
 High speed, low latency Infiniband internal
interconnect
2.14b: HP Superdome Cluster
57
2.15a: The Earth Simulator (I)
58
 Each processor of the Earth Simulator runs at
2ns per cycle, with 16GB shared memory
 Total number of processors is 5,120. The
aggregated peak performance is 40 TFlops
with 10 TB memory
 It has a single stage crossbar (1800 miles of
cable), 83,000 copper cables, 16 GB/s cross
station bandwidth
 700 TB disk space and 1.6PB mass storage
2.15b: The Earth Simulator (II)
59
 One node = 8 vector processors with a peak
performance of 8G Flops (640X8 = 5,120)
 Area of the computer = 4 tennis courts, 3
floors
 Sum of all the US Department
supercomputers = 24 TFlops/s in 2002
 Number 1 in 2002 top 500 list (five
consecutive lists)
 http://www.es.jamstec.go.jp/esc/eng/
2.15c: The Earth Simulator Machine (I)
60
2.15d: The Earth Simulator Machine (II)
61
2.16a: IBM Blue Gene (I)
62
 In December 1999, IBM Research
announced a 5-year $100M project, named
Blue Gene, to develop a petaflop computer
for research in computational biology
 Proposed architecture: SMASH (simple,
multiple, self-healing)
 Collaboration between IBM Thomas J.
Watson Research Center, Lawrence
Livermore National Laboratory, US DOE and
academia
2.16b: IBM Blue Gene (II)
63
 1 million CPUs of 1 gigflop each = 1 petaflop
 32 CPUs on a single chip, 16 MB memory
 64 chips are placed on a board of 20 inches
 8 boards form a tower, and 64 towers are connected
into a 8X8 grid to form Blue Gene
 Blue Gene/L, Blue Gene/C, and Blue Gene/P for
different applications
 # 1 and # 2 on top 500 list in December 2005
 http://domino.research.ibm.com/comm/research_pro
jects.nsf/pages/bluegene.index.html
2.17a: Current Number 1 (Blue Gene/L)
64
 eServer Blue Gene Solution (Blue Gene/L)
 32,768 GB memory
 Installed in 2005 DOE/NNSA/LLNL
 131,072 processors
 Peak performance 367.8 TFlops, 280.6 Tflops/s
LINPACK benchmark
 See more at http://www.top500.org
 The Earth Simulator is currently # 10
 Next round announcement: November 2006,
Tampa, Florida at Supercomputing 2006
2.17b: Blue Gene/L Picture
65
2.18a: The Really Fastest One
66
 On June 26, 2006, MD-Grape-3 at Riken in Japan,
clocked at 1.0 petaflop. But it does not run LINPACK
and not qualify for top 500 list
 Special purpose system for molecular dynamics – 3
times faster than Blue Gene/L
 201 units of 24 custom MDGrape-3 chips (4808
total), plus 64 servers each with 256 Dual-Core Intel
Xeon processor, and 37 servers each containing 74
Intel 3.2 GHz Xeon processors
 http://www.primidi.com/2004/09/01.html
2.18b: MDGrape-3 System
67
2.18c: One Board with 12 MDGrape-3
Chips
68

More Related Content

What's hot (20)

DOC
Computer network suggestion
Md. Mashiur Rahman
 
PDF
Inter-Process Communication in distributed systems
Aya Mahmoud
 
PDF
Lecture 3 parallel programming platforms
Vajira Thambawita
 
PPTX
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
Sehrish Asif
 
PPT
Lecture 1 (distributed systems)
Fazli Amin
 
PDF
Lecture 7 cuda execution model
Vajira Thambawita
 
PDF
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
kyxeminut
 
PPT
Parallel Processing Concepts
Dr Shashikant Athawale
 
DOCX
Distributed system notes unit I
NANDINI SHARMA
 
PDF
Distributed system lectures
marwaeng
 
PPT
system interconnect architectures in ACA
Pankaj Kumar Jain
 
PDF
Broadcast wormhole routed 3-d mesh
IJCNCJournal
 
PDF
Three dimension hamiltonian broadcast
IJCNCJournal
 
PDF
Design issues of dos
vanamali_vanu
 
DOCX
Distributed system unit II according to syllabus of RGPV, Bhopal
NANDINI SHARMA
 
PPT
16.Distributed System Structure
Senthil Kanth
 
PDF
Computer networks 1
Vijaya Babu
 
PDF
Computer network physical layer
Sweta Kumari Barnwal
 
PDF
Y-HAMILTONIAN LAYERS BROADCAST ALGORITHM
IJNSA Journal
 
PPTX
Distributed Systems
naveedchak
 
Computer network suggestion
Md. Mashiur Rahman
 
Inter-Process Communication in distributed systems
Aya Mahmoud
 
Lecture 3 parallel programming platforms
Vajira Thambawita
 
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
Sehrish Asif
 
Lecture 1 (distributed systems)
Fazli Amin
 
Lecture 7 cuda execution model
Vajira Thambawita
 
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
kyxeminut
 
Parallel Processing Concepts
Dr Shashikant Athawale
 
Distributed system notes unit I
NANDINI SHARMA
 
Distributed system lectures
marwaeng
 
system interconnect architectures in ACA
Pankaj Kumar Jain
 
Broadcast wormhole routed 3-d mesh
IJCNCJournal
 
Three dimension hamiltonian broadcast
IJCNCJournal
 
Design issues of dos
vanamali_vanu
 
Distributed system unit II according to syllabus of RGPV, Bhopal
NANDINI SHARMA
 
16.Distributed System Structure
Senthil Kanth
 
Computer networks 1
Vijaya Babu
 
Computer network physical layer
Sweta Kumari Barnwal
 
Y-HAMILTONIAN LAYERS BROADCAST ALGORITHM
IJNSA Journal
 
Distributed Systems
naveedchak
 

Similar to Lecture 04 chapter 2 - Parallel Programming Platforms (20)

PPT
Par com
tttoracle
 
PPT
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
KRamasamy2
 
PPT
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
VAISHNAVI MADHAN
 
PPT
Chap2 GGKK.ppt
aminnezarat
 
PPT
Ceg4131 models
anandme07
 
PPTX
Computer organisation and architecture unit 5, SRM
sameerkrdbg
 
PPT
Parallel processing extra
Er Girdhari Lal Kumawat
 
PPTX
High performance computing
punjab engineering college, chandigarh
 
DOCX
Parallel computing persentation
VIKAS SINGH BHADOURIA
 
PPTX
Flynn's Taxonomy
Ashish KC
 
PPTX
Hpc 4 5
Yasir Khan
 
PPTX
Parallel computing
virend111
 
PDF
unit_1.pdf
JyotiChoudhary469897
 
PPT
ceg4131_models.ppthjjjjjjjhhjhjhjhjhjhjhj
431m2rn14g
 
PPTX
Multiprocessor.pptx
Muhammad54342
 
PPT
Parallel computing
Vinay Gupta
 
PPT
Parallel architecture
Mr SMAK
 
PPTX
CSA unit5.pptx
AbcvDef
 
PPTX
PARALLELISM IN MULTICORE PROCESSORS
Amirthavalli Senthil
 
Par com
tttoracle
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
KRamasamy2
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
VAISHNAVI MADHAN
 
Chap2 GGKK.ppt
aminnezarat
 
Ceg4131 models
anandme07
 
Computer organisation and architecture unit 5, SRM
sameerkrdbg
 
Parallel processing extra
Er Girdhari Lal Kumawat
 
High performance computing
punjab engineering college, chandigarh
 
Parallel computing persentation
VIKAS SINGH BHADOURIA
 
Flynn's Taxonomy
Ashish KC
 
Hpc 4 5
Yasir Khan
 
Parallel computing
virend111
 
ceg4131_models.ppthjjjjjjjhhjhjhjhjhjhjhj
431m2rn14g
 
Multiprocessor.pptx
Muhammad54342
 
Parallel computing
Vinay Gupta
 
Parallel architecture
Mr SMAK
 
CSA unit5.pptx
AbcvDef
 
PARALLELISM IN MULTICORE PROCESSORS
Amirthavalli Senthil
 
Ad

More from National College of Business Administration & Economics ( NCBA&E) (16)

PDF
Lecture # 02 - OOP with Python Language by Muhammad Haroon
National College of Business Administration & Economics ( NCBA&E)
 
PDF
Lecture 01 - Basic Concept About OOP With Python
National College of Business Administration & Economics ( NCBA&E)
 
PDF
Lecture01 Part(B) - Installing Visual Studio Code On All Version Of Windows O...
National College of Business Administration & Economics ( NCBA&E)
 
PDF
Lecture02 - Fundamental Programming with Python Language
National College of Business Administration & Economics ( NCBA&E)
 
PDF
Lecture01 - Fundamental Programming with Python Language
National College of Business Administration & Economics ( NCBA&E)
 
PDF
Lecture 04 (Part 01) - Measure of Location
National College of Business Administration & Economics ( NCBA&E)
 
PDF
Lecture 03 Part 02 - All Examples of Chapter 02 by Muhammad Haroon
National College of Business Administration & Economics ( NCBA&E)
 
PPTX
Lecture 03 - Chapter 02 - Part 02 - Probability & Statistics by Muhammad Haroon
National College of Business Administration & Economics ( NCBA&E)
 
PPTX
Lecture 03 - Chapter 02 - Part 01 - Probability & Statistics by Muhammad Haroon
National College of Business Administration & Economics ( NCBA&E)
 
PPTX
Lecture 02 - Chapter 01 - Probability & Statistics by Muhammad Haroon
National College of Business Administration & Economics ( NCBA&E)
 
PPTX
Lecture 02 - Chapter 1 (Part 02): Grid/Cloud Computing Systems, Cluster Comp...
National College of Business Administration & Economics ( NCBA&E)
 
PDF
WHO director-general's opening remarks at the media briefing on covid-19 - 23...
National College of Business Administration & Economics ( NCBA&E)
 
PDF
Course outline of parallel and distributed computing
National College of Business Administration & Economics ( NCBA&E)
 
PPTX
Lecture 01 - Some basic terminology, History, Application of statistics - Def...
National College of Business Administration & Economics ( NCBA&E)
 
PDF
Course Outline of Probability & Statistics
National College of Business Administration & Economics ( NCBA&E)
 
Lecture # 02 - OOP with Python Language by Muhammad Haroon
National College of Business Administration & Economics ( NCBA&E)
 
Lecture 01 - Basic Concept About OOP With Python
National College of Business Administration & Economics ( NCBA&E)
 
Lecture01 Part(B) - Installing Visual Studio Code On All Version Of Windows O...
National College of Business Administration & Economics ( NCBA&E)
 
Lecture02 - Fundamental Programming with Python Language
National College of Business Administration & Economics ( NCBA&E)
 
Lecture01 - Fundamental Programming with Python Language
National College of Business Administration & Economics ( NCBA&E)
 
Lecture 04 (Part 01) - Measure of Location
National College of Business Administration & Economics ( NCBA&E)
 
Lecture 03 Part 02 - All Examples of Chapter 02 by Muhammad Haroon
National College of Business Administration & Economics ( NCBA&E)
 
Lecture 03 - Chapter 02 - Part 02 - Probability & Statistics by Muhammad Haroon
National College of Business Administration & Economics ( NCBA&E)
 
Lecture 03 - Chapter 02 - Part 01 - Probability & Statistics by Muhammad Haroon
National College of Business Administration & Economics ( NCBA&E)
 
Lecture 02 - Chapter 01 - Probability & Statistics by Muhammad Haroon
National College of Business Administration & Economics ( NCBA&E)
 
Lecture 02 - Chapter 1 (Part 02): Grid/Cloud Computing Systems, Cluster Comp...
National College of Business Administration & Economics ( NCBA&E)
 
WHO director-general's opening remarks at the media briefing on covid-19 - 23...
National College of Business Administration & Economics ( NCBA&E)
 
Course outline of parallel and distributed computing
National College of Business Administration & Economics ( NCBA&E)
 
Lecture 01 - Some basic terminology, History, Application of statistics - Def...
National College of Business Administration & Economics ( NCBA&E)
 
Course Outline of Probability & Statistics
National College of Business Administration & Economics ( NCBA&E)
 
Ad

Recently uploaded (20)

PPTX
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PDF
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
PPTX
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
PPTX
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PPTX
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PDF
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPTX
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
PPTX
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
PDF
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
PPTX
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PPTX
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
Horarios de distribución de agua en julio
pegazohn1978
 
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 

Lecture 04 chapter 2 - Parallel Programming Platforms

  • 1. Parallel and Distributed Computing Chapter 2: Parallel Programming Platforms 1 Muhammad Haroon [email protected] Cell# +92300-7327761 Department of Computer Science Hitec University Taxila Cantt Pakistan
  • 2. 2.1a: Flynn’s Classical Taxonomy 2 One of the more widely used parallel computer classifications, since 1966, is called Flynn’s Taxonomy It distinguishes multiprocessor computers according to the dimensions of Instruction and Data  SISD: Single instruction stream, Single data stream  SIMD: Single instruction stream, Multiple data streams  MISD: Multiple instruction streams, Single data stream  MIMD: Multiple instruction streams, Multiple data streams
  • 3. 2.1b: SISD Machines  A serial (non-parallel) computer  Single instruction: Only one instruction stream is acted on by CPU during any one clock cycle  Single data: Only one data stream is used as input during any one clock cycle  Deterministic execution  Oldest and most prevalent form computer  Examples: Most PCs, single CPU workstations and mainframes 3
  • 4. 2.2a: SIMD Machines (I) 4  A type of parallel computers  Single instruction: All processor units execute the same instruction at any give clock cycle  Multiple data: Each processing unit can operate on a different data element  It typically has an instruction dispatcher, a very high-bandwidth internal network, and a very large array of very small-capacity instruction units  Best suitable for specialized problems characterized by a high degree of regularity, e.g., image processing  Two varieties: Processor Arrays and Vector Pipelines  Examples: Connection Machines, MasPar-1, MasPar-2;  IBM 9000, Cray C90, Fujitsu VP, etc
  • 8. Pipelined Processing  A pipeline, also known as a data pipeline,  is a set of data processing elements connected in series,  where the output of one element is the input of the next one.  The elements of a pipeline are often executed in parallel or in time-sliced fashion  A category of techniques that provide simultaneous, or parallel, processing within the computer.  It refers to overlapping operations by moving data or instructions into a conceptual pipe with all stages of the pipe processing simultaneously. 8
  • 9. 2.2f: Pipelined Processing A six-step pipeline with IEEE arithmetic hardware Parallelization happens behind the scene Not true parallel computers 9
  • 10. Assembly Line  An assembly line is a manufacturing process in which parts are added as the semi-finished assembly moves from workstation to workstation where the parts are added in sequence until the final assembly is produced 10
  • 12. Vector Processor Pipeline  The term "vector pipeline" was used in the 1970's to describe vector processing at a time when a single vector instruction might (for example) compute the sum of two vectors of floating point numbers using a single pipelined floating point arithmetic unit 12
  • 13. 2.2h: Vector Processor Pipeline 13
  • 14. 2.3a: MISD Machines (I) 14  A single data stream is fed into multiple processing units  Each processing unit operates on the data independently via independent instruction streams  Very few actual machines: CMU’s C.mmp computer (1971)  Possible use: multiple frequency filters operating on a single signal stream
  • 16. 2.4a: MIMD Machines (I) 16  Multiple instruction: Every processor may execute a different instruction stream  Multiple data: Every processor may work with a different data stream  Execution can be synchronous or asynchronous, deterministic or non- deterministic  Examples: most current supercomputers, grids, networked parallel computers, multiprocessor SMP computer
  • 19. T3E-Cray 19  The Cray T3E was Cray Research's second-generation massively parallel supercomputer architecture, launched in late November 1995. ... Like the previous Cray T3D, it was a fully distributed memory machine using a 3D torus topology interconnection network.
  • 20. 2.4d: MIMD Machines (T3E-Cray) 20
  • 21. 2.5: Shared Memory and Message Passing 21  Parallel computers can also be classified according to memory access  Shared memory computers  Message-passing (distributed memory) computers  Multi-processor computers  Multi-computers  Clusters  Grids
  • 22. 2.6a: Shared Memory Computers 22  All processors have access to all memory as a global address space  Multiple processors can operate independently, but share the same memory resources  Changes in a memory location effected by one processor are visible to all other processors  Two classes of shared memory machines: UMA and NUMA, (and COMA)
  • 23. 2.6b: Shared Memory Architecture 23
  • 24. 2.6c: Uniform Memory Access (UMA) 24  Most commonly represented today by Symmetric Multiprocessor (SMP) machines  Identical processors  Equal access and access time to memory  Sometimes called CC-UMA – Cache Coherent UMA  Cache Coherence: If one processor updates a location in shared memory, all the other processors know about the update. Cache coherence is accomplished at the hardware level
  • 26. 2.6e: UMA –with and without caches 26
  • 27. 2.6f: Nonuniform Memory Access (NUMA) 27  Often made by physically linked two or more SMPs  One SMP can directly access memory of another SMP  Not all processors have equal access time to all memories  Memory access across link is slower  If cache coherence is maintained, they may also be called CC-NUMA
  • 29. 2.6h: Hybrid Machine (UMA & NUMA) 29
  • 30. 2.6i: Advantages of Shared Memory Machines 30  Global address space provides a user-friendly programming environment to memory access  Data sharing between tasks is both fast and uniform due to proximity of memory to CPUs
  • 31. 2.6j: Disadvantages of Shared Memory Machines 31  Lack of scalability between memory and CPUs: Adding processors can geometrically increase traffic on the shared memory-CPU path and for cache coherence management  Programmer’s responsibility for synchronization constructs (correct access to memory)  Expensive to design shared memory computers with increasing numbers of processors
  • 32. 2.7a: Distributed Memory Computers 32  Processors have their own local memory  It requires a communication network to connect inter-processor memory  Memory addresses in one processor do not map to another processor – no concept of global address space across all processors  The concept of cache coherence does not apply  Data are exchanged explicitly through message- passing  Synchronization between tasks is the programmer’s responsibility
  • 33. 2.7b: Distributed Memory Architecture 33
  • 34. 2.7c: Advantages of Distributed Memory Machines 34  Memory is scalable with the number of processors  Increase the number of processors, the size of memory increases proportionally  Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherence  Cost effectiveness: can use commodity, off-the- shelf processors and networking
  • 35. 2.7d: Disadvantages of Distributed Memory Machines 35  Difficult to program: Programmer has to handle data communication between processors  Nonuniform memory access (NUMA) times  It may be difficult to map existing data structures, based on global memory, to distributed memory organization
  • 36. 2.7e: Networked Cluster of Workstatsions (PCs) 36
  • 37. 2.7f: Massively Parallel Processing (MPP) Environment 37
  • 38. 2.7g: Distributed Shared Memory (DSM) Computer 38
  • 39. 2.7h: Hybrid and Combinations (Very Large Parallel Computing System) 39
  • 40. 2.8: Architectural Design Tradeoffs 40 Shared memory Distributed memory Programmability easier harder Scalability harder easier
  • 41. 2.9:Parallel Architectural Issues 41  Control mechanism: SIMD vs MIMD  Operation: synchronous vs asynchronous  Memory organization: private vs shared  Address space: local vs global  Memory access: uniform vs nonuniform  Granularity: power of individual processors coarse grained system vs fine grained system  Interconnection network topology
  • 42. 2.10a: Beowulf Cluster System 42  A cluster of tightly coupled PC’s for distributed parallel computation  Moderate size: normally 16 to 32 PC’s  Promise of good price/performance ratio  Use of commodity-of-the-self (COTS) components (PCs, Linux, MPI)  Initiated at NASA (Center of Excellence in Space Data and Information Sciences) in 1994 using 16 DX4 processors  http://www.beowulf.org
  • 43. 2.10b: NASA 128-Processor Beowulf Cluster 43
  • 44. 2.11a: Sun Enterprise 10000 Server (I) 44  Starfire is a shared-address space machine 4 to 64 processors, each is an UltraSPARC II running at 400 MHz with a peak rate 800 MHz  System board has up to 4 CPUs and 4GB RAM local memory  Up to 64 GB memory in total  16 such boards can be connected via a Gigaplane- XB interconnect
  • 45. 2.11b: Sun Enterprise 10000 Server (II) 45  Intra- and inter-board connectivity uses different interconnection networks  Intra-board connectivity is by a split transaction system bus (Gigaplane bus)  Global data router is a 16X16 non-blocking crossbar for the data buses and four global address buses  http://www.sun.com/servers/highend/e10000/
  • 46. 2.11c: Sun Enterprise 10000 Server (III) 46
  • 47. 2.11d: Sun Enterprise 10000 Server (IV) 47
  • 48. 2.12a: SGI Origin Servers (I) 48  SGI Origin 3000 Servers support up to 512 MPIS 14000 processors in a NUMA cache coherent shared-address-space configuration  Each processor operates at 500 MHz with a peak rate of 1.0 gigaflops  Modular framework with 4 CPUs and 8GB of RAM (C-Brick)  Interconnection by crossbar between C- Bricks and R-Bricks (routers)
  • 49. 2.12b: SGI Origin Servers (II) 49  Larger configuration are built by connecting multiple C-Bricks via R-Bricks using 6-port or 8-port crossbar switches and metarouters with full-duplex links operating at 1.6GB/s  http://www.sgi.com/products/servers/origin/30 00/overview.html
  • 50. 2.12c: C-Brick and R-Brick C- Brick 50 R- Brick
  • 51. 2.12d: SGI Origin Server 51
  • 52. 2.12e: SGI Origin Server Machine 52
  • 53. 2.13a: Cray T3E/1350 53  Use Alpha 21164A processor with 4-way superscalar architecture, 2 floating point instruction per cycle  CPU clock 675 MHz, with peak rating 1.35 Gigaflops, 512 MB local memory  Parallel systems with 40 to 2176 processors (with modules of 8 CPUs each)  3D torus interconnect with a single processor per node  Each node contains a router and has a processor interface and six full-duplex link (one for each direction of the cube)
  • 54. 2.13b: Cray T3E Topology 54
  • 55. 2.13c: Cray T3E Machine 55
  • 56. 2.14a: UK HP Superdome Cluster 56  4 HP superdomes  256 total processors (64 processors per host/node)  Itanium-2 processor at 1.6 TF peak rate  2 GB memory per processor  7 Terabytes of total disk space  High speed, low latency Infiniband internal interconnect
  • 57. 2.14b: HP Superdome Cluster 57
  • 58. 2.15a: The Earth Simulator (I) 58  Each processor of the Earth Simulator runs at 2ns per cycle, with 16GB shared memory  Total number of processors is 5,120. The aggregated peak performance is 40 TFlops with 10 TB memory  It has a single stage crossbar (1800 miles of cable), 83,000 copper cables, 16 GB/s cross station bandwidth  700 TB disk space and 1.6PB mass storage
  • 59. 2.15b: The Earth Simulator (II) 59  One node = 8 vector processors with a peak performance of 8G Flops (640X8 = 5,120)  Area of the computer = 4 tennis courts, 3 floors  Sum of all the US Department supercomputers = 24 TFlops/s in 2002  Number 1 in 2002 top 500 list (five consecutive lists)  http://www.es.jamstec.go.jp/esc/eng/
  • 60. 2.15c: The Earth Simulator Machine (I) 60
  • 61. 2.15d: The Earth Simulator Machine (II) 61
  • 62. 2.16a: IBM Blue Gene (I) 62  In December 1999, IBM Research announced a 5-year $100M project, named Blue Gene, to develop a petaflop computer for research in computational biology  Proposed architecture: SMASH (simple, multiple, self-healing)  Collaboration between IBM Thomas J. Watson Research Center, Lawrence Livermore National Laboratory, US DOE and academia
  • 63. 2.16b: IBM Blue Gene (II) 63  1 million CPUs of 1 gigflop each = 1 petaflop  32 CPUs on a single chip, 16 MB memory  64 chips are placed on a board of 20 inches  8 boards form a tower, and 64 towers are connected into a 8X8 grid to form Blue Gene  Blue Gene/L, Blue Gene/C, and Blue Gene/P for different applications  # 1 and # 2 on top 500 list in December 2005  http://domino.research.ibm.com/comm/research_pro jects.nsf/pages/bluegene.index.html
  • 64. 2.17a: Current Number 1 (Blue Gene/L) 64  eServer Blue Gene Solution (Blue Gene/L)  32,768 GB memory  Installed in 2005 DOE/NNSA/LLNL  131,072 processors  Peak performance 367.8 TFlops, 280.6 Tflops/s LINPACK benchmark  See more at http://www.top500.org  The Earth Simulator is currently # 10  Next round announcement: November 2006, Tampa, Florida at Supercomputing 2006
  • 65. 2.17b: Blue Gene/L Picture 65
  • 66. 2.18a: The Really Fastest One 66  On June 26, 2006, MD-Grape-3 at Riken in Japan, clocked at 1.0 petaflop. But it does not run LINPACK and not qualify for top 500 list  Special purpose system for molecular dynamics – 3 times faster than Blue Gene/L  201 units of 24 custom MDGrape-3 chips (4808 total), plus 64 servers each with 256 Dual-Core Intel Xeon processor, and 37 servers each containing 74 Intel 3.2 GHz Xeon processors  http://www.primidi.com/2004/09/01.html
  • 68. 2.18c: One Board with 12 MDGrape-3 Chips 68