0% found this document useful (0 votes)

66 views33 pages

26-27 SIMD Architecture

This document summarizes a lecture on SIMD architecture. It discusses vector processors and array processors as two architectures suitable for vector processing. Vector processors implement vector instructions via pipelined functional units, behaving like SIMD machines. Array processors execute the same operation in parallel across an array of processing units. The document also covers the Cray X1 supercomputer, which combines multiple parallel technologies, and how multimedia extensions to processors exploit data parallelism through SIMD-style short vector operations.

Uploaded by

fanna786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views33 pages

26-27 SIMD Architecture

Uploaded by

fanna786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Rehan Azmat

Lecture26-27
SIMD Architecture

Introduction and Motivation

Architecture classification
Performance of Parallel Architectures
Interconnection Network

Array processors
Vector processors
Cray X1
Multimedia extensions

Manipulation of arrays or vectors is a common operation in

scientific and engineering applications.

Typical operations of array-oriented data include:

Processing one or more vectors to produce a scalar result.

Combining two vectors to produce a third one.
Combining a scalar and a vector to generate a vector.
A combination of the above three operations.

Two architectures suitable for vector processing have evolved:

Pipelined vector processors

Implemented in many supercomputers

Parallel array processors

Compiler does some of the difficult work of finding parallelism,

so the hardware doesnt have to.
Data parallelism.

Strictly speaking, vector processors are not parallel processors.

There are not several CPUs in a vector processor, running in parallel.

Vector computers usually have vector registers which can store each 64
up to 128 words.

Vector instructions examples:

They only behave like SIMD computers.

They are SISD processors with vector instructions executed on pipelined functional
units.

Load vector from memory into vector register

Store vector into memory
Arithmetic and logic operations between vectors
Operations between vectors and scalars

The programmers are allowed to use operations on vectors in the

programs, and the compiler translates these operations into vector
instructions at machine level.

A vector unit typically consists of

Vector registers:

pipelined functional units

vector registers

n general purpose vector registers Ri, 0 i n-1;

vector length register VL: stores the length l (0 l s),
of the currently processed vector; s is the length of the

vector registers.
mask register M: stores a set of l bits, one for each
element in a vector, interpreted as Boolean values;

vector instructions can be executed in masked mode so that

vector register elements corresponding to a false value in M
are ignored.

Consider an element-by-element addition of two N-element vectors A

and B to create the sum vector C.

On an SISD machine, this computation will be implemented as:

for i = 0 to N-1 do
C[i] := A[i] + B[i];
There will be N*K instruction fetches (K instructions are needed for each iteration)
and N additions.
There will also be N conditional branches, if loop unrolling is not used.

A compiler for a vector computer generates something like:

C[0:N-1] A[0:N-1] + B[0:N-1];
Even though N additions will still be performed, there will only be K instruction
fetches (e.g., Load A, Load B, Add, Write C = 4 instructions).
No conditional branch is needed.

Advantages

Memory-to-memory operation mode

Register-to-register operations are more common

with new machines.

Quick fetch and decode of a single instruction for multiple

operations.
The instruction provides the processor with a regular
source of data, which can arrive at each cycle, and be
processed in a pipelined fashion regularly.
The compiler does the work for you.
no registers.
can process very long vectors, but start-up time is large.
appeared in the 70s and died in the 80s.

It is composed of N identical processing elements under the

control of a single control unit and a number of memory
modules.
The PEs execute instruction in a lock-step mode.

Processing units and memory elements communicate with each

other through an interconnection network.
Different topologies can be used.

Complexity of the control unit is at the same level of the

uniprocessor system.

Control unit is a computer with its own high speed registers,

local memory and arithmetic logic unit.

The main memory is the aggregate of the memory modules.

Processing element complexity

Single-bit processors
Connection Machine (CM-2) 65536 PEs connected by a
hypercube network (by Thinking Machine Corporation).

Multi-bit processors
ILLIAC IV (64-bit), MasPar MP-1 (32-bit)

Processor-memory interconnection
Dedicated memory organization
ILLIAC IV, CM-2, MP-1

Global memory organization

Bulk Synchronous Parallel (BSP) computer

Control and scalar type instructions are executed in the

control unit.

Vector instructions are performed in the processing

elements.

Data structuring and detection of parallelism in a program

are the major issues in the application of array processor.

Operations such as C(i) = A(i) B(i), 1 i n could be

executed in parallel, if the elements of the arrays A and B
are distributed properly among the processors or memory
modules.
Ex. PEi is assigned the task of computing C(i).

To compute
Assuming:
A dedicated memory organization.
Elements of A and B are properly and perfectly distributed
among processors (the compiler can help here).
We have:
The product terms are generated in parallel.
Additions can be performed in log2N iterations.

Speed up factor (S) is:

ILLIAC IV development started in the late 60s; fully

operational in 1975.

SIMD computer for array processing.

Control Unit + 64 Processing Elements.

CU can access all memory.

PEs can access local memory and communicate with

neighbors.

CU reads program and broadcasts instructions to PEs.

2K words memory per PE.

Cray combines several technologies in the X1

(2003)
12.8 Gflop/s vector processors
Shared caches
4 processor nodes sharing up to 64 GB of memory

Multi-streaming vector processing

Multiple node architecture

MSP: Multi-Streaming vector Processor

Formed by 4 SSPs (each a 2-pipe vector processor)
Balance computations across SSPs.
Compiler will try to vectorize/parallelize across the
MSP, achieving streaming

Many levels of parallelism

Some are automated by the compiler, some

require work by the programmer

Within a processor: vectorization

Within an MSP: streaming
Within a node: shared memory
Across nodes: message passing

This is a common trend

The more complex the architecture, the more difficult it
is for the programmer to exploit it

Hard to fit this machine into a simple taxonomy!

How do we extend general purpose microprocessors so that they can

handle multimedia applications efficiently.
Analysis of the need:

Video and audio applications very often deal with large arrays of small
data types (8 or 16 bits).

Such applications exhibit a large potential of SIMD (vector) parallelism.

Data parallelism.

Solutions:

New generations of general purpose microprocessors are equipped with

special instructions to exploit this parallelism.

The specialized multimedia instructions perform vector computations on

bytes, half-words, or words.

Several vendors have extended the instruction set of their

processors in order to improve performance with
multimedia applications:

MMX for Intel x86 family

VIS for UltraSparc
MDMX for MIPS
MAX-2 for Hewlett-Packard PA-RISC

The Pentium line provides 57 MMX instructions. They treat

data in a SIMD fashion to improve the performance of

Computer-aided design
Internet application
Computer visualization
Video games
Speech recognition

The basic idea: sub-word execution

Use the entire width of a processor data path (32 or

64 bits), even when processing the small data types
used in signal processing (8, 12, or 16 bits).

With word size 64 bits, an adder can be used to

implement eight 8-bit additions in parallel.

MMX technology allows a single instruction to work

on multiple pieces of data.

Consequently we have practically a kind of SIMD

parallelism, at a reduced scale.

Three packed data types are defined for

parallel operations: packed byte, packed
word, packed double word.

The following shows the performance of

Pentium processors with and without MMX
technology:

Vector processors are SISD processors which include in their instruction

set instructions operations on vectors.
They are implemented using pipelined functional units.
They behave like SIMD machines.

Array processors, being typical SIMD, execute the same operation on a

set of interconnected processing units.

Both vector and array processors are specialized for numerical problems
expressed in matrix or vector formats.

Many modern architectures deploy usually several parallel architecture

concepts at the same time, such as Cray X1.

Multimedia applications exhibit a large potential of SIMD parallelism.

The instruction set of modern microprocessors has been extended to support SIMDstyle parallelism with operations on short vectors.

End of the Lecture

Array & Vector Processor
No ratings yet
Array & Vector Processor
17 pages
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
64 pages
19 Computer Architecture Vector Processor
No ratings yet
19 Computer Architecture Vector Processor
20 pages
CS-482 - Lecture#4 - Vector and Array Processors
No ratings yet
CS-482 - Lecture#4 - Vector and Array Processors
40 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
7TH - Unit 4-21ec74h6 - Ca
No ratings yet
7TH - Unit 4-21ec74h6 - Ca
67 pages
Unit 4 - 5th Sem-Ec355tbf
No ratings yet
Unit 4 - 5th Sem-Ec355tbf
67 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
Vector and SIMD Computer Systems
No ratings yet
Vector and SIMD Computer Systems
59 pages
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
No ratings yet
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
31 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Unit5 Aca
No ratings yet
Unit5 Aca
11 pages
Lecture18 New
No ratings yet
Lecture18 New
19 pages
SIMD Presentation
No ratings yet
SIMD Presentation
28 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
35 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Lecture 7
No ratings yet
Lecture 7
29 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
EE6304 Lecture13 Processors
No ratings yet
EE6304 Lecture13 Processors
69 pages
Zareen 6
No ratings yet
Zareen 6
11 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
Supercomputers and Vector Machines
No ratings yet
Supercomputers and Vector Machines
40 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
Ca Part 3
No ratings yet
Ca Part 3
20 pages
20 Question of CA
No ratings yet
20 Question of CA
26 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
Catalogue Samsung
No ratings yet
Catalogue Samsung
76 pages
BCA Exam Scheme & Syllabus
No ratings yet
BCA Exam Scheme & Syllabus
55 pages
AVAT Automation Data Sheet PanelPC-Touch
No ratings yet
AVAT Automation Data Sheet PanelPC-Touch
2 pages
LC Parallal Circuit
No ratings yet
LC Parallal Circuit
4 pages
Elevator Drive User Guide
100% (1)
Elevator Drive User Guide
163 pages
M Plus Series
No ratings yet
M Plus Series
32 pages
SLCD .TW: Product Specification
No ratings yet
SLCD .TW: Product Specification
31 pages
DELTA IA-HMI-DOP-100 C EN 20190326 Web PDF
No ratings yet
DELTA IA-HMI-DOP-100 C EN 20190326 Web PDF
36 pages
Su 600
100% (1)
Su 600
33 pages
MBot Instruction
100% (1)
MBot Instruction
8 pages
Exp2.Study of RC Triggering Circuit
No ratings yet
Exp2.Study of RC Triggering Circuit
3 pages
PPM Transformer and RMU Checklist (Agreed With Safari) Aaaaa
No ratings yet
PPM Transformer and RMU Checklist (Agreed With Safari) Aaaaa
2 pages
Lecture04 Semiconductors
No ratings yet
Lecture04 Semiconductors
13 pages
01097GB Series 076
No ratings yet
01097GB Series 076
4 pages
Capacitors Solutions
No ratings yet
Capacitors Solutions
10 pages
Multifunction Time Relay Multifunction Time Relay: CRM-91H, CRM-93H, CRM-9S - CRM-91H, CRM-93H, CRM-9S
No ratings yet
Multifunction Time Relay Multifunction Time Relay: CRM-91H, CRM-93H, CRM-9S - CRM-91H, CRM-93H, CRM-9S
1 page
Analog Rytm MKII User Manual ENG
No ratings yet
Analog Rytm MKII User Manual ENG
102 pages
TERRATEK-PSMS10L Sliding Miter Saw Artowrk Manual 110121
No ratings yet
TERRATEK-PSMS10L Sliding Miter Saw Artowrk Manual 110121
61 pages
Lampu Philips Obstivision
No ratings yet
Lampu Philips Obstivision
5 pages
Full Custom Ic Design Flow Updated
No ratings yet
Full Custom Ic Design Flow Updated
60 pages
JAC A Car Body S10
100% (1)
JAC A Car Body S10
999 pages
ETI 1978-03 March
No ratings yet
ETI 1978-03 March
118 pages
Televic Confidea DD CD WiFi
No ratings yet
Televic Confidea DD CD WiFi
3 pages
Xilinx In-System Programming Using An Embedded Microcontroller
No ratings yet
Xilinx In-System Programming Using An Embedded Microcontroller
38 pages
Bopu Catalog 2020
No ratings yet
Bopu Catalog 2020
41 pages
Card Key Manual
No ratings yet
Card Key Manual
8 pages
Basic Resonant Peak and Q Factor Manipulation - GuitarNutz 2
No ratings yet
Basic Resonant Peak and Q Factor Manipulation - GuitarNutz 2
16 pages
BLW29
No ratings yet
BLW29
12 pages

26-27 SIMD Architecture

Uploaded by

26-27 SIMD Architecture

Uploaded by

Rehan Azmat

Introduction and Motivation

Manipulation of arrays or vectors is a common operation in

Typical operations of array-oriented data include:

Processing one or more vectors to produce a scalar result.

Two architectures suitable for vector processing have evolved:

Implemented in many supercomputers

Parallel array processors

Compiler does some of the difficult work of finding parallelism,

Strictly speaking, vector processors are not parallel processors.

There are not several CPUs in a vector processor, running in parallel.

Vector instructions examples:

They only behave like SIMD computers.

Load vector from memory into vector register

The programmers are allowed to use operations on vectors in the

A vector unit typically consists of

pipelined functional units

n general purpose vector registers Ri, 0 i n-1;

vector instructions can be executed in masked mode so that

Consider an element-by-element addition of two N-element vectors A

On an SISD machine, this computation will be implemented as:

A compiler for a vector computer generates something like:

Memory-to-memory operation mode

Register-to-register operations are more common

Quick fetch and decode of a single instruction for multiple

It is composed of N identical processing elements under the

Processing units and memory elements communicate with each

Complexity of the control unit is at the same level of the

Control unit is a computer with its own high speed registers,

The main memory is the aggregate of the memory modules.

Processing element complexity

Global memory organization

Control and scalar type instructions are executed in the

Vector instructions are performed in the processing

Data structuring and detection of parallelism in a program

Operations such as C(i) = A(i) B(i), 1 i n could be

Speed up factor (S) is:

ILLIAC IV development started in the late 60s; fully

SIMD computer for array processing.

Control Unit + 64 Processing Elements.

CU can access all memory.

PEs can access local memory and communicate with

CU reads program and broadcasts instructions to PEs.

2K words memory per PE.

Cray combines several technologies in the X1

Multi-streaming vector processing

MSP: Multi-Streaming vector Processor

Many levels of parallelism

Some are automated by the compiler, some

Within a processor: vectorization

This is a common trend

Hard to fit this machine into a simple taxonomy!

How do we extend general purpose microprocessors so that they can

Such applications exhibit a large potential of SIMD (vector) parallelism.

New generations of general purpose microprocessors are equipped with

The specialized multimedia instructions perform vector computations on

Several vendors have extended the instruction set of their

MMX for Intel x86 family

The Pentium line provides 57 MMX instructions. They treat

The basic idea: sub-word execution

Use the entire width of a processor data path (32 or

With word size 64 bits, an adder can be used to

MMX technology allows a single instruction to work

Consequently we have practically a kind of SIMD

Three packed data types are defined for

The following shows the performance of

Vector processors are SISD processors which include in their instruction

Array processors, being typical SIMD, execute the same operation on a

Many modern architectures deploy usually several parallel architecture

Multimedia applications exhibit a large potential of SIMD parallelism.

End of the Lecture

You might also like