Multiprocessor

Multistage Networks
• Multistage networks consist of multiple sages of
switch boxes, and should be able to connect any
input to any output.
• A multistage network is called blocking if the
simultaneous connections of some multiple
input-output pairs may result in conflicts in the
use of switches or communication links.
• A nonblocking multistage network can perform
all possible connections between inputs and
outputs by rearranging its connections.

Multistage Networks
The schematic of a typical multistage interconnection network.

Multistage Omega Network
• One of the most commonly used
multistage interconnects is the Omega
network.
• This network consists of log p stages,
where p is the number of inputs/outputs.
• At each stage, input i is connected to
output j if:

Network Topologies:
Each stage of the Omega network implements a perfect
shuffle as follows:
A perfect shuffle interconnection for eight inputs and outputs.

• The perfect shuffle patterns are connected
using 2×2 switches.
• The switches operate in two modes – crossover
or passthrough.
Two switching configurations of the 2 × 2 switch:
(a) Pass-through; (b) Cross-over.

A complete omega network connecting eight inputs and eight outputs.
An omega network has p/2 × log p switching nodes, and
the cost of such a network grows as (p log p).
A complete Omega network with the perfect shuffle
interconnects and switches can now be illustrated:

Multistage Omega Network – Routing
• Let s be the binary representation of the source
and d be that of the destination processor.
• The data traverses the link to the first switching
node. If the most significant bits of s and d are
the same, then the data is routed in pass-through
mode by the switch else, it switches to crossover.
• This process is repeated for each of the log p
switching stages.
• Note that this is not a non-blocking switch.

Multistage Omega Network – Routing
An example of blocking in omega network: one of the messages
(010 to 111 or 110 to 100) is blocked at link AB.

Hypercube Interconnection
• Hypercube or binary n-cube multiprocessor
structure is composed of N=2^n processors
interconnected in n-dimensional binary cube.
• Used in loosely coupled processors.
• Each processor form the node of the cube.
• Each processor has direct communication path
with n other neighbor processor.
• There are 2^n distinct n-bit binary address that
can be assigned to each processor.

Hypercubes and their Construction
Construction of hypercubes from hypercubes of lower dimension.

Properties of Hypercubes
• The distance between any two nodes is at most
log p.
• Each node has log p neighbors.
• The distance between two nodes is given by
the number of bit positions at which the two
nodes differ.

• Routing messages through an n-cube structure
may take from one to n links from a source node
to a destination node.
• For example, in a three-cube structure, node 000
can communicate directly with node 001.
• It must cross at least two links to communicate
with 011 (from 000 to 001 to 011 or from 000 to
010 to 011).
• It is necessary to go through at least three links to
communicate from node 000 to node 111.
• A routing procedure can be developed by
computing the exclusive-OR of the source node
address with the destination node address.

Cache Coherence
• In shared memory multi-processor system,
processor share memory and they have local
memory (part or all of which is cache).
• To ensure ability of the system to execute
memory instruction independently, multiple
copies of the data must be identical which is
called cache coherence.

Condition to Incoherence
• This condition arise when the processor need
to share the writable data.
• In both policy write back and write through
incoherence condition is created.
• In case of DMA also, IOP modify the data in
main memory which reside in the cache and
can’t be updated.

Cache Coherence
in Multiprocessor Systems
Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b)
Update protocol for shared variables.
When the value of a variable changes, all its copies
must either be invalidated or updated.

Cache Coherence:
Update and Invalidate Protocols
• If a processor just reads a value once and does
not need it again, an update protocol may
generate significant overhead.
• If two processors make interleaved test and
updates to a variable, an update protocol is
better.
• Both protocols suffer from false sharing
overheads (two words that are not shared,
however, they lie on the same cache line).
• Most current machines use invalidate protocols.

Maintaining Coherence
Using Invalidate Protocols
• Each copy of a data item is associated with a state.
• One example of such a set of states is, shared, invalid,
or dirty.
• In shared state, there are multiple valid copies of the
data item (and therefore, an invalidate would have to
be generated on an update).
• In dirty state, only one copy exists and therefore, no
invalidates need to be generated.
• In invalid state, the data copy is invalid, therefore, a
read generates a data request (and associated state
changes).

State diagram of a simple three-state coherence protocol.

Considering serial execution of 2 instructions with the simple
three-state coherence protocol.
Treating x=x+y as x has load instruction
32 ,
6 ,
33 , 13 ,

Considering serial execution of 2 instructions with the simple
Treating x=x+y as x has read instruction
32 ,
19 ,
33 , 13 ,

Considering parallel execution of 2 instructions with the simple
Treat x = x+ y as load and store (write) instruction

Snoopy Cache Systems
How are invalidates sent to the right processors?
In snoopy caches, there is a broadcast media that listens to all
invalidates and read requests and performs appropriate
coherence operations locally.
A simple snoopy bus based cache coherence system.

Performance of Snoopy Caches
• Once copies of data are tagged dirty, all
subsequent operations can be performed locally
on the cache without generating external traffic.
• If a data item is read by a number of processors,
it transitions to the shared state in the cache and
all subsequent read operations become local.
• If processors read and update data at the same
time, they generate coherence requests on the
bus - which is ultimately bandwidth limited.

Directory Based Systems
• In snoopy caches, each coherence operation is
sent to all processors. This is an inherent
limitation.
• Why not send coherence requests to only
those processors that need to be notified?
• This is done using a directory, which maintains
a presence vector for each data item (cache
line) along with its global state.

Directory Based Systems
Architecture of typical directory based systems: (a) a centralized
directory; and (b) a distributed directory.

Performance of
Directory Based Schemes
• The need for a broadcast media is replaced by
the directory.
• The additional bits to store the directory may
add significant overhead.
• The underlying network must be able to carry
all the coherence requests.
• The directory is a point of contention,
therefore, distributed directory schemes must
be used.

Multiprocessor

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Multiprocessor (20)

Recently uploaded (20)

Multiprocessor