Distributed Shared memory architecture.ppt

22/12/2005 Distributed Shared-Memory
Architectures by Seda Demirağ
Distrubuted Shared-Memory Architectures
by Seda Demirağ

Distributed Shared-Memory Architectures
• According to Flynn (1972),computers’ parallelism can be categorizied
like this:
Computers’ Parallelism
Single
instruction
stream, Single
data stream
(SISD)
Single
instruction
stream, multiple
data stream
(SIMD)
Multiple
instruction
stream, Single
data stream
(MISD)
Multiple
instruction
stream, Multiple
data stream
(MIMD)

• We can also clasify the MIMD Structures into two:
MIMD
Centralized
(symmetric)-
Shared Memory
Architectures
Distributed-
Shared Memory
Architecture

Centralized(symmetric)-shared memory architecture
Centralized(symmetric)-shared memory architecture
Caches can
contain either
private or
shared data.
This causes
cache chorence
problem.
Uniform access
time to all
memory from
all processors

Distributed memory architecture
Distributed memory architecture
To support larger
processor counts
Some processors may
be connected by a
single bus, but this is
less scalable than
global interconnection
network
Cost effective way to
scale the memory
Bandwidth (if most of
access is to local
memory)
But communicating
data between processors
becomes more complex,
has higher latency,
at least

Models for Communication Among Processors
Models for Communication Among Processors
There are two alternative architectural approaches that differ
in the method used for communicating data among processors
Distributed Shared-Memory
Architectures(DSM) : Commu.
Occurs via a shared address space
Multicomputers(Clusters) : The
address space can consist of
multiple private address spaces
that are logically disjoint.
Message-Passing Multiprocessors:
Comm. Of data is done by explicitly
passing messages among the processors.
For an access or operation on data,
a processors sends message to the receiver.
Receiver performs the operation and
sends the result back.

Distributed-shared memory architecture
Distributed-shared memory architecture
The first DSM architectures apperared in the late 1970s and continued
through the early 1980s,embodied in three machines: the Carnegie Mellon
Cm, the IBM RP3, and the BBN Buterfly.
In uniprocessors, the long access time to memory is largely hidden throug the
use of caches. Unfortunately, adapting caches to work in a multiprocessor
enviroment is difficult.
When used in a multiprocessor, caching introduces an additional problem:
cache coherence, which arises when different processors cache and update
values of the same memory location.

What is cache coherence?
What is cache coherence?
I will explain this with an example:
Processor: A, B Memory location X
Time Event Cache content
in A
Cache content
in B
Memory content
for X
0 * * 1
1 A reads X 1 * 1
2 B reads X 1 1 1
3 A Writes X 0 1 0

A memory system is coherence when it satisfy the following
conditions:
• To the same location, a write immediately followed by a read by
the same processor will always return the written value.
• To the same location, a read from P2 immediately follows a write
by P1 will returns the value written by P1
• Two writes to the same location by any two processors are seen in
the same order by all processors
This ensures a shared location will not have different copies in cache
blocks.

DSM Architectures which excluding cache coherence:
Protocols for cache coherence:
These systems have caches,shared data are marked as uncacheable and only private
data are kept in the caches.
SW can cache the shared data by copying the data from the shared portion of the
address space to the local private portion of at he address space that is cached.
Coherence controlled by software. Advantage is little HW support.
Snooping Protocol Directory Protocol

Snooping Protocol:
Snooping Protocol:
In a snooping system, all caches on the bus monitor the bus to
determine if they have a copy of the block of data that is requested on
the bus. Every cache has a copy of the sharing status of every block of
physical memory it has.
There are two types of Snooping Protocol:
write-invalidate: the processor that is writing data causes copies in the caches of
all other processors in the system to be rendered invalid before it changes its local
copy. The local machine does this by sending an invalidation signal over the bus, which
causes all of the other caches to check for a copy of the invalidated file. Once the
cache copies have been invalidated, the data on the local machine can be updated
until another processor requests it.
write-update: the processor that is writing the data broadcasts the new data
over the bus (without issuing the invalidation signal). All caches that contain copies
of the data are then updated. This scheme differs from write-invalidate in that it
does not create only one local copy for writes.

Directory-Based Cache Coherence Protocols
Directory-Based Cache Coherence Protocols
Each directory is
reaponsible for
tracking caches
that share the
memory address
of the portion of
memory in the node.
The directory must
track the state of
the cache block.
The states are Shared,
Uncached and Exclusive.

The possible messages sent among
nodes to maintain coherence,
along with source and destination
node. (P = requesting processor
number, A = requested address,
D = data contents.)

Example of Directory Protocol
State transition diagram for an
İn dividual cache block in a
directory-based system:
Requests by the local processor
are shown in black and those from
home directory are shown in gray.

The state transition diagram for the
directory: All actions are in gray
because they are all externally caused.
Bold indicates the action taken by
the directory in response to the request.
Bold italics indicate an action that
updates the sharing set, Sharers.

Example of Directory Protocol (cont’d)
The state of uncached:
Read miss: The requesting processor is
sent the requested data from memory and
the requestor is made the only sharing node.
The state of the block is made shared.
Write miss: The requesting processor
is sent the value and becomes the sharing
node. The block is made exclusive to
indicate that the only valid copy is cached.
Sharers indicates the identity of the owner.

The state of shared:
Read miss: The requesting processor is
sent the requested data from memory and
The requesting processor is added to the
sharing set.
Write miss: The requesting processor
is sent the value. All processors in the
set Sharers are sent invalidate messages,
and the Sharers set is to contain the
identity of the requesting processor.
The state of the block is made exclusive.

The state of exclusive:
Read miss: The qwner processor is sent a data
fetch message. The identity of the requesting
processor is added to the set Sharers, whivh still
contains the identity of the processor that was
the owner.
Data write back: The owner processor is
replacing the block and therefore must write
it back. This write back makes the memory
copy up to date, the block is now uncached and
the Sharers set is empty.
Write miss: The block has a new owner.
A message sent to the old owner, causing
the cache to invalidate the block and send the
value to the directory. Sharers is set to the
identity of the new owner, and the state of
the block remains exclusive.

Performance of DSM Multiprocessors
Performance of DSM Multiprocessors
In DSM architectures, the memory requests between local and remote
is key to performance.
It affects the bandwidth and the latency seen by requests.
In the performance example we will separate the cache misses
into local and remote requests.
We will also compare the performance changings of the computational
kernels FFT, LU; the applications Barnes and Ocean.

Performance of DSM Multiprocessors(cont’d)
The miss rates with these cache sizes are not
affected much by changes in processor count,
with the exception of Ocean. The rise of miss
rate at 64 processors results from these
factors:
An increase in mapping conflicts in cache
that occur when the grid becomes small
which leads to a rise in local misses and
an increase in the number of the coherence
misses, which are all remote.

This figure shows how the miss rates
change as the cache size is increased,
assuming a 64- processor execution and
64-byte blocks. By the time we reach the
largest cache size shown 512 KB, the
remote miss rat is equal to or greater
than the local miss rate.

We examine the effect of tchanging the
block size in this example. Increases in
block size reduce the mis rate, even for
large blocks, although the performance
benefits for going to the largest blocks
are small. So most of the improvement
in miss rate comes from a reduction in
the local misses.

The number of bytes per data reference
climbs steadily as block size is increased.

The effective latency of memory references
in a DSM multiprocessor depends both on
the relative frequency of cache misses and
on the location of the memory where the
accesses are served.

REFERENCES:
• Andrew S. T., Maarten V. S., Distributed Systems, 2002
• John L. H., David A. P. , Computer Architecture: A quantitive Approach,
2003
• Abraham S., Peter B. G., Greg G., Operating Systems Concepts, 2003
• Jinseok K., Gyungho L., Binding Time in Distributed Shared Memory
Architectures, 1998 International Conference on Parallel Processing.
• Bill N., Virginia L., Distributed Shared Memory: A Survey of Issues and
Algorithms, Volume 24, Issue 8, August 1991, IEEE Computer Society
Press
• S. Zhou, M. Stumm, D. Wortman, K. Li, Heterogeneous Distributed
Shared Memory, IEEE Transactions on Parallel and Distributed
Systems, v.3 n.5, p.540-554, September 1992.

Any Questions?

Distributed Shared memory architecture.ppt

More Related Content

Similar to Distributed Shared memory architecture.ppt (20)

More from Balasubramanian699229 (17)

Recently uploaded (20)

Distributed Shared memory architecture.ppt