Communication Costs
in Parallel Machines
• Along with idling and contention, communication is a major overhead
in parallel programs.
• The cost of communication is dependent on a variety of features
including the programming model semantics, the network topology,
data handling and routing, and associated software protocols.
Message Passing Costs in
Parallel Computers
• The total time to transfer a message over a network comprises of the
following:
• Startup time (ts): Time spent at sending and receiving nodes (executing the
routing algorithm, programming routers, etc.).
• Per-hop time (th): This time is a function of number of hops and includes
factors such as switch latencies, network delays, etc.
• Per-word transfer time (tw): This time includes all overheads that are
determined by the length of the message. This includes bandwidth of links,
error checking and correction, etc.
Store-and-Forward Routing
• A message traversing multiple hops is completely
received at an intermediate hop before being
forwarded to the next hop.
• The total communication cost for a message of size
m words to traverse l communication links is
• In most platforms, th is small and the above
expression can be approximated by
Routing Techniques
Passing a message from node P0 to P3 (a) through a store-and-
forward communication network; (b) and (c) extending the concept
to cut-through routing. The shaded regions represent the time that
the message is in transit. The startup time associated with this
message transfer is assumed to be zero.
Packet Routing
• Store-and-forward makes poor use of communication
resources.
• Packet routing breaks messages into packets and
pipelines them through the network.
• Since packets may take different paths, each packet must
carry routing information, error checking, sequencing,
and other related header information.
• The total communication time for packet routing is
approximated by:
• The factor tw accounts for overheads in packet headers.
Cut-Through Routing
• Takes the concept of packet routing to an extreme by further dividing
messages into basic units called flits.
• Since flits are typically small, the header information must be
minimized.
• This is done by forcing all flits to take the same path, in sequence.
• A tracer message first programs all intermediate routers. All flits then
take the same route.
• Error checks are performed on the entire message, as opposed to
flits.
• No sequence numbers are needed.
Cut-Through Routing
• The total communication time for cut-through
routing is approximated by:
• This is identical to packet routing, however, tw is
typically much smaller.
Simplified Cost Model for
Communicating Messages
• The cost of communicating a message between two
nodes l hops away using cut-through routing is
given by
• In this expression, th is typically smaller than ts and
tw. For this reason, the second term in the RHS does
not show, particularly, when m is large.
• Furthermore, it is often not possible to control
routing and placement of tasks.
• For these reasons, we can approximate the cost of
message transfer by
Simplified Cost Model for Communicating
Messages
• It is important to note that the original expression for communication
time is valid for only uncongested networks.
• If a link takes multiple messages, the corresponding tw term must be
scaled up by the number of messages.
• Different communication patterns congest different networks to
varying extents.
• It is important to understand and account for this in the
communication time accordingly.
Cost Models for
Shared Address Space Machines
• While the basic messaging cost applies to these machines as well, a
number of other factors make accurate cost modeling more difficult.
• Memory layout is typically determined by the system.
• Finite cache sizes can result in cache thrashing.
• Overheads associated with invalidate and update operations are
difficult to quantify.
• Spatial locality is difficult to model.
• Prefetching can play a role in reducing the overhead associated with
data access.
• False sharing and contention are difficult to model.
Routing Mechanisms
for Interconnection Networks
• How does one compute the route that a message takes from source
to destination?
• Routing must prevent deadlocks - for this reason, we use dimension-ordered
or e-cube routing.
• Routing must avoid hot-spots - for this reason, two-step routing is often used.
In this case, a message from source s to destination d is first sent to a
randomly chosen intermediate processor i and then forwarded to destination
d.
Routing Mechanisms
for Interconnection Networks
Routing a message from node Ps (010) to node Pd (111) in a three-
dimensional hypercube using E-cube routing.
Mapping Techniques for Graphs
• Often, we need to embed a known communication pattern into a
given interconnection topology.
• We may have an algorithm designed for one network, which we are
porting to another topology.
For these reasons, it is useful to understand mapping between graphs.
Mapping Techniques for Graphs: Metrics
• When mapping a graph G(V,E) into G’(V’,E’), the following metrics are
important:
• The maximum number of edges mapped onto any edge in E’ is called
the congestion of the mapping.
• The maximum number of links in E’ that any edge in E is
• mapped onto is called the dilation of the mapping.
• The ratio of the number of nodes in the set V’ to that in set V is called
the expansion of the mapping.
Embedding a Linear Array
into a Hypercube
• A linear array (or a ring) composed of 2d nodes (labeled 0 through 2d
− 1) can be embedded into a d-dimensional hypercube by mapping
node i of the linear array onto node
• G(i, d) of the hypercube. The function G(i, x) is defined as follows:
0
Embedding a Linear Array
into a Hypercube
The function G is called the binary reflected Gray code (RGC).
Since adjoining entries (G(i, d) and G(i + 1, d)) differ from each
other at only one bit position, corresponding processors are mapped
to neighbors in a hypercube. Therefore, the congestion, dilation, and
expansion of the mapping are all 1.
Embedding a Linear Array
into a Hypercube: Example
(a) A three-bit reflected Gray code ring; and (b) its embedding into a
three-dimensional hypercube.
Embedding a Mesh
into a Hypercube
• A 2r × 2s wraparound mesh can be mapped to a 2r+s-node hypercube
by mapping node (i, j) of the mesh onto node G(i, r− 1) || G(j, s − 1) of
the hypercube (where || denotes concatenation of the two Gray
codes).
Embedding a Mesh into a Hypercube
(a) A 4 × 4 mesh illustrating the mapping of mesh nodes to the nodes
in a four-dimensional hypercube; and (b) a 2 × 4 mesh embedded into
a three-dimensional hypercube.
Once again, the congestion, dilation, and expansion
of the mapping is 1.
Embedding a Mesh into a Linear Array
• Since a mesh has more edges than a linear array, we will not have an
optimal congestion/dilation mapping.
• We first examine the mapping of a linear array into a mesh and then
invert this mapping.
• This gives us an optimal mapping (in terms of congestion).
Embedding a Mesh into a Linear Array:
Example
(a) Embedding a 16 node linear array into a 2-D mesh; and (b) the
inverse of the mapping. Solid lines correspond to links in the linear
array and normal lines to links in the mesh.
Embedding a Hypercube into a 2-D Mesh
• Each node subcube of the hypercube is mapped
to a node row of the mesh.
• This is done by inverting the linear-array to
hypercube mapping.
• This can be shown to be an optimal mapping.
Embedding a Hypercube into a 2-D Mesh:
Example
Embedding a hypercube into a 2-D mesh.

Communication costs in parallel machines

  • 1.
    Communication Costs in ParallelMachines • Along with idling and contention, communication is a major overhead in parallel programs. • The cost of communication is dependent on a variety of features including the programming model semantics, the network topology, data handling and routing, and associated software protocols.
  • 2.
    Message Passing Costsin Parallel Computers • The total time to transfer a message over a network comprises of the following: • Startup time (ts): Time spent at sending and receiving nodes (executing the routing algorithm, programming routers, etc.). • Per-hop time (th): This time is a function of number of hops and includes factors such as switch latencies, network delays, etc. • Per-word transfer time (tw): This time includes all overheads that are determined by the length of the message. This includes bandwidth of links, error checking and correction, etc.
  • 3.
    Store-and-Forward Routing • Amessage traversing multiple hops is completely received at an intermediate hop before being forwarded to the next hop. • The total communication cost for a message of size m words to traverse l communication links is • In most platforms, th is small and the above expression can be approximated by
  • 4.
    Routing Techniques Passing amessage from node P0 to P3 (a) through a store-and- forward communication network; (b) and (c) extending the concept to cut-through routing. The shaded regions represent the time that the message is in transit. The startup time associated with this message transfer is assumed to be zero.
  • 5.
    Packet Routing • Store-and-forwardmakes poor use of communication resources. • Packet routing breaks messages into packets and pipelines them through the network. • Since packets may take different paths, each packet must carry routing information, error checking, sequencing, and other related header information. • The total communication time for packet routing is approximated by: • The factor tw accounts for overheads in packet headers.
  • 6.
    Cut-Through Routing • Takesthe concept of packet routing to an extreme by further dividing messages into basic units called flits. • Since flits are typically small, the header information must be minimized. • This is done by forcing all flits to take the same path, in sequence. • A tracer message first programs all intermediate routers. All flits then take the same route. • Error checks are performed on the entire message, as opposed to flits. • No sequence numbers are needed.
  • 7.
    Cut-Through Routing • Thetotal communication time for cut-through routing is approximated by: • This is identical to packet routing, however, tw is typically much smaller.
  • 8.
    Simplified Cost Modelfor Communicating Messages • The cost of communicating a message between two nodes l hops away using cut-through routing is given by • In this expression, th is typically smaller than ts and tw. For this reason, the second term in the RHS does not show, particularly, when m is large. • Furthermore, it is often not possible to control routing and placement of tasks. • For these reasons, we can approximate the cost of message transfer by
  • 9.
    Simplified Cost Modelfor Communicating Messages • It is important to note that the original expression for communication time is valid for only uncongested networks. • If a link takes multiple messages, the corresponding tw term must be scaled up by the number of messages. • Different communication patterns congest different networks to varying extents. • It is important to understand and account for this in the communication time accordingly.
  • 10.
    Cost Models for SharedAddress Space Machines • While the basic messaging cost applies to these machines as well, a number of other factors make accurate cost modeling more difficult. • Memory layout is typically determined by the system. • Finite cache sizes can result in cache thrashing. • Overheads associated with invalidate and update operations are difficult to quantify. • Spatial locality is difficult to model. • Prefetching can play a role in reducing the overhead associated with data access. • False sharing and contention are difficult to model.
  • 11.
    Routing Mechanisms for InterconnectionNetworks • How does one compute the route that a message takes from source to destination? • Routing must prevent deadlocks - for this reason, we use dimension-ordered or e-cube routing. • Routing must avoid hot-spots - for this reason, two-step routing is often used. In this case, a message from source s to destination d is first sent to a randomly chosen intermediate processor i and then forwarded to destination d.
  • 12.
    Routing Mechanisms for InterconnectionNetworks Routing a message from node Ps (010) to node Pd (111) in a three- dimensional hypercube using E-cube routing.
  • 13.
    Mapping Techniques forGraphs • Often, we need to embed a known communication pattern into a given interconnection topology. • We may have an algorithm designed for one network, which we are porting to another topology. For these reasons, it is useful to understand mapping between graphs.
  • 14.
    Mapping Techniques forGraphs: Metrics • When mapping a graph G(V,E) into G’(V’,E’), the following metrics are important: • The maximum number of edges mapped onto any edge in E’ is called the congestion of the mapping. • The maximum number of links in E’ that any edge in E is • mapped onto is called the dilation of the mapping. • The ratio of the number of nodes in the set V’ to that in set V is called the expansion of the mapping.
  • 15.
    Embedding a LinearArray into a Hypercube • A linear array (or a ring) composed of 2d nodes (labeled 0 through 2d − 1) can be embedded into a d-dimensional hypercube by mapping node i of the linear array onto node • G(i, d) of the hypercube. The function G(i, x) is defined as follows: 0
  • 16.
    Embedding a LinearArray into a Hypercube The function G is called the binary reflected Gray code (RGC). Since adjoining entries (G(i, d) and G(i + 1, d)) differ from each other at only one bit position, corresponding processors are mapped to neighbors in a hypercube. Therefore, the congestion, dilation, and expansion of the mapping are all 1.
  • 17.
    Embedding a LinearArray into a Hypercube: Example (a) A three-bit reflected Gray code ring; and (b) its embedding into a three-dimensional hypercube.
  • 18.
    Embedding a Mesh intoa Hypercube • A 2r × 2s wraparound mesh can be mapped to a 2r+s-node hypercube by mapping node (i, j) of the mesh onto node G(i, r− 1) || G(j, s − 1) of the hypercube (where || denotes concatenation of the two Gray codes).
  • 19.
    Embedding a Meshinto a Hypercube (a) A 4 × 4 mesh illustrating the mapping of mesh nodes to the nodes in a four-dimensional hypercube; and (b) a 2 × 4 mesh embedded into a three-dimensional hypercube. Once again, the congestion, dilation, and expansion of the mapping is 1.
  • 20.
    Embedding a Meshinto a Linear Array • Since a mesh has more edges than a linear array, we will not have an optimal congestion/dilation mapping. • We first examine the mapping of a linear array into a mesh and then invert this mapping. • This gives us an optimal mapping (in terms of congestion).
  • 21.
    Embedding a Meshinto a Linear Array: Example (a) Embedding a 16 node linear array into a 2-D mesh; and (b) the inverse of the mapping. Solid lines correspond to links in the linear array and normal lines to links in the mesh.
  • 22.
    Embedding a Hypercubeinto a 2-D Mesh • Each node subcube of the hypercube is mapped to a node row of the mesh. • This is done by inverting the linear-array to hypercube mapping. • This can be shown to be an optimal mapping.
  • 23.
    Embedding a Hypercubeinto a 2-D Mesh: Example Embedding a hypercube into a 2-D mesh.