Performance
Discussion and Future Directions
Group Steiner Problem
 Given an undirected weighted graph 𝐺 = (𝑉, 𝐸) and a family 𝑁 = 𝑁1, … , 𝑁𝑘 of 𝑘 disjoint groups of nodes 𝑁𝑖 ⊆ 𝑉 , find a
minimum-cost tree which contains at least one node from each group 𝑁𝑖.
 Steiner nodes, i.e., nodes that are not a member of any group 𝑁𝑖, can optionally be used to interconnect the groups. In the
figure, the solid dots represent non-Steiner nodes, hollow dots represent optional Steiner nodes and the boxes represent groups.
 The solution to the Group Steiner Problem (GSP) can be used to find an efficient way of connecting a subset of a group of
vertices in a graph. Applications of this problem and its variants include routing of VLSI circuits and computer-aided design.
A feasible solution for a group
Steiner problem instance
 In VLSI design flow, logical design is followed by planning the physical layout
which involves:
Partitioning Placement Routing
VLSI Routing Estimation
 In practice, however, each terminal consists of a large collection of
electrically equivalent ports, a fact that is not accounted for in
layout steps such as wiring estimation. Each module can also be
rotated or flipped giving eight possible locations for a given port.
Different wiring layouts using multiple
equivalent ports
A module is rotated and flipped to induce
a group of eight terminal positions
Placement of
standard cells
After global
routing
After detailed
routing
 Routing:
Input: Terminal List, location of
terminals and pins/ports
Output: Geometric layout of all nets
Objective: Minimize the total wire
length completing all connections
without increasing chip area
 All experiments were done on Blue Waters supercomputer which uses a Cray XE6/XK7 system.
 We used several VLSI WRP (Wire Routing Problem) instances from datasets provided by [3] with
graph size ranging from 128 to 3,168 vertices.
 We ran strong scalability test by increasing the number of processors while keeping the problem
size constant. (Figure 5)
 We also compared our implementation with the best known sequential performance by [4]. The
results show our implementation achieves up to 302x speedup for a graph size of 3,168 vertices.
(Figure 6)
 Table 1 shows the accuracy of our solutions in comparison with optimum costs from[4].
Number of Vertices
Runtime(s)
Figure 5: A strong scalability test. Also shows comparison with
our serial implementation which took 5852.4s for graph size of
2,518 vertices.
Figure 6: Runtime comparison with [4]
GPU-Accelerated VLSI Routing Using Group Steiner Trees
Basileal Imana, Venkata Suhas Maringanti and Peter Yoon
Department of Computer Science, Trinity College, Hartford, CT
Specifications: Master process runs on a GPU-enabled XK7 compute
node while all other slaves run on traditional XE6 compute nodes.
XK7: one AMD 6200 Interlagos CPU, 23GB Host memory; one NVIDIA
GK110 “Kepler” accelerator, 6GB memory
XE6: 2 AMD 6276 Interlagos CPUs, 2.3 GHz, 64GB physical memory,
16 cores
Discussion
 Our implementation outperforms existing serial implementations
while producing accurate solutions for WRP problem instances.
 The algorithm is adaptively refined in that it uses several steps to
refine the solution and is highly dynamic making it hard to predict the
size of the solution before actually computing it. This causes some
load imbalance.
 Due to the dynamic nature of the problem, our implementation
exhibits some irregular memory access patterns.
Future Directions
 Design more efficient work distributions schemes to decrease the
load imbalance.
 Overlap communication with computations where possible.
 Improve the efficiency of memory access pattern.
Graph Size Terminals Opt. cost [4] Approx. cost Error %
wrp3-11 128 11 1100361 1100427 0.006
wrp3-39 703 39 3900450 3900600 0.004
wrp3-96 2518 96 96001172 96003009 0.002
wrp3-83 3168 83 8300906 8302279 0.017
Table 1: Accuracy Comparison
References
[1] Helvig C.S., Robins, G. and Zelikovsky, A. New Approximation Algorithms for Routing
with Multi-Port Terminals. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 19(10), 1118-1128.
[2] Lund, B. D., and Smith, J. W. A multi-stage CUDA Kernel for Floyd-Warshall. CoRR
abs/1001.4108 (2010).
[3] Koch, T., Martin, A. and Voß, S. SteinLib: an updated library on Steiner tree problems
in graphs. in Cheng, X. and Du, D.Z. eds. Steiner Trees in Industry, Springer US, Berlin,
2001, 285–326.
[4] Polzin, T., Vahdati, S. The Steiner tree challenge: An updated study, 11th DIMACS
Implementation Challenge. Retrieved July 06, 2015, from Princeton University:
http://dimacs11.cs.princeton.edu/papers/PolzinVahdatiDIMACS.pdf
The Group Steiner Heuristic [1]
1. Replace input graph 𝐺 = (𝑉, 𝐸) by its metric closure, i.e., a complete graph
with vertices 𝑉 and edge weights equal to the shortest path lengths.
(Figure 1)
2. For every terminal vertex 𝑣, create a new node 𝑣’ and a new zero-cost edge
(𝑣, 𝑣’). 𝑣’ takes the role of 𝑣; 𝑣’ becomes a port and 𝑣 a non-port. (Figure 2)
3. Construct a rooted 1-star tree, i.e., a tree of depth 1 where all leaves are
terminal nodes one from each group. (Figure 3)
4. Select intermediate nodes and determine a set of groups that should be
connected to each intermediate node to for a partial-star. (Figure 4)
5. Combine the partial-stars to obtain a 2-star tree, i.e., a tree of depth 2.
(Figure 4)
The 2-star has to be rooted at the center of the optimal solution-tree. The
center, however, cannot be determined ahead of time. Therefore, a Steiner 2-
star is constructed with each node 𝑣 in 𝐺 as root 𝑟 and the minimum-cost tree
is determined over all possible choices of 𝑟.
Input: A graph 𝐺 = (𝑉, 𝐸), a family 𝑁 of 𝑘 disjoint groups 𝑁1, … , 𝑁𝑘 ⊆ 𝑉 and a root 𝑟 ∈ 𝑉
𝐴𝑝𝑝𝑟𝑜𝑥2(𝑟) ← 𝑟 /*add root to solution*/
𝑁′ ← 𝑁 /*begin with all groups remaining*/
WHILE 𝑁′
≠ ∅ DO
FOR EACH 𝑣 ∈ 𝑉 DO
Sort 𝑀 = 𝑁1, … , 𝑁𝑘 such that
𝑐𝑜𝑠𝑡(𝑣,𝑁𝑖)
𝑐𝑜𝑠𝑡(𝑟,𝑁𝑖)
≤
𝑐𝑜𝑠𝑡(𝑣,𝑁 𝑖+1)
𝑐𝑜𝑠𝑡(𝑟,𝑁 𝑖+1)
Find 𝑗 ∈ 1, … 𝑘 that minimizes
𝑛𝑜𝑟𝑚 𝑣 =
𝑐𝑜𝑠𝑡 𝑟,𝑣 + 𝑐𝑜𝑠𝑡(𝑣,𝑁𝑖
𝑗
𝑖=1 )
𝑐𝑜𝑠𝑡(𝑟,𝑁𝑖
𝑗
𝑖=1
)
𝑀 𝑣 ← 𝑁1, … , 𝑁𝑘 /*store groups connected to 𝑣 ∗/
ENDFOR
Find 𝑣 𝑚𝑖𝑛 with the minimum 𝑛𝑜𝑟𝑚(𝑣)
𝑃 ← (𝑟, 𝑣 𝑚𝑖𝑛, 𝑀 𝑣 𝑚𝑖𝑛 ) /*partial star with root 𝑟, intermediate 𝑣 */
𝑁′ ← 𝑁′ − 𝑔𝑟𝑜𝑢𝑝𝑠(𝑃)
𝐴𝑝𝑝𝑟𝑜𝑥2 𝑟 ← 𝐴𝑝𝑝𝑟𝑜𝑥2 𝑟 ∪ 𝑃 /*add partial star to solution*/
ENDWHILE
Output: A low-cost 2-star 𝐴𝑝𝑝𝑟𝑜𝑥2 𝑟 with the root 𝑟 intersecting each group 𝑁𝑖
Depth Bounded Tree Approximation
Figure 1: Construction of metric closure
Figure 2: Transformation of input graph
Figure 3: 1-star tree rooted at 𝑟
Figure 4: 2-star tree made
of three partial-stars
The time complexity of the group Steiner Heuristic is 𝛰(𝛼 + 𝑉 2 ∙ 𝑘2 ∙ log 𝑘),
where 𝑘 is the number of groups and α is the time it takes to compute all-pairs
shortest paths using Floyd-Warshall Algorithm.
CUDA-Aware MPI-Based Approach
 Our implementation of the Group Steiner Heuristic [1] is
based on CUDA-Aware MPI.
 We parallelize the algorithm by exploiting the need to
construct rooted 2-star trees with each vertex 𝑣 in the graph
as a possible root.
 All such trees can be constructed independently on separate
processes.
 Metric closure is constructed by computing All Pair Shortest
Paths using Blocked Floyd-Warshall Algorithm on a GPU [2].
 There is no communication during 2-star tree construction
(the most time-consuming step), making our implementation
scalable.
 Note the communication pattern when we launch as many processes as the number of the vertices in
the graph. If fewer number of processes are launched, round-robin work distribution scheme is used
to assign roots.
Two-star tree construction
IF master
𝐺 ← read_graph(𝑓𝑖𝑙𝑒𝑛𝑎𝑚𝑒)
(𝑀𝑐, 𝑃) ← FLOYD_APSP_CUDA(𝐺) /*get metric closure
and predecessors matrix*/
ENDIF
BROADCAST(𝑀𝑐) /*from master*/
𝑜𝑛𝑒𝑠𝑡𝑎𝑟_𝑎𝑙𝑙 ← [ ]
WHILE there is a remaining root 𝑟
𝑜𝑛𝑒𝑠𝑡𝑎𝑟 ← build_onestar(𝑀𝑐, 𝑟)
GATHER(𝑜𝑛𝑒𝑠𝑡𝑎𝑟, 𝑜𝑛𝑒𝑠𝑡𝑎𝑟_𝑎𝑙𝑙) /*to master*/
ENDWHILE
BROADCAST(onestar_all) //from master
𝑔𝑙𝑜𝑏𝑎𝑙_𝑚𝑖𝑛 ← { }
𝑙𝑜𝑐𝑎𝑙_𝑚𝑖𝑛 ← { }
WHILE there is a remaining root 𝑟
𝑡𝑤𝑜𝑠𝑡𝑎𝑟 ← build_twostar(𝑀𝑐, 𝑟 ,𝑜𝑛𝑒𝑠𝑡𝑎𝑟_𝑎𝑙𝑙)
IF cost(𝑡𝑤𝑜𝑠𝑡𝑎𝑟) < cost(𝑙𝑜𝑐𝑎𝑙_min)
𝑙𝑜𝑐𝑎𝑙_min ← 𝑡𝑤𝑜𝑠𝑡𝑎𝑟
ENDIF
ENDWHILE
REDUCE(𝑙𝑜𝑐𝑎𝑙_𝑚𝑖𝑛, 𝑔𝑙𝑜𝑏𝑎𝑙_𝑚𝑖𝑛) /*to master*/
IF master
build_sol_graph (𝑔𝑙𝑜𝑏𝑎𝑙_𝑚𝑖𝑛, 𝑃, 𝑀𝑐)
ENDIF
CUDA-Aware MPI-based approach
A partial-star
Intermediate nodes
Runtime(s)
Number of Processes
…
Broadcast metric closure (MPI_Bcast)
Each process constructs 1-star for assigned root
Collect 1-stars (MPI_Gather)
…
Broadcast all 1-stars
Find global minimum-cost tree
(MPI_Reduce)
Construct metric closure on a GPU
Build solution graph
Read input graph from file
Each process constructs 2-star for assigned root
0
n210
n210
0
0
Acknowledgment
This research was supported by:
 CUDA Teaching Center Program, NVIDIA Research
 Faculty Research Committee and Interdisciplinary Science Program, Trinity College
 Blue Waters Student Internship Program
 Tens and thousands of non-overlapping nets may need to be routed
simultaneously in large-scale circuit design.
 Conventional procedures of VLSI global routing estimation assume a one-to-one
correspondence between terminals and ports.

More Related Content

PDF
Cs36565569
PPTX
Hierarchical clustering techniques
PDF
Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...
PDF
Rough K Means - Numerical Example
PPTX
Birch Algorithm With Solved Example
DOCX
Principal Component Analysis
PPTX
"Principal Component Analysis - the original paper" presentation @ Papers We ...
PDF
Data scientist training in bangalore
Cs36565569
Hierarchical clustering techniques
Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...
Rough K Means - Numerical Example
Birch Algorithm With Solved Example
Principal Component Analysis
"Principal Component Analysis - the original paper" presentation @ Papers We ...
Data scientist training in bangalore

Similar to post119s1-file2 (20)

PDF
post119s1-file3
PDF
Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI
PDF
Connected Components Labeling
PDF
Comparative Report Ed098
PPTX
Parallel Algorithms for Geometric Graph Problems (at Stanford)
PDF
Bump Hunting in the Dark - ICDE15 presentation
PDF
I017425763
PPT
Processing Reachability Queries with Realistic Constraints on Massive Network...
PPTX
GRAPH - DISCRETE STRUCTURE AND ALGORITHM
PDF
Comparative Analysis of Algorithms for Single Source Shortest Path Problem
PPT
Recreation mathematics ppt
PPT
Weighted graphs
PDF
lecture 23 algorithm design and analysis
PPTX
Greedy technique - Algorithm design techniques using data structures
PPTX
Optimisation random graph presentation
PPTX
APznzaZLM_MVouyxM4cxHPJR5BC-TAxTWqhQJ2EywQQuXStxJTDoGkHdsKEQGd4Vo7BS3Q1npCOMV...
PPT
lecture 17
PDF
Goldberg etal 2006
PDF
Goldbergetal2006 211024102451
PDF
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
post119s1-file3
Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI
Connected Components Labeling
Comparative Report Ed098
Parallel Algorithms for Geometric Graph Problems (at Stanford)
Bump Hunting in the Dark - ICDE15 presentation
I017425763
Processing Reachability Queries with Realistic Constraints on Massive Network...
GRAPH - DISCRETE STRUCTURE AND ALGORITHM
Comparative Analysis of Algorithms for Single Source Shortest Path Problem
Recreation mathematics ppt
Weighted graphs
lecture 23 algorithm design and analysis
Greedy technique - Algorithm design techniques using data structures
Optimisation random graph presentation
APznzaZLM_MVouyxM4cxHPJR5BC-TAxTWqhQJ2EywQQuXStxJTDoGkHdsKEQGd4Vo7BS3Q1npCOMV...
lecture 17
Goldberg etal 2006
Goldbergetal2006 211024102451
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
Ad

post119s1-file2

  • 1. Performance Discussion and Future Directions Group Steiner Problem  Given an undirected weighted graph 𝐺 = (𝑉, 𝐸) and a family 𝑁 = 𝑁1, … , 𝑁𝑘 of 𝑘 disjoint groups of nodes 𝑁𝑖 ⊆ 𝑉 , find a minimum-cost tree which contains at least one node from each group 𝑁𝑖.  Steiner nodes, i.e., nodes that are not a member of any group 𝑁𝑖, can optionally be used to interconnect the groups. In the figure, the solid dots represent non-Steiner nodes, hollow dots represent optional Steiner nodes and the boxes represent groups.  The solution to the Group Steiner Problem (GSP) can be used to find an efficient way of connecting a subset of a group of vertices in a graph. Applications of this problem and its variants include routing of VLSI circuits and computer-aided design. A feasible solution for a group Steiner problem instance  In VLSI design flow, logical design is followed by planning the physical layout which involves: Partitioning Placement Routing VLSI Routing Estimation  In practice, however, each terminal consists of a large collection of electrically equivalent ports, a fact that is not accounted for in layout steps such as wiring estimation. Each module can also be rotated or flipped giving eight possible locations for a given port. Different wiring layouts using multiple equivalent ports A module is rotated and flipped to induce a group of eight terminal positions Placement of standard cells After global routing After detailed routing  Routing: Input: Terminal List, location of terminals and pins/ports Output: Geometric layout of all nets Objective: Minimize the total wire length completing all connections without increasing chip area  All experiments were done on Blue Waters supercomputer which uses a Cray XE6/XK7 system.  We used several VLSI WRP (Wire Routing Problem) instances from datasets provided by [3] with graph size ranging from 128 to 3,168 vertices.  We ran strong scalability test by increasing the number of processors while keeping the problem size constant. (Figure 5)  We also compared our implementation with the best known sequential performance by [4]. The results show our implementation achieves up to 302x speedup for a graph size of 3,168 vertices. (Figure 6)  Table 1 shows the accuracy of our solutions in comparison with optimum costs from[4]. Number of Vertices Runtime(s) Figure 5: A strong scalability test. Also shows comparison with our serial implementation which took 5852.4s for graph size of 2,518 vertices. Figure 6: Runtime comparison with [4] GPU-Accelerated VLSI Routing Using Group Steiner Trees Basileal Imana, Venkata Suhas Maringanti and Peter Yoon Department of Computer Science, Trinity College, Hartford, CT Specifications: Master process runs on a GPU-enabled XK7 compute node while all other slaves run on traditional XE6 compute nodes. XK7: one AMD 6200 Interlagos CPU, 23GB Host memory; one NVIDIA GK110 “Kepler” accelerator, 6GB memory XE6: 2 AMD 6276 Interlagos CPUs, 2.3 GHz, 64GB physical memory, 16 cores Discussion  Our implementation outperforms existing serial implementations while producing accurate solutions for WRP problem instances.  The algorithm is adaptively refined in that it uses several steps to refine the solution and is highly dynamic making it hard to predict the size of the solution before actually computing it. This causes some load imbalance.  Due to the dynamic nature of the problem, our implementation exhibits some irregular memory access patterns. Future Directions  Design more efficient work distributions schemes to decrease the load imbalance.  Overlap communication with computations where possible.  Improve the efficiency of memory access pattern. Graph Size Terminals Opt. cost [4] Approx. cost Error % wrp3-11 128 11 1100361 1100427 0.006 wrp3-39 703 39 3900450 3900600 0.004 wrp3-96 2518 96 96001172 96003009 0.002 wrp3-83 3168 83 8300906 8302279 0.017 Table 1: Accuracy Comparison References [1] Helvig C.S., Robins, G. and Zelikovsky, A. New Approximation Algorithms for Routing with Multi-Port Terminals. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 19(10), 1118-1128. [2] Lund, B. D., and Smith, J. W. A multi-stage CUDA Kernel for Floyd-Warshall. CoRR abs/1001.4108 (2010). [3] Koch, T., Martin, A. and Voß, S. SteinLib: an updated library on Steiner tree problems in graphs. in Cheng, X. and Du, D.Z. eds. Steiner Trees in Industry, Springer US, Berlin, 2001, 285–326. [4] Polzin, T., Vahdati, S. The Steiner tree challenge: An updated study, 11th DIMACS Implementation Challenge. Retrieved July 06, 2015, from Princeton University: http://dimacs11.cs.princeton.edu/papers/PolzinVahdatiDIMACS.pdf The Group Steiner Heuristic [1] 1. Replace input graph 𝐺 = (𝑉, 𝐸) by its metric closure, i.e., a complete graph with vertices 𝑉 and edge weights equal to the shortest path lengths. (Figure 1) 2. For every terminal vertex 𝑣, create a new node 𝑣’ and a new zero-cost edge (𝑣, 𝑣’). 𝑣’ takes the role of 𝑣; 𝑣’ becomes a port and 𝑣 a non-port. (Figure 2) 3. Construct a rooted 1-star tree, i.e., a tree of depth 1 where all leaves are terminal nodes one from each group. (Figure 3) 4. Select intermediate nodes and determine a set of groups that should be connected to each intermediate node to for a partial-star. (Figure 4) 5. Combine the partial-stars to obtain a 2-star tree, i.e., a tree of depth 2. (Figure 4) The 2-star has to be rooted at the center of the optimal solution-tree. The center, however, cannot be determined ahead of time. Therefore, a Steiner 2- star is constructed with each node 𝑣 in 𝐺 as root 𝑟 and the minimum-cost tree is determined over all possible choices of 𝑟. Input: A graph 𝐺 = (𝑉, 𝐸), a family 𝑁 of 𝑘 disjoint groups 𝑁1, … , 𝑁𝑘 ⊆ 𝑉 and a root 𝑟 ∈ 𝑉 𝐴𝑝𝑝𝑟𝑜𝑥2(𝑟) ← 𝑟 /*add root to solution*/ 𝑁′ ← 𝑁 /*begin with all groups remaining*/ WHILE 𝑁′ ≠ ∅ DO FOR EACH 𝑣 ∈ 𝑉 DO Sort 𝑀 = 𝑁1, … , 𝑁𝑘 such that 𝑐𝑜𝑠𝑡(𝑣,𝑁𝑖) 𝑐𝑜𝑠𝑡(𝑟,𝑁𝑖) ≤ 𝑐𝑜𝑠𝑡(𝑣,𝑁 𝑖+1) 𝑐𝑜𝑠𝑡(𝑟,𝑁 𝑖+1) Find 𝑗 ∈ 1, … 𝑘 that minimizes 𝑛𝑜𝑟𝑚 𝑣 = 𝑐𝑜𝑠𝑡 𝑟,𝑣 + 𝑐𝑜𝑠𝑡(𝑣,𝑁𝑖 𝑗 𝑖=1 ) 𝑐𝑜𝑠𝑡(𝑟,𝑁𝑖 𝑗 𝑖=1 ) 𝑀 𝑣 ← 𝑁1, … , 𝑁𝑘 /*store groups connected to 𝑣 ∗/ ENDFOR Find 𝑣 𝑚𝑖𝑛 with the minimum 𝑛𝑜𝑟𝑚(𝑣) 𝑃 ← (𝑟, 𝑣 𝑚𝑖𝑛, 𝑀 𝑣 𝑚𝑖𝑛 ) /*partial star with root 𝑟, intermediate 𝑣 */ 𝑁′ ← 𝑁′ − 𝑔𝑟𝑜𝑢𝑝𝑠(𝑃) 𝐴𝑝𝑝𝑟𝑜𝑥2 𝑟 ← 𝐴𝑝𝑝𝑟𝑜𝑥2 𝑟 ∪ 𝑃 /*add partial star to solution*/ ENDWHILE Output: A low-cost 2-star 𝐴𝑝𝑝𝑟𝑜𝑥2 𝑟 with the root 𝑟 intersecting each group 𝑁𝑖 Depth Bounded Tree Approximation Figure 1: Construction of metric closure Figure 2: Transformation of input graph Figure 3: 1-star tree rooted at 𝑟 Figure 4: 2-star tree made of three partial-stars The time complexity of the group Steiner Heuristic is 𝛰(𝛼 + 𝑉 2 ∙ 𝑘2 ∙ log 𝑘), where 𝑘 is the number of groups and α is the time it takes to compute all-pairs shortest paths using Floyd-Warshall Algorithm. CUDA-Aware MPI-Based Approach  Our implementation of the Group Steiner Heuristic [1] is based on CUDA-Aware MPI.  We parallelize the algorithm by exploiting the need to construct rooted 2-star trees with each vertex 𝑣 in the graph as a possible root.  All such trees can be constructed independently on separate processes.  Metric closure is constructed by computing All Pair Shortest Paths using Blocked Floyd-Warshall Algorithm on a GPU [2].  There is no communication during 2-star tree construction (the most time-consuming step), making our implementation scalable.  Note the communication pattern when we launch as many processes as the number of the vertices in the graph. If fewer number of processes are launched, round-robin work distribution scheme is used to assign roots. Two-star tree construction IF master 𝐺 ← read_graph(𝑓𝑖𝑙𝑒𝑛𝑎𝑚𝑒) (𝑀𝑐, 𝑃) ← FLOYD_APSP_CUDA(𝐺) /*get metric closure and predecessors matrix*/ ENDIF BROADCAST(𝑀𝑐) /*from master*/ 𝑜𝑛𝑒𝑠𝑡𝑎𝑟_𝑎𝑙𝑙 ← [ ] WHILE there is a remaining root 𝑟 𝑜𝑛𝑒𝑠𝑡𝑎𝑟 ← build_onestar(𝑀𝑐, 𝑟) GATHER(𝑜𝑛𝑒𝑠𝑡𝑎𝑟, 𝑜𝑛𝑒𝑠𝑡𝑎𝑟_𝑎𝑙𝑙) /*to master*/ ENDWHILE BROADCAST(onestar_all) //from master 𝑔𝑙𝑜𝑏𝑎𝑙_𝑚𝑖𝑛 ← { } 𝑙𝑜𝑐𝑎𝑙_𝑚𝑖𝑛 ← { } WHILE there is a remaining root 𝑟 𝑡𝑤𝑜𝑠𝑡𝑎𝑟 ← build_twostar(𝑀𝑐, 𝑟 ,𝑜𝑛𝑒𝑠𝑡𝑎𝑟_𝑎𝑙𝑙) IF cost(𝑡𝑤𝑜𝑠𝑡𝑎𝑟) < cost(𝑙𝑜𝑐𝑎𝑙_min) 𝑙𝑜𝑐𝑎𝑙_min ← 𝑡𝑤𝑜𝑠𝑡𝑎𝑟 ENDIF ENDWHILE REDUCE(𝑙𝑜𝑐𝑎𝑙_𝑚𝑖𝑛, 𝑔𝑙𝑜𝑏𝑎𝑙_𝑚𝑖𝑛) /*to master*/ IF master build_sol_graph (𝑔𝑙𝑜𝑏𝑎𝑙_𝑚𝑖𝑛, 𝑃, 𝑀𝑐) ENDIF CUDA-Aware MPI-based approach A partial-star Intermediate nodes Runtime(s) Number of Processes … Broadcast metric closure (MPI_Bcast) Each process constructs 1-star for assigned root Collect 1-stars (MPI_Gather) … Broadcast all 1-stars Find global minimum-cost tree (MPI_Reduce) Construct metric closure on a GPU Build solution graph Read input graph from file Each process constructs 2-star for assigned root 0 n210 n210 0 0 Acknowledgment This research was supported by:  CUDA Teaching Center Program, NVIDIA Research  Faculty Research Committee and Interdisciplinary Science Program, Trinity College  Blue Waters Student Internship Program  Tens and thousands of non-overlapping nets may need to be routed simultaneously in large-scale circuit design.  Conventional procedures of VLSI global routing estimation assume a one-to-one correspondence between terminals and ports.