post119s1-file2

Performance
Discussion and Future Directions
Group Steiner Problem
 Given an undirected weighted graph 𝐺 = (𝑉, 𝐸) and a family 𝑁 = 𝑁1, … , 𝑁𝑘 of 𝑘 disjoint groups of nodes 𝑁𝑖 ⊆ 𝑉 , find a
minimum-cost tree which contains at least one node from each group 𝑁𝑖.
 Steiner nodes, i.e., nodes that are not a member of any group 𝑁𝑖, can optionally be used to interconnect the groups. In the
figure, the solid dots represent non-Steiner nodes, hollow dots represent optional Steiner nodes and the boxes represent groups.
 The solution to the Group Steiner Problem (GSP) can be used to find an efficient way of connecting a subset of a group of
vertices in a graph. Applications of this problem and its variants include routing of VLSI circuits and computer-aided design.
A feasible solution for a group
Steiner problem instance
 In VLSI design flow, logical design is followed by planning the physical layout
which involves:
Partitioning Placement Routing
VLSI Routing Estimation
 In practice, however, each terminal consists of a large collection of
electrically equivalent ports, a fact that is not accounted for in
layout steps such as wiring estimation. Each module can also be
rotated or flipped giving eight possible locations for a given port.
Different wiring layouts using multiple
equivalent ports
A module is rotated and flipped to induce
a group of eight terminal positions
Placement of
standard cells
After global
routing
After detailed
routing
 Routing:
Input: Terminal List, location of
terminals and pins/ports
Output: Geometric layout of all nets
Objective: Minimize the total wire
length completing all connections
without increasing chip area
 All experiments were done on Blue Waters supercomputer which uses a Cray XE6/XK7 system.
 We used several VLSI WRP (Wire Routing Problem) instances from datasets provided by [3] with
graph size ranging from 128 to 3,168 vertices.
 We ran strong scalability test by increasing the number of processors while keeping the problem
size constant. (Figure 5)
 We also compared our implementation with the best known sequential performance by [4]. The
results show our implementation achieves up to 302x speedup for a graph size of 3,168 vertices.
(Figure 6)
 Table 1 shows the accuracy of our solutions in comparison with optimum costs from[4].
Number of Vertices
Runtime(s)
Figure 5: A strong scalability test. Also shows comparison with
our serial implementation which took 5852.4s for graph size of
2,518 vertices.
Figure 6: Runtime comparison with [4]
GPU-Accelerated VLSI Routing Using Group Steiner Trees
Basileal Imana, Venkata Suhas Maringanti and Peter Yoon
Department of Computer Science, Trinity College, Hartford, CT
Specifications: Master process runs on a GPU-enabled XK7 compute
node while all other slaves run on traditional XE6 compute nodes.
XK7: one AMD 6200 Interlagos CPU, 23GB Host memory; one NVIDIA
GK110 “Kepler” accelerator, 6GB memory
XE6: 2 AMD 6276 Interlagos CPUs, 2.3 GHz, 64GB physical memory,
16 cores
Discussion
 Our implementation outperforms existing serial implementations
while producing accurate solutions for WRP problem instances.
 The algorithm is adaptively refined in that it uses several steps to
refine the solution and is highly dynamic making it hard to predict the
size of the solution before actually computing it. This causes some
load imbalance.
 Due to the dynamic nature of the problem, our implementation
exhibits some irregular memory access patterns.
Future Directions
 Design more efficient work distributions schemes to decrease the
load imbalance.
 Overlap communication with computations where possible.
 Improve the efficiency of memory access pattern.
Graph Size Terminals Opt. cost [4] Approx. cost Error %
wrp3-11 128 11 1100361 1100427 0.006
wrp3-39 703 39 3900450 3900600 0.004
wrp3-96 2518 96 96001172 96003009 0.002
wrp3-83 3168 83 8300906 8302279 0.017
Table 1: Accuracy Comparison
References
[1] Helvig C.S., Robins, G. and Zelikovsky, A. New Approximation Algorithms for Routing
with Multi-Port Terminals. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 19(10), 1118-1128.
[2] Lund, B. D., and Smith, J. W. A multi-stage CUDA Kernel for Floyd-Warshall. CoRR
abs/1001.4108 (2010).
[3] Koch, T., Martin, A. and Voß, S. SteinLib: an updated library on Steiner tree problems
in graphs. in Cheng, X. and Du, D.Z. eds. Steiner Trees in Industry, Springer US, Berlin,
2001, 285–326.
[4] Polzin, T., Vahdati, S. The Steiner tree challenge: An updated study, 11th DIMACS
Implementation Challenge. Retrieved July 06, 2015, from Princeton University:
http://dimacs11.cs.princeton.edu/papers/PolzinVahdatiDIMACS.pdf
The Group Steiner Heuristic [1]
1. Replace input graph 𝐺 = (𝑉, 𝐸) by its metric closure, i.e., a complete graph
with vertices 𝑉 and edge weights equal to the shortest path lengths.
(Figure 1)
2. For every terminal vertex 𝑣, create a new node 𝑣’ and a new zero-cost edge
(𝑣, 𝑣’). 𝑣’ takes the role of 𝑣; 𝑣’ becomes a port and 𝑣 a non-port. (Figure 2)
3. Construct a rooted 1-star tree, i.e., a tree of depth 1 where all leaves are
terminal nodes one from each group. (Figure 3)
4. Select intermediate nodes and determine a set of groups that should be
connected to each intermediate node to for a partial-star. (Figure 4)
5. Combine the partial-stars to obtain a 2-star tree, i.e., a tree of depth 2.
(Figure 4)
The 2-star has to be rooted at the center of the optimal solution-tree. The
center, however, cannot be determined ahead of time. Therefore, a Steiner 2-
star is constructed with each node 𝑣 in 𝐺 as root 𝑟 and the minimum-cost tree
is determined over all possible choices of 𝑟.
Input: A graph 𝐺 = (𝑉, 𝐸), a family 𝑁 of 𝑘 disjoint groups 𝑁1, … , 𝑁𝑘 ⊆ 𝑉 and a root 𝑟 ∈ 𝑉
𝐴𝑝𝑝𝑟𝑜𝑥2(𝑟) ← 𝑟 /*add root to solution*/
𝑁′ ← 𝑁 /*begin with all groups remaining*/
WHILE 𝑁′
≠ ∅ DO
FOR EACH 𝑣 ∈ 𝑉 DO
Sort 𝑀 = 𝑁1, … , 𝑁𝑘 such that
𝑐𝑜𝑠𝑡(𝑣,𝑁𝑖)
𝑐𝑜𝑠𝑡(𝑟,𝑁𝑖)
≤
𝑐𝑜𝑠𝑡(𝑣,𝑁 𝑖+1)
𝑐𝑜𝑠𝑡(𝑟,𝑁 𝑖+1)
Find 𝑗 ∈ 1, … 𝑘 that minimizes
𝑛𝑜𝑟𝑚 𝑣 =
𝑐𝑜𝑠𝑡 𝑟,𝑣 + 𝑐𝑜𝑠𝑡(𝑣,𝑁𝑖
𝑗
𝑖=1 )
𝑐𝑜𝑠𝑡(𝑟,𝑁𝑖
𝑗
𝑖=1
)
𝑀 𝑣 ← 𝑁1, … , 𝑁𝑘 /*store groups connected to 𝑣 ∗/
ENDFOR
Find 𝑣 𝑚𝑖𝑛 with the minimum 𝑛𝑜𝑟𝑚(𝑣)
𝑃 ← (𝑟, 𝑣 𝑚𝑖𝑛, 𝑀 𝑣 𝑚𝑖𝑛 ) /*partial star with root 𝑟, intermediate 𝑣 */
𝑁′ ← 𝑁′ − 𝑔𝑟𝑜𝑢𝑝𝑠(𝑃)
𝐴𝑝𝑝𝑟𝑜𝑥2 𝑟 ← 𝐴𝑝𝑝𝑟𝑜𝑥2 𝑟 ∪ 𝑃 /*add partial star to solution*/
ENDWHILE
Output: A low-cost 2-star 𝐴𝑝𝑝𝑟𝑜𝑥2 𝑟 with the root 𝑟 intersecting each group 𝑁𝑖
Depth Bounded Tree Approximation
Figure 1: Construction of metric closure
Figure 2: Transformation of input graph
Figure 3: 1-star tree rooted at 𝑟
Figure 4: 2-star tree made
of three partial-stars
The time complexity of the group Steiner Heuristic is 𝛰(𝛼 + 𝑉 2 ∙ 𝑘2 ∙ log 𝑘),
where 𝑘 is the number of groups and α is the time it takes to compute all-pairs
shortest paths using Floyd-Warshall Algorithm.
CUDA-Aware MPI-Based Approach
 Our implementation of the Group Steiner Heuristic [1] is
based on CUDA-Aware MPI.
 We parallelize the algorithm by exploiting the need to
construct rooted 2-star trees with each vertex 𝑣 in the graph
as a possible root.
 All such trees can be constructed independently on separate
processes.
 Metric closure is constructed by computing All Pair Shortest
Paths using Blocked Floyd-Warshall Algorithm on a GPU [2].
 There is no communication during 2-star tree construction
(the most time-consuming step), making our implementation
scalable.
 Note the communication pattern when we launch as many processes as the number of the vertices in
the graph. If fewer number of processes are launched, round-robin work distribution scheme is used
to assign roots.
Two-star tree construction
IF master
𝐺 ← read_graph(𝑓𝑖𝑙𝑒𝑛𝑎𝑚𝑒)
(𝑀𝑐, 𝑃) ← FLOYD_APSP_CUDA(𝐺) /*get metric closure
and predecessors matrix*/
ENDIF
BROADCAST(𝑀𝑐) /*from master*/
𝑜𝑛𝑒𝑠𝑡𝑎𝑟_𝑎𝑙𝑙 ← [ ]
WHILE there is a remaining root 𝑟
𝑜𝑛𝑒𝑠𝑡𝑎𝑟 ← build_onestar(𝑀𝑐, 𝑟)
GATHER(𝑜𝑛𝑒𝑠𝑡𝑎𝑟, 𝑜𝑛𝑒𝑠𝑡𝑎𝑟_𝑎𝑙𝑙) /*to master*/
ENDWHILE
BROADCAST(onestar_all) //from master
𝑔𝑙𝑜𝑏𝑎𝑙_𝑚𝑖𝑛 ← { }
𝑙𝑜𝑐𝑎𝑙_𝑚𝑖𝑛 ← { }
WHILE there is a remaining root 𝑟
𝑡𝑤𝑜𝑠𝑡𝑎𝑟 ← build_twostar(𝑀𝑐, 𝑟 ,𝑜𝑛𝑒𝑠𝑡𝑎𝑟_𝑎𝑙𝑙)
IF cost(𝑡𝑤𝑜𝑠𝑡𝑎𝑟) < cost(𝑙𝑜𝑐𝑎𝑙_min)
𝑙𝑜𝑐𝑎𝑙_min ← 𝑡𝑤𝑜𝑠𝑡𝑎𝑟
ENDIF
ENDWHILE
REDUCE(𝑙𝑜𝑐𝑎𝑙_𝑚𝑖𝑛, 𝑔𝑙𝑜𝑏𝑎𝑙_𝑚𝑖𝑛) /*to master*/
IF master
build_sol_graph (𝑔𝑙𝑜𝑏𝑎𝑙_𝑚𝑖𝑛, 𝑃, 𝑀𝑐)
ENDIF
CUDA-Aware MPI-based approach
A partial-star
Intermediate nodes
Runtime(s)
Number of Processes
…
Broadcast metric closure (MPI_Bcast)
Each process constructs 1-star for assigned root
Collect 1-stars (MPI_Gather)
…
Broadcast all 1-stars
Find global minimum-cost tree
(MPI_Reduce)
Construct metric closure on a GPU
Build solution graph
Read input graph from file
Each process constructs 2-star for assigned root
0
n210
n210
0
0
Acknowledgment
This research was supported by:
 CUDA Teaching Center Program, NVIDIA Research
 Faculty Research Committee and Interdisciplinary Science Program, Trinity College
 Blue Waters Student Internship Program
 Tens and thousands of non-overlapping nets may need to be routed
simultaneously in large-scale circuit design.
 Conventional procedures of VLSI global routing estimation assume a one-to-one
correspondence between terminals and ports.

post119s1-file2

More Related Content

Similar to post119s1-file2 (20)

post119s1-file2