NUMA optimized Parallel Breadth first Search on Multicore Single node System

NUMA optimized Parallel Breadth
first
Search on Multicore Single node
System
Mohammad Opada Al-Bosh Mohammad Tahsin Al-Shalabi
Ruba Break Mariam Al-kassar Nagham Ballan

Outline
 Background
 Breadth-first Search (BFS)
 NUMAI optimized parallel BFS
 Numerical Results
 Conclusion

Background
 Large scale graph in various fields
 US Road network:
58 million edges
 Twitter follow-ship :
1.47 billion edges
 Neuronal network :
100 trillion edges
large

Background
 Fast and scalable graph processing by using HPC

Importance of graph processing
 Application field
 Transportation
 Social network
 Cyber-security
 Bioinformatics
- Step 3:
• concurrent search (breadth first search)
• optimization (single source shortest path)
• edge-oriented (maximal independent set)

Breadth first search
 BFS is important and fundamental graph processing
– Obtains relationship of distance (hops) as standIalone
– Many algorithm (BC, Max.flow, Max.independent set)
 Problems of Fast and scalable computation BFS
- low arithmetic intensity
- irregular memory accesses

Graph500 Benchmark
 Measures computer performance using TEPS ratio in graph
processing such as BFS (Breath-first search)
 TEPS ratio = # of Traversed edges per second

Contribution
 Efficient hybrid algorithm of BFS [Beamer2011,2012]
 reduces unnecessary edge traversal
Our proposal
- NUMA-optimized hybrid algorithm
- Improves locality of memory access
. Library for considering NUMA carefully
. Column-wise graph partitioning

 Example
4-way Intel Xeon E5 (64 CPU cores)
• Scalable: Scale well up to 64 threads.
• Fast: 11.15 GTEPS and 2.2x speedup compared with
original Hybrid algorithm

Breadth first Search (BFS)
 Obtain level of each vertices from source vertex
 Level = certain # of hops away from the source
Input:
Graph G and source
Output:
Tree with root as source

Hybrid BFS for low diameter graph
 Efficient for Low diameter graph
– scale free and/or small world property such as social
network.
 At higher ranks in Graph500 benchmark
 Hybrid algorithm
- combines top-down algorithm and bottom-up algorithm
– reduces unnecessary edge traversal

Hybrid algorithm
Top down algorithm Bottom up algorithm
Efficient for a small-frontier Efficient for a large-frontier

Top down algorithm
 Explores outgoing edges of frontier queue QF
 Appends unvisited vertices into neighbor queue QN
Efficient for a small frontier
• Has an unnecessary edge traversal for a large frontier

Bottom up algorithm
 Explores frontier queue QF from unvisited vertices.
 Appends adjacent vertices into neighbors QN
Efficient for a large frontier
• Has unnecessary edge traversal for a small frontier

How to speedup the hybrid algorithm?
 NUMA architecture
– Non uniform memory access
– Each CPU socket has a local RAM
– Fast local RAM and slow non-local RAM
4 socket Intel Xeon E5 system

 Frequent non local memory accesses on NUMA
architecture

Difficulty of considering NUMA architecture
1. How does distribute graph and data to each local RAM?
2. How does bind partial graph and data to each NUMA unit?

ULIBC: Ubiquity Library for Intelligently Binding
Cores
1. NUMACTL (command line tool, library for C/C++)
2. Intel compiler Thread Affinity Interface (API)
3. ULIBC (Our library, library for C/C++)
– Processor ID : index of logical processor core
– Package ID : index of CPU socket
– Core ID : index of physical core in each CPU socket

NUMA-opt. Column wise Graph Partitioning
. Divides G=(V,A) into partial Gk=(Vk,.Ak) and binds local RAM k
- Ak is a set of adjacency list that holds incoming edges to Vk.

NUMA-optimized Top down
 Explores outgoing edges of frontier queue QF.
 Appends unvisited vertices into neighbor queue QN.
Efficient for a small frontier
• Has unnecessary edge traversal for a large frontier

Details of NUMA-optimized Top-down

NUMA-optimized Bottom-up
 Explores frontier queue QF from unvisited vertices.
 Appends adjacent vertices into neighbors QN.
Efficient for a large frontier.
• Has unnecessary edge traversal for a small frontier

Details of NUMA-optimized Bottom- up

Machine specification
 4 way Intel Xeon E5.
– CentOS 6.4 (Kernel 2.6.32)
– GCC 4.4.7
– 64 logical CPU cores
– 4 NUMA units x 16 logical cores
 4 way AMD Opteron 6174.
– Fedora 19 (Kernel 3.11.2)
– GCC 4.8.1
– 48 CPU cores
– 8 NUMA units x 6-core

TEPS ratio varied with problem size
 NUMA 2.2x speedups compared with original hybrid algorithm
 NUMA achieves11.15 GTEPS for Kronecker graph (SCALE26).

Strong scaling on Intel/AMD System
Scale well up to # of threads as # of cores

Graph500 benchmark
 Fastest of single node on 4th list (Jue 2012)
 Fastest of CPU-based single-node on 6th list (June 2013)

1st Green Graph500 list on June 2013
 Measures power-efficient using TEPS/W ratio
 Results on various system such as Android ,Linux , and Mac.
NUMA
Small Data category

Conclusion
 NUMA-optimized Hybrid BFS algorithm
– Reduces unnecessary edge traversals and remote RAM access
carefully considering NUMA
 Numerical results on 4 way Intel Xeon
– scales well up to 64 threads (scalable)
– achieves 11.15 GTEPS (fast)
– 2.2x speedup compared original Hybrid
 Graph500 and Green Graph500
– Fastest single-node in June 2012
– Most power-efficient in June 2013

Future work
 Further optimizing NUMA-optimized BFS$
Distributed-memory parallel computation

References
 Parallel Breadth-First Search on Distributed Memory Systems
[Aydın Buluç Kamesh Madduri Computational Research Division Lawrence Berkeley National
Laboratory Berkeley, CA {ABuluc, KMadduri}@lbl.gov]
 A Scalable Distributed Parallel Breadth-First Search Algorithm on
BlueGene/L
[Andy Yooy Edmond Chowx Keith Hendersony William McLendonz Bruce Hendricksonz ÄUmit C»
atalyÄurek]
 Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up
Search
[Scott Beamer Aydn Buluc Krste Asanovic David A. Patterson]
 Distributed Breadth First Search
[CSE 6220, Spring 2013 Georgia Institute of Technology April 18 -- Guest lecture by Anita Zakrzewska]
 Evaluation and Optimization of Breadth-First Search on NUMA Cluster
[Zehan Cui1,2, Licheng Chen1,2, Mingyu Chen1, Yungang Bao1, Yongbing Huang1,2, Huiwei Lv1,2 1 State
Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of
Sciences 2 Graduate School of Chinese Academy of Sciences {cuizehan, chenlicheng, cmy, baoyg,
huangyongbing, lvhuiwei}@ict.ac.cn]

References
 Scaling Techniques for Massive Scale-Free Graphs in Distributed (External)
Memory
[Roger Pearcey, Maya Gokhaley, Nancy M. Amato Parasol Laboratory; Dept. of Computer Science and
Engineering Texas A&M University; College Station, TX yCenter for Applied Scientific Computing
Lawrence Livermore National Laboratory; Livermore, CA frpearce, mayag@llnl.gov frpearce,
amatog@cse.tamu.edu]
 Reducing Communication in Parallel Breadth-First Search on Distributed
Memory Systems
[Huiwei Luy, Guangming Tan, Mingyu Chen, Ninghui Sun State Key Laboratory of Computer
Architecture, Institute of Computing Technology, Chinese Academy of Sciences yArgonne National
Laboratory Email: huiweilu@mcs.anl.gov, tgm@ict.ac.cn, cmy@ict.ac.cn, snh@ncic.ac.cn]
 Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore
and Multiprocessor Systems
[Rudolf Berrendorf and Matthias Makulla Computer Science Department Bonn-Rhein-Sieg University Sankt
Augustin, Germany e-mail: rudolf.berrendorf@h-brs.de, mathias.makulla@h-brs.de]

NUMA optimized Parallel Breadth first Search on Multicore Single node System

More Related Content

What's hot (17)

Viewers also liked (20)

Similar to NUMA optimized Parallel Breadth first Search on Multicore Single node System (20)

More from Mohammad Tahsin Alshalabi (16)

Recently uploaded (20)

NUMA optimized Parallel Breadth first Search on Multicore Single node System