SlideShare a Scribd company logo
NUMA optimized Parallel Breadth
first
Search on Multicore Single node
System
Mohammad Opada Al-Bosh Mohammad Tahsin Al-Shalabi
Ruba Break Mariam Al-kassar Nagham Ballan
Outline
 Background
 Breadth-first Search (BFS)
 NUMAI optimized parallel BFS
 Numerical Results
 Conclusion
Background
 Large scale graph in various fields
 US Road network:
58 million edges
 Twitter follow-ship :
1.47 billion edges
 Neuronal network :
100 trillion edges
large
Background
 Fast and scalable graph processing by using HPC
Importance of graph processing
 Application field
 Transportation
 Social network
 Cyber-security
 Bioinformatics
- Step 3:
• concurrent search (breadth first search)
• optimization (single source shortest path)
• edge-oriented (maximal independent set)
Breadth first search
 BFS is important and fundamental graph processing
– Obtains relationship of distance (hops) as standIalone
– Many algorithm (BC, Max.flow, Max.independent set)
 Problems of Fast and scalable computation BFS
- low arithmetic intensity
- irregular memory accesses
Graph500 Benchmark
 Measures computer performance using TEPS ratio in graph
processing such as BFS (Breath-first search)
 TEPS ratio = # of Traversed edges per second
Contribution
 Efficient hybrid algorithm of BFS [Beamer2011,2012]
 reduces unnecessary edge traversal
Our proposal
- NUMA-optimized hybrid algorithm
- Improves locality of memory access
. Library for considering NUMA carefully
. Column-wise graph partitioning
 Example
4-way Intel Xeon E5 (64 CPU cores)
• Scalable: Scale well up to 64 threads.
• Fast: 11.15 GTEPS and 2.2x speedup compared with
original Hybrid algorithm
Outline
 Background
 Breadth-first Search (BFS)
 NUMAI optimized parallel BFS
 Numerical Results
 Conclusion
Breadth first Search (BFS)
 Obtain level of each vertices from source vertex
 Level = certain # of hops away from the source
Input:
Graph G and source
Output:
Tree with root as source
Hybrid BFS for low diameter graph
 Efficient for Low diameter graph
– scale free and/or small world property such as social
network.
 At higher ranks in Graph500 benchmark
 Hybrid algorithm
- combines top-down algorithm and bottom-up algorithm
– reduces unnecessary edge traversal
Hybrid algorithm
Top down algorithm Bottom up algorithm
Efficient for a small-frontier Efficient for a large-frontier
Top down algorithm
 Explores outgoing edges of frontier queue QF
 Appends unvisited vertices into neighbor queue QN
Efficient for a small frontier
• Has an unnecessary edge traversal for a large frontier
Bottom up algorithm
 Explores frontier queue QF from unvisited vertices.
 Appends adjacent vertices into neighbors QN
Efficient for a large frontier
• Has unnecessary edge traversal for a small frontier
Outline
 Background
 Breadth-first Search (BFS)
 NUMAI optimized parallel BFS
 Numerical Results
 Conclusion
How to speedup the hybrid algorithm?
 NUMA architecture
– Non uniform memory access
– Each CPU socket has a local RAM
– Fast local RAM and slow non-local RAM
4 socket Intel Xeon E5 system
 Frequent non local memory accesses on NUMA
architecture
Difficulty of considering NUMA architecture
1. How does distribute graph and data to each local RAM?
2. How does bind partial graph and data to each NUMA unit?
ULIBC: Ubiquity Library for Intelligently Binding
Cores
1. NUMACTL (command line tool, library for C/C++)
2. Intel compiler Thread Affinity Interface (API)
3. ULIBC (Our library, library for C/C++)
– Processor ID : index of logical processor core
– Package ID : index of CPU socket
– Core ID : index of physical core in each CPU socket
NUMA-opt. Column wise Graph Partitioning
. Divides G=(V,A) into partial Gk=(Vk,.Ak) and binds local RAM k
- Ak is a set of adjacency list that holds incoming edges to Vk.
NUMA-optimized Top down
 Explores outgoing edges of frontier queue QF.
 Appends unvisited vertices into neighbor queue QN.
Efficient for a small frontier
• Has unnecessary edge traversal for a large frontier
Details of NUMA-optimized Top-down
NUMA-optimized Bottom-up
 Explores frontier queue QF from unvisited vertices.
 Appends adjacent vertices into neighbors QN.
Efficient for a large frontier.
• Has unnecessary edge traversal for a small frontier
Details of NUMA-optimized Bottom- up
Outline
 Background
 Breadth-first Search (BFS)
 NUMAI optimized parallel BFS
 Numerical Results
 Conclusion
Machine specification
 4 way Intel Xeon E5.
– CentOS 6.4 (Kernel 2.6.32)
– GCC 4.4.7
– 64 logical CPU cores
– 4 NUMA units x 16 logical cores
 4 way AMD Opteron 6174.
– Fedora 19 (Kernel 3.11.2)
– GCC 4.8.1
– 48 CPU cores
– 8 NUMA units x 6-core
TEPS ratio varied with problem size
 NUMA 2.2x speedups compared with original hybrid algorithm
 NUMA achieves11.15 GTEPS for Kronecker graph (SCALE26).
Strong scaling on Intel/AMD System
Scale well up to # of threads as # of cores
Twitter network
Graph500 benchmark
 Fastest of single node on 4th list (Jue 2012)
 Fastest of CPU-based single-node on 6th list (June 2013)
1st Green Graph500 list on June 2013
 Measures power-efficient using TEPS/W ratio
 Results on various system such as Android ,Linux , and Mac.
NUMA
Small Data category
Outline
 Background
 Breadth-first Search (BFS)
 NUMAI optimized parallel BFS
 Numerical Results
 Conclusion
Conclusion
 NUMA-optimized Hybrid BFS algorithm
– Reduces unnecessary edge traversals and remote RAM access
carefully considering NUMA
 Numerical results on 4 way Intel Xeon
– scales well up to 64 threads (scalable)
– achieves 11.15 GTEPS (fast)
– 2.2x speedup compared original Hybrid
 Graph500 and Green Graph500
– Fastest single-node in June 2012
– Most power-efficient in June 2013
Future work
 Further optimizing NUMA-optimized BFS$
Distributed-memory parallel computation
References
 Parallel Breadth-First Search on Distributed Memory Systems
[Aydın Buluç Kamesh Madduri Computational Research Division Lawrence Berkeley National
Laboratory Berkeley, CA {ABuluc, KMadduri}@lbl.gov]
 A Scalable Distributed Parallel Breadth-First Search Algorithm on
BlueGene/L
[Andy Yooy Edmond Chowx Keith Hendersony William McLendonz Bruce Hendricksonz ÄUmit C»
atalyÄurek]
 Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up
Search
[Scott Beamer Aydn Buluc Krste Asanovic David A. Patterson]
 Distributed Breadth First Search
[CSE 6220, Spring 2013 Georgia Institute of Technology April 18 -- Guest lecture by Anita Zakrzewska]
 Evaluation and Optimization of Breadth-First Search on NUMA Cluster
[Zehan Cui1,2, Licheng Chen1,2, Mingyu Chen1, Yungang Bao1, Yongbing Huang1,2, Huiwei Lv1,2 1 State
Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of
Sciences 2 Graduate School of Chinese Academy of Sciences {cuizehan, chenlicheng, cmy, baoyg,
huangyongbing, lvhuiwei}@ict.ac.cn]
References
 Scaling Techniques for Massive Scale-Free Graphs in Distributed (External)
Memory
[Roger Pearcey, Maya Gokhaley, Nancy M. Amato Parasol Laboratory; Dept. of Computer Science and
Engineering Texas A&M University; College Station, TX yCenter for Applied Scientific Computing
Lawrence Livermore National Laboratory; Livermore, CA frpearce, mayag@llnl.gov frpearce,
amatog@cse.tamu.edu]
 Reducing Communication in Parallel Breadth-First Search on Distributed
Memory Systems
[Huiwei Luy, Guangming Tan, Mingyu Chen, Ninghui Sun State Key Laboratory of Computer
Architecture, Institute of Computing Technology, Chinese Academy of Sciences yArgonne National
Laboratory Email: huiweilu@mcs.anl.gov, tgm@ict.ac.cn, cmy@ict.ac.cn, snh@ncic.ac.cn]
 Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore
and Multiprocessor Systems
[Rudolf Berrendorf and Matthias Makulla Computer Science Department Bonn-Rhein-Sieg University Sankt
Augustin, Germany e-mail: rudolf.berrendorf@h-brs.de, mathias.makulla@h-brs.de]

More Related Content

What's hot (17)

PPT
Chap2 slides
BaliThorat1
 
PPTX
compiler design
sakthibalabalamuruga
 
PDF
Recent progress on distributing deep learning
Viet-Trung TRAN
 
PPT
Chap5 slides
BaliThorat1
 
PPT
Chap6 slides
BaliThorat1
 
PDF
[台灣人工智慧學校] 新竹分校第一期結業典禮 - 主題演講
台灣資料科學年會
 
PPTX
Google TPU
Hao(Robin) Dong
 
PPT
An Adaptive Load Balancing Middleware for Distributed Simulation
Gabriele D'Angelo
 
PDF
Lecture 7 cuda execution model
Vajira Thambawita
 
PDF
Accelerating economics: how GPUs can save you time and money
Laurent Oberholzer
 
PPTX
Big Graph Analytics Systems (Sigmod16 Tutorial)
Yuanyuan Tian
 
PPTX
Physical organization of parallel platforms
Syed Zaid Irshad
 
PPTX
Limitations of memory system performance
Syed Zaid Irshad
 
PDF
(Ds+alg) 3
MirOmranudinAbhar
 
PPTX
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Cheng-Hsuan Li
 
DOCX
AN OVERLAY ARCHITECTURE FOR THROUGHPUTOPTIMAL MULTIPATH ROUTING
nexgentechnology
 
PPTX
Communication model of parallel platforms
Syed Zaid Irshad
 
Chap2 slides
BaliThorat1
 
compiler design
sakthibalabalamuruga
 
Recent progress on distributing deep learning
Viet-Trung TRAN
 
Chap5 slides
BaliThorat1
 
Chap6 slides
BaliThorat1
 
[台灣人工智慧學校] 新竹分校第一期結業典禮 - 主題演講
台灣資料科學年會
 
Google TPU
Hao(Robin) Dong
 
An Adaptive Load Balancing Middleware for Distributed Simulation
Gabriele D'Angelo
 
Lecture 7 cuda execution model
Vajira Thambawita
 
Accelerating economics: how GPUs can save you time and money
Laurent Oberholzer
 
Big Graph Analytics Systems (Sigmod16 Tutorial)
Yuanyuan Tian
 
Physical organization of parallel platforms
Syed Zaid Irshad
 
Limitations of memory system performance
Syed Zaid Irshad
 
(Ds+alg) 3
MirOmranudinAbhar
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Cheng-Hsuan Li
 
AN OVERLAY ARCHITECTURE FOR THROUGHPUTOPTIMAL MULTIPATH ROUTING
nexgentechnology
 
Communication model of parallel platforms
Syed Zaid Irshad
 

Viewers also liked (20)

PDF
Extreme Scale Breadth-First Search on Supercomputers
Toyotaro Suzumura
 
PDF
Bfs dfs
Praveen Yadav
 
PPT
Breadth first search
Vignesh Prasanna
 
PDF
NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System
Yuichiro Yasui
 
PPT
Bfs and dfs in data structure
Ankit Kumar Singh
 
PDF
Php workshop L01 CSS
Mohammad Tahsin Alshalabi
 
PDF
Open mp library functions and environment variables
Suveeksha
 
PDF
OpenMP Tutorial for Beginners
Dhanashree Prasad
 
PPTX
Intro to OpenMP
jbp4444
 
PDF
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel Software Brasil
 
PPTX
Dynamic Programming - Part II
Amrinder Arora
 
PPTX
Bfs and Dfs
Masud Parvaze
 
PPTX
Graph Traversal Algorithms - Breadth First Search
Amrinder Arora
 
PPTX
NP completeness
Amrinder Arora
 
PPT
Multi core-architecture
Piyush Mittal
 
PPTX
Dynamic Programming - Part 1
Amrinder Arora
 
PPTX
Graph Traversal Algorithms - Depth First Search Traversal
Amrinder Arora
 
PPTX
BFS
jyothimonc
 
PPTX
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
Amrinder Arora
 
PPT
Multicore Processors
Smruti Sarangi
 
Extreme Scale Breadth-First Search on Supercomputers
Toyotaro Suzumura
 
Bfs dfs
Praveen Yadav
 
Breadth first search
Vignesh Prasanna
 
NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System
Yuichiro Yasui
 
Bfs and dfs in data structure
Ankit Kumar Singh
 
Php workshop L01 CSS
Mohammad Tahsin Alshalabi
 
Open mp library functions and environment variables
Suveeksha
 
OpenMP Tutorial for Beginners
Dhanashree Prasad
 
Intro to OpenMP
jbp4444
 
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel Software Brasil
 
Dynamic Programming - Part II
Amrinder Arora
 
Bfs and Dfs
Masud Parvaze
 
Graph Traversal Algorithms - Breadth First Search
Amrinder Arora
 
NP completeness
Amrinder Arora
 
Multi core-architecture
Piyush Mittal
 
Dynamic Programming - Part 1
Amrinder Arora
 
Graph Traversal Algorithms - Depth First Search Traversal
Amrinder Arora
 
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
Amrinder Arora
 
Multicore Processors
Smruti Sarangi
 
Ad

Similar to NUMA optimized Parallel Breadth first Search on Multicore Single node System (20)

PDF
Fast & Energy-Efficient Breadth-First Search on a Single NUMA System
Yuichiro Yasui
 
PDF
NUMA-aware Scalable Graph Traversal on SGI UV Systems
Yuichiro Yasui
 
PDF
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
Yuichiro Yasui
 
PDF
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
Yuichiro Yasui
 
PDF
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Yuichiro Yasui
 
PDF
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
Subhajit Sahu
 
PDF
Xbfs HPDC'2019
Anil Gaihre
 
PDF
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
Subhajit Sahu
 
PDF
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
MLconf
 
PDF
Parallel bfs using 2 stacks
Saptaparni Kumar
 
PDF
Parallel bfs using 2 stacks
Saptaparni Kumar
 
PPTX
Basic Graph Algorithms Vertex (Node): lk
ymwjd5j8pb
 
PDF
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
PPTX
Topological Sort and BFS
ArchanaMani2
 
PPTX
Data Structures - Lecture 10 [Graphs]
Muhammad Hammad Waseem
 
PDF
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Subhajit Sahu
 
PDF
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Subhajit Sahu
 
PDF
Benchmarking tool for graph algorithms
Yash Khandelwal
 
PDF
Graph Representation
Ramkrishna bhagat
 
PPT
Graphs
LavanyaJ28
 
Fast & Energy-Efficient Breadth-First Search on a Single NUMA System
Yuichiro Yasui
 
NUMA-aware Scalable Graph Traversal on SGI UV Systems
Yuichiro Yasui
 
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
Yuichiro Yasui
 
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
Yuichiro Yasui
 
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Yuichiro Yasui
 
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
Subhajit Sahu
 
Xbfs HPDC'2019
Anil Gaihre
 
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
Subhajit Sahu
 
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
MLconf
 
Parallel bfs using 2 stacks
Saptaparni Kumar
 
Parallel bfs using 2 stacks
Saptaparni Kumar
 
Basic Graph Algorithms Vertex (Node): lk
ymwjd5j8pb
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
Topological Sort and BFS
ArchanaMani2
 
Data Structures - Lecture 10 [Graphs]
Muhammad Hammad Waseem
 
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Subhajit Sahu
 
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Subhajit Sahu
 
Benchmarking tool for graph algorithms
Yash Khandelwal
 
Graph Representation
Ramkrishna bhagat
 
Graphs
LavanyaJ28
 
Ad

More from Mohammad Tahsin Alshalabi (16)

PDF
Learning Management System in Damascus University-Information Technology Engi...
Mohammad Tahsin Alshalabi
 
PPSX
Learning management system in information technology engineering faculty
Mohammad Tahsin Alshalabi
 
PDF
Moodle documentation
Mohammad Tahsin Alshalabi
 
PDF
Moodle plugins programing manual
Mohammad Tahsin Alshalabi
 
PPSX
CodeIgniter L5 email & user agent & security
Mohammad Tahsin Alshalabi
 
PPSX
CodeIgniter L4 file upload & image manipulation & language
Mohammad Tahsin Alshalabi
 
PPSX
CodeIgniter L3 model & active record & template
Mohammad Tahsin Alshalabi
 
PPSX
CodeIgniter L2 helper & libraries & form validation
Mohammad Tahsin Alshalabi
 
PPSX
CodeIgniter L1 introduction to CodeIgniter framework
Mohammad Tahsin Alshalabi
 
PPSX
Comparison between web and mobile application requirements
Mohammad Tahsin Alshalabi
 
PDF
Introduction to web services
Mohammad Tahsin Alshalabi
 
PDF
Introduction to HTML5
Mohammad Tahsin Alshalabi
 
PDF
Php workshop L04 database
Mohammad Tahsin Alshalabi
 
PDF
Php workshop L03 superglobals
Mohammad Tahsin Alshalabi
 
PDF
Php workshop L02 php basics
Mohammad Tahsin Alshalabi
 
PDF
Php workshop L0 Introduction
Mohammad Tahsin Alshalabi
 
Learning Management System in Damascus University-Information Technology Engi...
Mohammad Tahsin Alshalabi
 
Learning management system in information technology engineering faculty
Mohammad Tahsin Alshalabi
 
Moodle documentation
Mohammad Tahsin Alshalabi
 
Moodle plugins programing manual
Mohammad Tahsin Alshalabi
 
CodeIgniter L5 email & user agent & security
Mohammad Tahsin Alshalabi
 
CodeIgniter L4 file upload & image manipulation & language
Mohammad Tahsin Alshalabi
 
CodeIgniter L3 model & active record & template
Mohammad Tahsin Alshalabi
 
CodeIgniter L2 helper & libraries & form validation
Mohammad Tahsin Alshalabi
 
CodeIgniter L1 introduction to CodeIgniter framework
Mohammad Tahsin Alshalabi
 
Comparison between web and mobile application requirements
Mohammad Tahsin Alshalabi
 
Introduction to web services
Mohammad Tahsin Alshalabi
 
Introduction to HTML5
Mohammad Tahsin Alshalabi
 
Php workshop L04 database
Mohammad Tahsin Alshalabi
 
Php workshop L03 superglobals
Mohammad Tahsin Alshalabi
 
Php workshop L02 php basics
Mohammad Tahsin Alshalabi
 
Php workshop L0 Introduction
Mohammad Tahsin Alshalabi
 

Recently uploaded (20)

PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PPTX
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
Day2 B2 Best.pptx
helenjenefa1
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
Hashing Introduction , hash functions and techniques
sailajam21
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 

NUMA optimized Parallel Breadth first Search on Multicore Single node System

  • 1. NUMA optimized Parallel Breadth first Search on Multicore Single node System Mohammad Opada Al-Bosh Mohammad Tahsin Al-Shalabi Ruba Break Mariam Al-kassar Nagham Ballan
  • 2. Outline  Background  Breadth-first Search (BFS)  NUMAI optimized parallel BFS  Numerical Results  Conclusion
  • 3. Background  Large scale graph in various fields  US Road network: 58 million edges  Twitter follow-ship : 1.47 billion edges  Neuronal network : 100 trillion edges large
  • 4. Background  Fast and scalable graph processing by using HPC
  • 5. Importance of graph processing  Application field  Transportation  Social network  Cyber-security  Bioinformatics - Step 3: • concurrent search (breadth first search) • optimization (single source shortest path) • edge-oriented (maximal independent set)
  • 6. Breadth first search  BFS is important and fundamental graph processing – Obtains relationship of distance (hops) as standIalone – Many algorithm (BC, Max.flow, Max.independent set)  Problems of Fast and scalable computation BFS - low arithmetic intensity - irregular memory accesses
  • 7. Graph500 Benchmark  Measures computer performance using TEPS ratio in graph processing such as BFS (Breath-first search)  TEPS ratio = # of Traversed edges per second
  • 8. Contribution  Efficient hybrid algorithm of BFS [Beamer2011,2012]  reduces unnecessary edge traversal Our proposal - NUMA-optimized hybrid algorithm - Improves locality of memory access . Library for considering NUMA carefully . Column-wise graph partitioning
  • 9.  Example 4-way Intel Xeon E5 (64 CPU cores) • Scalable: Scale well up to 64 threads. • Fast: 11.15 GTEPS and 2.2x speedup compared with original Hybrid algorithm
  • 10. Outline  Background  Breadth-first Search (BFS)  NUMAI optimized parallel BFS  Numerical Results  Conclusion
  • 11. Breadth first Search (BFS)  Obtain level of each vertices from source vertex  Level = certain # of hops away from the source Input: Graph G and source Output: Tree with root as source
  • 12. Hybrid BFS for low diameter graph  Efficient for Low diameter graph – scale free and/or small world property such as social network.  At higher ranks in Graph500 benchmark  Hybrid algorithm - combines top-down algorithm and bottom-up algorithm – reduces unnecessary edge traversal
  • 13. Hybrid algorithm Top down algorithm Bottom up algorithm Efficient for a small-frontier Efficient for a large-frontier
  • 14. Top down algorithm  Explores outgoing edges of frontier queue QF  Appends unvisited vertices into neighbor queue QN Efficient for a small frontier • Has an unnecessary edge traversal for a large frontier
  • 15. Bottom up algorithm  Explores frontier queue QF from unvisited vertices.  Appends adjacent vertices into neighbors QN Efficient for a large frontier • Has unnecessary edge traversal for a small frontier
  • 16. Outline  Background  Breadth-first Search (BFS)  NUMAI optimized parallel BFS  Numerical Results  Conclusion
  • 17. How to speedup the hybrid algorithm?  NUMA architecture – Non uniform memory access – Each CPU socket has a local RAM – Fast local RAM and slow non-local RAM 4 socket Intel Xeon E5 system
  • 18.  Frequent non local memory accesses on NUMA architecture
  • 19. Difficulty of considering NUMA architecture 1. How does distribute graph and data to each local RAM? 2. How does bind partial graph and data to each NUMA unit?
  • 20. ULIBC: Ubiquity Library for Intelligently Binding Cores 1. NUMACTL (command line tool, library for C/C++) 2. Intel compiler Thread Affinity Interface (API) 3. ULIBC (Our library, library for C/C++) – Processor ID : index of logical processor core – Package ID : index of CPU socket – Core ID : index of physical core in each CPU socket
  • 21. NUMA-opt. Column wise Graph Partitioning . Divides G=(V,A) into partial Gk=(Vk,.Ak) and binds local RAM k - Ak is a set of adjacency list that holds incoming edges to Vk.
  • 22. NUMA-optimized Top down  Explores outgoing edges of frontier queue QF.  Appends unvisited vertices into neighbor queue QN. Efficient for a small frontier • Has unnecessary edge traversal for a large frontier
  • 24. NUMA-optimized Bottom-up  Explores frontier queue QF from unvisited vertices.  Appends adjacent vertices into neighbors QN. Efficient for a large frontier. • Has unnecessary edge traversal for a small frontier
  • 26. Outline  Background  Breadth-first Search (BFS)  NUMAI optimized parallel BFS  Numerical Results  Conclusion
  • 27. Machine specification  4 way Intel Xeon E5. – CentOS 6.4 (Kernel 2.6.32) – GCC 4.4.7 – 64 logical CPU cores – 4 NUMA units x 16 logical cores  4 way AMD Opteron 6174. – Fedora 19 (Kernel 3.11.2) – GCC 4.8.1 – 48 CPU cores – 8 NUMA units x 6-core
  • 28. TEPS ratio varied with problem size  NUMA 2.2x speedups compared with original hybrid algorithm  NUMA achieves11.15 GTEPS for Kronecker graph (SCALE26).
  • 29. Strong scaling on Intel/AMD System Scale well up to # of threads as # of cores
  • 31. Graph500 benchmark  Fastest of single node on 4th list (Jue 2012)  Fastest of CPU-based single-node on 6th list (June 2013)
  • 32. 1st Green Graph500 list on June 2013  Measures power-efficient using TEPS/W ratio  Results on various system such as Android ,Linux , and Mac. NUMA Small Data category
  • 33. Outline  Background  Breadth-first Search (BFS)  NUMAI optimized parallel BFS  Numerical Results  Conclusion
  • 34. Conclusion  NUMA-optimized Hybrid BFS algorithm – Reduces unnecessary edge traversals and remote RAM access carefully considering NUMA  Numerical results on 4 way Intel Xeon – scales well up to 64 threads (scalable) – achieves 11.15 GTEPS (fast) – 2.2x speedup compared original Hybrid  Graph500 and Green Graph500 – Fastest single-node in June 2012 – Most power-efficient in June 2013
  • 35. Future work  Further optimizing NUMA-optimized BFS$ Distributed-memory parallel computation
  • 36. References  Parallel Breadth-First Search on Distributed Memory Systems [Aydın Buluç Kamesh Madduri Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA {ABuluc, KMadduri}@lbl.gov]  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L [Andy Yooy Edmond Chowx Keith Hendersony William McLendonz Bruce Hendricksonz ÄUmit C» atalyÄurek]  Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search [Scott Beamer Aydn Buluc Krste Asanovic David A. Patterson]  Distributed Breadth First Search [CSE 6220, Spring 2013 Georgia Institute of Technology April 18 -- Guest lecture by Anita Zakrzewska]  Evaluation and Optimization of Breadth-First Search on NUMA Cluster [Zehan Cui1,2, Licheng Chen1,2, Mingyu Chen1, Yungang Bao1, Yongbing Huang1,2, Huiwei Lv1,2 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences 2 Graduate School of Chinese Academy of Sciences {cuizehan, chenlicheng, cmy, baoyg, huangyongbing, lvhuiwei}@ict.ac.cn]
  • 37. References  Scaling Techniques for Massive Scale-Free Graphs in Distributed (External) Memory [Roger Pearcey, Maya Gokhaley, Nancy M. Amato Parasol Laboratory; Dept. of Computer Science and Engineering Texas A&M University; College Station, TX yCenter for Applied Scientific Computing Lawrence Livermore National Laboratory; Livermore, CA frpearce, [email protected] frpearce, [email protected]]  Reducing Communication in Parallel Breadth-First Search on Distributed Memory Systems [Huiwei Luy, Guangming Tan, Mingyu Chen, Ninghui Sun State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences yArgonne National Laboratory Email: [email protected], [email protected], [email protected], [email protected]]  Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore and Multiprocessor Systems [Rudolf Berrendorf and Matthias Makulla Computer Science Department Bonn-Rhein-Sieg University Sankt Augustin, Germany e-mail: [email protected], [email protected]]