SlideShare a Scribd company logo
Indian Institute of Technology, Patna
Large Scale Graph Processing: Neo4j Vs Apache
Giraph Vs Hadoop-MapReduce
(Survey Report)
Nishant M Gandhi
M.Tech. CSE
IIT Patna
Contents
1. Introduction ..........................................................................................................................................3
2. Graph Processing Platforms..................................................................................................................3
a. Hadoop-MapReduce .............................................................................................................................3
b. Giraph....................................................................................................................................................4
c. Neo4j .....................................................................................................................................................4
3. Analysis of Platforms.............................................................................................................................5
a. Hadoop-MapReduce.........................................................................................................................5
b. Giraph................................................................................................................................................5
c. Neo4j.................................................................................................................................................5
4. Conclusion.............................................................................................................................................6
5. References ............................................................................................................................................7
1. Introduction
Today we are living in era of big data. From social media to scientific experiments, from
computer to mobile devices, generate huge amount of data every day. Storing and
processing this data is also the challenge now a day. There are so many real life
problems, which can be solved with use of this generated big data. Many of these
problems related with big data can be mapped to graph problems.
Many solutions have been created to process large scale data. One of the most popular
is [4] Hadoop with its [2] MapReduce programming platform. The lack of a programming
model dedicated for graph was addressed by Google with [3] Pregel. The Pregel uses Bulk
Synchronization Parallel model for graph processing. The open source version of Pregel
is [1] Giraph. Another platform is [6] Neo4j which is graph database processing platform.
In this document, we will try to understand these three platforms and their pros & cons.
2. Graph Processing Platforms
There are many platforms available for large scale graph processing. However we are
considering only these three platforms because all three contain very different
programming models to process graph.
 [6]Neo4J: desktop platform, NoSQL, graph database, version
 [2]Hadoop-MapReduce: cluster platform, generic large-scale data processing
platform
 [1]Giraph: cluster platform, large-scale graph processing specialized platform
a. Hadoop-MapReduce
[2]Hadoop is an open-source platform for storing & computing huge amount of data.
Hadoop has been used widely in many data analytics applications. It uses MapReduce
programming model. Hadoop’s MapReduce programming model is inspired by
functional programming’s Map & Reduce functions. The MapReduce programming
model process input data and divides it based on key/value pairs.
Data used by [4] Hadoop is stored in the Hadoop Distributed File System (HDFS). HDFS is
not a part of Hadoop, although it is being used by it and the platform will not work
without HDFS. Datasets which are stored in the HDFS are divided into N blocks of similar
size. Each of these blocks is used as an input for Mapper.
[8]Hadoop’s programming model has low performance and high resource consumption
for iterative graph algorithms, because of programming model which require multiple
map-reduce cycle. For example, for iterative graph traversing algorithms Hadoop would
often need to store and load the entire graph structure during each iteration, to transfer
data between the map and reduce processes through the disk-intensive HDFS, and to
run an convergence-checking iterations as an additional job.
b. Giraph
[1]Giraph is an open-source, graph specific distributed system platform. Giraph uses the
Pregel programming model, which is a vertex-centric programming abstraction that
adapts the Bulk Synchronous Parallel (BSP) model. A BSP computation proceeds in a
series of global Supersteps. Within each Superstep, active vertices execute the same
user defined compute function and create & deliver inter-vertex messages. Barriers
ensure synchronization between vertex computations. Once there are no messages to
process and all vertices vote to halt.
[8]Giraph utilizes the design of Hadoop, from which it leverages only Map phase. The
single biggest difference between Hadoop & Giraph is the fact that Giraph is in-memory
which speedup job execution. For fault-tolerance, Giraph uses periodic checkpoints. To
co-ordinate Superstep execution, it uses [5]ZooKeeper.
c. Neo4j
Neo4j is one of the popular open-source NoSQL graph database implemented in java.
Neo4j stores data in graphs rather than in tables. Every stored graph in Neo4j consists of
relationships and vertices annotated with properties. Neo4j can execute graph-
processing algorithms efficiently on just one machine, because of its optimization
techniques that favor response time. [8]Neo4j uses a two-level, main-memory caching
mechanism to improve its performance. The file buffer caches the storage file data in
the same format as it is stored on the durable storage media. The object buffer caches
vertices and relationships in a format that is optimized for high traversal speeds and
transactional writes.
Neo4j processes graphs by traversing all vertices, with the use of either the BFS or DFS
traversal algorithm. To start graph traversal a program has to define a special reference
vertex. This vertex is not a part of the original graph, but an additional artificial vertex
which is add to the graph structure and act as a starting point of the graph traversal. All
graph operations are performed as ACID transactions.
3. Analysis of Platforms
The performance analyses of these platforms have been done several times but here I
am using two materials and their results to write this section of report. The one is M.S.
theses report of Marcin Biczak batch of 2013 from Delft University of Technology.
Another one is report titled [8] “How well do Graph-processing platform performs?”
From these two materials, some important finding comes out which are as listed below.
[8]However, the performance of all platforms is stable and largest variance around 10%.
a. Hadoop-MapReduce
i. [7]Hadoop-MapReduce performs worst in any graph algorithm then other
platforms.
ii. [7]Multi-iteration algorithms suffer from additional performance
penalties.
b. Giraph
i. [8]Giraph process graph in-memory and realize dynamic computation
mechanism by which only selected vertices will be processed in all
iterations of algorithms. That reduces computation time.
ii. [7]For large amounts of messages or big datasets, Giraph can lead to
crashes due to lack of memory.
c. Neo4j
i. [8]Limited by the resource of single machine, the performance of Neo4j
becomes significantly worst when the graph exceeds the memory
capacity.
ii. [7]Neo4j was designed as a single machine dataset. To achieve multi scale,
users of Neo4j have to implement communication between these
machines as well as manage partitioning, consistency etc. It require
significant amount of additional work beside the application
implementations.
iii. [7]Two-level cache allows Neo4j to achieve excellent hot-cache execution
times, especially when graph data accessed by the algorithm fits in cache.
iv. [8]The data ingestion time of Neo4j matches closely the characteristics of
the graph. Overall, data ingestion takes much longer for Neo4j than
HDFS.
4. Conclusion
Based on survey, we can reach to following conclusion.
Modern computers can handle most of smaller or sparser graph databases. However,
once the dataset size significantly increases or if the graph is dense, the execution time
increases significantly. For this reason, single machine based graph processing platforms
cannot compete with distributed system.
We have considered two graph-processing frameworks (Giraph, Neo4j) and a generic
data-processing platform (Hadoop). The platforms which focus on processing graph
dataset achieve significant performance advantages over generic platforms in most of
cases. A Hadoop does not maintain the relations between data and treats every vertex
as a disjoint, which other platforms have to and pay a performance penalty for it. Thus
for certain datasets Hadoop can achieve better performance than the graph-processing
platforms.
There are two significant factors for the large-scale graph-processing platforms: the
programming model and platform design. The Pregel performs much better than
MapReduce for iterative algorithms. Giraph has limitation that it performs in memory
computation, which limitation is not for Hadoop-mapreduce. [7]The Neo4j has achieved
good performance for smaller or sparser dataset on single system. It has very good
documentation and hence easy to learn. [7]The Hadoop-mapreduce considered slowest
platform of all evaluated platforms but Neo4j’s performance for the large or dense
dataset is lower than that of Hadoop-mapreduce. [7]The Giraph platform, which
represents distributed large-scale graph processing platforms, was the fastest platform
in all the test experiment made by Marcin Biczak.
5. References
1. Apache Software Foundation, “Giraph.” http://giraph.apache.org
2. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,”
Comm. ACM, vol. 51, no. 1,2008, pp. 107–112.
3. Pregel: a system for large-scale graph processing - "abstract". G. Malewicz, M. H.
Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. In Proceedings of
the 28th ACM symposium on Principles of distributed computing, PODC '09, pages 6-
6,New York, NY, USA, 2009. ACM.
4. Apache Software Foundation, “Hadoop” Website, 2011.http://hadoop.apache.org
5. Apache Software Foundation, “Zookeeper”.Website,2010. http://zookeeper.apache.org
6. Neo Technology, http://www.neo4j.org
7. LudoGraph: a Sampling Capable Cloud-Based System for Large-Scale Graph Processing
Based on the Pregel programming model, Marcin Biczak, Masters of Science Thesis,Delft
University of Technology Year 2013
8. Y. Guo, M. Biczak, A. Varbanescu, A. Isoup, C. Martella, and T. Willke, “How well do
graph-processing platforms perform? an empirical performance evaluation and analysis:
Extended report,” tech. rep., Delft University of Technology, 2013.

More Related Content

PDF
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Kai Wähner
 
PPTX
Disaster Recovery Strategies - AWS Siklab 2022.pptx
John Louis Garcia
 
PDF
IBM Integration Bus High Availability Overview
Peter Broadhurst
 
PDF
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
PDF
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
PPTX
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
PPTX
Mini project on microsoft azure based on time
LawalMuhd2
 
PPTX
vSphere with Tanzu Tech Overview 7.0 U1 (1).pptx
hokismen
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Kai Wähner
 
Disaster Recovery Strategies - AWS Siklab 2022.pptx
John Louis Garcia
 
IBM Integration Bus High Availability Overview
Peter Broadhurst
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
Mini project on microsoft azure based on time
LawalMuhd2
 
vSphere with Tanzu Tech Overview 7.0 U1 (1).pptx
hokismen
 

What's hot (20)

PPTX
MySQL Replication Overview -- PHPTek 2016
Dave Stokes
 
ODP
Introduction to Structured Streaming
Knoldus Inc.
 
PPTX
A visual introduction to Apache Kafka
Paul Brebner
 
PDF
Processing edges on apache giraph
DataWorks Summit
 
PPT
SSL Implementation - IBM MQ - Secure Communications
nishchal29
 
PDF
IBM MQ High Availability 2019
David Ware
 
PPTX
Introduction to Apache Kafka
AIMDek Technologies
 
PPTX
Video - FortiMail and FortiMail Cloud - April 2021.pptx
EsminGadalaKattnMart
 
PDF
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
confluent
 
PDF
Spring integration을 통해_살펴본_메시징_세계
Wangeun Lee
 
PDF
Streaming Visualization
Guido Schmutz
 
PDF
Seminar Report on Google File System
Vishal Polley
 
PPTX
Cloud Security Architecture.pptx
Moshe Ferber
 
PPTX
Streaming Data and Stream Processing with Apache Kafka
confluent
 
PPSX
Multi-tenancy in Private Clouds
Patrick Nicolas
 
PDF
AWS Summit Seoul 2023 | AWS에서 최소한의 비용으로 구현하는 멀티리전 DR 자동화 구성
Amazon Web Services Korea
 
PDF
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
LINE Corporation
 
PDF
Cassandra serving netflix @ scale
Vinay Kumar Chella
 
PPTX
Kafka pub sub demo
Srish Kumar
 
PDF
Helm - Application deployment management for Kubernetes
Alexei Ledenev
 
MySQL Replication Overview -- PHPTek 2016
Dave Stokes
 
Introduction to Structured Streaming
Knoldus Inc.
 
A visual introduction to Apache Kafka
Paul Brebner
 
Processing edges on apache giraph
DataWorks Summit
 
SSL Implementation - IBM MQ - Secure Communications
nishchal29
 
IBM MQ High Availability 2019
David Ware
 
Introduction to Apache Kafka
AIMDek Technologies
 
Video - FortiMail and FortiMail Cloud - April 2021.pptx
EsminGadalaKattnMart
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
confluent
 
Spring integration을 통해_살펴본_메시징_세계
Wangeun Lee
 
Streaming Visualization
Guido Schmutz
 
Seminar Report on Google File System
Vishal Polley
 
Cloud Security Architecture.pptx
Moshe Ferber
 
Streaming Data and Stream Processing with Apache Kafka
confluent
 
Multi-tenancy in Private Clouds
Patrick Nicolas
 
AWS Summit Seoul 2023 | AWS에서 최소한의 비용으로 구현하는 멀티리전 DR 자동화 구성
Amazon Web Services Korea
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
LINE Corporation
 
Cassandra serving netflix @ scale
Vinay Kumar Chella
 
Kafka pub sub demo
Srish Kumar
 
Helm - Application deployment management for Kubernetes
Alexei Ledenev
 
Ad

Viewers also liked (20)

ODP
Neo4j Spatial - Backing a GIS with a true graph database
Craig Taverner
 
PPTX
Computer services
Arz Sy
 
DOC
Tutorial debian
Yunita Siswanti
 
PDF
Sql injection exposed proof of concept
laila wulandari
 
PDF
Sekolah Impianku
Yunita Siswanti
 
PPTX
Hadoop Graph Processing with Apache Giraph
DataWorks Summit
 
PPT
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
PPTX
Networking
Pravesh Hidko
 
PDF
Modul Free One Day Workshop Implementing Cisco IP Routing and Switched Networks
I Putu Hariyadi
 
PDF
Packet tracer (network simulation)
Aldi Nor Fahrudin
 
PDF
Modul Praktikum Sistem Keamanan Jaringan STMIK Bumigora Versi 1.0
I Putu Hariyadi
 
PDF
Using packet-tracer, capture and other Cisco ASA tools for network troublesho...
Cisco Russia
 
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
PDF
Pembahasan Cisco Packet Tracer Challenge LKS SMK Provinsi NTB 2016
I Putu Hariyadi
 
PPT
Big Graph Analytics on Neo4j with Apache Spark
Kenny Bastani
 
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
DOCX
Packet tracer practical guide
Nishant Gandhi
 
PDF
Using MongoDB as a high performance graph database
Chris Clarke
 
PDF
Tutorial cisco packet tracer lengkap
laila wulandari
 
PPT
Cisco Packet Tracer Overview
Ali Usman
 
Neo4j Spatial - Backing a GIS with a true graph database
Craig Taverner
 
Computer services
Arz Sy
 
Tutorial debian
Yunita Siswanti
 
Sql injection exposed proof of concept
laila wulandari
 
Sekolah Impianku
Yunita Siswanti
 
Hadoop Graph Processing with Apache Giraph
DataWorks Summit
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
Networking
Pravesh Hidko
 
Modul Free One Day Workshop Implementing Cisco IP Routing and Switched Networks
I Putu Hariyadi
 
Packet tracer (network simulation)
Aldi Nor Fahrudin
 
Modul Praktikum Sistem Keamanan Jaringan STMIK Bumigora Versi 1.0
I Putu Hariyadi
 
Using packet-tracer, capture and other Cisco ASA tools for network troublesho...
Cisco Russia
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Pembahasan Cisco Packet Tracer Challenge LKS SMK Provinsi NTB 2016
I Putu Hariyadi
 
Big Graph Analytics on Neo4j with Apache Spark
Kenny Bastani
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
Packet tracer practical guide
Nishant Gandhi
 
Using MongoDB as a high performance graph database
Chris Clarke
 
Tutorial cisco packet tracer lengkap
laila wulandari
 
Cisco Packet Tracer Overview
Ali Usman
 
Ad

Similar to Neo4j vs giraph (20)

PDF
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
PDF
A data aware caching 2415
SANTOSH WAYAL
 
DOCX
Hadoop Seminar Report
Bhushan Kulkarni
 
PDF
Understanding hadoop
RexRamos9
 
PDF
Report Hadoop Map Reduce
Urvashi Kataria
 
PDF
43_Sameer_Kumar_Das2
Mr.Sameer Kumar Das
 
PDF
Leveraging Map Reduce With Hadoop for Weather Data Analytics
iosrjce
 
PDF
B017320612
IOSR Journals
 
PDF
Architecting and productionising data science applications at scale
samthemonad
 
PDF
Seminar_Report_hadoop
Varun Narang
 
PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
PDF
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
PDF
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
PDF
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
ijcsit
 
PDF
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
AM Publications
 
PPTX
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
bhuvankumar3877
 
PDF
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
A data aware caching 2415
SANTOSH WAYAL
 
Hadoop Seminar Report
Bhushan Kulkarni
 
Understanding hadoop
RexRamos9
 
Report Hadoop Map Reduce
Urvashi Kataria
 
43_Sameer_Kumar_Das2
Mr.Sameer Kumar Das
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
iosrjce
 
B017320612
IOSR Journals
 
Architecting and productionising data science applications at scale
samthemonad
 
Seminar_Report_hadoop
Varun Narang
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
ijcsit
 
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
AM Publications
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
bhuvankumar3877
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 

More from Nishant Gandhi (7)

PPTX
Customer Feedback Analytics for Starbucks
Nishant Gandhi
 
PDF
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Nishant Gandhi
 
PPT
Processing Large Graphs
Nishant Gandhi
 
PDF
Graph Coloring Algorithms on Pregel Model using Hadoop
Nishant Gandhi
 
PPTX
Map reduce programming model to solve graph problems
Nishant Gandhi
 
DOCX
Hadoop Report
Nishant Gandhi
 
PPSX
Hadoop
Nishant Gandhi
 
Customer Feedback Analytics for Starbucks
Nishant Gandhi
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Nishant Gandhi
 
Processing Large Graphs
Nishant Gandhi
 
Graph Coloring Algorithms on Pregel Model using Hadoop
Nishant Gandhi
 
Map reduce programming model to solve graph problems
Nishant Gandhi
 
Hadoop Report
Nishant Gandhi
 

Recently uploaded (20)

PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Introduction to Data Analytics and Data Science
KavithaCIT
 

Neo4j vs giraph

  • 1. Indian Institute of Technology, Patna Large Scale Graph Processing: Neo4j Vs Apache Giraph Vs Hadoop-MapReduce (Survey Report) Nishant M Gandhi M.Tech. CSE IIT Patna
  • 2. Contents 1. Introduction ..........................................................................................................................................3 2. Graph Processing Platforms..................................................................................................................3 a. Hadoop-MapReduce .............................................................................................................................3 b. Giraph....................................................................................................................................................4 c. Neo4j .....................................................................................................................................................4 3. Analysis of Platforms.............................................................................................................................5 a. Hadoop-MapReduce.........................................................................................................................5 b. Giraph................................................................................................................................................5 c. Neo4j.................................................................................................................................................5 4. Conclusion.............................................................................................................................................6 5. References ............................................................................................................................................7
  • 3. 1. Introduction Today we are living in era of big data. From social media to scientific experiments, from computer to mobile devices, generate huge amount of data every day. Storing and processing this data is also the challenge now a day. There are so many real life problems, which can be solved with use of this generated big data. Many of these problems related with big data can be mapped to graph problems. Many solutions have been created to process large scale data. One of the most popular is [4] Hadoop with its [2] MapReduce programming platform. The lack of a programming model dedicated for graph was addressed by Google with [3] Pregel. The Pregel uses Bulk Synchronization Parallel model for graph processing. The open source version of Pregel is [1] Giraph. Another platform is [6] Neo4j which is graph database processing platform. In this document, we will try to understand these three platforms and their pros & cons. 2. Graph Processing Platforms There are many platforms available for large scale graph processing. However we are considering only these three platforms because all three contain very different programming models to process graph.  [6]Neo4J: desktop platform, NoSQL, graph database, version  [2]Hadoop-MapReduce: cluster platform, generic large-scale data processing platform  [1]Giraph: cluster platform, large-scale graph processing specialized platform a. Hadoop-MapReduce [2]Hadoop is an open-source platform for storing & computing huge amount of data. Hadoop has been used widely in many data analytics applications. It uses MapReduce programming model. Hadoop’s MapReduce programming model is inspired by functional programming’s Map & Reduce functions. The MapReduce programming model process input data and divides it based on key/value pairs. Data used by [4] Hadoop is stored in the Hadoop Distributed File System (HDFS). HDFS is not a part of Hadoop, although it is being used by it and the platform will not work without HDFS. Datasets which are stored in the HDFS are divided into N blocks of similar size. Each of these blocks is used as an input for Mapper.
  • 4. [8]Hadoop’s programming model has low performance and high resource consumption for iterative graph algorithms, because of programming model which require multiple map-reduce cycle. For example, for iterative graph traversing algorithms Hadoop would often need to store and load the entire graph structure during each iteration, to transfer data between the map and reduce processes through the disk-intensive HDFS, and to run an convergence-checking iterations as an additional job. b. Giraph [1]Giraph is an open-source, graph specific distributed system platform. Giraph uses the Pregel programming model, which is a vertex-centric programming abstraction that adapts the Bulk Synchronous Parallel (BSP) model. A BSP computation proceeds in a series of global Supersteps. Within each Superstep, active vertices execute the same user defined compute function and create & deliver inter-vertex messages. Barriers ensure synchronization between vertex computations. Once there are no messages to process and all vertices vote to halt. [8]Giraph utilizes the design of Hadoop, from which it leverages only Map phase. The single biggest difference between Hadoop & Giraph is the fact that Giraph is in-memory which speedup job execution. For fault-tolerance, Giraph uses periodic checkpoints. To co-ordinate Superstep execution, it uses [5]ZooKeeper. c. Neo4j Neo4j is one of the popular open-source NoSQL graph database implemented in java. Neo4j stores data in graphs rather than in tables. Every stored graph in Neo4j consists of relationships and vertices annotated with properties. Neo4j can execute graph- processing algorithms efficiently on just one machine, because of its optimization techniques that favor response time. [8]Neo4j uses a two-level, main-memory caching mechanism to improve its performance. The file buffer caches the storage file data in the same format as it is stored on the durable storage media. The object buffer caches vertices and relationships in a format that is optimized for high traversal speeds and transactional writes. Neo4j processes graphs by traversing all vertices, with the use of either the BFS or DFS traversal algorithm. To start graph traversal a program has to define a special reference vertex. This vertex is not a part of the original graph, but an additional artificial vertex which is add to the graph structure and act as a starting point of the graph traversal. All graph operations are performed as ACID transactions.
  • 5. 3. Analysis of Platforms The performance analyses of these platforms have been done several times but here I am using two materials and their results to write this section of report. The one is M.S. theses report of Marcin Biczak batch of 2013 from Delft University of Technology. Another one is report titled [8] “How well do Graph-processing platform performs?” From these two materials, some important finding comes out which are as listed below. [8]However, the performance of all platforms is stable and largest variance around 10%. a. Hadoop-MapReduce i. [7]Hadoop-MapReduce performs worst in any graph algorithm then other platforms. ii. [7]Multi-iteration algorithms suffer from additional performance penalties. b. Giraph i. [8]Giraph process graph in-memory and realize dynamic computation mechanism by which only selected vertices will be processed in all iterations of algorithms. That reduces computation time. ii. [7]For large amounts of messages or big datasets, Giraph can lead to crashes due to lack of memory. c. Neo4j i. [8]Limited by the resource of single machine, the performance of Neo4j becomes significantly worst when the graph exceeds the memory capacity. ii. [7]Neo4j was designed as a single machine dataset. To achieve multi scale, users of Neo4j have to implement communication between these machines as well as manage partitioning, consistency etc. It require significant amount of additional work beside the application implementations. iii. [7]Two-level cache allows Neo4j to achieve excellent hot-cache execution times, especially when graph data accessed by the algorithm fits in cache. iv. [8]The data ingestion time of Neo4j matches closely the characteristics of the graph. Overall, data ingestion takes much longer for Neo4j than HDFS.
  • 6. 4. Conclusion Based on survey, we can reach to following conclusion. Modern computers can handle most of smaller or sparser graph databases. However, once the dataset size significantly increases or if the graph is dense, the execution time increases significantly. For this reason, single machine based graph processing platforms cannot compete with distributed system. We have considered two graph-processing frameworks (Giraph, Neo4j) and a generic data-processing platform (Hadoop). The platforms which focus on processing graph dataset achieve significant performance advantages over generic platforms in most of cases. A Hadoop does not maintain the relations between data and treats every vertex as a disjoint, which other platforms have to and pay a performance penalty for it. Thus for certain datasets Hadoop can achieve better performance than the graph-processing platforms. There are two significant factors for the large-scale graph-processing platforms: the programming model and platform design. The Pregel performs much better than MapReduce for iterative algorithms. Giraph has limitation that it performs in memory computation, which limitation is not for Hadoop-mapreduce. [7]The Neo4j has achieved good performance for smaller or sparser dataset on single system. It has very good documentation and hence easy to learn. [7]The Hadoop-mapreduce considered slowest platform of all evaluated platforms but Neo4j’s performance for the large or dense dataset is lower than that of Hadoop-mapreduce. [7]The Giraph platform, which represents distributed large-scale graph processing platforms, was the fastest platform in all the test experiment made by Marcin Biczak.
  • 7. 5. References 1. Apache Software Foundation, “Giraph.” http://giraph.apache.org 2. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Comm. ACM, vol. 51, no. 1,2008, pp. 107–112. 3. Pregel: a system for large-scale graph processing - "abstract". G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. In Proceedings of the 28th ACM symposium on Principles of distributed computing, PODC '09, pages 6- 6,New York, NY, USA, 2009. ACM. 4. Apache Software Foundation, “Hadoop” Website, 2011.http://hadoop.apache.org 5. Apache Software Foundation, “Zookeeper”.Website,2010. http://zookeeper.apache.org 6. Neo Technology, http://www.neo4j.org 7. LudoGraph: a Sampling Capable Cloud-Based System for Large-Scale Graph Processing Based on the Pregel programming model, Marcin Biczak, Masters of Science Thesis,Delft University of Technology Year 2013 8. Y. Guo, M. Biczak, A. Varbanescu, A. Isoup, C. Martella, and T. Willke, “How well do graph-processing platforms perform? an empirical performance evaluation and analysis: Extended report,” tech. rep., Delft University of Technology, 2013.