SlideShare a Scribd company logo
Scalable graph analysis with
Apache Giraph and Spark GraphX
Roman Shaposhnik rvs@apache.org @rhatr
Director of Open Source, Pivotal Inc.
Introduction into
scalable graph analysis with
Apache Giraph and Spark GraphX
Roman Shaposhnik rvs@apache.org @rhatr
Director of Open Source, Pivotal Inc.
Shameless plug #1
Shameless plug #1
Agenda:
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
2
1
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
2
1 foo
bar
fee
42
fum
What kind of graphs are we talking about?
• Page ranking on Facebook social graph (mid 2013)
•  10^9 (billions) vertices
•  10^12 (trillion) edges
•  10^15 (petabtybe) cold storage data scale
•  200 servers
•  …all in under 4 minutes!
“On day one Doug created
HDFS and MapReduce”
Google papers that started it all
• GFS (file system)
•  distributed
•  replicated
•  non-POSIX"

• MapReduce (computational framework)
•  distributed
•  batch-oriented (long jobs; final results)
•  data-gravity aware
•  designed for “embarrassingly parallel” algorithms
HDFS pools and abstracts direct-attached storage
…
HDFS
MR MR
A Unix analogy
§ It is as though instead of:
$	
  grep	
  foo	
  bar.txt	
  |	
  tr	
  “,”	
  “	
  “	
  |	
  sort	
  -­‐u	
  
	
  
§ We are doing:
$	
  grep	
  foo	
  <	
  bar.txt	
  >	
  /tmp/1.txt	
  
$	
  tr	
  “,”	
  “	
  “	
  	
  <	
  /tmp/1.txt	
  >	
  /tmp/2.txt	
  
$	
  sort	
  –u	
  <	
  /tmp/2.txt	
  
Enter Apache Spark
RAM is the new disk, Disk is the new tape
Source: UC Berkeley Spark project (just the image)
RDDs instead of HDFS files, RAM instead of Disk
warnings = textFile(…).filter(_.contains(“warning”))
.map(_.split(‘ ‘)(1))
HadoopRDD
path = hdfs://
FilteredRDD
contains…
MappedRDD
split…
pooled RAM
RDDs: resilient, distributed, datasets
§ Distributed on a cluster in RAM
§ Immutable (mostly)
§ Can be evicted, snapshotted, etc.
§ Manipulated via parallel operators (map, etc.)
§ Automatically rebuilt on failure
§ A parallel ecosystem
§ A solution to iterative and multi-stage apps
What’s so special about Graphs and
big data?
Graph relationships
§ Entities in your data: tuples
-  customer data
-  product data
-  interaction data
§ Connection between entities: graphs
-  social network or my customers
-  clustering of customers vs. products
A word about Graph databases
§  Plenty available
-  Neo4J, Titan, etc.
§  Benefits
-  Query language
-  Tightly integrate systems with few moving parts
-  High performance on known data sets
§  Shortcomings
-  Not easy to scale horizontally
-  Don’t integrate with HDFS
-  Combine storage and computational layers
-  A sea of APIs
What’s the key API?
§ Directed multi-graph with labels attached to vertices and edges
§ Defining vertices and edges dynamically
§ Selecting sub-graphs
§ Mutating the topology of the graph
§ Partitioning the graph
§ Computing model that is
-  iterative
-  scalable (shared nothing)
-  resilient
-  easy to manage at scale
Bulk Synchronous Parallel
BSP compute model
BSP in a nutshell
time
communications
local
processing
barrier #1
barrier #2
barrier #3
Vertex-centric BSP application
@rhatr
@TheASF
@c0sin
“Think like a vertex”
•  I know my local state
•  I know my neighbors
•  I can send messages to vertices
•  I can declare that I am done
•  I can mutate graph topology
Local state, global messaging
time
communications
vertices are
doing local
computing
and pooling 
messages
superstep #1
all vertices are
done computing
superstep #2
Lets put it all together
Hadoop ecosystem view
HDFS
Pig
Sqoop Flume
MR
Hive
Tez
Giraph
Mahout
Spark
SparkSQL
MLib
GraphX
HAWQ
Kafka
YARN
MADlib
Spark view
HDFS, Ceph, GlusterFS, S3
Hive
Spark
SparkSQL
MLib
GraphX
Kafka
YARN, Mesos, MR
Enough boxology!
Lets look at some code
Our toy for the rest of this talk
Adjacency lists stored on HDFS
$ hadoop fs –cat /tmp/graph/1.txt
1
2 1 3
3 1 2
@rhatr
@TheASF
@c0sin
3
1
2
Graph modeling in GraphX
§  The property graph is parameterized over the vertex (VD) and edge (ED) types
class Graph[VD, ED] {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
}
§  Graph[(String, String), String]
Hello world in GraphX
$ spark*/bin/spark-shell
scala val inputFile = sc.textFile(“hdfs:///tmp/graph/1.txt”)
scala val edges = inputFile.flatMap(s = { // “2 1 3”
val l = s.split(t); // [ “2”, “1”, “3” ]
l.drop(1).map(x = (l.head.toLong, x.toLong)) // [ (2, 1), (2, 3) ]
})
scala val graph = Graph.fromEdgeTuples(edges, ) // Graph[String, Int]
scala val result = graph.collectNeighborIds(EdgeDirection.Out).map(x =
println(Hello world from the:  + x._1 +  :  + x._2.mkString( )) )
scala result.collect() // don’t try this @home
Hello world from the: 1 :
Hello world from the: 2 : 1 3
Hello world from the: 3 : 1 2
Graph modeling in Giraph
BasicComputationI	
  extends	
  WritableComparable,	
  	
  	
  	
  	
  //	
  VertexID	
  	
  	
  -­‐-­‐	
  vertex	
  ref	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  V	
  extends	
  Writable,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  VertexData	
  -­‐-­‐	
  a	
  vertex	
  datum	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  E	
  extends	
  Writable,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  EdgeData	
  	
  	
  -­‐-­‐	
  an	
  edge	
  label	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  M	
  extends	
  Writable	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  MessageData-­‐–	
  message	
  payload	
  
	
  
	
  
V	
  is	
  sort	
  of	
  like	
  VD	
  
E	
  is	
  sort	
  of	
  like	
  ED	
  
Hello world in Giraph
public class GiraphHelloWorld extends
BasicComputationIntWritable, IntWritable, NullWritable, NullWritable {
public void compute(VertexIntWritable, IntWritable, NullWritable vertex,
IterableNullWritable messages) {
System.out.print(“Hello world from the: “ + vertex.getId() + “ : “);
for (EdgeIntWritable, NullWritable e : vertex.getEdges()) {
System.out.print(“ “ + e.getTargetVertexId());
}
System.out.println(“”);
vertex.voteToHalt();
}
}
How to run it
$ giraph target/*.jar giraph.GiraphHelloWorld 
-vip /tmp/graph/ 
-vif org.apache.giraph.io.formats.IntIntNullTextInputFormat 
-w 1 
-ca giraph.SplitMasterWorker=false,giraph.logLevel=error
Hello world from the: 1 :
Hello world from the: 2 : 1 3
Hello world from the: 3 : 1 2
Anatomy of Giraph run
BSP assumes an exclusively vertex view
Turning Twitter into Facebook
@rhatr
@TheASF
@c0sin
@rhatr
@TheASF
@c0sin
Hello world in Giraph
public void compute(VertexText, DoubleWritable, DoubleWritable vertex, IterableText ms ){
if (getSuperstep() == 0) {
sendMessageToAllEdges(vertex, vertex.getId());
} else {
for (Text m : ms) {
if (vertex.getEdgeValue(m) == null) {
vertex.addEdge(EdgeFactory.create(m, SYNTHETIC_EDGE));
}
}
}
vertex.voteToHalt();
}
BSP in GraphX
Single source shortest path
scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message
((id, dist, newDist) = math.min(dist, newDist), // Vertex Program
triplet = { // Send Message
if (triplet.srcAttr + triplet.attr  triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) = math.min(a,b)) // Merge Messages
scala println(sssp.vertices.collect.mkString(n))
2
42
0
3
Single source shortest path
scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message
((id, dist, newDist) = math.min(dist, newDist), // Vertex Program
triplet = { // Send Message
if (triplet.srcAttr + triplet.attr  triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) = math.min(a,b)) // Merge Messages
scala println(sssp.vertices.collect.mkString(n))
2
5
0
3
Operational views of the graph
Masking instead of mutation
§ def subgraph(
epred: EdgeTriplet[VD,ED] = Boolean = (x = true),
vpred: (VertexID, VD) = Boolean = ((v, d) = true))
: Graph[VD, ED]
§ def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
Built-in algorithms
§  def pageRank(tol: Double, resetProb: Double = 0.15):
Graph[Double, Double]
§  def connectedComponents(): Graph[VertexID, ED]
§  def triangleCount(): Graph[Int, ED]
§  def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED]
Final thoughts
Giraph
§ An unconstrained BSP framework
§ Specialized fully mutable,
dynamically balanced in-memory
graph representation
§ Very procedural, vertex-centric
programming model
§ Genuine part of Hadoop ecosystem
§ Definitely a 1.0
GraphX
§ An RDD framework
§ Graphs are “views” on RDDs and
thus immutable
§ Functional-like, “declarative”
programming model
§ Genuine part of Spark ecosystem
§ Technically still an alpha
QA
Thanks!

More Related Content

What's hot (20)

PDF
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Spark Summit
 
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
PDF
Apache Giraph: Large-scale graph processing done better
🧑‍💻 Manuel Coppotelli
 
PDF
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
PDF
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
PDF
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Databricks
 
PPT
Mapreduce in Search
Amund Tveit
 
PPTX
Graph databases: Tinkerpop and Titan DB
Mohamed Taher Alrefaie
 
PPTX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
DataWorks Summit/Hadoop Summit
 
PPTX
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PDF
Data profiling in Apache Calcite
DataWorks Summit
 
PDF
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyData
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PDF
Apache Giraph
Ahmet Emre Aladağ
 
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
PDF
Spark Meetup @ Netflix, 05/19/2015
Yves Raimond
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Spark Summit
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
Apache Giraph: Large-scale graph processing done better
🧑‍💻 Manuel Coppotelli
 
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Databricks
 
Mapreduce in Search
Amund Tveit
 
Graph databases: Tinkerpop and Titan DB
Mohamed Taher Alrefaie
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
DataWorks Summit/Hadoop Summit
 
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Data profiling in Apache Calcite
DataWorks Summit
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyData
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Apache Giraph
Ahmet Emre Aladağ
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Intro to Spark and Spark SQL
jeykottalam
 
Spark Meetup @ Netflix, 05/19/2015
Yves Raimond
 

Viewers also liked (13)

PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
PDF
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
PPTX
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Graph Processing with Apache Giraph
DataWorks Summit
 
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
PDF
Apache kudu
Asim Jalis
 
PPTX
Machine Learning with GraphLab Create
Turi, Inc.
 
PDF
Time Series Analysis with Spark
Sandy Ryza
 
PPTX
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
 
PPTX
Introduction to Apache Kudu
Jeff Holoman
 
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
 
Hadoop Graph Processing with Apache Giraph
DataWorks Summit
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
Apache kudu
Asim Jalis
 
Machine Learning with GraphLab Create
Turi, Inc.
 
Time Series Analysis with Spark
Sandy Ryza
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
 
Introduction to Apache Kudu
Jeff Holoman
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
Ad

Similar to Introduction into scalable graph analysis with Apache Giraph and Spark GraphX (20)

PPT
Hadoop trainingin bangalore
appaji intelhunt
 
PPTX
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
PDF
Apache Flink & Graph Processing
Vasia Kalavri
 
PPT
Behm Shah Pagerank
gothicane
 
PDF
Full stack analytics with Hadoop 2
Gabriele Modena
 
PDF
Big Data for Mobile
BugSense
 
PPTX
Hadoop ecosystem
Ran Silberman
 
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
PDF
Cloud jpl
Marc de Palol
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PPT
Spark training-in-bangalore
Kelly Technologies
 
PDF
Hadoop ecosystem
Ran Silberman
 
PDF
Osd ctw spark
Wisely chen
 
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
PPTX
Map Reduce
Prashant Gupta
 
PDF
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
PDF
Scala+data
Samir Bessalah
 
PPTX
Map Reduce
Rahul Agarwal
 
Hadoop trainingin bangalore
appaji intelhunt
 
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
Apache Flink & Graph Processing
Vasia Kalavri
 
Behm Shah Pagerank
gothicane
 
Full stack analytics with Hadoop 2
Gabriele Modena
 
Big Data for Mobile
BugSense
 
Hadoop ecosystem
Ran Silberman
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Cloud jpl
Marc de Palol
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Spark training-in-bangalore
Kelly Technologies
 
Hadoop ecosystem
Ran Silberman
 
Osd ctw spark
Wisely chen
 
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Map Reduce
Prashant Gupta
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Scala+data
Samir Bessalah
 
Map Reduce
Rahul Agarwal
 
Ad

More from rhatr (8)

PDF
Unikernels: in search of a killer app and a killer ecosystem
rhatr
 
PDF
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
rhatr
 
PDF
Tachyon and Apache Spark
rhatr
 
PDF
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
PPTX
OSv: probably the best OS for cloud workloads you've never hear of
rhatr
 
PDF
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
rhatr
 
PDF
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
rhatr
 
PDF
Elephant in the cloud
rhatr
 
Unikernels: in search of a killer app and a killer ecosystem
rhatr
 
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
rhatr
 
Tachyon and Apache Spark
rhatr
 
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
OSv: probably the best OS for cloud workloads you've never hear of
rhatr
 
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
rhatr
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
rhatr
 
Elephant in the cloud
rhatr
 

Recently uploaded (20)

PDF
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
Tally software_Introduction_Presentation
AditiBansal54083
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Executive Business Intelligence Dashboards
vandeslie24
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX

  • 1. Scalable graph analysis with Apache Giraph and Spark GraphX Roman Shaposhnik [email protected] @rhatr Director of Open Source, Pivotal Inc.
  • 2. Introduction into scalable graph analysis with Apache Giraph and Spark GraphX Roman Shaposhnik [email protected] @rhatr Director of Open Source, Pivotal Inc.
  • 6. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee
  • 7. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee
  • 8. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee 2 1
  • 9. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee 2 1 foo bar fee 42 fum
  • 10. What kind of graphs are we talking about? • Page ranking on Facebook social graph (mid 2013) •  10^9 (billions) vertices •  10^12 (trillion) edges •  10^15 (petabtybe) cold storage data scale •  200 servers •  …all in under 4 minutes!
  • 11. “On day one Doug created HDFS and MapReduce”
  • 12. Google papers that started it all • GFS (file system) •  distributed •  replicated •  non-POSIX" • MapReduce (computational framework) •  distributed •  batch-oriented (long jobs; final results) •  data-gravity aware •  designed for “embarrassingly parallel” algorithms
  • 13. HDFS pools and abstracts direct-attached storage … HDFS MR MR
  • 14. A Unix analogy § It is as though instead of: $  grep  foo  bar.txt  |  tr  “,”  “  “  |  sort  -­‐u     § We are doing: $  grep  foo  <  bar.txt  >  /tmp/1.txt   $  tr  “,”  “  “    <  /tmp/1.txt  >  /tmp/2.txt   $  sort  –u  <  /tmp/2.txt  
  • 16. RAM is the new disk, Disk is the new tape Source: UC Berkeley Spark project (just the image)
  • 17. RDDs instead of HDFS files, RAM instead of Disk warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1)) HadoopRDD path = hdfs:// FilteredRDD contains… MappedRDD split… pooled RAM
  • 18. RDDs: resilient, distributed, datasets § Distributed on a cluster in RAM § Immutable (mostly) § Can be evicted, snapshotted, etc. § Manipulated via parallel operators (map, etc.) § Automatically rebuilt on failure § A parallel ecosystem § A solution to iterative and multi-stage apps
  • 19. What’s so special about Graphs and big data?
  • 20. Graph relationships § Entities in your data: tuples -  customer data -  product data -  interaction data § Connection between entities: graphs -  social network or my customers -  clustering of customers vs. products
  • 21. A word about Graph databases §  Plenty available -  Neo4J, Titan, etc. §  Benefits -  Query language -  Tightly integrate systems with few moving parts -  High performance on known data sets §  Shortcomings -  Not easy to scale horizontally -  Don’t integrate with HDFS -  Combine storage and computational layers -  A sea of APIs
  • 22. What’s the key API? § Directed multi-graph with labels attached to vertices and edges § Defining vertices and edges dynamically § Selecting sub-graphs § Mutating the topology of the graph § Partitioning the graph § Computing model that is -  iterative -  scalable (shared nothing) -  resilient -  easy to manage at scale
  • 24. BSP in a nutshell time communications local processing barrier #1 barrier #2 barrier #3
  • 25. Vertex-centric BSP application @rhatr @TheASF @c0sin “Think like a vertex” •  I know my local state •  I know my neighbors •  I can send messages to vertices •  I can declare that I am done •  I can mutate graph topology
  • 26. Local state, global messaging time communications vertices are doing local computing and pooling messages superstep #1 all vertices are done computing superstep #2
  • 27. Lets put it all together
  • 28. Hadoop ecosystem view HDFS Pig Sqoop Flume MR Hive Tez Giraph Mahout Spark SparkSQL MLib GraphX HAWQ Kafka YARN MADlib
  • 29. Spark view HDFS, Ceph, GlusterFS, S3 Hive Spark SparkSQL MLib GraphX Kafka YARN, Mesos, MR
  • 31. Our toy for the rest of this talk Adjacency lists stored on HDFS $ hadoop fs –cat /tmp/graph/1.txt 1 2 1 3 3 1 2 @rhatr @TheASF @c0sin 3 1 2
  • 32. Graph modeling in GraphX §  The property graph is parameterized over the vertex (VD) and edge (ED) types class Graph[VD, ED] { val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] } §  Graph[(String, String), String]
  • 33. Hello world in GraphX $ spark*/bin/spark-shell scala val inputFile = sc.textFile(“hdfs:///tmp/graph/1.txt”) scala val edges = inputFile.flatMap(s = { // “2 1 3” val l = s.split(t); // [ “2”, “1”, “3” ] l.drop(1).map(x = (l.head.toLong, x.toLong)) // [ (2, 1), (2, 3) ] }) scala val graph = Graph.fromEdgeTuples(edges, ) // Graph[String, Int] scala val result = graph.collectNeighborIds(EdgeDirection.Out).map(x = println(Hello world from the: + x._1 + : + x._2.mkString( )) ) scala result.collect() // don’t try this @home Hello world from the: 1 : Hello world from the: 2 : 1 3 Hello world from the: 3 : 1 2
  • 34. Graph modeling in Giraph BasicComputationI  extends  WritableComparable,          //  VertexID      -­‐-­‐  vertex  ref                                                                                                        V  extends  Writable,                              //  VertexData  -­‐-­‐  a  vertex  datum                                    E  extends  Writable,                              //  EdgeData      -­‐-­‐  an  edge  label                                    M  extends  Writable                              //  MessageData-­‐–  message  payload       V  is  sort  of  like  VD   E  is  sort  of  like  ED  
  • 35. Hello world in Giraph public class GiraphHelloWorld extends BasicComputationIntWritable, IntWritable, NullWritable, NullWritable { public void compute(VertexIntWritable, IntWritable, NullWritable vertex, IterableNullWritable messages) { System.out.print(“Hello world from the: “ + vertex.getId() + “ : “); for (EdgeIntWritable, NullWritable e : vertex.getEdges()) { System.out.print(“ “ + e.getTargetVertexId()); } System.out.println(“”); vertex.voteToHalt(); } }
  • 36. How to run it $ giraph target/*.jar giraph.GiraphHelloWorld -vip /tmp/graph/ -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -w 1 -ca giraph.SplitMasterWorker=false,giraph.logLevel=error Hello world from the: 1 : Hello world from the: 2 : 1 3 Hello world from the: 3 : 1 2
  • 38. BSP assumes an exclusively vertex view
  • 39. Turning Twitter into Facebook @rhatr @TheASF @c0sin @rhatr @TheASF @c0sin
  • 40. Hello world in Giraph public void compute(VertexText, DoubleWritable, DoubleWritable vertex, IterableText ms ){ if (getSuperstep() == 0) { sendMessageToAllEdges(vertex, vertex.getId()); } else { for (Text m : ms) { if (vertex.getEdgeValue(m) == null) { vertex.addEdge(EdgeFactory.create(m, SYNTHETIC_EDGE)); } } } vertex.voteToHalt(); }
  • 42. Single source shortest path scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message ((id, dist, newDist) = math.min(dist, newDist), // Vertex Program triplet = { // Send Message if (triplet.srcAttr + triplet.attr triplet.dstAttr) { Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) } else { Iterator.empty } }, (a,b) = math.min(a,b)) // Merge Messages scala println(sssp.vertices.collect.mkString(n)) 2 42 0 3
  • 43. Single source shortest path scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message ((id, dist, newDist) = math.min(dist, newDist), // Vertex Program triplet = { // Send Message if (triplet.srcAttr + triplet.attr triplet.dstAttr) { Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) } else { Iterator.empty } }, (a,b) = math.min(a,b)) // Merge Messages scala println(sssp.vertices.collect.mkString(n)) 2 5 0 3
  • 44. Operational views of the graph
  • 45. Masking instead of mutation § def subgraph( epred: EdgeTriplet[VD,ED] = Boolean = (x = true), vpred: (VertexID, VD) = Boolean = ((v, d) = true)) : Graph[VD, ED] § def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
  • 46. Built-in algorithms §  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double] §  def connectedComponents(): Graph[VertexID, ED] §  def triangleCount(): Graph[Int, ED] §  def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED]
  • 47. Final thoughts Giraph § An unconstrained BSP framework § Specialized fully mutable, dynamically balanced in-memory graph representation § Very procedural, vertex-centric programming model § Genuine part of Hadoop ecosystem § Definitely a 1.0 GraphX § An RDD framework § Graphs are “views” on RDDs and thus immutable § Functional-like, “declarative” programming model § Genuine part of Spark ecosystem § Technically still an alpha