SlideShare a Scribd company logo
Spark GraphX & Pregel
Challenges and Best Practices
Ashutosh Trivedi (IIIT Bangalore)
Kaushik Ranjan (IIIT Bangalore)
Sigmoid-Meetup Bangalore
https://github.com/anantasty/SparkAlgorithms
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Agenda
•Introduction to GraphX
– How to describe a graph
– RDDs to store Graph
– Algorithms available
•Application in graph algorithms
– Feedback Vertex Set of a Graph
– Identifying parallel parts of the solution.
•Challenges we faced
•Best practices
2
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
33
GraphX - Representation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Graph Representation
4
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
• The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional
constraint that each VertexID occurs only once.
• Moreover, VertexRDD[A] represents a set of vertices each with an
attribute of type A
• The EdgeRDD[ED], extends RDD[Edge[ED]]
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 5
GraphX - Representation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 6
A BA
Vertex and Edges
Vertex Edge
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Triplets Join Vertices and Edges
• The triplets operator joins vertices and edges:
TripletsVertices
B
A
C
D
Edges
A B
A C
B C
C D
A BA
B A C
B C
C D
7
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
88
Triplets elements
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 9
Subgraphs
Predicates vpred and epred
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 10
Feedback Vertex Set
• A feedback vertex set of a graph is a set of vertices
whose removal leaves a graph without cycles.
• Each feedback vertex set contains at least one vertex of
any cycle in the graph.
• The feedback vertex set problem is an NP-
complete problem in computational complexity theory
• Enumerate each simple cycle.
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 11
1 2
34
5
6
7
8
9
10
Strongly Connected Components
Each strongly connected component can be
considered in parallel since they do not share
any cycle
SC1 – (1) SC2 – (5) SC3 – (8) SC4 – (9)
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 12
FVS Algorithm
#Greedy recursive solution
FVS(G)
sccGraph = scc(G)
For each graph in sccGraph
For each vertex
remove vertex and again calculate scc,
vertex V = vertex which give max number of scc
#which means it kills maximum cycles
subGraph = subgraph(remove V )
FVS (subGraph )
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 13
1 2
4 3
2
4 3
Graph Iteration SCC count
3
1
4 3
1
1 2
4
3
1 2
4 3
1 2
4 3
Remove 2
Remove 1
Remove 3
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 14
1
5
8 9
1 5 8 9Feedback Vertex Set
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 15
FVS – Spark Implementation
sccGraph has one more property sccID on each vertices, extract it
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 16
FVS – Spark Implementation
sccGraph = scc(G)
For each graph in sccGraph
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 17
FVS – Spark Implementation
#Greedy recursive function
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 18
FVS – Spark Implementation
For each vertex
remove vertex and again calculate scc,
# Z is a list of scc count after removing each vertex
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 19
vertex V = vertex which give max number of scc
#which means it kills maximum cycles
FVS – Spark Implementation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 20
subGraph = subgraph(remove V )
FVS (subGraph )
FVS – Spark Implementation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 21
Pregel
• Graph DB
– Data Storage
– Data Mining
• Advantages
– Large-scale distributed computations
– Parallel-algorithms for graphs on multiple machines
– Fault tolerance and distributability
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 22
Oldest Follower
What is the age of oldest follower of each user ?
Val oldestFollowerAge = graph
.aggregateMessages(
#map word => (word.dst.id, word.src.age),
#reduce (a,b) => max(a, b)
)
.vertices
mapReduceTriplets is now aggregateMessages
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 23
In aggregateMessages :
• EdgeContext which exposes the triplet fields .
• functions to explicitly send messages to the source and
destination vertex.
• It require the user to indicate what fields in the triplet are
actually required.
New in GraphX
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Theory – it’s Good
How it works – that’s awesome
24
Graph’s are recursive data-structures, where the
property of a vertex is dependent on the properties of
it’s neighbors, which in turn are dependent on the
properties of their neighbors.
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Graph.Pregel ( initialMessage ) (
#message consumption
( vertexID, initialProperty, message ) → compute new property
,
#message generation
triplet → .. code ..
Iterator( vertexID, message )
Iterator.empty
,
#message aggregation
( existing message set, new message ) → NEW message set
)
25
Architecture
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 26
1 2
4 3
1030
30 20
1 2
4 3
10
30
30 20
max [30,10,20]
max [20] max [10]
1 2
4 3
100
10 10
1 2
4 3
10
0
10 10
max [10] max [10]
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 27
Example - output
1 2
4 3
100
0 0
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Applications - GIS
• Algorithm – to compute all vertices in a directed graph, that can
reach out to a given vertex.
• Can be used for watershed delineation in Geographic Information
Systems
28
Vertices that can reach out to E are A and B
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Algorithm
Graph.Pregel( Seq[vertexID’s] ) (
#message consumption
if vertex.state == 1
vertex.state → 2
else if vertex.state == 0
if ( vertex.adjacentVertices ∩ Seq[ vertexID’s ] ) isNotEmpty
vertex.state → 2
#message aggregator
Seq[existing vertex ID’s] U Seq[new vertex ID]
)
29
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 30
#message generation
for each triplet
if destinationVertex.state == 1
message( sourceVertexID, Seq[destinationVertexID] )
message( destinationVertexID, Seq[destinationVertexID] )
else if sourceVertex.state == 1 and destinationVertex.state == 2
message( sourceVertexID, Seq[destinationVertexID] )
else message( empty )
Algorithm
References
• Fork our repository at
• https://github.com/anantasty/SparkAlgorithms
• Follow us at
• https://github.com/codeAshu
• https://github.com/kaushikranjan
• https://spark.apache.org/docs/latest/graphx-programming-guide.html
31

More Related Content

What's hot (20)

PDF
Apache Flink internals
Kostas Tzoumas
 
PDF
Sqoop
Prashant Gupta
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PPTX
Introduction to Hadoop and Hadoop component
rebeccatho
 
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
PDF
Cassandra Introduction & Features
DataStax Academy
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PPTX
Introduction to Pig
Prashanth Babu
 
KEY
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Kevin Weil
 
PPTX
Spark rdd vs data frame vs dataset
Ankit Beohar
 
PDF
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Glen Hawkins
 
PDF
Software-Defined Storage (SDS)
Ali Mirfallah
 
PDF
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
Markus Michalewicz
 
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
PDF
The delta architecture
Prakash Chockalingam
 
PPTX
Data streaming fundamentals
Mohammed Fazuluddin
 
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Apache Flink internals
Kostas Tzoumas
 
Programming in Spark using PySpark
Mostafa
 
Introduction to Hadoop and Hadoop component
rebeccatho
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
Cassandra Introduction & Features
DataStax Academy
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
Data Lakehouse Symposium | Day 4
Databricks
 
Introduction to Pig
Prashanth Babu
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Kevin Weil
 
Spark rdd vs data frame vs dataset
Ankit Beohar
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Glen Hawkins
 
Software-Defined Storage (SDS)
Ali Mirfallah
 
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
Markus Michalewicz
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
The delta architecture
Prakash Chockalingam
 
Data streaming fundamentals
Mohammed Fazuluddin
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 

Viewers also liked (20)

PDF
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
PDF
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Spark Summit
 
PDF
Machine Learning and GraphX
Andy Petrella
 
PDF
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
PDF
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
PDF
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
PDF
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit
 
PDF
Ashutosh pycon
Ashutosh Trivedi
 
PPTX
Spark algorithms
Ashutosh Trivedi
 
PDF
Graph x pregel
Sigmoid
 
PDF
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
Spark Summit
 
PDF
Fighting financial crime with graph analysis at BIWA Summit 2017
Linkurious
 
PDF
Using graphs technologies for intelligence analysis.
Linkurious
 
PDF
Xia Zhu – Intel at MLconf ATL
MLconf
 
PDF
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
PDF
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
Spark Summit
 
PDF
An excursion into Text Analytics with Apache Spark
Krishna Sankar
 
PDF
Lambda at Weather Scale by Robbie Strickland
Spark Summit
 
PPTX
Social Network Analysis with Spark
Ghulam Imaduddin
 
PPTX
Using spark for timeseries graph analytics
Sigmoid
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Spark Summit
 
Machine Learning and GraphX
Andy Petrella
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit
 
Ashutosh pycon
Ashutosh Trivedi
 
Spark algorithms
Ashutosh Trivedi
 
Graph x pregel
Sigmoid
 
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
Spark Summit
 
Fighting financial crime with graph analysis at BIWA Summit 2017
Linkurious
 
Using graphs technologies for intelligence analysis.
Linkurious
 
Xia Zhu – Intel at MLconf ATL
MLconf
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
Spark Summit
 
An excursion into Text Analytics with Apache Spark
Krishna Sankar
 
Lambda at Weather Scale by Robbie Strickland
Spark Summit
 
Social Network Analysis with Spark
Ghulam Imaduddin
 
Using spark for timeseries graph analytics
Sigmoid
 
Ad

Similar to GraphX and Pregel - Apache Spark (20)

PDF
Write Graph Algorithms Like a Boss Andrew Ray
Databricks
 
PDF
F14 lec12graphs
ankush karwa
 
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
PDF
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PDF
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
鉄平 土佐
 
PDF
Processing large-scale graphs with Google(TM) Pregel
ArangoDB Database
 
PDF
Frank Celler – Processing large-scale graphs with Google(TM) Pregel - NoSQL m...
NoSQLmatters
 
PDF
Distributed graph processing
Bartosz Konieczny
 
PDF
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
Toyotaro Suzumura
 
PDF
Gelly in Apache Flink Bay Area Meetup
Vasia Kalavri
 
PPTX
Data Structure and algorithms - Graph1.pptx
Kishor767966
 
PDF
Graph Analytics in Spark
Paco Nathan
 
PPTX
Unit 4 dsuc
Sweta Singh
 
PDF
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
Big Data Spain
 
PPT
An Introduction to Graph Databases
InfiniteGraph
 
PDF
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
PPTX
PowerLyra@EuroSys2015
realstolz
 
PPTX
ppt 1.pptx
ShasidharaniD
 
PDF
Spark graphx
Carol McDonald
 
Write Graph Algorithms Like a Boss Andrew Ray
Databricks
 
F14 lec12graphs
ankush karwa
 
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Spark Summit
 
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
鉄平 土佐
 
Processing large-scale graphs with Google(TM) Pregel
ArangoDB Database
 
Frank Celler – Processing large-scale graphs with Google(TM) Pregel - NoSQL m...
NoSQLmatters
 
Distributed graph processing
Bartosz Konieczny
 
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
Toyotaro Suzumura
 
Gelly in Apache Flink Bay Area Meetup
Vasia Kalavri
 
Data Structure and algorithms - Graph1.pptx
Kishor767966
 
Graph Analytics in Spark
Paco Nathan
 
Unit 4 dsuc
Sweta Singh
 
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
Big Data Spain
 
An Introduction to Graph Databases
InfiniteGraph
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
PowerLyra@EuroSys2015
realstolz
 
ppt 1.pptx
ShasidharaniD
 
Spark graphx
Carol McDonald
 
Ad

Recently uploaded (20)

PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Climate Action.pptx action plan for climate
justfortalabat
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 

GraphX and Pregel - Apache Spark

  • 1. Spark GraphX & Pregel Challenges and Best Practices Ashutosh Trivedi (IIIT Bangalore) Kaushik Ranjan (IIIT Bangalore) Sigmoid-Meetup Bangalore https://github.com/anantasty/SparkAlgorithms
  • 2. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Agenda •Introduction to GraphX – How to describe a graph – RDDs to store Graph – Algorithms available •Application in graph algorithms – Feedback Vertex Set of a Graph – Identifying parallel parts of the solution. •Challenges we faced •Best practices 2
  • 3. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 33 GraphX - Representation
  • 4. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Graph Representation 4 class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) • The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional constraint that each VertexID occurs only once. • Moreover, VertexRDD[A] represents a set of vertices each with an attribute of type A • The EdgeRDD[ED], extends RDD[Edge[ED]]
  • 5. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 5 GraphX - Representation
  • 6. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 6 A BA Vertex and Edges Vertex Edge
  • 7. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Triplets Join Vertices and Edges • The triplets operator joins vertices and edges: TripletsVertices B A C D Edges A B A C B C C D A BA B A C B C C D 7
  • 8. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 88 Triplets elements
  • 9. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 9 Subgraphs Predicates vpred and epred
  • 10. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 10 Feedback Vertex Set • A feedback vertex set of a graph is a set of vertices whose removal leaves a graph without cycles. • Each feedback vertex set contains at least one vertex of any cycle in the graph. • The feedback vertex set problem is an NP- complete problem in computational complexity theory • Enumerate each simple cycle.
  • 11. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 11 1 2 34 5 6 7 8 9 10 Strongly Connected Components Each strongly connected component can be considered in parallel since they do not share any cycle SC1 – (1) SC2 – (5) SC3 – (8) SC4 – (9)
  • 12. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 12 FVS Algorithm #Greedy recursive solution FVS(G) sccGraph = scc(G) For each graph in sccGraph For each vertex remove vertex and again calculate scc, vertex V = vertex which give max number of scc #which means it kills maximum cycles subGraph = subgraph(remove V ) FVS (subGraph )
  • 13. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 13 1 2 4 3 2 4 3 Graph Iteration SCC count 3 1 4 3 1 1 2 4 3 1 2 4 3 1 2 4 3 Remove 2 Remove 1 Remove 3
  • 14. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 14 1 5 8 9 1 5 8 9Feedback Vertex Set
  • 15. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 15 FVS – Spark Implementation sccGraph has one more property sccID on each vertices, extract it
  • 16. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 16 FVS – Spark Implementation sccGraph = scc(G) For each graph in sccGraph
  • 17. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 17 FVS – Spark Implementation #Greedy recursive function
  • 18. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 18 FVS – Spark Implementation For each vertex remove vertex and again calculate scc, # Z is a list of scc count after removing each vertex
  • 19. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 19 vertex V = vertex which give max number of scc #which means it kills maximum cycles FVS – Spark Implementation
  • 20. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 20 subGraph = subgraph(remove V ) FVS (subGraph ) FVS – Spark Implementation
  • 21. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 21 Pregel • Graph DB – Data Storage – Data Mining • Advantages – Large-scale distributed computations – Parallel-algorithms for graphs on multiple machines – Fault tolerance and distributability
  • 22. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 22 Oldest Follower What is the age of oldest follower of each user ? Val oldestFollowerAge = graph .aggregateMessages( #map word => (word.dst.id, word.src.age), #reduce (a,b) => max(a, b) ) .vertices mapReduceTriplets is now aggregateMessages
  • 23. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 23 In aggregateMessages : • EdgeContext which exposes the triplet fields . • functions to explicitly send messages to the source and destination vertex. • It require the user to indicate what fields in the triplet are actually required. New in GraphX
  • 24. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Theory – it’s Good How it works – that’s awesome 24 Graph’s are recursive data-structures, where the property of a vertex is dependent on the properties of it’s neighbors, which in turn are dependent on the properties of their neighbors.
  • 25. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Graph.Pregel ( initialMessage ) ( #message consumption ( vertexID, initialProperty, message ) → compute new property , #message generation triplet → .. code .. Iterator( vertexID, message ) Iterator.empty , #message aggregation ( existing message set, new message ) → NEW message set ) 25 Architecture
  • 26. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 26 1 2 4 3 1030 30 20 1 2 4 3 10 30 30 20 max [30,10,20] max [20] max [10] 1 2 4 3 100 10 10 1 2 4 3 10 0 10 10 max [10] max [10]
  • 27. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 27 Example - output 1 2 4 3 100 0 0
  • 28. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Applications - GIS • Algorithm – to compute all vertices in a directed graph, that can reach out to a given vertex. • Can be used for watershed delineation in Geographic Information Systems 28 Vertices that can reach out to E are A and B
  • 29. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Algorithm Graph.Pregel( Seq[vertexID’s] ) ( #message consumption if vertex.state == 1 vertex.state → 2 else if vertex.state == 0 if ( vertex.adjacentVertices ∩ Seq[ vertexID’s ] ) isNotEmpty vertex.state → 2 #message aggregator Seq[existing vertex ID’s] U Seq[new vertex ID] ) 29
  • 30. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 30 #message generation for each triplet if destinationVertex.state == 1 message( sourceVertexID, Seq[destinationVertexID] ) message( destinationVertexID, Seq[destinationVertexID] ) else if sourceVertex.state == 1 and destinationVertex.state == 2 message( sourceVertexID, Seq[destinationVertexID] ) else message( empty ) Algorithm
  • 31. References • Fork our repository at • https://github.com/anantasty/SparkAlgorithms • Follow us at • https://github.com/codeAshu • https://github.com/kaushikranjan • https://spark.apache.org/docs/latest/graphx-programming-guide.html 31