SlideShare a Scribd company logo
GraphFrames
DataFrame-based graphs for Apache® Spark™
Joseph K. Bradley
4/14/2016
About the speaker: Joseph Bradley
Joseph Bradley is a Software Engineerand Apache
Spark PMC member working on MLlib at
Databricks. Previously,he was a postdoc at UC
Berkeley after receiving hisPh.D. in Machine
Learning from Carnegie Mellon U.in 2013.His
research included probabilistic graphical models,
parallel sparse regression, and aggregation
mechanismsfor peergrading in MOOCs.
2
About the moderator: Denny Lee
Denny Lee is a Technology Evangelistwith
Databricks; he is a hands-on data sciencesengineer
with more than 15 years of experience developing
internet-scale infrastructure, data platforms, and
distributed systems for both on-premisesand cloud.
Prior to joining Databricks, Denny worked as a
SeniorDirector of Data SciencesEngineering at
Concur and was part of the incubation teamthat
builtHadoop on Windowsand Azure (currently
known as HDInsight).
3
We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engineacross diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries
GraphFrames: DataFrame-based graphs for Apache® Spark™
NOTABL E USERS THAT PRESENTED AT SPARK SUMMIT
2 0 1 5 SAN F RANCISCO
Source: Slide5ofSparkCommunityUpdate
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
8
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
9
Graphs
10
vertex
edge
id City State
“JFK” “New York” NY
Example: airports & flights between them
JFK
IAD
LAX
SFO
SEA
DFW
src dst delay tripID
“JFK
”
“SEA” 45 1058923
Apache Spark’s GraphX library
Overview
• General-purpose graph
processinglibrary
• Optimized for fast
distributedcomputing
• Library of algorithms:
PageRank, Connected
Components,etc.
11
Challenges
• No Java, PythonAPIs
• Lower-levelRDD-based
API (vs.DataFrames)
• Cannot use recent Spark
optimizations:Catalyst
query optimizer,Tungsten
memory management
Enter GraphFrames
Goal: DataFrame-based graphson ApacheSpark
• Simplify interactive queries
• Support motif-findingforstructural pattern search
• Benefitfrom DataFrame optimizations
Collaboration between Databricks, UC Berkeley& MIT
+ Now with community contributors!
12
Graphs
13
vertex
edge
id City State
“JFK” “New York” NY
Example: airports & flights between them
JFK
IAD
LAX
SFO
SEA
DFW
src dst delay tripID
“JFK
”
“SEA” 45 1058923
GraphFrames
“vertices” DataFrame
• 1 vertexper Row
• id: column with unique ID
“edges” DataFrame
• 1 edge per Row
• src, dst: columns using IDs from vertices.id
14
Extra columns store vertexor edge data
(a.k.a. attributes or properties).
id City State
“JFK” “New York” NY
“SEA” “Seattle” WA
src dst delay tripID
“JFK” “SEA” 45 1058923
“DFW” “SFO” -7 4100224
Demo:
Building a GraphFrame
15
16
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
Queries
Simple queries
Motif finding
Graph algorithms
19
Simple queries
SQL queries on vertices & edges
E.g., what trips are most likely to have significantdelays?
20
Graph queries
• Vertex degrees
• # edgesper vertex(incoming,outgoing,total)
• Triplets
• Join vertices and edgesto get (src, edge,dst)
21
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
Motif finding
24
IAD
JFK
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Motif finding
25
IAD
JFK
LAX
SFO
SEA
DFW
(b)
(a)Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Motif finding
26
IAD
JFK
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Motif finding
27
IAD
JFK
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Motif finding
28
IAD
JFK
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
(b)
(a)
(c)
Then filter using vertex &
edge data.
paths.filter(“e1.delay > 20”)
29
GraphFrames: DataFrame-based graphs for Apache® Spark™
Graph algorithms
Find importantvertices
• PageRank
31
Find pathsbetweensets of vertices
• Breadth-first search (BFS)
• Shortest paths
Find groupsof vertices(components,
communities)
• Connected components
• Strongly connected components
• Label Propagation Algorithm(LPA)
Other
• Triangle counting
• SVDPlusPlus
32
Algorithm implementations
Mostly wrappers for GraphX
• PageRank
• Shortest paths
• Connected components
• Strongly connected components
• Label Propagation Algorithm (LPA)
• SVDPlusPlus
33
Some algorithms implemented
usingDataFrames
• Breadth-first search
• Triangle counting
Saving & loading graphs
Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parquet(...)
g = GraphFrame(vertices, edges)
g.vertices.write.parquet(...)
g.edges.write.parquet(...)
In the future...
• SQL data sources for graph formats
34
APIs: Scala, Java, Python
API available from all 3 languages
à First time GraphX functionality hasbeen available to
Java & Python users
2 missing items (WIP)
• Java-friendliness is currently in alpha.
• Python does not have aggregateMessages
(for implementing your own graph algorithms).
35
Outline
GraphFrames overview
GraphFrames vs. GraphX and other libraries
Details for power users
Roadmap and resources
36
2 types of graph libraries
37
Graph algorithms Graph queries
Standard & custom algorithms
Optimized for batch processing
Motif finding
Point queries &updates
GraphFrames: Both algorithms &queries (but notpoint updates)
GraphFrames vs. GraphX
38
GraphFrames GraphX
Builton DataFrames RDDs
Languages Scala, Java, Python Scala
Use cases Queries & algorithms Algorithms
Vertex IDs Any type (in Catalyst) Long
Vertex/edg
e attributes
Any number of
DataFrame columns
Any type (VD, ED)
Return
types
GraphFrame or
DataFrame
Graph[VD, ED], or
RDD[Long, VD]
GraphX compatibility
Simple conversionsbetweenGraphFrames& GraphX.
val g: GraphFrame = ...
// Convert GraphFrame à GraphX
val gx: Graph[Row, Row] = g.toGraphX
// Convert GraphX à GraphFrame
val g2: GraphFrame = GraphFrame.fromGraphX(gx)
39
Vertex & edgeattributes
are Rows in order to
handlenon-LongIDs
Wrapping existing GraphX code: See Belief Propagation example:
https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/examples/BeliefPropagation.scala
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
40
Scalability
Currentstatus
• DataFrame-based parts benefitfrom DataFrame scalability +
performance optimizations(Catalyst, Tungsten).
• GraphX wrappers are as fast as GraphX (+ conversion overhead).
WIP
• GraphX hasoptimizationswhich are not yet ported to GraphFrames.
• See nextslide…
41
WIP optimizations
Join elimination
• GraphFrame algorithms require lots
of joins.
• Not all joins are necessary
Solution:
• Vertex IDs serve as unique keys.
• Tracking keys allows Catalyst to
eliminate some joins.
42
For more info & benchmark results, see AnkurDave’s SSE 2016 talk.
https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/
Materializedviews
• Data locality for common usecases
• Message-passing algorithms often
need “triplet view” (src,edge, dst)
Solution:
• Materialize specific views
• Analogous to GraphX’s “replicated
vertex view”
Implementing new algorithms
43
Method 2: Messagepassing
aggregateMessages
• Same primitive as GraphX
• Specify messages & aggregation
using DataFrame expressions
Belief propagation example code
Method 1: DataFrame &
GraphFrame operations
Motif finding
• Series of DataFrame joins
Triangle count
• DataFrame ops + motif finding
BFS
• DataFrame joins & filters
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
44
Current status
Published
• Open source (Apache 2.0) on Github
https://github.com/graphframes/graphframes
• Spark package http://spark-
packages.org/package/graphframes/graphframes
Compatible
• Spark 1.4, 1.5, 1.6
• Databricks Community Edition
Documented
• http://graphframes.github.io/
45
Roadmap
• MergeWIP speed optimizations
• Java API tests & examples
• Migrate more algorithms to DataFrame-based
implementations for greater scalability
• Getcommunity feedback!
46
Contribute
• Tracking issueson Github
• Thanks to those who have
already sent pull requests!
Resources for learning more
User guide + API docs http://graphframes.github.io/
• Quick-start
• Overview & examples for all algorithms
• Alsoavailable as executablenotebooks:
• Scala: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-scala.html
• Python: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-python.html
Blog posts
• Intro: https://databricks.com/blog/2016/03/03/introducing-graphframes.html
• Flight delay analysis: https://databricks.com/blog/2016/03/16/on-time-flight-performance-
with-spark-graphframes.html
47
48
Thank you!
Thanks to
• Denny Lee & Bill Chambers (demo)
• Tim Hunter, Xiangrui Meng, Ankur Dave &others (GraphFrames development)

More Related Content

What's hot (20)

PPTX
Knowledge Graph Introduction
Sören Auer
 
PDF
Querying the Wikidata Knowledge Graph
Ioan Toma
 
PPTX
Intro to Neo4j
Neo4j
 
PDF
Intro to Neo4j and Graph Databases
Neo4j
 
PDF
How to build a generative AI solution From prototyping to production.pdf
StephenAmell4
 
PDF
ntroducing to the Power of Graph Technology
Neo4j
 
PDF
Explainable AI
Dinesh V
 
PPTX
FAIR Data-centric Information Architecture.pptx
Ben Gardner
 
PDF
Speeding Time to Insight with a Modern ELT Approach
Databricks
 
PDF
Vector Databases 101 - An introduction to the world of Vector Databases
Zilliz
 
PDF
Data Contracts: Consensus as Code - Pycon 2023
Ryan Collingwood
 
PDF
The Knowledge Graph Explosion
Neo4j
 
PPTX
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptx
Neo4j
 
PDF
Automatic machine learning (AutoML) 101
QuantUniversity
 
PDF
Generative AI at the edge.pdf
Qualcomm Research
 
PDF
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
StreamNative
 
PDF
The future of AI is hybrid
Qualcomm Research
 
PDF
3. Relationships Matter: Using Connected Data for Better Machine Learning
Neo4j
 
PDF
Vector databases and neural search
Dmitry Kan
 
PPTX
Introduction to Hadoop Technology
Manish Borkar
 
Knowledge Graph Introduction
Sören Auer
 
Querying the Wikidata Knowledge Graph
Ioan Toma
 
Intro to Neo4j
Neo4j
 
Intro to Neo4j and Graph Databases
Neo4j
 
How to build a generative AI solution From prototyping to production.pdf
StephenAmell4
 
ntroducing to the Power of Graph Technology
Neo4j
 
Explainable AI
Dinesh V
 
FAIR Data-centric Information Architecture.pptx
Ben Gardner
 
Speeding Time to Insight with a Modern ELT Approach
Databricks
 
Vector Databases 101 - An introduction to the world of Vector Databases
Zilliz
 
Data Contracts: Consensus as Code - Pycon 2023
Ryan Collingwood
 
The Knowledge Graph Explosion
Neo4j
 
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptx
Neo4j
 
Automatic machine learning (AutoML) 101
QuantUniversity
 
Generative AI at the edge.pdf
Qualcomm Research
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
StreamNative
 
The future of AI is hybrid
Qualcomm Research
 
3. Relationships Matter: Using Connected Data for Better Machine Learning
Neo4j
 
Vector databases and neural search
Dmitry Kan
 
Introduction to Hadoop Technology
Manish Borkar
 

Similar to GraphFrames: DataFrame-based graphs for Apache® Spark™ (20)

PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PDF
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
PDF
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
Spark Summit
 
PDF
Graph Analytics in Spark
Paco Nathan
 
PDF
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
PDF
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
PDF
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Mo Patel
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PPTX
SPARK ARCHITECTURE
GauravBiswas9
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PPTX
Graph processing at scale using spark & graph frames
Ron Barabash
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
Spark Summit
 
Graph Analytics in Spark
Paco Nathan
 
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Mo Patel
 
An introduction To Apache Spark
Amir Sedighi
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
SPARK ARCHITECTURE
GauravBiswas9
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Databricks
 
Graph processing at scale using spark & graph frames
Ron Barabash
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 

GraphFrames: DataFrame-based graphs for Apache® Spark™

  • 1. GraphFrames DataFrame-based graphs for Apache® Spark™ Joseph K. Bradley 4/14/2016
  • 2. About the speaker: Joseph Bradley Joseph Bradley is a Software Engineerand Apache Spark PMC member working on MLlib at Databricks. Previously,he was a postdoc at UC Berkeley after receiving hisPh.D. in Machine Learning from Carnegie Mellon U.in 2013.His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanismsfor peergrading in MOOCs. 2
  • 3. About the moderator: Denny Lee Denny Lee is a Technology Evangelistwith Databricks; he is a hands-on data sciencesengineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premisesand cloud. Prior to joining Databricks, Denny worked as a SeniorDirector of Data SciencesEngineering at Concur and was part of the incubation teamthat builtHadoop on Windowsand Azure (currently known as HDInsight). 3
  • 4. We are Databricks, the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 4 Data Value Created Databricks on top of Spark to make big data simple.
  • 5. … Apache Spark Engine Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engineacross diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  • 7. NOTABL E USERS THAT PRESENTED AT SPARK SUMMIT 2 0 1 5 SAN F RANCISCO Source: Slide5ofSparkCommunityUpdate
  • 8. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 8
  • 9. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 9
  • 10. Graphs 10 vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK ” “SEA” 45 1058923
  • 11. Apache Spark’s GraphX library Overview • General-purpose graph processinglibrary • Optimized for fast distributedcomputing • Library of algorithms: PageRank, Connected Components,etc. 11 Challenges • No Java, PythonAPIs • Lower-levelRDD-based API (vs.DataFrames) • Cannot use recent Spark optimizations:Catalyst query optimizer,Tungsten memory management
  • 12. Enter GraphFrames Goal: DataFrame-based graphson ApacheSpark • Simplify interactive queries • Support motif-findingforstructural pattern search • Benefitfrom DataFrame optimizations Collaboration between Databricks, UC Berkeley& MIT + Now with community contributors! 12
  • 13. Graphs 13 vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK ” “SEA” 45 1058923
  • 14. GraphFrames “vertices” DataFrame • 1 vertexper Row • id: column with unique ID “edges” DataFrame • 1 edge per Row • src, dst: columns using IDs from vertices.id 14 Extra columns store vertexor edge data (a.k.a. attributes or properties). id City State “JFK” “New York” NY “SEA” “Seattle” WA src dst delay tripID “JFK” “SEA” 45 1058923 “DFW” “SFO” -7 4100224
  • 16. 16
  • 20. Simple queries SQL queries on vertices & edges E.g., what trips are most likely to have significantdelays? 20 Graph queries • Vertex degrees • # edgesper vertex(incoming,outgoing,total) • Triplets • Join vertices and edgesto get (src, edge,dst)
  • 21. 21
  • 24. Motif finding 24 IAD JFK LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 25. Motif finding 25 IAD JFK LAX SFO SEA DFW (b) (a)Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 26. Motif finding 26 IAD JFK LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 27. Motif finding 27 IAD JFK LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 28. Motif finding 28 IAD JFK LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) (b) (a) (c) Then filter using vertex & edge data. paths.filter(“e1.delay > 20”)
  • 29. 29
  • 31. Graph algorithms Find importantvertices • PageRank 31 Find pathsbetweensets of vertices • Breadth-first search (BFS) • Shortest paths Find groupsof vertices(components, communities) • Connected components • Strongly connected components • Label Propagation Algorithm(LPA) Other • Triangle counting • SVDPlusPlus
  • 32. 32
  • 33. Algorithm implementations Mostly wrappers for GraphX • PageRank • Shortest paths • Connected components • Strongly connected components • Label Propagation Algorithm (LPA) • SVDPlusPlus 33 Some algorithms implemented usingDataFrames • Breadth-first search • Triangle counting
  • 34. Saving & loading graphs Save & load the DataFrames. vertices = sqlContext.read.parquet(...) edges = sqlContext.read.parquet(...) g = GraphFrame(vertices, edges) g.vertices.write.parquet(...) g.edges.write.parquet(...) In the future... • SQL data sources for graph formats 34
  • 35. APIs: Scala, Java, Python API available from all 3 languages à First time GraphX functionality hasbeen available to Java & Python users 2 missing items (WIP) • Java-friendliness is currently in alpha. • Python does not have aggregateMessages (for implementing your own graph algorithms). 35
  • 36. Outline GraphFrames overview GraphFrames vs. GraphX and other libraries Details for power users Roadmap and resources 36
  • 37. 2 types of graph libraries 37 Graph algorithms Graph queries Standard & custom algorithms Optimized for batch processing Motif finding Point queries &updates GraphFrames: Both algorithms &queries (but notpoint updates)
  • 38. GraphFrames vs. GraphX 38 GraphFrames GraphX Builton DataFrames RDDs Languages Scala, Java, Python Scala Use cases Queries & algorithms Algorithms Vertex IDs Any type (in Catalyst) Long Vertex/edg e attributes Any number of DataFrame columns Any type (VD, ED) Return types GraphFrame or DataFrame Graph[VD, ED], or RDD[Long, VD]
  • 39. GraphX compatibility Simple conversionsbetweenGraphFrames& GraphX. val g: GraphFrame = ... // Convert GraphFrame à GraphX val gx: Graph[Row, Row] = g.toGraphX // Convert GraphX à GraphFrame val g2: GraphFrame = GraphFrame.fromGraphX(gx) 39 Vertex & edgeattributes are Rows in order to handlenon-LongIDs Wrapping existing GraphX code: See Belief Propagation example: https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/examples/BeliefPropagation.scala
  • 40. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 40
  • 41. Scalability Currentstatus • DataFrame-based parts benefitfrom DataFrame scalability + performance optimizations(Catalyst, Tungsten). • GraphX wrappers are as fast as GraphX (+ conversion overhead). WIP • GraphX hasoptimizationswhich are not yet ported to GraphFrames. • See nextslide… 41
  • 42. WIP optimizations Join elimination • GraphFrame algorithms require lots of joins. • Not all joins are necessary Solution: • Vertex IDs serve as unique keys. • Tracking keys allows Catalyst to eliminate some joins. 42 For more info & benchmark results, see AnkurDave’s SSE 2016 talk. https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/ Materializedviews • Data locality for common usecases • Message-passing algorithms often need “triplet view” (src,edge, dst) Solution: • Materialize specific views • Analogous to GraphX’s “replicated vertex view”
  • 43. Implementing new algorithms 43 Method 2: Messagepassing aggregateMessages • Same primitive as GraphX • Specify messages & aggregation using DataFrame expressions Belief propagation example code Method 1: DataFrame & GraphFrame operations Motif finding • Series of DataFrame joins Triangle count • DataFrame ops + motif finding BFS • DataFrame joins & filters
  • 44. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 44
  • 45. Current status Published • Open source (Apache 2.0) on Github https://github.com/graphframes/graphframes • Spark package http://spark- packages.org/package/graphframes/graphframes Compatible • Spark 1.4, 1.5, 1.6 • Databricks Community Edition Documented • http://graphframes.github.io/ 45
  • 46. Roadmap • MergeWIP speed optimizations • Java API tests & examples • Migrate more algorithms to DataFrame-based implementations for greater scalability • Getcommunity feedback! 46 Contribute • Tracking issueson Github • Thanks to those who have already sent pull requests!
  • 47. Resources for learning more User guide + API docs http://graphframes.github.io/ • Quick-start • Overview & examples for all algorithms • Alsoavailable as executablenotebooks: • Scala: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-scala.html • Python: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-python.html Blog posts • Intro: https://databricks.com/blog/2016/03/03/introducing-graphframes.html • Flight delay analysis: https://databricks.com/blog/2016/03/16/on-time-flight-performance- with-spark-graphframes.html 47
  • 48. 48
  • 49. Thank you! Thanks to • Denny Lee & Bill Chambers (demo) • Tim Hunter, Xiangrui Meng, Ankur Dave &others (GraphFrames development)