SlideShare a Scribd company logo
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Graph Analytics
with Greenplum and Apache MADlib
Pivotal Korea
HongDon Lee, Sr. Data Scientist
30th January 2019
Agenda
1. Why Graph Analytics?
2. What is Graph Analytics?
3. Graph Analytics w/ MADlib
Graph
Analytics
Why
Where
to use
What
How
Everything is connected!
“Nothing ever exists
entirely alone.
Everything is
in relation to
everything else
(緣起)”
“Learn how to see:
Everything is
connected to
everything else”
“In nature we never
see anything isolated,
but everything in
connection with
something else which
is before it, beside it,
under it and over it”
Buddha Leonardo
da vinci
Goethe
3
What a Small World!
“6 Degrees of Separation”
1973, Stanley Milgram, Small-world experiment
1 2 3 4 5 6
4
From Reductionism to Holism
Reductionism Holism
“Divide and Conquer”
vs.
“Everything has to be understood
in relation to the whole”
5
From Individual to Relation
Time
Features
2019.01.01
2019.01.02
2019.01.03
2019.01.04
2019.01.05
2019.01.06
2019.01.30
...
Cross-sectional
Perspective
Longitudinal
Perspective
At the individual levelDemographics
Behaviors
Preferences
Economic Status
Education Background
...
W
ho
are
you?
6
From Individual to Relation
Time
Features
Demographics
Behaviors
Preferences
Economic Status
Education Background
2019.01.01
...
2019.01.02
2019.01.03
2019.01.04
2019.01.05
2019.01.06
2019.01.30
...
Cross-sectional
Perspective
Longitudinal
Perspective
Relation/ Connection
Family
Friends
Colleagues
Community
...
“Tell me who your friends are
and I’ll tell you who you are”
- Mexican Proverb -
7
Graph Analytics, one of the Data Scientist’s knifes
Graph Analytics
t-Test, ANOVA
CNN, RNN, GAN
Random Forest,
XGBoost
Bayesian
Statistics
Regression,
Logistic Regression
PCA, factor
analysis
Clustering
Text Analysis, NLP
Depends on
business
problem and
data
8
Network: Everywhere with Everything, All the time
MMO Role-Playing Game
* www.researchgate.net
Chemistry
* https://www.nature.com/articles/
Social Network Epidemiology
* http://www.netminer.com/community* Grandjean, M. (2016)
Bank Risk
* https://cambridge-intelligence.com
1st
Party Fraud Manufacturing
* www.infoglide.com * https://blog.trifinance.com* www.researchgate.net
Gene
9
Use Cases - PageRank
● Measures the importance of a vertex in a graph by counting the number and
quality of the links to that vertex
❏ Web Search
❏ Scientific impact of researchers
❏ Neuroscience
❏ Street and space usage
* Image from https://en.wikipedia.org/wiki/PageRank
10
Use Cases - Single Source Shortest Path
● Find a path to every vertex so that the sum of the weights of its constituent
edges is minimized
❏ Vehicle routing/ navigation
❏ Degrees of separation in a social network
❏ Mid-delay path in a telecommunications network
❏ Plant and facility layout
❏ VLSI (Very-Large-Scale Integration)
design
11
Use Cases - Cyber-security by Graph model
● Using historical window events data to build
historical graphs* of typical user behavior
• Which machines does the user log in to?
• Which machines does the user log in from?
• How often?
• In which order?
● Is this behavior typical?
• Is it typical for this user?
• Is it typical for someone in a particular department?
• Is this typical for someone in the user’s job role?
● Graph models are sensitive to direction, order,
and frequency.
34.23.123.4
Typical Behavior
Anomalous Behavior
DB with financial
information
34.23.123.51
34.23.1.1
34.23.0.1
34.23.2.8
34.23.123.4
34.23.1.1
34.23.0.1
34.23.2.8
34.23.123.51
*Reference: Alexander D. Kenta, Lorie M. Liebrock, Joshua C. Neila. Authentication graphs: Analyzing user behavior within an enterprise network.
vs.
12
Use Cases - Connected Component
● Calculate the Jaccard Dissimilarity
Scores for each pair of materials
● If material X and Y are potential
duplicates and material Y and Z are
potential duplicates then X, Y, Z is a
connected component in the graph of
all materials and form a cluster
⇒ Connected component analysis
resulted in 10% of materials identified
as potential duplicates based on their
bill of material attributes
Z
X Y
Z
Features for each material
• part type
• material type & group
• product line & family
• revision key
• weld, material & coating
specs
• quality matrix
• unit of measurement
• Weight
:
13
Agenda
1. Why Graph Analytics?
2. What is Graph Analytics?
3. Graph Analytics w/ MADlib
Graph
Analytics
Why
Where
to use
What
How
The Origin of Graph Theory
● Seven bridges of Konigsberg problem
● Leonhard Euler, a mathematician, proved that the problem has no solution
“The problem was to devise a walk through the city that would
cross each of those bridges once and only once.”
Euler, 1753
15
[ Terminology of Graph theory ]
What is Graph Theory?
● Graph theory is the study of graphs, which are mathematical structures
used to model pairwise relations between objects.
0 1
2
4
3
5
6
7
1
2
10
1
10
1
1
3
1
-2
1
1
vertice
edge weight ● Vertex
● Node
● Point
● Actor
V
● Edge
● Link
● Arc
● Line
Directed
Undirected
[ Directed Network Graph with Weight (example) ]
10
Weight
E
16
Graph Algorithms and Measures
Group
Structure
Centrality
Types Question Feasures
Path
“What are the sub-graphs,
component, communities?”
“What is the character of the
network structure?”
“What is the most important
vertices within a graph”
“What is the shortest
path(distance) among vertices”
weakly-connected component
Density, Diameter,
Average path length,
Modularity
Degree (in/out, weight),
Closeness,
PageRank, Hub, Authority,
Betweenness,
Clustering coefficient
Single source shortest path,
All pairs shortest path,
Breadth-First Search
Graph-based
Features
1
2
3
4
17
Graph Algorithms and Measures - (1) Group
Group
Structure
Centrality
Path
Weakly-Connected Component
● A Connected Component (or just Component) of an undirected graph is
subgraph in which any two vertices are connected to each other by paths,
and which is connected to no additional vertices in the supergraph
* source: Wikipedia
[ A supergraph with three connected components ]
Component 1
Component 2
Component 3
Supergraph
18
D =
|E|
|V| (|V| - 1) / 2
Graph Algorithms and Measures - (2) Structure
Group
Structure
Centrality
Path
Density
● A dense graph is a graph in which the number of edges is close to the
maximal number of edges. The opposite, a graph with only a few edges, is a
sparse graph. The distinction between sparse and dense graphs is rather vague,
and depends on the context.
* source: Wikipedia
❏ For Undirected simple graphs
❏ For Directed simple graphs
D =
|E|
|V| (|V| - 1)
E : the number of Edges, V : the number of Vertices
[ Density by components (example) ]
D=
|6|
|4|(|4|-1)/2
D =
|3|
|4|(|4|-1)/2
=1 =0.5
19
Graph Algorithms and Measures - (3) Path
Group
Structure
Centrality
Path
Single Source Shortest Path (SSSP)
● Given a graph and a source vertex, the Single Source Shortest Path (SSSP)
algorithm finds a path from the source vertex to every other vertex in the
graph, such that the sum of the weights of the path is minimized.
0 1
2
4
3
5
6 7
1
2
10
1
10
1
1
3
1
-2
1
1
[ Shortest paths from vertex ‘0’ (example) ]
ID weight parent
0 0 0
1 1 0
2 1 0
3 2 (= 1+1) 2
4 10 0
5 2 2
6 3 5
7 4 6
* weight : The total weight of the shortest path from the source vertex to this particular vertex.
* parent : The parent of this vertex in the shortest path from source.
23
0
20
Graph Algorithms and Measures - (4) Centrality
Group
Structure
Centrality
Path
In-Degree, Out-Degree
● The node in-degree is the number of edges pointing in to the node
● The node out-degree is the number of edges pointing out of the node
0 1
2
4
3
5
6 7
1
2
10
1
10
1
1
3
1
-2
1
1
[ The in-out degree for each node (example) ]
ID In-degree Out-degree
0 2 3
1 1 2
2 2 3
3 2 1
4 1 1
5 1 1
6 2 1
7 1 0
21
Graph Algorithms and Measures - (4) Centrality
Group
Structure
Centrality
Path
PageRank (1 / 2)
The size of each face is proportional to the total
size of the other faces which are pointing to it.
- PR(A): PageRank of node A
- N: the total number of Nodes
- L(B): the number of Links from node B
- d: damping factor (probability, at any step, that a surfer will
continue randomly clicking on links)
● PageRank works by counting the number and quality of links to a page to
determine a rough estimate of how important the website is. The underlying
assumption is that more important websites are likely to receive more links
from other websites
* source: Wikipedia
22
Graph Algorithms and Measures - (4) Centrality
Group
Structure
Centrality
Path
PageRank (2 / 2)
A
B
C
D
0.25
0.25
0.25
0.25
1st round 2nd round
PR(A) = (1-0.85)/4 +
0.85*(0.25/2 + 0.25/1 +
0.25/3) = 0.427
PR(B) = (1-0.85)/4 +
0.85*(0.25/3) = 0.108
PR(C) = (1-0.85)/4 +
0.85*(0.25/2 + 0.25/3) =
0.214
PR(D) = (1-0.85)/4 +
0.85*0 = 0.037
PR(A) = 0.25
PR(B) =0.25
PR(C) = 0.25
PR(D) = 0.25
Final round
A
B
C
D
0.108
0.214
0.037
0.427
...
Recursive calculation → converged
A
B
C
D
0.048
0.069
0.038
0.127
* PR(A): PageRank of node A, N: the total number of Nodes, L(B): the number of Links from node B, d: damping factor (typically 0.85) 23
Graph Algorithms and Measures - (4) Centrality
Group
Structure
Centrality
Path
Closeness
● Closeness of a node is a measure of centrality in a network, calculated as the
sum of the length of the shortest paths between the node and all other
nodes in the graph (ie, based on All Pairs Shortest Path)
0 1
2
4
3
5
6 7
1
2
10
1
10
1
1
3
1
-2
1
1
[ All Pairs Shortest Path ]
source destination weight
0 0 0
0 1 1
0 2 1
0 3 2
0 4 10
0 5 2
0 6 3
0 7 4
1 0 4
1 1 0
1 2 2
1 3 3
1 4 14
1 5 3
1 6 4
1 7 5
2 0 2
...
[ Closeness Centrality ]
src_id closeness
0 0.043
1 0.028
2 0.041
3 0.035
* N: The number of nodes, * d(y, x) : The distance between y and x node 24
Big Issue of Graph Algorithms - High Complexity
* image source: https://www.xkcd.com/399/
25
Big Issue of Graph Algorithms - High Complexity
Type Algorithms/ Measures Time Complexity
Group Weakly-Connected Component O(|V| + |E|)
Structure
Density O(|V|+|E|log|E|)
Diameter O(|V|3
)
Path
All Pairs Shortest Path O(|V|3
)
Single Source Shortest Path O(|V|2
)
Breadth-First Search O(E + V)
Centrality
In-Degree, Out-Degree O(|V| + |E|)
Closeness Centrality O(|V|3
)
PageRank O(log(network size)/(1-damping factor))
Betweenness Centrality O(|V|2
log|V|+|V||E|)
* |V|: the number of Vertices in graph
* |E|: the number of Edges in graph
● Computationally
Intensive
- exponential to the
number of vertices
and edges
Graph Analysis
at Scale,
parallel processing
with MADlib
on Greenplum
26
Agenda
1. Why Graph Analytics?
2. What is Graph Analytics?
3. Graph Analytics w/ MADlib
Why
Where
to use
What
How
Graph
Analytics
Tools for Graph Analytics
● Graph Analytics at Scale with Open Source MADlib on Greenplum
Commercial
Open Source
Small Data Big Data/ Parallel Processing
Data Size & Processing
WhetherOSSornot
sna,
igraph,
ergm,
network
NetworkX,
graph-tool,
SNAP,
pygraphviz
...
Interactive visualization focused Graph DB
28
Analytics Platform, GPDB
Graph Analytics at Scale
● Designed for very large graphs
(billions of vertices/edges)
● No need to move data and transform
for external graph engine
- One analytics database to deploy and
manage
● Familiar SQL interface
● Combine context-based graph
analytics with other content-based
techniques
❏ Advanced Analytics In Database
Extended Language
GPText
❏ Scale Out
❏ MPP (Massively Parallel Processing) Architecture
REGRESSIONCLASSIFICATIONCLUSTERING GEOSPATIAL GRAPHTEXT IMAGE
Graph Analytics with MADlib on Greenplum
29
: Scaleable, In-Database Machine Learning
● Open source https://github.com/apache/madlib
● Downloads and docs http://madlib.apache.org/
● Wiki https://cwiki.apache.org/confluence/display/MADLIB/
Apache MADlib: Big Data Machine Learning in SQL
Open source,
top level
Apache
project
For
PostgreSQL
and Greenplum
Database
Powerful machine
learning, graph,
statistics and
analytics for data
scientists
30
: Functions
Data Types and Transformations
Array and Matrix Operations
Matrix Factorization
• Low Rank
• Singular Value Decomposition (SVD)
Norms and Distance Functions
Sparse Vectors
Encoding Categorical Variables
Path Functions
Pivot
Sessionize
Stemming
April 2018
Graph
All Pairs Shortest Path (APSP)
Breadth-First Search
Hyperlink-Induced Topic Search (HITS)
Average Path Length
Closeness Centrality
Graph Diameter
In-Out Degree
PageRank and Personalized PageRank
Single Source Shortest Path (SSSP)
Weakly Connected Components
Model Selection
Cross Validation
Prediction Metrics
Train-Test Split
Statistics
Descriptive Statistics
• Cardinality Estimators
• Correlation and Covariance
• Summary
Inferential Statistics
• Hypothesis Tests
Probability Functions
Supervised Learning
Neural Networks
Support Vector Machines (SVM)
Conditional Random Field (CRF)
Regression Models
• Clustered Variance
• Cox-Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Naïve Bayes
• Ordinal Regression
• Robust Variance
Tree Methods
• Decision Tree
• Random Forest
Time Series Analysis
• ARIMA
Unsupervised Learning
Association Rules (Apriori)
Clustering (k-Means)
Principal Component Analysis (PCA)
Topic Modelling (Latent Dirichlet Allocation)
Utility Functions
Columns to Vector
Conjugate Gradient
Linear Solvers
• Dense Linear Systems
• Sparse Linear Systems
Mini-Batching
PMML Export
Term Frequency for Text
Vector to Columns
Nearest Neighbors
• k-Nearest Neighbors
Sampling
Balanced/ Random/ Stratified Sampling
31
: Graph Representation in MADlib
Source
Vertex
Dest
Vertex
Edge
Weight
Edge
Params
0 3 1.0 ...
1 0 5.0 ...
1 2 3.0 ...
2 3 8.0 ...
3 0 3.0 ...
3 1 2.0 ...
Vertex Vertex
Params
0 ...
1 ...
2 ...
3 ...
. . . . . .
Vertex Table Edge Table
...
...
0
1
2
3
5
3
8
2
1
3
[ Directed Graph (example) ]
V
32
example : PageRank in MADlib
● Create vertex and edge tables to represent the graph
* http://madlib.apache.org/docs/latest/group__grp__pagerank.html
DROP TABLE IF EXISTS vertex;
CREATE TABLE vertex(
id INTEGER
);
INSERT INTO vertex VALUES
(0),
(1),
(2),
(3),
(4),
(5),
(6);
DROP TABLE IF EXISTS edge;
CREATE TABLE edge(
src INTEGER,
dest INTEGER,
user_id INTEGER
)
DISTRIBUTED BY (user_id);
INSERT INTO edge VALUES
(0, 1, 1), (0, 2, 1), -- user id 1
(0, 4, 1), (1, 2, 1),
(1, 3, 1), (2, 3, 1),
(2, 5, 1), (2, 6, 1),
(3, 0, 1), (4, 0, 1),
(5, 6, 1), (6, 3, 1),
(0, 1, 2), (0, 2, 2), -- user id 2
(0, 4, 2), (1, 2, 2),
(1, 3, 2), (2, 3, 2),
(3, 0, 2), (4, 0, 2),
(5, 6, 2), (6, 3, 2);
33
example : PageRank in MADlib
● Compute the PageRank with All IDs
DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
SELECT madlib.pagerank(
'vertex' -- Vertex table
, 'id' -- Vertex id column
, 'edge' -- Edge table
, 'src=src, dest=dest' -- Comma delimited string of edge arguments
, 'pagerank_out' -- Output table of RageRank
, NULL); -- Damping factor (default 0.85)
SELECT * FROM pagerank_out
ORDER BY pagerank DESC;
34
example : PageRank in MADlib
● Network Diagram by Graphviz and PyGraphviz
* PyGraphviz is a Python interface to the Graphviz graph layout and visualization package
A vertex with a high PageRank is usually
considered more "important" or more
"influential" or more "relevant" than a
vertex with a low PageRank.
Size of node is proportional
to PageRank value
User ID 1
User ID 2
ID
(PageRank)
35
example : PageRank in MADlib
● PageRank of vertices associated with each user by the grouping feature
DROP TABLE IF EXISTS pagerank_gr_out, pagerank_gr_out_summary;
SELECT madlib.pagerank(
'vertex' -- Vertex table
, 'id' -- Vertex id column
, 'edge' -- Edge table
, 'src=src, dest=dest' -- Comma delimited string of edge arguments
, 'pagerank_gr_out' -- Output table of PageRank
, NULL -- Default damping factor (0.85)
, NULL -- Default max iterations (100)
, 0.00000001 -- Threshold
, 'user_id' -- Grouping column name
);
SELECT * FROM pagerank_gr_out
ORDER BY user_id, pagerank DESC;
PageRank
of user id 1
PageRank
of user id 2
36
example : PageRank in MADlib
● Personalized PageRank of vertices {2, 4}, for Recommendations
DROP TABLE IF EXISTS pagerank_pers_out,
pagerank_pers_out_summary;
SELECT madlib.pagerank(
'vertex' -- Vertex table
, 'id' -- Vertex id column
, 'edge' -- Edge table
, 'src=src, dest=dest' -- Comma delimited string of edge arguments
, 'pagerank_pers_out' -- Output table of PageRank
, NULL -- Default damping factor (0.85)
, NULL -- Default max iterations (100)
, NULL -- Default Threshold (1/number of vertices*1000)
, NULL -- No Grouping
, '{2, 4}' -- Personalization vertices
);
SELECT * FROM pagerank_pers_out
ORDER BY pagerank DESC;
SELECT * FROM
pagerank_pers_out_summary;
* Personalized PageRank = (1-p)*Ax + p*E , where ‘E’ is the list of vertices for personalized PageRank, ‘p’ is the damping factor 37
example : PageRank in MADlib
Greenplum cluster:
● 1 master
● 4 segment hosts with 6
segments per host
Normal random graphs with
mean degrees 50 edges per vertex
(i.e., 5B edges in the largest case)
5B edges
(1K) (10K) (100K) (1M) (10M) (100M)
* Note: log-log scale
(100s)
(1s)
(10K s)
(1M s)
PageRank Performance on Greenplum w/ MADlib
38
In Summary
● Capture the Relationship in Networks using Graph Analytics
→ Community, Structure, Path, Centrality
→ Combine context-based graph analytics with other content-based insights
● Graph analytics at SCALE with Open Source Software
→ Apache MADlib on Greenplum, massively parallel processing
39
One more thing...
GREENPLUM SUMMIT at PostgresConf 2019
by Pivotal
40
Transforming How The World Builds Software
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.

More Related Content

Similar to Graph Analytics with Greenplum and Apache MADlib (20)

PDF
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
PPTX
Network analysis lecture
Sara-Jayne Terp
 
PPTX
Network Measures Social Computing-Unit 2.pptx
chavanprasad17092001
 
PPTX
Node XL - features and demo
Mayank Mohan
 
PDF
Jürgens diata12-communities
Pascal Juergens
 
PPTX
Interactive visualization and exploration of network data with Gephi
Digital Methods Initiative
 
PPTX
Discrete mathematics presentation related to application
rutujakjadhav20
 
PDF
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
TigerGraph
 
PDF
Graph Gurus Episode 27: Using Graph Algorithms for Advanced Analytics Part 2
TigerGraph
 
PPTX
Apache Spark GraphX highlights.
Doug Needham
 
PPTX
20121001 pawcon 2012-marc smith - mapping collections of connections in socia...
Marc Smith
 
PPTX
2013 NodeXL Social Media Network Analysis
Marc Smith
 
PPTX
LSS'11: Charting Collections Of Connections In Social Media
Local Social Summit
 
PPTX
20111103 con tech2011-marc smith
Marc Smith
 
PPTX
Social Network Analysis Introduction including Data Structure Graph overview.
Doug Needham
 
PDF
Graph Gurus Episode 26: Using Graph Algorithms for Advanced Analytics Part 1
TigerGraph
 
PPTX
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Jonathan Stray
 
PPTX
20120301 strata-marc smith-mapping social media networks with no coding using...
Marc Smith
 
PDF
Graph analytic and machine learning
Stanley Wang
 
PPTX
AI Class Topic 5: Social Network Graph
Value Amplify Consulting
 
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
Network analysis lecture
Sara-Jayne Terp
 
Network Measures Social Computing-Unit 2.pptx
chavanprasad17092001
 
Node XL - features and demo
Mayank Mohan
 
Jürgens diata12-communities
Pascal Juergens
 
Interactive visualization and exploration of network data with Gephi
Digital Methods Initiative
 
Discrete mathematics presentation related to application
rutujakjadhav20
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
TigerGraph
 
Graph Gurus Episode 27: Using Graph Algorithms for Advanced Analytics Part 2
TigerGraph
 
Apache Spark GraphX highlights.
Doug Needham
 
20121001 pawcon 2012-marc smith - mapping collections of connections in socia...
Marc Smith
 
2013 NodeXL Social Media Network Analysis
Marc Smith
 
LSS'11: Charting Collections Of Connections In Social Media
Local Social Summit
 
20111103 con tech2011-marc smith
Marc Smith
 
Social Network Analysis Introduction including Data Structure Graph overview.
Doug Needham
 
Graph Gurus Episode 26: Using Graph Algorithms for Advanced Analytics Part 1
TigerGraph
 
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Jonathan Stray
 
20120301 strata-marc smith-mapping social media networks with no coding using...
Marc Smith
 
Graph analytic and machine learning
Stanley Wang
 
AI Class Topic 5: Social Network Graph
Value Amplify Consulting
 

More from VMware Tanzu (20)

PDF
Spring into AI presented by Dan Vega 5/14
VMware Tanzu
 
PDF
What AI Means For Your Product Strategy And What To Do About It
VMware Tanzu
 
PDF
Make the Right Thing the Obvious Thing at Cardinal Health 2023
VMware Tanzu
 
PPTX
Enhancing DevEx and Simplifying Operations at Scale
VMware Tanzu
 
PDF
Spring Update | July 2023
VMware Tanzu
 
PPTX
Platforms, Platform Engineering, & Platform as a Product
VMware Tanzu
 
PPTX
Building Cloud Ready Apps
VMware Tanzu
 
PDF
Spring Boot 3 And Beyond
VMware Tanzu
 
PDF
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
VMware Tanzu
 
PDF
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
VMware Tanzu
 
PDF
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
VMware Tanzu
 
PPTX
tanzu_developer_connect.pptx
VMware Tanzu
 
PDF
Tanzu Virtual Developer Connect Workshop - French
VMware Tanzu
 
PDF
Tanzu Developer Connect Workshop - English
VMware Tanzu
 
PDF
Virtual Developer Connect Workshop - English
VMware Tanzu
 
PDF
Tanzu Developer Connect - French
VMware Tanzu
 
PDF
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
VMware Tanzu
 
PDF
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
VMware Tanzu
 
PDF
SpringOne Tour: The Influential Software Engineer
VMware Tanzu
 
PDF
SpringOne Tour: Domain-Driven Design: Theory vs Practice
VMware Tanzu
 
Spring into AI presented by Dan Vega 5/14
VMware Tanzu
 
What AI Means For Your Product Strategy And What To Do About It
VMware Tanzu
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
VMware Tanzu
 
Enhancing DevEx and Simplifying Operations at Scale
VMware Tanzu
 
Spring Update | July 2023
VMware Tanzu
 
Platforms, Platform Engineering, & Platform as a Product
VMware Tanzu
 
Building Cloud Ready Apps
VMware Tanzu
 
Spring Boot 3 And Beyond
VMware Tanzu
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
VMware Tanzu
 
tanzu_developer_connect.pptx
VMware Tanzu
 
Tanzu Virtual Developer Connect Workshop - French
VMware Tanzu
 
Tanzu Developer Connect Workshop - English
VMware Tanzu
 
Virtual Developer Connect Workshop - English
VMware Tanzu
 
Tanzu Developer Connect - French
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
VMware Tanzu
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
VMware Tanzu
 
SpringOne Tour: The Influential Software Engineer
VMware Tanzu
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
VMware Tanzu
 
Ad

Recently uploaded (20)

PDF
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
PPTX
Role Of Python In Programing Language.pptx
jaykoshti048
 
PDF
Troubleshooting Virtual Threads in Java!
Tier1 app
 
PDF
Step-by-Step Guide to Install SAP HANA Studio | Complete Installation Tutoria...
SAP Vista, an A L T Z E N Company
 
PDF
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PPTX
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
 
PDF
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
PDF
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PDF
What companies do with Pharo (ESUG 2025)
ESUG
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PDF
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
PDF
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
Role Of Python In Programing Language.pptx
jaykoshti048
 
Troubleshooting Virtual Threads in Java!
Tier1 app
 
Step-by-Step Guide to Install SAP HANA Studio | Complete Installation Tutoria...
SAP Vista, an A L T Z E N Company
 
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
 
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
What companies do with Pharo (ESUG 2025)
ESUG
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
Ad

Graph Analytics with Greenplum and Apache MADlib

  • 1. © Copyright 2019 Pivotal Software, Inc. All rights Reserved. Graph Analytics with Greenplum and Apache MADlib Pivotal Korea HongDon Lee, Sr. Data Scientist 30th January 2019
  • 2. Agenda 1. Why Graph Analytics? 2. What is Graph Analytics? 3. Graph Analytics w/ MADlib Graph Analytics Why Where to use What How
  • 3. Everything is connected! “Nothing ever exists entirely alone. Everything is in relation to everything else (緣起)” “Learn how to see: Everything is connected to everything else” “In nature we never see anything isolated, but everything in connection with something else which is before it, beside it, under it and over it” Buddha Leonardo da vinci Goethe 3
  • 4. What a Small World! “6 Degrees of Separation” 1973, Stanley Milgram, Small-world experiment 1 2 3 4 5 6 4
  • 5. From Reductionism to Holism Reductionism Holism “Divide and Conquer” vs. “Everything has to be understood in relation to the whole” 5
  • 6. From Individual to Relation Time Features 2019.01.01 2019.01.02 2019.01.03 2019.01.04 2019.01.05 2019.01.06 2019.01.30 ... Cross-sectional Perspective Longitudinal Perspective At the individual levelDemographics Behaviors Preferences Economic Status Education Background ... W ho are you? 6
  • 7. From Individual to Relation Time Features Demographics Behaviors Preferences Economic Status Education Background 2019.01.01 ... 2019.01.02 2019.01.03 2019.01.04 2019.01.05 2019.01.06 2019.01.30 ... Cross-sectional Perspective Longitudinal Perspective Relation/ Connection Family Friends Colleagues Community ... “Tell me who your friends are and I’ll tell you who you are” - Mexican Proverb - 7
  • 8. Graph Analytics, one of the Data Scientist’s knifes Graph Analytics t-Test, ANOVA CNN, RNN, GAN Random Forest, XGBoost Bayesian Statistics Regression, Logistic Regression PCA, factor analysis Clustering Text Analysis, NLP Depends on business problem and data 8
  • 9. Network: Everywhere with Everything, All the time MMO Role-Playing Game * www.researchgate.net Chemistry * https://www.nature.com/articles/ Social Network Epidemiology * http://www.netminer.com/community* Grandjean, M. (2016) Bank Risk * https://cambridge-intelligence.com 1st Party Fraud Manufacturing * www.infoglide.com * https://blog.trifinance.com* www.researchgate.net Gene 9
  • 10. Use Cases - PageRank ● Measures the importance of a vertex in a graph by counting the number and quality of the links to that vertex ❏ Web Search ❏ Scientific impact of researchers ❏ Neuroscience ❏ Street and space usage * Image from https://en.wikipedia.org/wiki/PageRank 10
  • 11. Use Cases - Single Source Shortest Path ● Find a path to every vertex so that the sum of the weights of its constituent edges is minimized ❏ Vehicle routing/ navigation ❏ Degrees of separation in a social network ❏ Mid-delay path in a telecommunications network ❏ Plant and facility layout ❏ VLSI (Very-Large-Scale Integration) design 11
  • 12. Use Cases - Cyber-security by Graph model ● Using historical window events data to build historical graphs* of typical user behavior • Which machines does the user log in to? • Which machines does the user log in from? • How often? • In which order? ● Is this behavior typical? • Is it typical for this user? • Is it typical for someone in a particular department? • Is this typical for someone in the user’s job role? ● Graph models are sensitive to direction, order, and frequency. 34.23.123.4 Typical Behavior Anomalous Behavior DB with financial information 34.23.123.51 34.23.1.1 34.23.0.1 34.23.2.8 34.23.123.4 34.23.1.1 34.23.0.1 34.23.2.8 34.23.123.51 *Reference: Alexander D. Kenta, Lorie M. Liebrock, Joshua C. Neila. Authentication graphs: Analyzing user behavior within an enterprise network. vs. 12
  • 13. Use Cases - Connected Component ● Calculate the Jaccard Dissimilarity Scores for each pair of materials ● If material X and Y are potential duplicates and material Y and Z are potential duplicates then X, Y, Z is a connected component in the graph of all materials and form a cluster ⇒ Connected component analysis resulted in 10% of materials identified as potential duplicates based on their bill of material attributes Z X Y Z Features for each material • part type • material type & group • product line & family • revision key • weld, material & coating specs • quality matrix • unit of measurement • Weight : 13
  • 14. Agenda 1. Why Graph Analytics? 2. What is Graph Analytics? 3. Graph Analytics w/ MADlib Graph Analytics Why Where to use What How
  • 15. The Origin of Graph Theory ● Seven bridges of Konigsberg problem ● Leonhard Euler, a mathematician, proved that the problem has no solution “The problem was to devise a walk through the city that would cross each of those bridges once and only once.” Euler, 1753 15
  • 16. [ Terminology of Graph theory ] What is Graph Theory? ● Graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. 0 1 2 4 3 5 6 7 1 2 10 1 10 1 1 3 1 -2 1 1 vertice edge weight ● Vertex ● Node ● Point ● Actor V ● Edge ● Link ● Arc ● Line Directed Undirected [ Directed Network Graph with Weight (example) ] 10 Weight E 16
  • 17. Graph Algorithms and Measures Group Structure Centrality Types Question Feasures Path “What are the sub-graphs, component, communities?” “What is the character of the network structure?” “What is the most important vertices within a graph” “What is the shortest path(distance) among vertices” weakly-connected component Density, Diameter, Average path length, Modularity Degree (in/out, weight), Closeness, PageRank, Hub, Authority, Betweenness, Clustering coefficient Single source shortest path, All pairs shortest path, Breadth-First Search Graph-based Features 1 2 3 4 17
  • 18. Graph Algorithms and Measures - (1) Group Group Structure Centrality Path Weakly-Connected Component ● A Connected Component (or just Component) of an undirected graph is subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the supergraph * source: Wikipedia [ A supergraph with three connected components ] Component 1 Component 2 Component 3 Supergraph 18
  • 19. D = |E| |V| (|V| - 1) / 2 Graph Algorithms and Measures - (2) Structure Group Structure Centrality Path Density ● A dense graph is a graph in which the number of edges is close to the maximal number of edges. The opposite, a graph with only a few edges, is a sparse graph. The distinction between sparse and dense graphs is rather vague, and depends on the context. * source: Wikipedia ❏ For Undirected simple graphs ❏ For Directed simple graphs D = |E| |V| (|V| - 1) E : the number of Edges, V : the number of Vertices [ Density by components (example) ] D= |6| |4|(|4|-1)/2 D = |3| |4|(|4|-1)/2 =1 =0.5 19
  • 20. Graph Algorithms and Measures - (3) Path Group Structure Centrality Path Single Source Shortest Path (SSSP) ● Given a graph and a source vertex, the Single Source Shortest Path (SSSP) algorithm finds a path from the source vertex to every other vertex in the graph, such that the sum of the weights of the path is minimized. 0 1 2 4 3 5 6 7 1 2 10 1 10 1 1 3 1 -2 1 1 [ Shortest paths from vertex ‘0’ (example) ] ID weight parent 0 0 0 1 1 0 2 1 0 3 2 (= 1+1) 2 4 10 0 5 2 2 6 3 5 7 4 6 * weight : The total weight of the shortest path from the source vertex to this particular vertex. * parent : The parent of this vertex in the shortest path from source. 23 0 20
  • 21. Graph Algorithms and Measures - (4) Centrality Group Structure Centrality Path In-Degree, Out-Degree ● The node in-degree is the number of edges pointing in to the node ● The node out-degree is the number of edges pointing out of the node 0 1 2 4 3 5 6 7 1 2 10 1 10 1 1 3 1 -2 1 1 [ The in-out degree for each node (example) ] ID In-degree Out-degree 0 2 3 1 1 2 2 2 3 3 2 1 4 1 1 5 1 1 6 2 1 7 1 0 21
  • 22. Graph Algorithms and Measures - (4) Centrality Group Structure Centrality Path PageRank (1 / 2) The size of each face is proportional to the total size of the other faces which are pointing to it. - PR(A): PageRank of node A - N: the total number of Nodes - L(B): the number of Links from node B - d: damping factor (probability, at any step, that a surfer will continue randomly clicking on links) ● PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites * source: Wikipedia 22
  • 23. Graph Algorithms and Measures - (4) Centrality Group Structure Centrality Path PageRank (2 / 2) A B C D 0.25 0.25 0.25 0.25 1st round 2nd round PR(A) = (1-0.85)/4 + 0.85*(0.25/2 + 0.25/1 + 0.25/3) = 0.427 PR(B) = (1-0.85)/4 + 0.85*(0.25/3) = 0.108 PR(C) = (1-0.85)/4 + 0.85*(0.25/2 + 0.25/3) = 0.214 PR(D) = (1-0.85)/4 + 0.85*0 = 0.037 PR(A) = 0.25 PR(B) =0.25 PR(C) = 0.25 PR(D) = 0.25 Final round A B C D 0.108 0.214 0.037 0.427 ... Recursive calculation → converged A B C D 0.048 0.069 0.038 0.127 * PR(A): PageRank of node A, N: the total number of Nodes, L(B): the number of Links from node B, d: damping factor (typically 0.85) 23
  • 24. Graph Algorithms and Measures - (4) Centrality Group Structure Centrality Path Closeness ● Closeness of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph (ie, based on All Pairs Shortest Path) 0 1 2 4 3 5 6 7 1 2 10 1 10 1 1 3 1 -2 1 1 [ All Pairs Shortest Path ] source destination weight 0 0 0 0 1 1 0 2 1 0 3 2 0 4 10 0 5 2 0 6 3 0 7 4 1 0 4 1 1 0 1 2 2 1 3 3 1 4 14 1 5 3 1 6 4 1 7 5 2 0 2 ... [ Closeness Centrality ] src_id closeness 0 0.043 1 0.028 2 0.041 3 0.035 * N: The number of nodes, * d(y, x) : The distance between y and x node 24
  • 25. Big Issue of Graph Algorithms - High Complexity * image source: https://www.xkcd.com/399/ 25
  • 26. Big Issue of Graph Algorithms - High Complexity Type Algorithms/ Measures Time Complexity Group Weakly-Connected Component O(|V| + |E|) Structure Density O(|V|+|E|log|E|) Diameter O(|V|3 ) Path All Pairs Shortest Path O(|V|3 ) Single Source Shortest Path O(|V|2 ) Breadth-First Search O(E + V) Centrality In-Degree, Out-Degree O(|V| + |E|) Closeness Centrality O(|V|3 ) PageRank O(log(network size)/(1-damping factor)) Betweenness Centrality O(|V|2 log|V|+|V||E|) * |V|: the number of Vertices in graph * |E|: the number of Edges in graph ● Computationally Intensive - exponential to the number of vertices and edges Graph Analysis at Scale, parallel processing with MADlib on Greenplum 26
  • 27. Agenda 1. Why Graph Analytics? 2. What is Graph Analytics? 3. Graph Analytics w/ MADlib Why Where to use What How Graph Analytics
  • 28. Tools for Graph Analytics ● Graph Analytics at Scale with Open Source MADlib on Greenplum Commercial Open Source Small Data Big Data/ Parallel Processing Data Size & Processing WhetherOSSornot sna, igraph, ergm, network NetworkX, graph-tool, SNAP, pygraphviz ... Interactive visualization focused Graph DB 28
  • 29. Analytics Platform, GPDB Graph Analytics at Scale ● Designed for very large graphs (billions of vertices/edges) ● No need to move data and transform for external graph engine - One analytics database to deploy and manage ● Familiar SQL interface ● Combine context-based graph analytics with other content-based techniques ❏ Advanced Analytics In Database Extended Language GPText ❏ Scale Out ❏ MPP (Massively Parallel Processing) Architecture REGRESSIONCLASSIFICATIONCLUSTERING GEOSPATIAL GRAPHTEXT IMAGE Graph Analytics with MADlib on Greenplum 29
  • 30. : Scaleable, In-Database Machine Learning ● Open source https://github.com/apache/madlib ● Downloads and docs http://madlib.apache.org/ ● Wiki https://cwiki.apache.org/confluence/display/MADLIB/ Apache MADlib: Big Data Machine Learning in SQL Open source, top level Apache project For PostgreSQL and Greenplum Database Powerful machine learning, graph, statistics and analytics for data scientists 30
  • 31. : Functions Data Types and Transformations Array and Matrix Operations Matrix Factorization • Low Rank • Singular Value Decomposition (SVD) Norms and Distance Functions Sparse Vectors Encoding Categorical Variables Path Functions Pivot Sessionize Stemming April 2018 Graph All Pairs Shortest Path (APSP) Breadth-First Search Hyperlink-Induced Topic Search (HITS) Average Path Length Closeness Centrality Graph Diameter In-Out Degree PageRank and Personalized PageRank Single Source Shortest Path (SSSP) Weakly Connected Components Model Selection Cross Validation Prediction Metrics Train-Test Split Statistics Descriptive Statistics • Cardinality Estimators • Correlation and Covariance • Summary Inferential Statistics • Hypothesis Tests Probability Functions Supervised Learning Neural Networks Support Vector Machines (SVM) Conditional Random Field (CRF) Regression Models • Clustered Variance • Cox-Proportional Hazards Regression • Elastic Net Regularization • Generalized Linear Models • Linear Regression • Logistic Regression • Marginal Effects • Multinomial Regression • Naïve Bayes • Ordinal Regression • Robust Variance Tree Methods • Decision Tree • Random Forest Time Series Analysis • ARIMA Unsupervised Learning Association Rules (Apriori) Clustering (k-Means) Principal Component Analysis (PCA) Topic Modelling (Latent Dirichlet Allocation) Utility Functions Columns to Vector Conjugate Gradient Linear Solvers • Dense Linear Systems • Sparse Linear Systems Mini-Batching PMML Export Term Frequency for Text Vector to Columns Nearest Neighbors • k-Nearest Neighbors Sampling Balanced/ Random/ Stratified Sampling 31
  • 32. : Graph Representation in MADlib Source Vertex Dest Vertex Edge Weight Edge Params 0 3 1.0 ... 1 0 5.0 ... 1 2 3.0 ... 2 3 8.0 ... 3 0 3.0 ... 3 1 2.0 ... Vertex Vertex Params 0 ... 1 ... 2 ... 3 ... . . . . . . Vertex Table Edge Table ... ... 0 1 2 3 5 3 8 2 1 3 [ Directed Graph (example) ] V 32
  • 33. example : PageRank in MADlib ● Create vertex and edge tables to represent the graph * http://madlib.apache.org/docs/latest/group__grp__pagerank.html DROP TABLE IF EXISTS vertex; CREATE TABLE vertex( id INTEGER ); INSERT INTO vertex VALUES (0), (1), (2), (3), (4), (5), (6); DROP TABLE IF EXISTS edge; CREATE TABLE edge( src INTEGER, dest INTEGER, user_id INTEGER ) DISTRIBUTED BY (user_id); INSERT INTO edge VALUES (0, 1, 1), (0, 2, 1), -- user id 1 (0, 4, 1), (1, 2, 1), (1, 3, 1), (2, 3, 1), (2, 5, 1), (2, 6, 1), (3, 0, 1), (4, 0, 1), (5, 6, 1), (6, 3, 1), (0, 1, 2), (0, 2, 2), -- user id 2 (0, 4, 2), (1, 2, 2), (1, 3, 2), (2, 3, 2), (3, 0, 2), (4, 0, 2), (5, 6, 2), (6, 3, 2); 33
  • 34. example : PageRank in MADlib ● Compute the PageRank with All IDs DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; SELECT madlib.pagerank( 'vertex' -- Vertex table , 'id' -- Vertex id column , 'edge' -- Edge table , 'src=src, dest=dest' -- Comma delimited string of edge arguments , 'pagerank_out' -- Output table of RageRank , NULL); -- Damping factor (default 0.85) SELECT * FROM pagerank_out ORDER BY pagerank DESC; 34
  • 35. example : PageRank in MADlib ● Network Diagram by Graphviz and PyGraphviz * PyGraphviz is a Python interface to the Graphviz graph layout and visualization package A vertex with a high PageRank is usually considered more "important" or more "influential" or more "relevant" than a vertex with a low PageRank. Size of node is proportional to PageRank value User ID 1 User ID 2 ID (PageRank) 35
  • 36. example : PageRank in MADlib ● PageRank of vertices associated with each user by the grouping feature DROP TABLE IF EXISTS pagerank_gr_out, pagerank_gr_out_summary; SELECT madlib.pagerank( 'vertex' -- Vertex table , 'id' -- Vertex id column , 'edge' -- Edge table , 'src=src, dest=dest' -- Comma delimited string of edge arguments , 'pagerank_gr_out' -- Output table of PageRank , NULL -- Default damping factor (0.85) , NULL -- Default max iterations (100) , 0.00000001 -- Threshold , 'user_id' -- Grouping column name ); SELECT * FROM pagerank_gr_out ORDER BY user_id, pagerank DESC; PageRank of user id 1 PageRank of user id 2 36
  • 37. example : PageRank in MADlib ● Personalized PageRank of vertices {2, 4}, for Recommendations DROP TABLE IF EXISTS pagerank_pers_out, pagerank_pers_out_summary; SELECT madlib.pagerank( 'vertex' -- Vertex table , 'id' -- Vertex id column , 'edge' -- Edge table , 'src=src, dest=dest' -- Comma delimited string of edge arguments , 'pagerank_pers_out' -- Output table of PageRank , NULL -- Default damping factor (0.85) , NULL -- Default max iterations (100) , NULL -- Default Threshold (1/number of vertices*1000) , NULL -- No Grouping , '{2, 4}' -- Personalization vertices ); SELECT * FROM pagerank_pers_out ORDER BY pagerank DESC; SELECT * FROM pagerank_pers_out_summary; * Personalized PageRank = (1-p)*Ax + p*E , where ‘E’ is the list of vertices for personalized PageRank, ‘p’ is the damping factor 37
  • 38. example : PageRank in MADlib Greenplum cluster: ● 1 master ● 4 segment hosts with 6 segments per host Normal random graphs with mean degrees 50 edges per vertex (i.e., 5B edges in the largest case) 5B edges (1K) (10K) (100K) (1M) (10M) (100M) * Note: log-log scale (100s) (1s) (10K s) (1M s) PageRank Performance on Greenplum w/ MADlib 38
  • 39. In Summary ● Capture the Relationship in Networks using Graph Analytics → Community, Structure, Path, Centrality → Combine context-based graph analytics with other content-based insights ● Graph analytics at SCALE with Open Source Software → Apache MADlib on Greenplum, massively parallel processing 39
  • 40. One more thing... GREENPLUM SUMMIT at PostgresConf 2019 by Pivotal 40
  • 41. Transforming How The World Builds Software © Copyright 2019 Pivotal Software, Inc. All rights Reserved.