SlideShare a Scribd company logo
Dynamic Community Detection for Large-scale e-Commerce data
with Spark Streaming and GraphX
Ming Huang
Meng Zhang, Bin Wei
GuangYuan Huang, Jinkui Shi
Community Detection
Scenarios
•  VIP Customer
•  Reputation Escalator
•  Fraud Seller
•  ………
Algorithms
•  LPA
•  GN
•  Fast Unfolding
•  …….
How to make it Dynamic?
Static Communities Streaming Data
Make sophisticated, real-time decisions
Definition & Solution
Dynamic Community Detection
1.  Decide New Node’s community
2.  Update Graph Physical Topology
3.  Effect communities and modularity
Spark Streaming + GraphX à Streaming Graph
REAL-TIME
Streaming Graph
Edges
DStream
Graph
DStream
merge merge merge
Stock Graph
… … …
Models and Algorithms
Quick Overview of
Fast Unfolding
Modularity:
!
Q=
1
2m
Aij
*
ki
kj
2m
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥i,j
∑ δ ci
,cj( )
!
Q = Qi
i
c
∑ =
in∑
2m
)
tot∑
2m
⎛
⎝⎜
⎞
⎠⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥i
c
∑
Incremental Algorithms
JV(Streaming with RDD ) UMG(Streaming with Graph)
"   Union & Modularity Greedy"   Join & Vote
JV
A B C
C1 C2 C2
D D D
A B C
D D D
C1 C2 C2
D
C2
join
Vote
incEdgeRDD stockCommunityRDD
D
C2
UMG 1 - Union
A
B
C1
C2
C3
C
(C1 or C2) ?
   newGraph = stockGraph.union(incGraph)"
A
B
C
D
UMG 2 - findBestCommunity
A
B
C
D
gain1=G(node(d), community(1))
gain2=G(node(d) , community(2))
C3
incVertexWithNeighbors = newGraph.mapReduceTriplets[Array[VertexData]]
(collectNeighborFunc, _ ++ _,"# # # # #Some((incGraph.vertices, EdgeDirection.Either)))
idCommunity = incVertexWithNeighbors.map {"
case (vid, neighbors) => (vid, findBestCommunity(neighbors))"
}.cache()"
!
Ci
=Cmax
j
G(nodei
,Cj
)
!
ΔQ=
in∑ +ki,in
2m
+
tot+ki∑
2m
⎛
⎝
⎜
⎞
⎠
⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥
+
in∑
2m
+
tot∑
2m
⎛
⎝
⎜
⎞
⎠
⎟
2
+
ki
2m
⎛
⎝
⎜
⎞
⎠
⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥
C2
C1
UMG 3 - updateCommunities
A
D
B
C
newCommunityRdd = idCommunity.updateCommunities(commuitiyRdd)"
"
newModularity = newCommunityRdd.map(community=>community.modularity).reduce(_+_)"
C1
C2
!
Q = Qi
i
c
∑ =
in∑
2m
)
tot∑
2m
⎛
⎝⎜
⎞
⎠⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥i
c
∑
(Q1, Q2)
edgeStreamRDD.foreachRDD { "
  incEdgeRdd => { "
   val incGraph  = buildIncGraph(incEdgeRdd) "
   (communityInfoRDD, modularity) = streamingFU.trainOn(incGraph)"
outputToHBase(communityInfoRDD)"
outputToHBase(modularity)"
edgeRdd "
  }"
} "
Flow Example Code
ssc.start()"
ssc.awaitTermination()"
val conf = new SparkConf().setMaster(……).setAppName(……)"
val ssc = new StreamingContext(conf, Seconds(60))"
"
"
val totalGraph = initGraph(totalEdgesRdd) "
Val streamingFU = new StreamingFU().setTotalGraph(totalGraph)"
"
val onlineDataFlow = getDataFlow(ssc.sparkContext)"
val edgeStreamRDD  = ssc.queueStream(onlineDataFlow, true) "
"
Experiment Results
Autonomous Systems Graphs
Stanford Large Network Dataset Collection(as-733)
https://snap.stanford.edu/data/
Modularity Trend – AS
Online Trading Graph
Buyer Seller
C-C
Modularity Trend – OT
Streaming Graph à Better Result
Key Points
"   Operator
"   Merge Small graph into Large graph
"   Model
"   Local changes
"   Index or summary
"   Algorithm
"   Delicate formula
"   Commutative law & Associative law
"   Parallelly & Incrementally
Complex GraphX
Operators
Graph Union Operator
GRAPH(H)GRAPH(G)
∪ =	

E
F
G
H
B
C
D E
F
A
B
C
D
E
F
A
H
G
GRAPH(G U H)
Graph Union Operator
https://issues.apache.org/jira/browse/SPARK-7894"
"
[GraphX] Complex Operators between Graphs: Union
https://github.com/apache/spark/pull/6685"
"
   newGraph = stockGraph.union(incGraph)"
Complex GraphX Operators
"   Union of Graphs ( G ∪ H )
"   Intersection of Graphs ( G ∩ H)
"   Graph Join
"   Difference of Graphs(G – H)
"   Graph Complement
"   Line Graph ( L(G) )
Issues:"
Complex Operators between Graphs
https://issues.apache.org/jira/browse/SPARK-7893"
Streaming Optimization
Monitoring and Correction
Ω
Data Loading Modularity Threshold CheckingStreaming-FU
FastUnfolding
[Hourly Monitoring]
[Streaming]
[Daily Running]
FastUnfolding
communityID	
 communityInfo	

community1	
 (in1,tot1,degree1,modularity1)	

……	
 ……	

mTime mValue
timestamp1 totalModularity1
…… ……
modularityTablecommRDDTable
Streaming Resource Allocation
•  Driver-Memory: 20G
•  Executors: 100
•  Core: 2
•  Executor-Memory: 20G
Not Enough for Peak Period!
Streaming Buffer
Kafka
Stream
Hdfs
Stream
Join
StreamingFUModel
Streaming-
FU
Streaming-
Buffer
TT
Receiver
Split
HDFS
Modularity Correction Buffer
Resource Peak Buffer
Kafka
Buffer
Writer
Conclusion
"   Streaming Graph
"   Complex Operators will help
"   Daily Rebuild & Threshold Check
"   Costs more memory and time
"   Open Question
checkpoint with Streaming or Graph?
Acknowledgements
1.  Limits of community detection
" http://www.slideshare.net/vtraag/comm-detect
2.  Community Detection
" http://www.traag.net/projects/community-detection/
3.  Social Network Analysis
" http://lorenzopaoliani.info/topics/
4.  Community detection in complex networks using Extremal Optimization
" http://arxiv.org/pdf/cond-mat/0501368.pdf
"   Q & A
Agenda
"   Dynamic Community Detection
"   Streaming Graph
"   Models and Algorithms
"   Complex GraphX Operators
"   Streaming Optimization
"   Conclusion
Static vs. Dynamic
Static Model Dynamic Model

More Related Content

What's hot (20)

PDF
仮想マシンにおけるメモリ管理
Akari Asai
 
PPTX
OWASP AppSecCali 2015 - Marshalling Pickles
Christopher Frohoff
 
PDF
シングルサインオンの歴史とSAMLへの道のり
Shinichi Tomita
 
PPTX
Interrupts on xv6
Takuya ASADA
 
PDF
Spigotで看板のクリックを取得するには
ZOIdayo
 
PDF
大学等におけるAzure AD B2Cを使用したSNS認証の活用
Naohiro Fujie
 
PDF
Dockerを支える技術
Etsuji Nakai
 
PDF
개인 일정관리에 Agile을 끼얹으면?
Curt Park
 
PPT
Introduction to Git Commands and Concepts
Carl Brown
 
PPTX
Elixirと他言語の比較的紹介 ver.2
Tsunenori Oohara
 
PDF
コンセプトから理解するGitコマンド
ktateish
 
PDF
最新Active DirectoryによるIDMaaSとハイブリッド認証基盤の実現
junichi anno
 
PPT
Javaバイトコード入門
Kota Mizushima
 
PPTX
[WeFocus] 인공지능_딥러닝_특허 확보 전략_김성현_201902_v1
Luke Sunghyun Kim
 
PPTX
Hacking Oracle From Web Apps 1 9
sumsid1234
 
PDF
PHP-FPM の子プロセス制御方法と設定をおさらいしよう
Shohei Okada
 
PDF
Kong Enterprise の紹介
Yoshito Tabuchi
 
PDF
AngularとSpring Bootで作るSPA + RESTful Web Serviceアプリケーション
ssuser070fa9
 
PPTX
詳説!Azure AD 条件付きアクセス - 動作の仕組みを理解する編
Yusuke Kodama
 
PPTX
監視基盤 ~ZabbixとCloudWatch~
真乙 九龍
 
仮想マシンにおけるメモリ管理
Akari Asai
 
OWASP AppSecCali 2015 - Marshalling Pickles
Christopher Frohoff
 
シングルサインオンの歴史とSAMLへの道のり
Shinichi Tomita
 
Interrupts on xv6
Takuya ASADA
 
Spigotで看板のクリックを取得するには
ZOIdayo
 
大学等におけるAzure AD B2Cを使用したSNS認証の活用
Naohiro Fujie
 
Dockerを支える技術
Etsuji Nakai
 
개인 일정관리에 Agile을 끼얹으면?
Curt Park
 
Introduction to Git Commands and Concepts
Carl Brown
 
Elixirと他言語の比較的紹介 ver.2
Tsunenori Oohara
 
コンセプトから理解するGitコマンド
ktateish
 
最新Active DirectoryによるIDMaaSとハイブリッド認証基盤の実現
junichi anno
 
Javaバイトコード入門
Kota Mizushima
 
[WeFocus] 인공지능_딥러닝_특허 확보 전략_김성현_201902_v1
Luke Sunghyun Kim
 
Hacking Oracle From Web Apps 1 9
sumsid1234
 
PHP-FPM の子プロセス制御方法と設定をおさらいしよう
Shohei Okada
 
Kong Enterprise の紹介
Yoshito Tabuchi
 
AngularとSpring Bootで作るSPA + RESTful Web Serviceアプリケーション
ssuser070fa9
 
詳説!Azure AD 条件付きアクセス - 動作の仕組みを理解する編
Yusuke Kodama
 
監視基盤 ~ZabbixとCloudWatch~
真乙 九龍
 

Viewers also liked (20)

PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
PDF
Limits of community detection
Vincent Traag
 
PPTX
ODSC_Cherven_20160518
Ken Cherven
 
PDF
Emr hive barcamp 2012
Ezequiel Golub
 
PDF
Visualizing Networks
freshdatabos
 
PDF
Xgboost
Vivian S. Zhang
 
PDF
Diagnosing Open-Source Community Health with Spark-(William Benton, Red Hat)
Spark Summit
 
PDF
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit
 
PPTX
Estudio sobre Spark, Storm, Kafka y Hive
Wellness Telecom
 
PDF
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Spark Summit
 
PDF
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
PDF
Spark Summit Keynote with Ken Tsai
Spark Summit
 
PDF
Spark Summit EU talk by Stephan Kessler
Spark Summit
 
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
PDF
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
Spark Summit
 
PDF
Huohua: A Distributed Time Series Analysis Framework For Spark
Jen Aman
 
PDF
Community Detection in Social Media
Symeon Papadopoulos
 
PDF
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Spark Summit
 
PDF
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
PDF
Credit Fraud Prevention with Spark and Graph Analysis
Jen Aman
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
Limits of community detection
Vincent Traag
 
ODSC_Cherven_20160518
Ken Cherven
 
Emr hive barcamp 2012
Ezequiel Golub
 
Visualizing Networks
freshdatabos
 
Diagnosing Open-Source Community Health with Spark-(William Benton, Red Hat)
Spark Summit
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit
 
Estudio sobre Spark, Storm, Kafka y Hive
Wellness Telecom
 
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Spark Summit
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
Spark Summit Keynote with Ken Tsai
Spark Summit
 
Spark Summit EU talk by Stephan Kessler
Spark Summit
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
Spark Summit
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Jen Aman
 
Community Detection in Social Media
Symeon Papadopoulos
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Spark Summit
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
Credit Fraud Prevention with Spark and Graph Analysis
Jen Aman
 
Ad

Similar to Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao) (20)

PDF
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 
PPTX
Scaling graph investigations with Math, GPUs, & Experts
graphistry
 
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
PPTX
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
PPTX
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
PDF
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
Jason Riedy
 
PPTX
Big Stream Processing Systems, Big Graphs
Petr Novotný
 
PPTX
GraphQL & DGraph with Go
James Tan
 
PDF
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
TigerGraph
 
PPTX
Large-scale Recommendation Systems on Just a PC
Aapo Kyrölä
 
PPTX
Graph protocol for accessing information about blockchains and d apps
Gene Leybzon
 
PDF
Introduction to MapReduce & hadoop
Colin Su
 
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
PPTX
100X Investigations - Graphistry / Microsoft BlueHat
graphistry
 
PPTX
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
MLconf
 
PDF
GraphGen: Conducting Graph Analytics over Relational Databases
PyData
 
PDF
GraphGen: Conducting Graph Analytics over Relational Databases
Konstantinos Xirogiannopoulos
 
PDF
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
PDF
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
cscpconf
 
PDF
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 
Scaling graph investigations with Math, GPUs, & Experts
graphistry
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
Jason Riedy
 
Big Stream Processing Systems, Big Graphs
Petr Novotný
 
GraphQL & DGraph with Go
James Tan
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
TigerGraph
 
Large-scale Recommendation Systems on Just a PC
Aapo Kyrölä
 
Graph protocol for accessing information about blockchains and d apps
Gene Leybzon
 
Introduction to MapReduce & hadoop
Colin Su
 
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
100X Investigations - Graphistry / Microsoft BlueHat
graphistry
 
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
MLconf
 
GraphGen: Conducting Graph Analytics over Relational Databases
PyData
 
GraphGen: Conducting Graph Analytics over Relational Databases
Konstantinos Xirogiannopoulos
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
cscpconf
 
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
big data eco system fundamentals of data science
arivukarasi
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
BinarySearchTree in datastructures in detail
kichokuttu
 

Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao)

  • 1. Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX Ming Huang Meng Zhang, Bin Wei GuangYuan Huang, Jinkui Shi
  • 2. Community Detection Scenarios •  VIP Customer •  Reputation Escalator •  Fraud Seller •  ……… Algorithms •  LPA •  GN •  Fast Unfolding •  …….
  • 3. How to make it Dynamic? Static Communities Streaming Data Make sophisticated, real-time decisions
  • 4. Definition & Solution Dynamic Community Detection 1.  Decide New Node’s community 2.  Update Graph Physical Topology 3.  Effect communities and modularity Spark Streaming + GraphX à Streaming Graph REAL-TIME
  • 7. Quick Overview of Fast Unfolding Modularity: ! Q= 1 2m Aij * ki kj 2m ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥i,j ∑ δ ci ,cj( ) ! Q = Qi i c ∑ = in∑ 2m ) tot∑ 2m ⎛ ⎝⎜ ⎞ ⎠⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥i c ∑
  • 8. Incremental Algorithms JV(Streaming with RDD ) UMG(Streaming with Graph) "   Union & Modularity Greedy"   Join & Vote
  • 9. JV A B C C1 C2 C2 D D D A B C D D D C1 C2 C2 D C2 join Vote incEdgeRDD stockCommunityRDD D C2
  • 10. UMG 1 - Union A B C1 C2 C3 C (C1 or C2) ?    newGraph = stockGraph.union(incGraph)" A B C D
  • 11. UMG 2 - findBestCommunity A B C D gain1=G(node(d), community(1)) gain2=G(node(d) , community(2)) C3 incVertexWithNeighbors = newGraph.mapReduceTriplets[Array[VertexData]] (collectNeighborFunc, _ ++ _,"# # # # #Some((incGraph.vertices, EdgeDirection.Either))) idCommunity = incVertexWithNeighbors.map {" case (vid, neighbors) => (vid, findBestCommunity(neighbors))" }.cache()" ! Ci =Cmax j G(nodei ,Cj ) ! ΔQ= in∑ +ki,in 2m + tot+ki∑ 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ + in∑ 2m + tot∑ 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 2 + ki 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ C2 C1
  • 12. UMG 3 - updateCommunities A D B C newCommunityRdd = idCommunity.updateCommunities(commuitiyRdd)" " newModularity = newCommunityRdd.map(community=>community.modularity).reduce(_+_)" C1 C2 ! Q = Qi i c ∑ = in∑ 2m ) tot∑ 2m ⎛ ⎝⎜ ⎞ ⎠⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥i c ∑ (Q1, Q2)
  • 13. edgeStreamRDD.foreachRDD { "   incEdgeRdd => { "    val incGraph  = buildIncGraph(incEdgeRdd) "    (communityInfoRDD, modularity) = streamingFU.trainOn(incGraph)" outputToHBase(communityInfoRDD)" outputToHBase(modularity)" edgeRdd "   }" } " Flow Example Code ssc.start()" ssc.awaitTermination()" val conf = new SparkConf().setMaster(……).setAppName(……)" val ssc = new StreamingContext(conf, Seconds(60))" " " val totalGraph = initGraph(totalEdgesRdd) " Val streamingFU = new StreamingFU().setTotalGraph(totalGraph)" " val onlineDataFlow = getDataFlow(ssc.sparkContext)" val edgeStreamRDD  = ssc.queueStream(onlineDataFlow, true) " "
  • 15. Autonomous Systems Graphs Stanford Large Network Dataset Collection(as-733) https://snap.stanford.edu/data/
  • 18. Modularity Trend – OT Streaming Graph à Better Result
  • 19. Key Points "   Operator "   Merge Small graph into Large graph "   Model "   Local changes "   Index or summary "   Algorithm "   Delicate formula "   Commutative law & Associative law "   Parallelly & Incrementally
  • 21. Graph Union Operator GRAPH(H)GRAPH(G) ∪ =  E F G H B C D E F A B C D E F A H G GRAPH(G U H) Graph Union Operator https://issues.apache.org/jira/browse/SPARK-7894" " [GraphX] Complex Operators between Graphs: Union https://github.com/apache/spark/pull/6685" "    newGraph = stockGraph.union(incGraph)"
  • 22. Complex GraphX Operators "   Union of Graphs ( G ∪ H ) "   Intersection of Graphs ( G ∩ H) "   Graph Join "   Difference of Graphs(G – H) "   Graph Complement "   Line Graph ( L(G) ) Issues:" Complex Operators between Graphs https://issues.apache.org/jira/browse/SPARK-7893"
  • 24. Monitoring and Correction Ω Data Loading Modularity Threshold CheckingStreaming-FU FastUnfolding [Hourly Monitoring] [Streaming] [Daily Running] FastUnfolding communityID  communityInfo  community1  (in1,tot1,degree1,modularity1)  ……  ……  mTime mValue timestamp1 totalModularity1 …… …… modularityTablecommRDDTable
  • 25. Streaming Resource Allocation •  Driver-Memory: 20G •  Executors: 100 •  Core: 2 •  Executor-Memory: 20G Not Enough for Peak Period!
  • 27. Conclusion "   Streaming Graph "   Complex Operators will help "   Daily Rebuild & Threshold Check "   Costs more memory and time "   Open Question checkpoint with Streaming or Graph?
  • 28. Acknowledgements 1.  Limits of community detection " http://www.slideshare.net/vtraag/comm-detect 2.  Community Detection " http://www.traag.net/projects/community-detection/ 3.  Social Network Analysis " http://lorenzopaoliani.info/topics/ 4.  Community detection in complex networks using Extremal Optimization " http://arxiv.org/pdf/cond-mat/0501368.pdf
  • 29. "   Q & A
  • 30. Agenda "   Dynamic Community Detection "   Streaming Graph "   Models and Algorithms "   Complex GraphX Operators "   Streaming Optimization "   Conclusion
  • 31. Static vs. Dynamic Static Model Dynamic Model