SlideShare a Scribd company logo
Analyzing Complex Networks
Using Open Source Software
@ODSC
OPEN
DATA
SCIENCE
CONFERENCE
Ken Cherven
@kc2519
visual-baseball.com
visualidity.com
Boston | May 20-22nd 2016
A Brief Outline
• Network Graph Analysis overview
• Tools
• Case Studies
• Conclusions
Network Graph Analysis – aka Social
Network Analysis (SNA), is the study of
connections (links) between actors
(nodes) within a network
node
node
node
node
node
Network Graph Analysis has many use
cases, ranging from the familiar SNA
(Facebook, Twitter networks) to the
more specialized visual and statistical
investigation of political, criminal, or
terrorist networks
The use cases for Network Graph
Analysis are almost endless – any
dataset where relationships can be
mapped can be analyzed both
statistically and visually; all we need are
nodes and links
We have two primary approaches to
assess patterns in a network:
• Statistical measures are used to
understand the underlying structure and
relationships between nodes
• Visual assessment allows us to leverage
size, color, spacing, and structure to
understand patterns at a network level
Statistical measures are employed to
understand structural patterns within the
network:
• Degrees (# of connections)
• Centrality (influence)
• Density (level of network connectedness)
• Homophily (common groupings)
• Diameter (max distance between nodes)
Visual assessment allows us to use our
visual sense to interpret network
patterns:
• Node location to represent related nodes
• Node sizes to represent degrees
• Node coloring to represent common
groupings (clusters, categories)
• Edge weights that show the strength of
connections between nodes
Some open source network graph tools:
• Gephi (http://gephi.org)
• Cytoscape (http://cytoscape.org)
• GraphViz (http://graphviz.org)
• Sigma.js (http://sigmajs.org)
• NodeXL (http://nodexl.codeplex.com/)
• Pajek (http://mrvar.fdv.uni-lj.si/pajek/)
• Tulip (http://tulip.labri.fr/TulipDrupal/)
We’ll use Gephi and Sigma.js for the
following examples:
• Miles Davis album network (tripartite
network)
• Boston Red Sox player network
• GDELT event networks
Miles Davis Album Network
The desire behind the Miles Davis
network is to understand the multiple
phases within his long and varied
career, and to see the shifting
patterns in his musical partnerships
and styles
http://visual-baseball.com/gephi/jazz/miles_davis/#
Miles Davis Network Topology
Miles Davis
Albums
(pink)
Musicians
(colored by instrument)
Five Album Clusters to Investigate
2
3
1
4
5
What do these
clusters represent?
Five Album Clusters Revealed
Early 60s
Big
Bands
Mid-
60s
small
group
1950s
small
groups
1970s
fusion,
electric
sounds
Late career – 1980s,
experimentation, eclectic
instrumentation
A quick exploration of the network
reveals information about the elements
of time, instrumentation, number of
musicians, and types of instruments.
With just a few minutes of traversing the
network, we gain a greater
understanding of Miles Davis’ musical
career
Red Sox Historical Player Network
The goal for the Red Sox player
network is to understand connections
between players across eras, and to
understand influence and groupings
within the network, as defined by
degrees and other centrality
measures
http://visual-baseball.com/gephi/teams/redsox_network/
Red Sox Network Topology
Player nodes are sized and
colored based on number
of years with team and
cluster assignment
Players are positioned based
on common years with team
Links are built using the number
of seasons two players were
on the team roster together
Individual Network Footprints
19 Seasons
269 Degrees
6 Eccentricity
126,355 Betweenness
3.30 Closeness
Ted Williams
Individual Network Footprints
23 Seasons
283 Degrees
5 Eccentricity
596,003 Betweenness
2.64 Closeness
Carl Yastrzemski
Individual Network Footprints
15 Seasons
379 Degrees
7 Eccentricity
120,696 Betweenness
3.36 Closeness
Jason Varitek
A simple look at 3 prominent players
showed us some quickly observable
differences using centrality measures:
• Despite playing several fewer seasons
than either Williams or Yastrzemski,
Varitek has the most connections; but
Yastrzemski could get you to more
players faster by being very central to
the network structure
GDELT Network Analysis
GDELT data exposes an incredible
number of opportunities for viewing
network data based on published
accounts of news events around the
world. Our exploration focuses on US
Government threats reported
between March 1st and April 30, 2016
GDELT Network Topology (Geo Layout)
Using Geo Layout
Connections are between Actor1
and Actor2 within a specific event
instance; Actor1 is often the
Protagonist, Actor2 the Target
Nodes are positioned by lat/lon coordinates;
most are concentrated in the Northeast US
Node and edge colors are based on the
GDELT GoldsteinScale variable; darker colors
are indicative of higher destabilization potential
Exploring the Graph Geographically
Using Geo Layout
GDELT Network Topology (Dual Circle)
Using Dual Circle Layout
Prominent nodes are positioned in the inner
circle, based on the number of articles on
cumulative events (speeches, press
conference, negotiations, etc.)
Secondary nodes are positioned around the
outer circle; these may be either primary or
secondary actors in an event
Node colors are again based on the GDELT
GoldsteinScale variable
Exploring Nodes Using Sigma.js
Using Dual Circle Layout
Exploring Nodes Using Sigma.js
Using Dual Circle Layout
A few minutes of network exploration
reveals topic patterns based on news
reporting, and allows us to understand
which actors are directing actions
against others, and what is the tone of
those actions. Tracking these measures
over time will enable us to spot trends
both positive and negative.
Conclusions
• Network graph analysis is a powerful tool for
visually and statistically assessing complex
networks
• Network graphs are proliferating, due to the
availability of multiple open source tools and
increasing amounts of open data
• Network graph analysis can be used to tell
powerful stories wherever connected data is
present
Thanks –
and happy networking!
Backup
Miles Davis network specs:
• Data sourced from Wikipedia
• Nodes and edges created in Excel
• Graph created in Gephi using the Yifan Hu
Proportional algorithm
• Exported to Sigma.js (json format)
• 348 nodes, 596 edges
Red Sox Player Network specs:
• Data sourced from Lahman Database at
seanlahman.com
• Nodes and edges created using SQL code in
Toad for MySQL
• Graphs created in Gephi using the ARF layout
algorithm
• JSON file exported to Sigma.js
• 1668 nodes, 51,223 edges
GDELT classifications:
• Type refers to groupings such as
Government, Media, Education, and many
more
• Event codes reference the type of event –
riots, protests, sanctions, and so on
• The GoldsteinScale runs from -10 to 10 in
describing the relative destabilizing potential
of the event
GDELT Network specs:
• Data sourced from the GDELT event database
at gdeltproject.org (3/1 to 4/30/16)
• Nodes and edges refined using SQL code in
Toad for MySQL
• Graphs created in Gephi using the Geo Layout
and Dual Circle algorithms
• GEXF files exported for use with Sigma.js
• 414 nodes, 11,975 edges

More Related Content

PPTX
Bibliometric Study and Network Analysis of the Phenomenon of Self-Publishing
Technological Ecosystems for Enhancing Multiculturality
 
PPTX
Social network analysis
FEG
 
PPTX
Social Network Analysis
Fred Stutzman
 
PDF
Ph.D. defense: semantic social network analysis
guillaume ereteo
 
PPT
topol05
webuploader
 
PPTX
Red Blue Presentation
Lincoln Jackson
 
PPTX
05 Network Canvas (2017)
Duke Network Analysis Center
 
PDF
Social Network Analysis (SNA)
Development Innovations
 
Bibliometric Study and Network Analysis of the Phenomenon of Self-Publishing
Technological Ecosystems for Enhancing Multiculturality
 
Social network analysis
FEG
 
Social Network Analysis
Fred Stutzman
 
Ph.D. defense: semantic social network analysis
guillaume ereteo
 
topol05
webuploader
 
Red Blue Presentation
Lincoln Jackson
 
05 Network Canvas (2017)
Duke Network Analysis Center
 
Social Network Analysis (SNA)
Development Innovations
 

What's hot (20)

PDF
Social network analysis & Big Data - Telecommunications and more
Wael Elrifai
 
PPTX
Social Network Visualization 101
librarianrafia
 
PPTX
Predicting News Popularity by Mining Online Discussions
Symeon Papadopoulos
 
PPTX
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
Axel Bruns
 
PPTX
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Xiaohan Zeng
 
PDF
Team CDTW Capstone Presentation
Todd Rutherford
 
PPTX
Multiple points of view in #VemPraRua Retweets: the perspectival method of ne...
Labic Ufes
 
PDF
Data Science career mixer poster
Tom Jeon
 
PPT
2010 Catalyst Conference - Trends in Social Network Analysis
Marc Smith
 
PPT
Phd Colloquium Spatial Analysis
alistairleak
 
PPTX
Data Cleaning for social media knowledge extraction
Marco Brambilla
 
PPT
Social Network Analysis
Giorgos Cheliotis
 
PPT
Prof. Hendrik Speck - Social Network Analysis
Hendrik Speck
 
PPTX
Social network analysis
World Agroforestry (ICRAF)
 
PPTX
Data-mining the Semantic Web
Frank Lynam
 
PDF
#ICCSS2015 - Computational Human Security Analytics using "Big Data"
Pete Burnap
 
PPT
01 Introduction to Networks Methods and Measures
dnac
 
PPTX
From Geographic Location to Network Location: The Potential of Big Social Data
Axel Bruns
 
PDF
RDA, Data Citation, and PIDs for DataOne
Research Data Alliance
 
PDF
CS6010 Social Network Analysis Unit V
pkaviya
 
Social network analysis & Big Data - Telecommunications and more
Wael Elrifai
 
Social Network Visualization 101
librarianrafia
 
Predicting News Popularity by Mining Online Discussions
Symeon Papadopoulos
 
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
Axel Bruns
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Xiaohan Zeng
 
Team CDTW Capstone Presentation
Todd Rutherford
 
Multiple points of view in #VemPraRua Retweets: the perspectival method of ne...
Labic Ufes
 
Data Science career mixer poster
Tom Jeon
 
2010 Catalyst Conference - Trends in Social Network Analysis
Marc Smith
 
Phd Colloquium Spatial Analysis
alistairleak
 
Data Cleaning for social media knowledge extraction
Marco Brambilla
 
Social Network Analysis
Giorgos Cheliotis
 
Prof. Hendrik Speck - Social Network Analysis
Hendrik Speck
 
Social network analysis
World Agroforestry (ICRAF)
 
Data-mining the Semantic Web
Frank Lynam
 
#ICCSS2015 - Computational Human Security Analytics using "Big Data"
Pete Burnap
 
01 Introduction to Networks Methods and Measures
dnac
 
From Geographic Location to Network Location: The Potential of Big Social Data
Axel Bruns
 
RDA, Data Citation, and PIDs for DataOne
Research Data Alliance
 
CS6010 Social Network Analysis Unit V
pkaviya
 
Ad

Viewers also liked (17)

PDF
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATA
DataTactics
 
PDF
Revealing the Hidden Patterns of News Photos: Analysis of Millions of News Ph...
Haewoon Kwak
 
PPTX
Οι Λάπωνες
Despoina Angelaki
 
PPTX
Data Tactics dhs introduction to cloud technologies wtc
DataTactics
 
PDF
A Blended Approach to Analytics at Data Tactics Corporation
Rich Heimann
 
PDF
Data Tactics Semantic and Interoperability Summit Feb 12, 2013
DataTactics
 
PPTX
Multi Discipline Intelligence Production Teams 1
DataTactics
 
PPTX
Data Tactics Open Source Brief
DataTactics
 
PDF
Big Data Conference
DataTactics
 
PPTX
Ontology and Reports
DataTactics
 
PPTX
Data Tactics and Nervve Integrated Big Data v3
DataTactics
 
PDF
Data Science and Analytics Brown Bag
DataTactics
 
PDF
Why L-3 Data Tactics Data Science?
Rich Heimann
 
PPTX
Horizontal Integration of Big Intelligence Data
DataTactics
 
PDF
Visualizing Networks
freshdatabos
 
PDF
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Spark Summit
 
KEY
Visualizing Networks: Beyond the Hairball
OReillyStrata
 
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATA
DataTactics
 
Revealing the Hidden Patterns of News Photos: Analysis of Millions of News Ph...
Haewoon Kwak
 
Οι Λάπωνες
Despoina Angelaki
 
Data Tactics dhs introduction to cloud technologies wtc
DataTactics
 
A Blended Approach to Analytics at Data Tactics Corporation
Rich Heimann
 
Data Tactics Semantic and Interoperability Summit Feb 12, 2013
DataTactics
 
Multi Discipline Intelligence Production Teams 1
DataTactics
 
Data Tactics Open Source Brief
DataTactics
 
Big Data Conference
DataTactics
 
Ontology and Reports
DataTactics
 
Data Tactics and Nervve Integrated Big Data v3
DataTactics
 
Data Science and Analytics Brown Bag
DataTactics
 
Why L-3 Data Tactics Data Science?
Rich Heimann
 
Horizontal Integration of Big Intelligence Data
DataTactics
 
Visualizing Networks
freshdatabos
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Spark Summit
 
Visualizing Networks: Beyond the Hairball
OReillyStrata
 
Ad

Similar to ODSC_Cherven_20160518 (20)

PPT
Contractor-Borner-SNA-SAC
webuploader
 
PDF
Mining Social Graph Data
Drew Conway
 
PPT
SSRI_pt1.ppt
9260SahilPatil
 
PPTX
Social Network Analysis Introduction including Data Structure Graph overview.
Doug Needham
 
PPTX
Social Network Analysis and Interstate Mobility
Matthew Hendrickson
 
PPT
Social network analysis course 2010 - 2011
guillaume ereteo
 
PDF
Network Mapping & Data Storytelling for Beginners
Renaud Clément
 
PDF
Social Networks Analysis
Joud Khattab
 
PPTX
A comparative study of social network analysis tools
David Combe
 
PDF
Talk 2017 Respawn / Devcom - Social Network Analysis in Games and Communities
Johanna Pirker
 
PPTX
Simplifying Social Network Diagrams
Lynn Cherny
 
PPTX
Exploring the Networks in Open Public Data
Uldis Bojars
 
PDF
SP1: Exploratory Network Analysis with Gephi
John Breslin
 
PPTX
Small Worlds Social Graphs Social Media
suresh sood
 
PDF
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
Denis Parra Santander
 
PDF
Networks: A Crash Course at Local Social Summit
berniehogan
 
PPTX
AI Class Topic 5: Social Network Graph
Value Amplify Consulting
 
PDF
Organisational Network Analysis and Enterprise Architecture
Nicole Mathison
 
PPTX
20120301 strata-marc smith-mapping social media networks with no coding using...
Marc Smith
 
PPTX
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Jonathan Stray
 
Contractor-Borner-SNA-SAC
webuploader
 
Mining Social Graph Data
Drew Conway
 
SSRI_pt1.ppt
9260SahilPatil
 
Social Network Analysis Introduction including Data Structure Graph overview.
Doug Needham
 
Social Network Analysis and Interstate Mobility
Matthew Hendrickson
 
Social network analysis course 2010 - 2011
guillaume ereteo
 
Network Mapping & Data Storytelling for Beginners
Renaud Clément
 
Social Networks Analysis
Joud Khattab
 
A comparative study of social network analysis tools
David Combe
 
Talk 2017 Respawn / Devcom - Social Network Analysis in Games and Communities
Johanna Pirker
 
Simplifying Social Network Diagrams
Lynn Cherny
 
Exploring the Networks in Open Public Data
Uldis Bojars
 
SP1: Exploratory Network Analysis with Gephi
John Breslin
 
Small Worlds Social Graphs Social Media
suresh sood
 
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
Denis Parra Santander
 
Networks: A Crash Course at Local Social Summit
berniehogan
 
AI Class Topic 5: Social Network Graph
Value Amplify Consulting
 
Organisational Network Analysis and Enterprise Architecture
Nicole Mathison
 
20120301 strata-marc smith-mapping social media networks with no coding using...
Marc Smith
 
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Jonathan Stray
 

ODSC_Cherven_20160518

  • 1. Analyzing Complex Networks Using Open Source Software @ODSC OPEN DATA SCIENCE CONFERENCE Ken Cherven @kc2519 visual-baseball.com visualidity.com Boston | May 20-22nd 2016
  • 2. A Brief Outline • Network Graph Analysis overview • Tools • Case Studies • Conclusions
  • 3. Network Graph Analysis – aka Social Network Analysis (SNA), is the study of connections (links) between actors (nodes) within a network node node node node node
  • 4. Network Graph Analysis has many use cases, ranging from the familiar SNA (Facebook, Twitter networks) to the more specialized visual and statistical investigation of political, criminal, or terrorist networks
  • 5. The use cases for Network Graph Analysis are almost endless – any dataset where relationships can be mapped can be analyzed both statistically and visually; all we need are nodes and links
  • 6. We have two primary approaches to assess patterns in a network: • Statistical measures are used to understand the underlying structure and relationships between nodes • Visual assessment allows us to leverage size, color, spacing, and structure to understand patterns at a network level
  • 7. Statistical measures are employed to understand structural patterns within the network: • Degrees (# of connections) • Centrality (influence) • Density (level of network connectedness) • Homophily (common groupings) • Diameter (max distance between nodes)
  • 8. Visual assessment allows us to use our visual sense to interpret network patterns: • Node location to represent related nodes • Node sizes to represent degrees • Node coloring to represent common groupings (clusters, categories) • Edge weights that show the strength of connections between nodes
  • 9. Some open source network graph tools: • Gephi (http://gephi.org) • Cytoscape (http://cytoscape.org) • GraphViz (http://graphviz.org) • Sigma.js (http://sigmajs.org) • NodeXL (http://nodexl.codeplex.com/) • Pajek (http://mrvar.fdv.uni-lj.si/pajek/) • Tulip (http://tulip.labri.fr/TulipDrupal/)
  • 10. We’ll use Gephi and Sigma.js for the following examples: • Miles Davis album network (tripartite network) • Boston Red Sox player network • GDELT event networks
  • 11. Miles Davis Album Network
  • 12. The desire behind the Miles Davis network is to understand the multiple phases within his long and varied career, and to see the shifting patterns in his musical partnerships and styles http://visual-baseball.com/gephi/jazz/miles_davis/#
  • 13. Miles Davis Network Topology Miles Davis Albums (pink) Musicians (colored by instrument)
  • 14. Five Album Clusters to Investigate 2 3 1 4 5 What do these clusters represent?
  • 15. Five Album Clusters Revealed Early 60s Big Bands Mid- 60s small group 1950s small groups 1970s fusion, electric sounds Late career – 1980s, experimentation, eclectic instrumentation
  • 16. A quick exploration of the network reveals information about the elements of time, instrumentation, number of musicians, and types of instruments. With just a few minutes of traversing the network, we gain a greater understanding of Miles Davis’ musical career
  • 17. Red Sox Historical Player Network
  • 18. The goal for the Red Sox player network is to understand connections between players across eras, and to understand influence and groupings within the network, as defined by degrees and other centrality measures http://visual-baseball.com/gephi/teams/redsox_network/
  • 19. Red Sox Network Topology Player nodes are sized and colored based on number of years with team and cluster assignment Players are positioned based on common years with team Links are built using the number of seasons two players were on the team roster together
  • 20. Individual Network Footprints 19 Seasons 269 Degrees 6 Eccentricity 126,355 Betweenness 3.30 Closeness Ted Williams
  • 21. Individual Network Footprints 23 Seasons 283 Degrees 5 Eccentricity 596,003 Betweenness 2.64 Closeness Carl Yastrzemski
  • 22. Individual Network Footprints 15 Seasons 379 Degrees 7 Eccentricity 120,696 Betweenness 3.36 Closeness Jason Varitek
  • 23. A simple look at 3 prominent players showed us some quickly observable differences using centrality measures: • Despite playing several fewer seasons than either Williams or Yastrzemski, Varitek has the most connections; but Yastrzemski could get you to more players faster by being very central to the network structure
  • 25. GDELT data exposes an incredible number of opportunities for viewing network data based on published accounts of news events around the world. Our exploration focuses on US Government threats reported between March 1st and April 30, 2016
  • 26. GDELT Network Topology (Geo Layout) Using Geo Layout Connections are between Actor1 and Actor2 within a specific event instance; Actor1 is often the Protagonist, Actor2 the Target Nodes are positioned by lat/lon coordinates; most are concentrated in the Northeast US Node and edge colors are based on the GDELT GoldsteinScale variable; darker colors are indicative of higher destabilization potential
  • 27. Exploring the Graph Geographically Using Geo Layout
  • 28. GDELT Network Topology (Dual Circle) Using Dual Circle Layout Prominent nodes are positioned in the inner circle, based on the number of articles on cumulative events (speeches, press conference, negotiations, etc.) Secondary nodes are positioned around the outer circle; these may be either primary or secondary actors in an event Node colors are again based on the GDELT GoldsteinScale variable
  • 29. Exploring Nodes Using Sigma.js Using Dual Circle Layout
  • 30. Exploring Nodes Using Sigma.js Using Dual Circle Layout
  • 31. A few minutes of network exploration reveals topic patterns based on news reporting, and allows us to understand which actors are directing actions against others, and what is the tone of those actions. Tracking these measures over time will enable us to spot trends both positive and negative.
  • 32. Conclusions • Network graph analysis is a powerful tool for visually and statistically assessing complex networks • Network graphs are proliferating, due to the availability of multiple open source tools and increasing amounts of open data • Network graph analysis can be used to tell powerful stories wherever connected data is present
  • 33. Thanks – and happy networking!
  • 35. Miles Davis network specs: • Data sourced from Wikipedia • Nodes and edges created in Excel • Graph created in Gephi using the Yifan Hu Proportional algorithm • Exported to Sigma.js (json format) • 348 nodes, 596 edges
  • 36. Red Sox Player Network specs: • Data sourced from Lahman Database at seanlahman.com • Nodes and edges created using SQL code in Toad for MySQL • Graphs created in Gephi using the ARF layout algorithm • JSON file exported to Sigma.js • 1668 nodes, 51,223 edges
  • 37. GDELT classifications: • Type refers to groupings such as Government, Media, Education, and many more • Event codes reference the type of event – riots, protests, sanctions, and so on • The GoldsteinScale runs from -10 to 10 in describing the relative destabilizing potential of the event
  • 38. GDELT Network specs: • Data sourced from the GDELT event database at gdeltproject.org (3/1 to 4/30/16) • Nodes and edges refined using SQL code in Toad for MySQL • Graphs created in Gephi using the Geo Layout and Dual Circle algorithms • GEXF files exported for use with Sigma.js • 414 nodes, 11,975 edges

Editor's Notes

  • #14: A tripartite network