SlideShare a Scribd company logo
Social network analysis with Hadoop

                                        Jake Hofman

                                        Yahoo! Research


                                      October 2, 2009




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Social networks

  • Rapid increase in amount and variety of social network data




  • Valuable information for products (recommendations, advertising,
     etc.) and research (structure/dynamics, diffusion, etc.)


Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Social networks




  Goal: to enable analysis of large-scale social network data with readily
                       available software/hardware

Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
1970s ∼ 101 nodes                            456             JOURNAL OF ANTHROPOLOGICAL RESEARCH
                                                                               FIGURE 1
                                                         Social Network Model of Relationships   in the Karate Club

                                                                                 34      1
                                                                         33 3                     2




                                              27                                                                      8

                                             26 i                                                                     9

                                              25                                                                      10



                                                            CONFLICT AND FISSION IN SMALL GROUPS                           453

                                        to bounded social groups of all types in all settings. Also, the data
                                        required can be collected by a reliable method currently familiar to
                                        anthropologists, the use of nominal scales.
                                                                         19      18              16
                                                                           18   17
                                                                 THE ETHNOGRAPHIC RATIONALE
                                    The is the the clubrepresentationline ofis the socialbetween of three years, the indi-1970
                                    This
                                           karate karate was observed for a period two amongwhen 34 two
                                    viduals in
                                                 graphic
                                                             club. A            drawn
                                                                                            relationships         the
                                                                                                                       from
                                                                                                          points
                                to 1972. In addition to direct observation, the history of outside those of to
                                    individuals being represented consistently      interacted in contexts the club prior
                                the period of the study and club meetings. Each through drawn is referredandasclub
                                    karate classes, workouts, was reconstructed such line informants to
                                    an edge.
                                records in the university archives. During the period of observation, the
                                club maintained between 50 and 100 members, and its activities
                                    two individuals consistently were observed to interact outside the
                                included social affairs (parties, dances, and club
                                    normal activities of the club (karate classes banquets, etc.) Thatwell as
                                                                                                                      as
  • Few direct observations; highly detailed info on nodes and edges                                      meetings).
                                regularly scheduled ifkarate lessons. could be said to be friends outside the
                                    an edge is drawn          the individuals The political organization of
                                                                                                                          is,

                                clubthe club activities.This while there was a constitutionin Figure 2. officers,
                                      was informal, and graph is represented as a matrix and four All
                                most decisions were made nondirectional at represent interaction in both
                                    the edges in Figure 1 are by concensus          (they club meetings. For its classes,
  • E.g. karate club (Zachary, 1977)
                                the club employed thepart-time karate instructor, who will possible to to
                                    directions), and a graph is said to be symmetrical.It is also be referred
                                    draw edges that are directed (representing one-way relationships); such
                                as Mr. Hi.2
                                    At the beginning of the study there was an incipient conflict
                                between the club president, John A., and Mr. Hi over the price of
Jake Hofman   (Yahoo! Research) karate lessons. Mr. Hi, who analysis with prices, claimed the authority
                                             Social network wished to raise Hadoop                                               October 2, 2009
1990s ∼ 104 nodes




  • Larger, indirect samples; relatively few details on nodes and edges
  • E.g. APS co-authorship network (http://bit.ly/aps08jmh)

Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Present ∼ 108 nodes +




  • Very large, dynamic samples; many details in node and edge metadata
  • E.g. Mail, Messenger, Facebook, Twitter, etc.



Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Scale



                                                                             ...
 • Example numbers:
     • ∼ 107 nodes
     • ∼ 102 edges/node (degree)
                                                                    User 1         User 2
     • no node/edge data
     • static
     • ∼8GB
                                                                             ...




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop              October 2, 2009
Scale



                                                                             ...
 • Example numbers:
     • ∼ 107 nodes
     • ∼ 102 edges/node (degree)
                                                                    User 1         User 2
     • no node/edge data
     • static
     • ∼8GB
                                                                             ...

    Simple, static networks push memory limit for commodity machines




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop              October 2, 2009
Scale



                                                                                    ...
 • Example numbers:
     • ∼ 107 nodes
     • ∼ 102 edges/node (degree)                                                 Message
                                                                                Header
     • node/edge metadata                                              User 1   Content
                                                                                ...
                                                                                           User 2
                                                               User                                   User
     • dynamic                                               Profile
                                                             History
                                                                                                    Profile
                                                                                                    History
     • ∼100GB/day                                            ...                                    ...

                                                                                    ...




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop                        October 2, 2009
Scale



                                                                                    ...
 • Example numbers:
     • ∼ 107 nodes
     • ∼ 102 edges/node (degree)                                                 Message
                                                                                Header
     • node/edge metadata                                              User 1   Content
                                                                                ...
                                                                                           User 2
                                                               User                                   User
     • dynamic                                               Profile
                                                             History
                                                                                                    Profile
                                                                                                    History
     • ∼100GB/day                                            ...                                    ...

                                                                                    ...

     Dynamic, data-rich social networks exceed memory limits; require
                           considerable storage




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop                        October 2, 2009
Distributed network analysis




 MapReduce convenient for
 parallelizing individual
 node/edge-level calculations




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Distributed network analysis




 Higher-order calculations more
 difficult when network exceeds
 memory constraints, but can be
 adapted to MapReduce
 framework




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Package details

                                                              • Higher-order node-level
     • Network
                                                                  descriptive statistics
        creation/manipulation
                                                                      • Clustering coefficient
              •    Logs → edges
                                                                      • Implicit degree
              •    Edge list ↔ adjacency list
                                                                      • ...
              •    Directed ↔ undirected
              •    Edge thresholds                            • Global calculations
     • First-order descriptive                                    • Pairwise connectivity
                                                                  • Connected components
        statistics
                                                                  • Minimum spanning tree
              • Number of nodes
                                                                  • Breadth-first search
              • Number of edges
                                                                  • Pagerank
              • Node degrees
                                                                  • Community detection




Jake Hofman       (Yahoo! Research)   Social network analysis with Hadoop                  October 2, 2009
Package details

                                                              • Higher-order node-level
     • Network
                                                                  descriptive statistics
        creation/manipulation
                                                                      • Clustering coefficient
              •    Logs → edges
                                                                      • Implicit degree
              •    Edge list ↔ adjacency list
                                                                      • ...
              •    Directed ↔ undirected
              •    Edge thresholds                            • Global calculations
     • First-order descriptive                                    • Pairwise connectivity
                                                                  • Connected components
        statistics
                                                                  • Minimum spanning tree
              • Number of nodes
                                                                  • Breadth-first search
              • Number of edges
                                                                  • Pagerank
              • Node degrees
                                                                  • Community detection

                     Currently implemented in Streaming with Python
                     Algorithms exist/developed for additional features


Jake Hofman       (Yahoo! Research)   Social network analysis with Hadoop                  October 2, 2009
Application: Twitter




  • Distributed crawl of Twitter social network + public messages
     (crawler by Eytan Bakshy, http://bit.ly/eytanb)


Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Application: Twitter




  • Distributed crawl of Twitter social network + public messages
     (crawler by Eytan Bakshy, http://bit.ly/eytanb)
  • ∼ 25 million nodes, ∼ 800 million edges
Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Twitter: Degree Distribution
                             8
                           10
                                                                                   out−degree (friends)
                                                                                   in−degree (followers)
                             7
                           10


                             6
                           10


                             5
                           10
                   count




                             4
                           10


                             3
                           10


                             2
                           10


                             1
                           10


                             0
                           10
                               0    1              2            3              4         5                  6
                             10    10            10          10               10       10                  10
                                                            degree


  • Aggregates users by number of friends/followers seen in crawl

Jake Hofman   (Yahoo! Research)         Social network analysis with Hadoop                                     October 2, 2009
Twitter: Degree Distribution
                             8
                           10
                                                                                   out−degree (friends)
                                                                                   in−degree (followers)
                             7
                           10


                             6
                           10


                             5
                           10
                   count




                             4
                           10


                             3
                           10


                             2
                           10


                             1
                           10


                             0
                           10
                               0    1              2            3              4         5                  6
                             10    10            10          10               10       10                  10
                                                            degree


              Many people not followed by anyone; few followed by many
                     Most people follow at least a few others
Jake Hofman   (Yahoo! Research)         Social network analysis with Hadoop                                     October 2, 2009
Twitter: Node-level clustering coefficient

                                                    ?




                                                    ?
  • Fraction of edges amongst a node’s friends/followers (Watts &
     Strogatz, 1998)

Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Twitter: Node-level clustering coefficient
                                    8
                                   10
                                                                                                               followers
                                                                                                               friends
                                    7
                                   10


                                    6
                                   10


                 ?                  5
                                   10
                           count



                                    4
                                   10


                                    3
                                   10
                 ?
                                    2
                                   10


                                    1
                                   10


                                    0
                                   10
                                        0   0.1       0.2       0.3            0.4           0.5   0.6   0.7               0.8
                                                                      clustering coefficient


  • Fraction of edges amongst a node’s friends/followers (Watts &
     Strogatz, 1998)
Jake Hofman   (Yahoo! Research)             Social network analysis with Hadoop                                  October 2, 2009
Twitter: Node-level clustering coefficient
                                    8
                                   10
                                                                                                               followers
                                                                                                               friends
                                    7
                                   10


                                    6
                                   10


                 ?                  5
                                   10
                           count




                                    4
                                   10


                                    3
                                   10
                 ?
                                    2
                                   10


                                    1
                                   10


                                    0
                                   10
                                        0   0.1       0.2       0.3            0.4           0.5   0.6   0.7               0.8
                                                                      clustering coefficient



                Suprisingly high density at 0.5 (many isolated triangles)

Jake Hofman   (Yahoo! Research)             Social network analysis with Hadoop                                  October 2, 2009
Future plans



  • Open-source release
  • “A Model of Computation for MapReduce”, Karloff, Suri, &
     Vassilvitskii, Symposium on Discrete Algorithms, 2010 (Accepted)
  • Twitter analysis publication (In progress)



  Goal: to enable analysis of large-scale social network data with readily
                       available software/hardware




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Collaborators


    • Eytan Bakshym,y
    • Sharad Goely
    • Winter Masony
    • Sid Suriy
    • Sergei Vassilvitskiiy
    • Duncan Wattsy
    • (You?)


y   Yahoo! Research (http://research.yahoo.com)
m   University of Michigan



Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Thanks.




                                         Questions?1
   1
       hofman@yahoo-inc.com, jakehofman.com
Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009

More Related Content

Viewers also liked (17)

PPTX
2013 NodeXL Social Media Network Analysis
Marc Smith
 
PPT
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
rhatr
 
PDF
Diffusion of Innovations Overview
Kevin Trowbridge, Ph.D., APR
 
PDF
Distributed Data Analysis with Hadoop and R - OSCON 2011
Jonathan Seidman
 
PDF
Complex and Social Network Analysis in Python
rik0
 
PPT
Chapter 8 Diffusion Networks
Mardy McGaw
 
DOCX
Resume of Vimal 4.1
Vimal Suthar
 
PPTX
Hadoop data analysis
Vakul Vankadaru
 
PPTX
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
PPTX
BIG Data & Hadoop Applications in Social Media
Skillspeed
 
PPTX
Traffic data analysis using HADOOP
Kirthan S Holla
 
PPTX
Basic Sentiment Analysis using Hive
Qubole
 
PPTX
Hadoop - Stock Analysis
Vaibhav Jain
 
PPTX
TRAFFIC DATA ANALYSIS USING HADOOP
Kirthan S Holla
 
PPTX
PCAP Graphs for Cybersecurity and System Tuning
Dr. Mirko Kämpf
 
PDF
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Hortonworks
 
PPT
An Introduction to Graph Databases
InfiniteGraph
 
2013 NodeXL Social Media Network Analysis
Marc Smith
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
rhatr
 
Diffusion of Innovations Overview
Kevin Trowbridge, Ph.D., APR
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Jonathan Seidman
 
Complex and Social Network Analysis in Python
rik0
 
Chapter 8 Diffusion Networks
Mardy McGaw
 
Resume of Vimal 4.1
Vimal Suthar
 
Hadoop data analysis
Vakul Vankadaru
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
BIG Data & Hadoop Applications in Social Media
Skillspeed
 
Traffic data analysis using HADOOP
Kirthan S Holla
 
Basic Sentiment Analysis using Hive
Qubole
 
Hadoop - Stock Analysis
Vaibhav Jain
 
TRAFFIC DATA ANALYSIS USING HADOOP
Kirthan S Holla
 
PCAP Graphs for Cybersecurity and System Tuning
Dr. Mirko Kämpf
 
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Hortonworks
 
An Introduction to Graph Databases
InfiniteGraph
 

Similar to HW09 Social network analysis with Hadoop (20)

PPTX
20121010 marc smith - mapping collections of connections in social media with...
Marc Smith
 
PPTX
20121001 pawcon 2012-marc smith - mapping collections of connections in socia...
Marc Smith
 
PDF
Network analysis methods for assessment & measurement
Patti Anklam
 
PPTX
Network Analysis (SNA/ONA) Methods for Assessment & Measurement
Leadership Learning Community
 
PPTX
2013 passbac-marc smith-node xl-sna-social media-formatted
Marc Smith
 
PDF
Autobiography, Mobile Social Life-Logging and the Transition from Ephemeral t...
Marc Smith
 
PDF
Fbk Seminar Michela Ferron
Bruno Kessler Foundation
 
PPTX
Computational Social Science, Lecture 06: Networks, Part II
jakehofman
 
PDF
Mining Social Graph Data
Drew Conway
 
PDF
The International Journal of Engineering and Science (IJES)
theijes
 
PDF
4.1 network analysis basic
jilung hsieh
 
PPT
Practice hunting with British telephone call records
Ben Anderson
 
PPTX
20111123 mwa2011-marc smith
Marc Smith
 
PDF
Four degrees of separation
augustodefranco .
 
PDF
Sociogram understanding maps-11_03a[1]
Brijesh Rao
 
PDF
Sociogram understanding maps-11_03a[1]
Brijesh Rao
 
PDF
Document 8 1.pdf
Aniket223719
 
PPTX
Diagrammatic Elicitation:Using diagrams as a data collecton method
Muriah Umoquit
 
PDF
Modeling the Social, Spatial, and Temporal dimensions of Human Mobility in a ...
Dima Karamshuk
 
PPTX
20120301 strata-marc smith-mapping social media networks with no coding using...
Marc Smith
 
20121010 marc smith - mapping collections of connections in social media with...
Marc Smith
 
20121001 pawcon 2012-marc smith - mapping collections of connections in socia...
Marc Smith
 
Network analysis methods for assessment & measurement
Patti Anklam
 
Network Analysis (SNA/ONA) Methods for Assessment & Measurement
Leadership Learning Community
 
2013 passbac-marc smith-node xl-sna-social media-formatted
Marc Smith
 
Autobiography, Mobile Social Life-Logging and the Transition from Ephemeral t...
Marc Smith
 
Fbk Seminar Michela Ferron
Bruno Kessler Foundation
 
Computational Social Science, Lecture 06: Networks, Part II
jakehofman
 
Mining Social Graph Data
Drew Conway
 
The International Journal of Engineering and Science (IJES)
theijes
 
4.1 network analysis basic
jilung hsieh
 
Practice hunting with British telephone call records
Ben Anderson
 
20111123 mwa2011-marc smith
Marc Smith
 
Four degrees of separation
augustodefranco .
 
Sociogram understanding maps-11_03a[1]
Brijesh Rao
 
Sociogram understanding maps-11_03a[1]
Brijesh Rao
 
Document 8 1.pdf
Aniket223719
 
Diagrammatic Elicitation:Using diagrams as a data collecton method
Muriah Umoquit
 
Modeling the Social, Spatial, and Temporal dimensions of Human Mobility in a ...
Dima Karamshuk
 
20120301 strata-marc smith-mapping social media networks with no coding using...
Marc Smith
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Ad

Recently uploaded (20)

PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
July Patch Tuesday
Ivanti
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
July Patch Tuesday
Ivanti
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 

HW09 Social network analysis with Hadoop

  • 1. Social network analysis with Hadoop Jake Hofman Yahoo! Research October 2, 2009 Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 2. Social networks • Rapid increase in amount and variety of social network data • Valuable information for products (recommendations, advertising, etc.) and research (structure/dynamics, diffusion, etc.) Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 3. Social networks Goal: to enable analysis of large-scale social network data with readily available software/hardware Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 4. 1970s ∼ 101 nodes 456 JOURNAL OF ANTHROPOLOGICAL RESEARCH FIGURE 1 Social Network Model of Relationships in the Karate Club 34 1 33 3 2 27 8 26 i 9 25 10 CONFLICT AND FISSION IN SMALL GROUPS 453 to bounded social groups of all types in all settings. Also, the data required can be collected by a reliable method currently familiar to anthropologists, the use of nominal scales. 19 18 16 18 17 THE ETHNOGRAPHIC RATIONALE The is the the clubrepresentationline ofis the socialbetween of three years, the indi-1970 This karate karate was observed for a period two amongwhen 34 two viduals in graphic club. A drawn relationships the from points to 1972. In addition to direct observation, the history of outside those of to individuals being represented consistently interacted in contexts the club prior the period of the study and club meetings. Each through drawn is referredandasclub karate classes, workouts, was reconstructed such line informants to an edge. records in the university archives. During the period of observation, the club maintained between 50 and 100 members, and its activities two individuals consistently were observed to interact outside the included social affairs (parties, dances, and club normal activities of the club (karate classes banquets, etc.) Thatwell as as • Few direct observations; highly detailed info on nodes and edges meetings). regularly scheduled ifkarate lessons. could be said to be friends outside the an edge is drawn the individuals The political organization of is, clubthe club activities.This while there was a constitutionin Figure 2. officers, was informal, and graph is represented as a matrix and four All most decisions were made nondirectional at represent interaction in both the edges in Figure 1 are by concensus (they club meetings. For its classes, • E.g. karate club (Zachary, 1977) the club employed thepart-time karate instructor, who will possible to to directions), and a graph is said to be symmetrical.It is also be referred draw edges that are directed (representing one-way relationships); such as Mr. Hi.2 At the beginning of the study there was an incipient conflict between the club president, John A., and Mr. Hi over the price of Jake Hofman (Yahoo! Research) karate lessons. Mr. Hi, who analysis with prices, claimed the authority Social network wished to raise Hadoop October 2, 2009
  • 5. 1990s ∼ 104 nodes • Larger, indirect samples; relatively few details on nodes and edges • E.g. APS co-authorship network (http://bit.ly/aps08jmh) Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 6. Present ∼ 108 nodes + • Very large, dynamic samples; many details in node and edge metadata • E.g. Mail, Messenger, Facebook, Twitter, etc. Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 7. Scale ... • Example numbers: • ∼ 107 nodes • ∼ 102 edges/node (degree) User 1 User 2 • no node/edge data • static • ∼8GB ... Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 8. Scale ... • Example numbers: • ∼ 107 nodes • ∼ 102 edges/node (degree) User 1 User 2 • no node/edge data • static • ∼8GB ... Simple, static networks push memory limit for commodity machines Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 9. Scale ... • Example numbers: • ∼ 107 nodes • ∼ 102 edges/node (degree) Message Header • node/edge metadata User 1 Content ... User 2 User User • dynamic Profile History Profile History • ∼100GB/day ... ... ... Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 10. Scale ... • Example numbers: • ∼ 107 nodes • ∼ 102 edges/node (degree) Message Header • node/edge metadata User 1 Content ... User 2 User User • dynamic Profile History Profile History • ∼100GB/day ... ... ... Dynamic, data-rich social networks exceed memory limits; require considerable storage Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 11. Distributed network analysis MapReduce convenient for parallelizing individual node/edge-level calculations Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 12. Distributed network analysis Higher-order calculations more difficult when network exceeds memory constraints, but can be adapted to MapReduce framework Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 13. Package details • Higher-order node-level • Network descriptive statistics creation/manipulation • Clustering coefficient • Logs → edges • Implicit degree • Edge list ↔ adjacency list • ... • Directed ↔ undirected • Edge thresholds • Global calculations • First-order descriptive • Pairwise connectivity • Connected components statistics • Minimum spanning tree • Number of nodes • Breadth-first search • Number of edges • Pagerank • Node degrees • Community detection Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 14. Package details • Higher-order node-level • Network descriptive statistics creation/manipulation • Clustering coefficient • Logs → edges • Implicit degree • Edge list ↔ adjacency list • ... • Directed ↔ undirected • Edge thresholds • Global calculations • First-order descriptive • Pairwise connectivity • Connected components statistics • Minimum spanning tree • Number of nodes • Breadth-first search • Number of edges • Pagerank • Node degrees • Community detection Currently implemented in Streaming with Python Algorithms exist/developed for additional features Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 15. Application: Twitter • Distributed crawl of Twitter social network + public messages (crawler by Eytan Bakshy, http://bit.ly/eytanb) Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 16. Application: Twitter • Distributed crawl of Twitter social network + public messages (crawler by Eytan Bakshy, http://bit.ly/eytanb) • ∼ 25 million nodes, ∼ 800 million edges Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 17. Twitter: Degree Distribution 8 10 out−degree (friends) in−degree (followers) 7 10 6 10 5 10 count 4 10 3 10 2 10 1 10 0 10 0 1 2 3 4 5 6 10 10 10 10 10 10 10 degree • Aggregates users by number of friends/followers seen in crawl Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 18. Twitter: Degree Distribution 8 10 out−degree (friends) in−degree (followers) 7 10 6 10 5 10 count 4 10 3 10 2 10 1 10 0 10 0 1 2 3 4 5 6 10 10 10 10 10 10 10 degree Many people not followed by anyone; few followed by many Most people follow at least a few others Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 19. Twitter: Node-level clustering coefficient ? ? • Fraction of edges amongst a node’s friends/followers (Watts & Strogatz, 1998) Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 20. Twitter: Node-level clustering coefficient 8 10 followers friends 7 10 6 10 ? 5 10 count 4 10 3 10 ? 2 10 1 10 0 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 clustering coefficient • Fraction of edges amongst a node’s friends/followers (Watts & Strogatz, 1998) Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 21. Twitter: Node-level clustering coefficient 8 10 followers friends 7 10 6 10 ? 5 10 count 4 10 3 10 ? 2 10 1 10 0 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 clustering coefficient Suprisingly high density at 0.5 (many isolated triangles) Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 22. Future plans • Open-source release • “A Model of Computation for MapReduce”, Karloff, Suri, & Vassilvitskii, Symposium on Discrete Algorithms, 2010 (Accepted) • Twitter analysis publication (In progress) Goal: to enable analysis of large-scale social network data with readily available software/hardware Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 23. Collaborators • Eytan Bakshym,y • Sharad Goely • Winter Masony • Sid Suriy • Sergei Vassilvitskiiy • Duncan Wattsy • (You?) y Yahoo! Research (http://research.yahoo.com) m University of Michigan Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 24. Thanks. Questions?1 1 [email protected], jakehofman.com Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009