SlideShare a Scribd company logo
Design Patterns for Efficient Graph Algorithms in MapReduceJimmy Lin and Michael SchatzUniversity of MarylandTuesday, June 29, 2010This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
@lintool
Talk OutlineGraph algorithmsGraph algorithms in MapReduceMaking it efficientExperimental resultsPunch line: per-iteration running time -69% on 1.4b link webgraph!
What’s a graph?G = (V, E), whereV represents the set of vertices (nodes)E represents the set of edges (links)Both vertices and edges may contain additional informationGraphs are everywhere:E.g., hyperlink structure of the web, interstate highway system, social networks, etc.Graph problems are everywhere:E.g., random walks, shortest paths, MST, max flow, bipartite matching, clustering, etc.
Source: Wikipedia (Königsberg)
Graph RepresentationG = (V, E)Typically represented as adjacency lists:Each node is associated with its neighbors (via outgoing edges)21: 2, 42: 1, 3, 43: 14: 1, 3134
“Message Passing” Graph AlgorithmsLarge class of iterative algorithms on sparse, directed graphsAt each iteration:Computations at each vertexPartial results (“messages”) passed (usually) along directed edgesComputations at each vertex: messages aggregate to alter stateIterate until convergence
A Few Examples…Parallel breadth-first search (SSSP)Messages are distances from sourceEach node emits current distance + 1Aggregation = MINPageRankMessages are partial PageRank massEach node evenly distributes mass to neighborsAggregation = SUMDNA Sequence assemblyMichael Schatz’s dissertationBoring!Still boring!
PageRank in a nutshell….Random surfer model:User starts at a random Web pageUser randomly clicks on links, surfing from page to pageWith some probability, user randomly jumps aroundPageRank…Characterizes the amount of time spent on any given pageMathematically, a probability distribution over pages
Given page x with inlinkst1…tn, whereC(t) is the out-degree of t is probability of random jumpN is the total number of nodes in the graphPageRank: Definedt1Xt2…tn
Sample PageRank Iteration (1)Iteration 1n2 (0.2)n2 (0.166)0.1n1 (0.2)0.10.1n1 (0.066)0.10.0660.0660.066n5 (0.2)n5 (0.3)n3 (0.2)n3 (0.166)0.20.2n4 (0.2)n4 (0.3)
Sample PageRank Iteration (2)Iteration 2n2 (0.166)n2 (0.133)0.0330.083n1 (0.066)0.083n1 (0.1)0.0330.10.10.1n5 (0.3)n5 (0.383)n3 (0.166)n3 (0.183)0.30.166n4 (0.3)n4 (0.2)
PageRank in MapReduceMapn2n4n3n5n1n2n3n4n5n2n4n3n5n1n2n3n4n5Reduce
PageRank Pseudo-Code
Why don’t distributed algorithms scale?
Source: http://www.flickr.com/photos/fusedforces/4324320625/
Three Design PatternsIn-mapper combining: efficient local aggregationSmarter partitioning: create more opportunitiesSchimmy: avoid shuffling the graph
In-Mapper CombiningUse combinersPerform local aggregation on map outputDownside: intermediate data is still materializedBetter: in-mapper combiningPreserve state across multiple map calls, aggregate messages in buffer, emit buffer contents at endDownside: requires memory managementbufferconfiguremapclose
Better PartitioningDefault: hash partitioningRandomly assign nodes to partitionsObservation: many graphs exhibit local structureE.g., communities in social networksBetter partitioning creates more opportunities for local aggregationUnfortunately… partitioning is hard!Sometimes, chick-and-eggBut in some domains (e.g., webgraphs) take advantage of cheap heuristicsFor webgraphs: range partition on domain-sorted URLs
Schimmy Design PatternBasic implementation contains two dataflows:Messages (actual computations)Graph structure (“bookkeeping”)Schimmy: separate the two data flows, shuffle only the messagesBasic idea: merge join between graph structure and messagesboth relations sorted by join keyboth relations consistently partitioned and sorted by join keySTS1T1S2T2S3T3
Do the Schimmy!Schimmy = reduce side parallel merge join between graph structure and messagesConsistent partitioning between input and intermediate dataMappers emit only messages (actual computation)Reducers read graph structure directly from HDFSintermediate data(messages)intermediate data(messages)intermediate data(messages)from HDFS(graph structure)from HDFS(graph structure)from HDFS(graph structure)S1T1S2T2S3T3ReducerReducerReducer
ExperimentsCluster setup:10 workers, each 2 cores (3.2 GHz Xeon), 4GB RAM, 367 GB diskHadoop 0.20.0 on RHELS 5.3Dataset:First English segment of ClueWeb09 collection50.2m web pages (1.53 TB uncompressed, 247 GB compressed)Extracted webgraph: 1.4 billion links, 7.0 GBDataset arranged in crawl orderSetup:Measured per-iteration running time (5 iterations)100 partitions
Results“Best Practices”
Results+18%1.4b674m
Results+18%1.4b674m-15%
Results+18%1.4b674m-15%-60%86m
Results+18%1.4b674m-15%-60%-69%86m
Take-Away MessagesLots of interesting graph problems!Social network analysisBioinformaticsReducing intermediate data is keyLocal aggregationBetter partitioningLess bookkeeping
Complete details in Jimmy Lin and Michael Schatz. Design Patterns for Efficient Graph Algorithms in MapReduce.Proceedings of the 2010 Workshop on Mining and Learning with Graphs Workshop (MLG-2010), July 2010, Washington, D.C. http://mapreduce.me/Source code available in Cloud9http://cloud9lib.org/@lintool

More Related Content

What's hot (20)

PDF
Minicourse on Network Science
Pavel Loskot
 
PDF
Python networkx library quick start guide
Universiti Technologi Malaysia (UTM)
 
DOC
Full Search Technique
lalithaganapathi
 
PDF
IRJET- Bidirectional Graph Search Techniques for Finding Shortest Path in Ima...
IRJET Journal
 
PPTX
Bidirectional graph search techniques for finding shortest path in image base...
Navin Kumar
 
PPTX
Md2k 0219 shang
BBKuhn
 
PPTX
Word2 vec epam
Victoria Astapenko
 
PDF
Social network-analysis-in-python
Joe OntheRocks
 
PDF
Networkx tutorial
Deepakshankar S
 
PDF
Finding similar items in high dimensional spaces locality sensitive hashing
Dmitriy Selivanov
 
PPTX
A Fast and Dirty Intro to NetworkX (and D3)
Lynn Cherny
 
PPTX
A Novel Approach of Caching Direct Mapping using Cubic Approach
Kartik Asati
 
PDF
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
ijcsit
 
PDF
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
Menlo Systems GmbH
 
PDF
Transmission efficient control protocol for WSN
Avinash Chourasia
 
PPT
Lect12 graph mining
Houw Liong The
 
PPTX
Locality sensitive hashing
Sameera Horawalavithana
 
PPT
Trends In Graph Data Management And Mining
Srinath Srinivasa
 
PDF
Scaling PageRank to 100 Billion Pages
Subhajit Sahu
 
PDF
A New Chaos Based Image Encryption and Decryption using a Hash Function
IRJET Journal
 
Minicourse on Network Science
Pavel Loskot
 
Python networkx library quick start guide
Universiti Technologi Malaysia (UTM)
 
Full Search Technique
lalithaganapathi
 
IRJET- Bidirectional Graph Search Techniques for Finding Shortest Path in Ima...
IRJET Journal
 
Bidirectional graph search techniques for finding shortest path in image base...
Navin Kumar
 
Md2k 0219 shang
BBKuhn
 
Word2 vec epam
Victoria Astapenko
 
Social network-analysis-in-python
Joe OntheRocks
 
Networkx tutorial
Deepakshankar S
 
Finding similar items in high dimensional spaces locality sensitive hashing
Dmitriy Selivanov
 
A Fast and Dirty Intro to NetworkX (and D3)
Lynn Cherny
 
A Novel Approach of Caching Direct Mapping using Cubic Approach
Kartik Asati
 
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
ijcsit
 
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
Menlo Systems GmbH
 
Transmission efficient control protocol for WSN
Avinash Chourasia
 
Lect12 graph mining
Houw Liong The
 
Locality sensitive hashing
Sameera Horawalavithana
 
Trends In Graph Data Management And Mining
Srinath Srinivasa
 
Scaling PageRank to 100 Billion Pages
Subhajit Sahu
 
A New Chaos Based Image Encryption and Decryption using a Hash Function
IRJET Journal
 

Similar to Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010 (20)

PDF
F14 lec12graphs
ankush karwa
 
PDF
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
PDF
Complex Networks Analysis @ Universita Roma Tre
Matteo Moci
 
PPT
MapReduceAlgorithms.ppt
CheeWeiTan10
 
PPTX
Pregel
Weiru Dai
 
PDF
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Joey gonzalez, graph lab, m lconf 2013
MLconf
 
PDF
cis97003
perfj
 
PDF
Informatics systems
Animesh Chaturvedi
 
PPTX
Introduction to Deep Learning
Oswald Campesato
 
PPT
Distributed Streams
Ashraf Bashir
 
PPT
Computing with Directed Labeled Graphs
Marko Rodriguez
 
PDF
Graph convolutional networks in apache spark
Emiliano Martinez Sanchez
 
PDF
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
AIST
 
PDF
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Derryck Lamptey, MPhil, CISSP
 
PPTX
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
 
PPTX
Spanning Tree in data structure and .pptx
asimshahzad8611
 
PPTX
Summer training matlab
Arshit Rai
 
PDF
Summer training matlab
Arshit Rai
 
PPTX
Angular and Deep Learning
Oswald Campesato
 
F14 lec12graphs
ankush karwa
 
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
Complex Networks Analysis @ Universita Roma Tre
Matteo Moci
 
MapReduceAlgorithms.ppt
CheeWeiTan10
 
Pregel
Weiru Dai
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
The Statistical and Applied Mathematical Sciences Institute
 
Joey gonzalez, graph lab, m lconf 2013
MLconf
 
cis97003
perfj
 
Informatics systems
Animesh Chaturvedi
 
Introduction to Deep Learning
Oswald Campesato
 
Distributed Streams
Ashraf Bashir
 
Computing with Directed Labeled Graphs
Marko Rodriguez
 
Graph convolutional networks in apache spark
Emiliano Martinez Sanchez
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
AIST
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Derryck Lamptey, MPhil, CISSP
 
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
 
Spanning Tree in data structure and .pptx
asimshahzad8611
 
Summer training matlab
Arshit Rai
 
Summer training matlab
Arshit Rai
 
Angular and Deep Learning
Oswald Campesato
 
Ad

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
PDF
CICD at Oath using Screwdriver
Yahoo Developer Network
 
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
PDF
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
CICD at Oath using Screwdriver
Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Ad

Recently uploaded (20)

PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 

Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010

  • 1. Design Patterns for Efficient Graph Algorithms in MapReduceJimmy Lin and Michael SchatzUniversity of MarylandTuesday, June 29, 2010This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
  • 3. Talk OutlineGraph algorithmsGraph algorithms in MapReduceMaking it efficientExperimental resultsPunch line: per-iteration running time -69% on 1.4b link webgraph!
  • 4. What’s a graph?G = (V, E), whereV represents the set of vertices (nodes)E represents the set of edges (links)Both vertices and edges may contain additional informationGraphs are everywhere:E.g., hyperlink structure of the web, interstate highway system, social networks, etc.Graph problems are everywhere:E.g., random walks, shortest paths, MST, max flow, bipartite matching, clustering, etc.
  • 6. Graph RepresentationG = (V, E)Typically represented as adjacency lists:Each node is associated with its neighbors (via outgoing edges)21: 2, 42: 1, 3, 43: 14: 1, 3134
  • 7. “Message Passing” Graph AlgorithmsLarge class of iterative algorithms on sparse, directed graphsAt each iteration:Computations at each vertexPartial results (“messages”) passed (usually) along directed edgesComputations at each vertex: messages aggregate to alter stateIterate until convergence
  • 8. A Few Examples…Parallel breadth-first search (SSSP)Messages are distances from sourceEach node emits current distance + 1Aggregation = MINPageRankMessages are partial PageRank massEach node evenly distributes mass to neighborsAggregation = SUMDNA Sequence assemblyMichael Schatz’s dissertationBoring!Still boring!
  • 9. PageRank in a nutshell….Random surfer model:User starts at a random Web pageUser randomly clicks on links, surfing from page to pageWith some probability, user randomly jumps aroundPageRank…Characterizes the amount of time spent on any given pageMathematically, a probability distribution over pages
  • 10. Given page x with inlinkst1…tn, whereC(t) is the out-degree of t is probability of random jumpN is the total number of nodes in the graphPageRank: Definedt1Xt2…tn
  • 11. Sample PageRank Iteration (1)Iteration 1n2 (0.2)n2 (0.166)0.1n1 (0.2)0.10.1n1 (0.066)0.10.0660.0660.066n5 (0.2)n5 (0.3)n3 (0.2)n3 (0.166)0.20.2n4 (0.2)n4 (0.3)
  • 12. Sample PageRank Iteration (2)Iteration 2n2 (0.166)n2 (0.133)0.0330.083n1 (0.066)0.083n1 (0.1)0.0330.10.10.1n5 (0.3)n5 (0.383)n3 (0.166)n3 (0.183)0.30.166n4 (0.3)n4 (0.2)
  • 15. Why don’t distributed algorithms scale?
  • 17. Three Design PatternsIn-mapper combining: efficient local aggregationSmarter partitioning: create more opportunitiesSchimmy: avoid shuffling the graph
  • 18. In-Mapper CombiningUse combinersPerform local aggregation on map outputDownside: intermediate data is still materializedBetter: in-mapper combiningPreserve state across multiple map calls, aggregate messages in buffer, emit buffer contents at endDownside: requires memory managementbufferconfiguremapclose
  • 19. Better PartitioningDefault: hash partitioningRandomly assign nodes to partitionsObservation: many graphs exhibit local structureE.g., communities in social networksBetter partitioning creates more opportunities for local aggregationUnfortunately… partitioning is hard!Sometimes, chick-and-eggBut in some domains (e.g., webgraphs) take advantage of cheap heuristicsFor webgraphs: range partition on domain-sorted URLs
  • 20. Schimmy Design PatternBasic implementation contains two dataflows:Messages (actual computations)Graph structure (“bookkeeping”)Schimmy: separate the two data flows, shuffle only the messagesBasic idea: merge join between graph structure and messagesboth relations sorted by join keyboth relations consistently partitioned and sorted by join keySTS1T1S2T2S3T3
  • 21. Do the Schimmy!Schimmy = reduce side parallel merge join between graph structure and messagesConsistent partitioning between input and intermediate dataMappers emit only messages (actual computation)Reducers read graph structure directly from HDFSintermediate data(messages)intermediate data(messages)intermediate data(messages)from HDFS(graph structure)from HDFS(graph structure)from HDFS(graph structure)S1T1S2T2S3T3ReducerReducerReducer
  • 22. ExperimentsCluster setup:10 workers, each 2 cores (3.2 GHz Xeon), 4GB RAM, 367 GB diskHadoop 0.20.0 on RHELS 5.3Dataset:First English segment of ClueWeb09 collection50.2m web pages (1.53 TB uncompressed, 247 GB compressed)Extracted webgraph: 1.4 billion links, 7.0 GBDataset arranged in crawl orderSetup:Measured per-iteration running time (5 iterations)100 partitions
  • 28. Take-Away MessagesLots of interesting graph problems!Social network analysisBioinformaticsReducing intermediate data is keyLocal aggregationBetter partitioningLess bookkeeping
  • 29. Complete details in Jimmy Lin and Michael Schatz. Design Patterns for Efficient Graph Algorithms in MapReduce.Proceedings of the 2010 Workshop on Mining and Learning with Graphs Workshop (MLG-2010), July 2010, Washington, D.C. http://mapreduce.me/Source code available in Cloud9http://cloud9lib.org/@lintool