SlideShare a Scribd company logo
Map & ReduceChristopher Schleiden, Christian Corsten, Michael Lottko, Jinhui Li1The slides are licensed under aCreative Commons Attribution 3.0 License
OutlineMotivationConceptParallel Map & ReduceGoogle’s MapReduceExample: Word CountDemo: HadoopSummaryWeb Technologies2
Today the web is all about data!GoogleProcessing of 20 PB/day (2008)LHCWill generate about 15PB/yearFacebook2.5 PB of data+ 15 TB/day (4/2009)3BUT: It takes ~2.5 hours to read one terabyte off a typical hard disk!
4Solution: Going Parallel!Data DistributionHowever, parallel programming is hard! SynchronizationLoad Balancing…
Map & ReduceProgramming model and Framework Designed for large volumes of data in parallelBased on functional map and reduce concepte.g., Output of functions only depends on their input, there are no side-effects5
Functional ConceptMapApply function to each value of a sequencemap(k,v)  <k’, v’>*Reduce/FoldCombine all elements of a sequence using binary operator reduce(k’, <v’>*) <k’, v’>*6
Typical problemIterate over large number of recordsExtract something interestingShuffle & sort intermediate resultsAggregate intermediate resultsWrite final output7MapReduce
Parallel Map & Reduce8
Parallel Map & ReducePublished (2004) and patented (2010) by Google IncC++ Runtime with Bindings to Java/PythonOther Implementations:Apache Hadoop/Hive project (Java)Developed at Yahoo!Used by:FacebookHuluIBMAnd many moreMicrosoft COSMOS (Scope, based on SQL and C#)Starfish (Ruby)… 9Footer Text
Parallel Map & Reduce /2Parallel execution of Map and Reduce stagesScheduling through Master/Worker patternRuntime handles:Assigning workers to map and reduce tasksData distributionDetects crashed workers10
Parallel Map & Reduce Execution11MapReduceInputOutputShuffle & SortDREASUTLTA
Components in Google’s MapReduceWeb Technologies12
Google Filesystem (GFS)Stores…Input dataIntermediate resultsFinal results…in 64MB chunks on at least three different machinesWeb Technologies13FileNodes
Scheduling (Master/Worker)One master, many workerInput data split into M map tasks (~64MB in Size; GFS)Reduce phase partitioned into R tasksTasks are assigned to workers dynamicallyMaster assigns each map task to a free workerMaster assigns each reducetask to a free workerFault handling via RedundancyMaster checks if Worker still alive via heart-beatReschedules work item if worker has diedWeb Technologies14
Scheduling Example15MapReduceInputOutputTempMasterAssign mapAssign reduceDWorkerWorkerRESAWorkerTWorkerULTWorkerA
Googles M&R vsHadoopGoogle MapReduceMain language: C++Google Filesystem (GFS)GFS MasterGFS chunkserverHadoopMapReduceMain language: JavaHadoopFilesystem (HDFS)HadoopnamenodeHadoopdatanodeWeb Technologies16
Word CountThe Map & Reduce “Hello World” example17
Word Count - InputSet of text files:Expected Output:sweet (1), this (2), is (2), the (2), foo (1), bar (1), file (1)18bar.txtThis is the bar filefoo.txtSweet, this is the foo file
Word Count - MapMapper(filename, file-contents):for each wordemit(word,1)Outputthis (1)is (1)the (1)sweet (1)this (1)the (1) is (1) foo (1) bar (1) file (1)19
Word Count – Shuffle Sortthis (1)is (1)the (1)sweet (1)this (1)the (1) is (1) foo (1) bar (1) file (1)this (1)this (1)is (1)is (1) the (1)the (1) sweet (1)foo (1) bar (1) file (1)20
Word Count - Reducereducer(word, values):sum = 0for each value in values:sum = sum + valueemit(word,sum)Outputsweet (1)this (2)is (2)the (2)foo (1)bar (1) file (1)21
DEMOHadoop – Word Count22
SummaryLots of data processed on the web (e.g., Google)Performance solution: Go parallelInput, Map, Shuffle & Sort, Reduce, OutputGoogle File SystemScheduling: Master/WorkerWord Count exampleHadoopQuestions?Web Technologies23
ReferencesInspirations for presentationhttp://www4.informatik.uni-erlangen.de/Lehre/WS10/V_MW/Uebung/folien/05-Map-Reduce-Framework.pdfhttp://www.scribd.com/doc/23844299/Map-Reduce-Hadoop-PigRWTH Map Reduce Talk: http://bit.ly/f5oM7pPaperDean et al, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004Ghemawat et al, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.24

More Related Content

What's hot (20)

PPT
Riding the Elephant - Hadoop 2.0
Simon Elliston Ball
 
PDF
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
PDF
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Alexander Schätzle
 
PPT
Hadoop institutes-in-bangalore
Kelly Technologies
 
PDF
Relational Algebra and MapReduce
Pietro Michiardi
 
PDF
C++ on its way to exascale and beyond -- The HPX Parallel Runtime System
Thomas Heller
 
PPTX
Mapreduce introduction
Yogender Singh
 
PPTX
Graph 500 DISLIB powered optimized version
Anton Korzh
 
PPT
Hive integration: HBase and Rcfile__HadoopSummit2010
Yahoo Developer Network
 
PPTX
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
European Data Forum
 
PPTX
Stratosphere with big_data_analytics
Avinash Pandu
 
PPTX
Scrap Your MapReduce - Apache Spark
IndicThreads
 
PDF
Applying stratosphere for big data analytics
Avinash Pandu
 
PPTX
MapMap-Reduce recipes in with c#
Erik Lebel
 
PPTX
Ronalao termpresent
Elma Belitz
 
ODP
CartoType & OpenStreetMap
guest69c941
 
PPTX
OWL reasoning with WebPIE: calculating the closer of 100 billion triples
Mahdi Atawneh
 
PPTX
Towards a Green Ranking for Programming Languages
GreenLabAtDI
 
PDF
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Asociatia ProLinux
 
PPT
Presentation July 22nd
yujin tang
 
Riding the Elephant - Hadoop 2.0
Simon Elliston Ball
 
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Alexander Schätzle
 
Hadoop institutes-in-bangalore
Kelly Technologies
 
Relational Algebra and MapReduce
Pietro Michiardi
 
C++ on its way to exascale and beyond -- The HPX Parallel Runtime System
Thomas Heller
 
Mapreduce introduction
Yogender Singh
 
Graph 500 DISLIB powered optimized version
Anton Korzh
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Yahoo Developer Network
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
European Data Forum
 
Stratosphere with big_data_analytics
Avinash Pandu
 
Scrap Your MapReduce - Apache Spark
IndicThreads
 
Applying stratosphere for big data analytics
Avinash Pandu
 
MapMap-Reduce recipes in with c#
Erik Lebel
 
Ronalao termpresent
Elma Belitz
 
CartoType & OpenStreetMap
guest69c941
 
OWL reasoning with WebPIE: calculating the closer of 100 billion triples
Mahdi Atawneh
 
Towards a Green Ranking for Programming Languages
GreenLabAtDI
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Asociatia ProLinux
 
Presentation July 22nd
yujin tang
 

Similar to Map and Reduce (20)

PPTX
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
PPTX
Map reduce and Hadoop on windows
Muhammad Shahid
 
PPTX
This gives a brief detail about big data
chinky1118
 
PDF
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 
PPTX
Big Data.pptx
NelakurthyVasanthRed1
 
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
harikumar288574
 
PPTX
MapReduce : Simplified Data Processing on Large Clusters
Abolfazl Asudeh
 
PPTX
introduction to Complete Map and Reduce Framework
harikumar288574
 
PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
ssuserb91a20
 
PPT
Hadoop
Raghu Juluri
 
PPT
Map reducecloudtech
Jakir Hossain
 
PPTX
MapReduce.pptx
AtulYadav218546
 
PPT
L4.FA16n nm,m,m,,m,m,m,mmnm,n,mnmnmm.ppt
abdulbasetalselwi
 
PPTX
ch02-mapreduce.pptx
GiannisPagges
 
PDF
The google MapReduce
Romain Jacotin
 
PPT
Map Reduce
Sri Prasanna
 
PPT
MapReduce in cgrid and cloud computinge.ppt
gvlbcy
 
PDF
Mapreduce2008 cacm
lmphuong06
 
PPTX
Embarrassingly/Delightfully Parallel Problems
Dilum Bandara
 
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Map reduce and Hadoop on windows
Muhammad Shahid
 
This gives a brief detail about big data
chinky1118
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 
Big Data.pptx
NelakurthyVasanthRed1
 
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
harikumar288574
 
MapReduce : Simplified Data Processing on Large Clusters
Abolfazl Asudeh
 
introduction to Complete Map and Reduce Framework
harikumar288574
 
Lecture2-MapReduce - An introductory lecture to Map Reduce
ssuserb91a20
 
Hadoop
Raghu Juluri
 
Map reducecloudtech
Jakir Hossain
 
MapReduce.pptx
AtulYadav218546
 
L4.FA16n nm,m,m,,m,m,m,mmnm,n,mnmnmm.ppt
abdulbasetalselwi
 
ch02-mapreduce.pptx
GiannisPagges
 
The google MapReduce
Romain Jacotin
 
Map Reduce
Sri Prasanna
 
MapReduce in cgrid and cloud computinge.ppt
gvlbcy
 
Mapreduce2008 cacm
lmphuong06
 
Embarrassingly/Delightfully Parallel Problems
Dilum Bandara
 
Ad

Recently uploaded (20)

PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Learn Computer Forensics, Second Edition
AnuraShantha7
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
July Patch Tuesday
Ivanti
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Learn Computer Forensics, Second Edition
AnuraShantha7
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Ad

Map and Reduce

  • 1. Map & ReduceChristopher Schleiden, Christian Corsten, Michael Lottko, Jinhui Li1The slides are licensed under aCreative Commons Attribution 3.0 License
  • 2. OutlineMotivationConceptParallel Map & ReduceGoogle’s MapReduceExample: Word CountDemo: HadoopSummaryWeb Technologies2
  • 3. Today the web is all about data!GoogleProcessing of 20 PB/day (2008)LHCWill generate about 15PB/yearFacebook2.5 PB of data+ 15 TB/day (4/2009)3BUT: It takes ~2.5 hours to read one terabyte off a typical hard disk!
  • 4. 4Solution: Going Parallel!Data DistributionHowever, parallel programming is hard! SynchronizationLoad Balancing…
  • 5. Map & ReduceProgramming model and Framework Designed for large volumes of data in parallelBased on functional map and reduce concepte.g., Output of functions only depends on their input, there are no side-effects5
  • 6. Functional ConceptMapApply function to each value of a sequencemap(k,v)  <k’, v’>*Reduce/FoldCombine all elements of a sequence using binary operator reduce(k’, <v’>*) <k’, v’>*6
  • 7. Typical problemIterate over large number of recordsExtract something interestingShuffle & sort intermediate resultsAggregate intermediate resultsWrite final output7MapReduce
  • 8. Parallel Map & Reduce8
  • 9. Parallel Map & ReducePublished (2004) and patented (2010) by Google IncC++ Runtime with Bindings to Java/PythonOther Implementations:Apache Hadoop/Hive project (Java)Developed at Yahoo!Used by:FacebookHuluIBMAnd many moreMicrosoft COSMOS (Scope, based on SQL and C#)Starfish (Ruby)… 9Footer Text
  • 10. Parallel Map & Reduce /2Parallel execution of Map and Reduce stagesScheduling through Master/Worker patternRuntime handles:Assigning workers to map and reduce tasksData distributionDetects crashed workers10
  • 11. Parallel Map & Reduce Execution11MapReduceInputOutputShuffle & SortDREASUTLTA
  • 12. Components in Google’s MapReduceWeb Technologies12
  • 13. Google Filesystem (GFS)Stores…Input dataIntermediate resultsFinal results…in 64MB chunks on at least three different machinesWeb Technologies13FileNodes
  • 14. Scheduling (Master/Worker)One master, many workerInput data split into M map tasks (~64MB in Size; GFS)Reduce phase partitioned into R tasksTasks are assigned to workers dynamicallyMaster assigns each map task to a free workerMaster assigns each reducetask to a free workerFault handling via RedundancyMaster checks if Worker still alive via heart-beatReschedules work item if worker has diedWeb Technologies14
  • 15. Scheduling Example15MapReduceInputOutputTempMasterAssign mapAssign reduceDWorkerWorkerRESAWorkerTWorkerULTWorkerA
  • 16. Googles M&R vsHadoopGoogle MapReduceMain language: C++Google Filesystem (GFS)GFS MasterGFS chunkserverHadoopMapReduceMain language: JavaHadoopFilesystem (HDFS)HadoopnamenodeHadoopdatanodeWeb Technologies16
  • 17. Word CountThe Map & Reduce “Hello World” example17
  • 18. Word Count - InputSet of text files:Expected Output:sweet (1), this (2), is (2), the (2), foo (1), bar (1), file (1)18bar.txtThis is the bar filefoo.txtSweet, this is the foo file
  • 19. Word Count - MapMapper(filename, file-contents):for each wordemit(word,1)Outputthis (1)is (1)the (1)sweet (1)this (1)the (1) is (1) foo (1) bar (1) file (1)19
  • 20. Word Count – Shuffle Sortthis (1)is (1)the (1)sweet (1)this (1)the (1) is (1) foo (1) bar (1) file (1)this (1)this (1)is (1)is (1) the (1)the (1) sweet (1)foo (1) bar (1) file (1)20
  • 21. Word Count - Reducereducer(word, values):sum = 0for each value in values:sum = sum + valueemit(word,sum)Outputsweet (1)this (2)is (2)the (2)foo (1)bar (1) file (1)21
  • 23. SummaryLots of data processed on the web (e.g., Google)Performance solution: Go parallelInput, Map, Shuffle & Sort, Reduce, OutputGoogle File SystemScheduling: Master/WorkerWord Count exampleHadoopQuestions?Web Technologies23
  • 24. ReferencesInspirations for presentationhttp://www4.informatik.uni-erlangen.de/Lehre/WS10/V_MW/Uebung/folien/05-Map-Reduce-Framework.pdfhttp://www.scribd.com/doc/23844299/Map-Reduce-Hadoop-PigRWTH Map Reduce Talk: http://bit.ly/f5oM7pPaperDean et al, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004Ghemawat et al, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.24

Editor's Notes

  • #4: In these days the web is all about data. All major and important websites relay on huge amount of data in some form in order to provide services to users. For example Google … and Facebook …. Also facilities like the LHC will produce data measures in peta bytes each year. However, it takes about 2.5 hours in order to read one terabyte off a typical hard drive. The solution that comes immediately to mind, of course, is going parallel. KonkretesBeispiel [TODO], [Kontextzu Cloud Computing]
  • #5: Parallel programming is still hard. Programmers have to deal with a lot of boilerplate code and have to manually write code for things like scheduling and load balancing. Also people want to use the company cluster in parallel, so something like a batch system is needed. As more and more companies use huge amounts of data, a some kind of standard framework or platform has emerged in recent years and that is the Map/Reduce framework.
  • #7: Map Reduce known for years as functional programming concept
  • #16: Actual execution and scheduling
  • #25: http://www4.informatik.uni-erlangen.de/Lehre/WS10/V_MW/Uebung/folien/05-Map-Reduce-Framework.pdf