SlideShare a Scribd company logo
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.http://creativecommons.org/licenses/by/2.5
Outline MapReduce: Programming Model MapReduce Examples A Brief History  MapReduce Execution Overview Hadoop MapReduce Resources
MapReduce “ A simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.” Dean and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”,  Google Inc.
MapReduce More simply, MapReduce is: A parallel programming model and associated implementation.
Programming Model Description The mental model the programmer has about the detailed execution of their application. Purpose Improve programmer productivity Evaluation Expressibility Simplicity Performance
Programming Models von Neumann model Execute a stream of instructions (machine code) Instructions can specify Arithmetic operations Data addresses Next instruction to execute Complexity Track billions of data locations and millions of instructions Manage with: Modular design High-level programming languages (isomorphic)
Programming Models Parallel Programming Models Message passing Independent tasks encapsulating local data Tasks interact by exchanging messages Shared memory Tasks share a common address space Tasks interact by reading and writing this space asynchronously Data parallelization Tasks execute a sequence of independent operations Data usually evenly partitioned across tasks Also referred to as “Embarrassingly parallel”
MapReduce: Programming Model Process data using special  map () and  reduce () functions The map() function is called on every item in the input and emits a series of intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output
MapReduce: Programming Model How now Brown cow How does It work now brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 M M M M R R <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> Input Output Map Reduce MapReduce Framework
MapReduce: Programming Model More formally, Map(k1,v1) --> list(k2,v2) Reduce(k2, list(v2)) --> list(v2)
MapReduce Runtime System Partitions input data Schedules execution across a set of machines Handles machine failure Manages interprocess communication
MapReduce Benefits Greatly reduces parallel programming complexity Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing Practical Approximately 1000 Google MapReduce jobs run everyday.
MapReduce Examples Word frequency Map doc Reduce <word,3> <word,1> <word,1> <word,1> Runtime System <word,1,1,1>
MapReduce Examples Distributed grep Map function emits <word, line_number> if word matches search criteria Reduce function is the identity function URL access frequency Map function processes web logs, emits <url, 1> Reduce function sums values and emits <url, total>
A Brief History Functional programming (e.g., Lisp) map() function Applies a function to each value of a sequence reduce() function Combines all elements of a sequence using a binary operator
MapReduce Execution Overview The user program, via the MapReduce library, shards the input data User Program Input Data Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 * Shards are typically 16-64mb in size
MapReduce Execution Overview The user program creates process copies distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads. User Program Master Workers Workers Workers Workers Workers
MapReduce Resources The master distributes M map and  R reduce tasks to idle workers. M == number of shards R == the intermediate key space is divided into R parts Master Idle Worker Message(Do_map_task)
MapReduce Resources Each map-task worker reads assigned input shard and outputs intermediate key/value pairs. Output buffered in RAM. Map worker Shard 0 Key/value pairs
MapReduce Execution Overview Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process.  Master Map worker Disk locations Local Storage
MapReduce Execution Overview Master process gives disk locations to an available reduce-task worker who reads all associated intermediate data.  Master Reduce worker Disk locations remote Storage
MapReduce Execution Overview Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file. Reduce worker Sorts data Partition Output file
MapReduce Execution Overview Master process wakes up user process when all tasks have completed.  Output contained in R output files. wakeup User Program Master Output files
MapReduce Execution Overview Fault Tolerance Master process periodically pings workers Map-task failure Re-execute All output was stored locally Reduce-task failure Only re-execute partially completed tasks All output stored in the global file system
Hadoop Open source MapReduce implementation http: //hadoop .apache.org/core/index.html Uses  Hadoop Distributed Filesytem (HDFS) http: //hadoop .apache. org/core/docs/current/hdfs_design .html Java ssh
References Introduction to Parallel Programming and MapReduce, Google Code University http://code.google.com/edu/parallel/mapreduce-tutorial.html Distributed Systems http://code. google . com/edu/parallel/index .html MapReduce: Simplified Data Processing on Large Clusters http://labs. google . com/papers/mapreduce .html Hadoop http: //hadoop .apache.org/core/

More Related Content

PPT
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
PPTX
Map Reduce
Prashant Gupta
 
PPTX
Introduction to Map Reduce
Apache Apex
 
PPTX
Map reduce presentation
ateeq ateeq
 
PPTX
Apache PIG
Prashant Gupta
 
PPT
Map reduce in BIG DATA
GauravBiswas9
 
PDF
Hadoop Overview & Architecture
EMC
 
PDF
Map Reduce
Vigen Sahakyan
 
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Map Reduce
Prashant Gupta
 
Introduction to Map Reduce
Apache Apex
 
Map reduce presentation
ateeq ateeq
 
Apache PIG
Prashant Gupta
 
Map reduce in BIG DATA
GauravBiswas9
 
Hadoop Overview & Architecture
EMC
 
Map Reduce
Vigen Sahakyan
 

What's hot (20)

PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PDF
Mapreduce by examples
Andrea Iacono
 
PPT
Distributed Database System
Sulemang
 
PPTX
Introduction to Hadoop and Hadoop component
rebeccatho
 
PPTX
Gfs vs hdfs
Yuval Carmel
 
PDF
Hadoop YARN
Vigen Sahakyan
 
PDF
Hadoop & MapReduce
Newvewm
 
PDF
Hadoop ecosystem
Stanley Wang
 
PDF
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
PPTX
Hadoop Distributed File System
Rutvik Bapat
 
PPTX
HADOOP TECHNOLOGY ppt
sravya raju
 
PDF
Hadoop
Rajesh Piryani
 
PPTX
Slide #1:Introduction to Apache Storm
Md. Shamsur Rahim
 
PPTX
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
PPTX
Developing a Map Reduce Application
Dr. C.V. Suresh Babu
 
PPSX
Hadoop
Nishant Gandhi
 
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
PPTX
Hadoop
ABHIJEET RAJ
 
PDF
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
PDF
Apache Pig: A big data processor
Tushar B Kute
 
Hadoop File system (HDFS)
Prashant Gupta
 
Mapreduce by examples
Andrea Iacono
 
Distributed Database System
Sulemang
 
Introduction to Hadoop and Hadoop component
rebeccatho
 
Gfs vs hdfs
Yuval Carmel
 
Hadoop YARN
Vigen Sahakyan
 
Hadoop & MapReduce
Newvewm
 
Hadoop ecosystem
Stanley Wang
 
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Hadoop Distributed File System
Rutvik Bapat
 
HADOOP TECHNOLOGY ppt
sravya raju
 
Slide #1:Introduction to Apache Storm
Md. Shamsur Rahim
 
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Developing a Map Reduce Application
Dr. C.V. Suresh Babu
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Hadoop
ABHIJEET RAJ
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Apache Pig: A big data processor
Tushar B Kute
 
Ad

Similar to Map Reduce (20)

PDF
MapReduce
ahmedelmorsy89
 
PPT
Big Data- process of map reducing MapReduce- .ppt
sunilsoni446112
 
PPT
Introduction To Map Reduce
rantav
 
PDF
Hadoop interview questions - Softwarequery.com
softwarequery
 
PPTX
MapReduce and Hadoop Introcuctory Presentation
ssuserb91a20
 
PPTX
Introduction to map reduce
M Baddar
 
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
PDF
Map reduce
Shahbaz Sidhu
 
PDF
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
PDF
Mapreduce Osdi04
Jyotirmoy Dey
 
PPT
Meethadoop
IIIT-H
 
PDF
Lecture 1 mapreduce
Shubham Bansal
 
PPTX
map Reduce.pptx
habibaabderrahim1
 
PDF
Report Hadoop Map Reduce
Urvashi Kataria
 
PDF
Mapreduce - Simplified Data Processing on Large Clusters
Abhishek Singh
 
PDF
The google MapReduce
Romain Jacotin
 
PPTX
Hadoop File System was developed using distributed file system design.
JSujatha2
 
PDF
E031201032036
ijceronline
 
PDF
MapReduce-Notes.pdf
AnilVijayagiri
 
PPTX
Map reduce
대호 김
 
MapReduce
ahmedelmorsy89
 
Big Data- process of map reducing MapReduce- .ppt
sunilsoni446112
 
Introduction To Map Reduce
rantav
 
Hadoop interview questions - Softwarequery.com
softwarequery
 
MapReduce and Hadoop Introcuctory Presentation
ssuserb91a20
 
Introduction to map reduce
M Baddar
 
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Map reduce
Shahbaz Sidhu
 
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
Mapreduce Osdi04
Jyotirmoy Dey
 
Meethadoop
IIIT-H
 
Lecture 1 mapreduce
Shubham Bansal
 
map Reduce.pptx
habibaabderrahim1
 
Report Hadoop Map Reduce
Urvashi Kataria
 
Mapreduce - Simplified Data Processing on Large Clusters
Abhishek Singh
 
The google MapReduce
Romain Jacotin
 
Hadoop File System was developed using distributed file system design.
JSujatha2
 
E031201032036
ijceronline
 
MapReduce-Notes.pdf
AnilVijayagiri
 
Map reduce
대호 김
 
Ad

More from Sri Prasanna (20)

PDF
Qr codes para tech radar
Sri Prasanna
 
PDF
Qr codes para tech radar 2
Sri Prasanna
 
DOC
Test
Sri Prasanna
 
DOC
Test
Sri Prasanna
 
PDF
assds
Sri Prasanna
 
PDF
assds
Sri Prasanna
 
PDF
asdsa
Sri Prasanna
 
PDF
dsd
Sri Prasanna
 
PDF
About stacks
Sri Prasanna
 
PDF
About Stacks
Sri Prasanna
 
PDF
About Stacks
Sri Prasanna
 
PDF
About Stacks
Sri Prasanna
 
PDF
About Stacks
Sri Prasanna
 
PDF
About Stacks
Sri Prasanna
 
PDF
About Stacks
Sri Prasanna
 
PDF
About Stacks
Sri Prasanna
 
PPT
Network and distributed systems
Sri Prasanna
 
PPT
Introduction & Parellelization on large scale clusters
Sri Prasanna
 
PPT
Mapreduce: Theory and implementation
Sri Prasanna
 
PPT
Other distributed systems
Sri Prasanna
 
Qr codes para tech radar
Sri Prasanna
 
Qr codes para tech radar 2
Sri Prasanna
 
About stacks
Sri Prasanna
 
About Stacks
Sri Prasanna
 
About Stacks
Sri Prasanna
 
About Stacks
Sri Prasanna
 
About Stacks
Sri Prasanna
 
About Stacks
Sri Prasanna
 
About Stacks
Sri Prasanna
 
About Stacks
Sri Prasanna
 
Network and distributed systems
Sri Prasanna
 
Introduction & Parellelization on large scale clusters
Sri Prasanna
 
Mapreduce: Theory and implementation
Sri Prasanna
 
Other distributed systems
Sri Prasanna
 

Recently uploaded (20)

PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Doc9.....................................
SofiaCollazos
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 

Map Reduce

  • 1. Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.http://creativecommons.org/licenses/by/2.5
  • 2. Outline MapReduce: Programming Model MapReduce Examples A Brief History MapReduce Execution Overview Hadoop MapReduce Resources
  • 3. MapReduce “ A simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.” Dean and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”, Google Inc.
  • 4. MapReduce More simply, MapReduce is: A parallel programming model and associated implementation.
  • 5. Programming Model Description The mental model the programmer has about the detailed execution of their application. Purpose Improve programmer productivity Evaluation Expressibility Simplicity Performance
  • 6. Programming Models von Neumann model Execute a stream of instructions (machine code) Instructions can specify Arithmetic operations Data addresses Next instruction to execute Complexity Track billions of data locations and millions of instructions Manage with: Modular design High-level programming languages (isomorphic)
  • 7. Programming Models Parallel Programming Models Message passing Independent tasks encapsulating local data Tasks interact by exchanging messages Shared memory Tasks share a common address space Tasks interact by reading and writing this space asynchronously Data parallelization Tasks execute a sequence of independent operations Data usually evenly partitioned across tasks Also referred to as “Embarrassingly parallel”
  • 8. MapReduce: Programming Model Process data using special map () and reduce () functions The map() function is called on every item in the input and emits a series of intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output
  • 9. MapReduce: Programming Model How now Brown cow How does It work now brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 M M M M R R <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> Input Output Map Reduce MapReduce Framework
  • 10. MapReduce: Programming Model More formally, Map(k1,v1) --> list(k2,v2) Reduce(k2, list(v2)) --> list(v2)
  • 11. MapReduce Runtime System Partitions input data Schedules execution across a set of machines Handles machine failure Manages interprocess communication
  • 12. MapReduce Benefits Greatly reduces parallel programming complexity Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing Practical Approximately 1000 Google MapReduce jobs run everyday.
  • 13. MapReduce Examples Word frequency Map doc Reduce <word,3> <word,1> <word,1> <word,1> Runtime System <word,1,1,1>
  • 14. MapReduce Examples Distributed grep Map function emits <word, line_number> if word matches search criteria Reduce function is the identity function URL access frequency Map function processes web logs, emits <url, 1> Reduce function sums values and emits <url, total>
  • 15. A Brief History Functional programming (e.g., Lisp) map() function Applies a function to each value of a sequence reduce() function Combines all elements of a sequence using a binary operator
  • 16. MapReduce Execution Overview The user program, via the MapReduce library, shards the input data User Program Input Data Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 * Shards are typically 16-64mb in size
  • 17. MapReduce Execution Overview The user program creates process copies distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads. User Program Master Workers Workers Workers Workers Workers
  • 18. MapReduce Resources The master distributes M map and R reduce tasks to idle workers. M == number of shards R == the intermediate key space is divided into R parts Master Idle Worker Message(Do_map_task)
  • 19. MapReduce Resources Each map-task worker reads assigned input shard and outputs intermediate key/value pairs. Output buffered in RAM. Map worker Shard 0 Key/value pairs
  • 20. MapReduce Execution Overview Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process. Master Map worker Disk locations Local Storage
  • 21. MapReduce Execution Overview Master process gives disk locations to an available reduce-task worker who reads all associated intermediate data. Master Reduce worker Disk locations remote Storage
  • 22. MapReduce Execution Overview Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file. Reduce worker Sorts data Partition Output file
  • 23. MapReduce Execution Overview Master process wakes up user process when all tasks have completed. Output contained in R output files. wakeup User Program Master Output files
  • 24. MapReduce Execution Overview Fault Tolerance Master process periodically pings workers Map-task failure Re-execute All output was stored locally Reduce-task failure Only re-execute partially completed tasks All output stored in the global file system
  • 25. Hadoop Open source MapReduce implementation http: //hadoop .apache.org/core/index.html Uses Hadoop Distributed Filesytem (HDFS) http: //hadoop .apache. org/core/docs/current/hdfs_design .html Java ssh
  • 26. References Introduction to Parallel Programming and MapReduce, Google Code University http://code.google.com/edu/parallel/mapreduce-tutorial.html Distributed Systems http://code. google . com/edu/parallel/index .html MapReduce: Simplified Data Processing on Large Clusters http://labs. google . com/papers/mapreduce .html Hadoop http: //hadoop .apache.org/core/