MapReduce : Simplified Data
Processing on Large Cluster
Dae Ho Kim, Dept of Computer Science, Sangmyung Univ.
Introduction
The Age of Big Data
Introduction
The Age of Big Data
SNS
IoT
Smart
Phone
Introduction
The Age of Big Data
SNS
IoT
Smart
Phone Large-scale Computation
Introduction
The Age of Big Data
Automatic
Powerful
Simple
Introduction
The Age of Big Data
Automatic
Powerful
Simple
MapReduce
Concept Description
MapReduce
Concept Description
MapReduce
MapReduce
Map Reduce
Input Data -> key / value Merge Values
Concept Description
MapReduce
Implementation
Overview, Fault Tolerance, …
Implementation
Execution Overview
In-progress completeidle
Map Reduce
MasterWorker
Implementation
Execution Overview
Implementation
Fault Tolerance
1. Worker Failure
• If no response is received from a worker in a certain amount of time, the master marks the
worker as failed.
• Any map task or reduce task in progress on a failed worker is reset to idle and becomes
eligible for rescheduling.
• Completed map task are re-executed on an failure because their output is stored on the
local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do
not need to be re-executed since their output is stored in a global file system.
• When a map task is executed first by worker A and then later executed by worker B
(because A failed), all workers executing reduce tasks are notified of the re-execution. Any
reduce task that has not already read the data from worker A will read the data from
worker B.
Implementation
Fault Tolerance
2. Master Failure
• Our current implementation aborts the MapReduce computation if the master fails.
Clients can check for this condition and retry the MapReduce operation if they desire.
3. Semantics in the Presence of Failures
• When the user-supplied map and reduce operators are deterministic functions of their
input values, our distributed implementation produces the same output as would have
been produced by a non-faulting sequential execution of the entire program.
• If the master receives a completion message for an already completed map task, it
ignores the message.
• If the same reduce task is executed on multiple machines, multiple rename calls will be
executed for the same final output file. We rely on the atomic rename operation provided
by the underlying file system to guarantee that the final file system state contains just the
data produced by one execution of the reduce task.
• The vast majority of our map and reduce operators are deterministic, and the fact that our
semantics are equivalent to a sequential execution in this case makes it very easy for
programmers to reason about their program’s behavior.
Implementation
Backup Tasks
 Backup Tasks
• One of the common causes that lengthens the total
time taken for a MapReduce operation is a “straggler”:
a machine that takes an unusually long time to
complete one of the last few map or reduce tasks in
the computation.
• When a MapReduce operation is close to completion,
the master schedules backup executions of the
remaining in-progress tasks. The task is marked as
completed whenever either the primary or the backup
execution completes.
• The sort program described in Section 5.3 takes 44%
longer to complete when the backup task mechanism is
disabled.
Refinements
Partitioning Function, Combiner Function, …
Refinements
Partitioning Function, Combiner Function
 Partitioning Function
• Data gets partitioned across these tasks using a partitioning function on the intermediate key.
• A default partitioning function is provided that uses hashing (e.g. “hash(key) mod R”).
• For example, using “hash(Hostname(urlkey)) mod R” as the partitioning function causes all
URLs from the same host to end up in the same output file.
 Combiner Function
• In some cases, there is significant repetition in the intermediate keys produced by each map
task, and the user specified Reduce function is commutative and associative.
• We allow the user to specify an optional Combiner function that does partial merging of this
data before it is sent over the network.
• The Combiner function is executed on each machine that performs a map task.
Refinements
Skipping Bad Records, Counters
 Skipping Bad Records
• Sometimes it is acceptable to ignore a few records, for example when doing statistical
analysis on a large data set.
• We provide an optional mode of execution where the MapReduce library detects which
records cause deterministic crashes and skips these records in order to make forward
progress.
• When the master has seen more than one failure on a particular record, the signal handler
indicates that the record should be skipped when it issues the next re-execution of the
corresponding Map or Reduce task.
 Counters
• To use this facility, user code creates a named counter object and then increments the
counter appropriately in the Map and/or Reduce function.
• The current counter values are also displayed on the master status page so that a human can
watch the progress of the live computation.
Map reduce

More Related Content

PPTX
MapReduce : Simplified Data Processing on Large Clusters
PDF
MapReduce: Simplified Data Processing on Large Clusters
PDF
Mapreduce - Simplified Data Processing on Large Clusters
PPT
Map reduce - simplified data processing on large clusters
PDF
MapReduce: Simplified Data Processing On Large Clusters
PPTX
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
PPTX
Mapreduce script
PPTX
Hadoop Map Reduce OS
MapReduce : Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
Map reduce - simplified data processing on large clusters
MapReduce: Simplified Data Processing On Large Clusters
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
Mapreduce script
Hadoop Map Reduce OS

What's hot (19)

PPTX
Map reduce presentation
PPTX
Hadoop deconstructing map reduce job step by step
PDF
Hadoop map reduce v2
PPTX
Adaptive Execution Support for Malleable Computation
PDF
Hadoop secondary sort and a custom comparator
PPTX
Introduction to map reduce
PPTX
Load balancing In cloud - In a semi distributed system
PDF
Hadoop combiner and partitioner
PPTX
Map reduce
PDF
Hadoop map reduce in operation
PPTX
load balancing in public cloud
PDF
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
PPTX
Mapreduce total order sorting technique
PDF
An Enhanced MapReduce Model (on BSP)
PPTX
PDF
Processing Large Datasets for the National Broadband Map with FME
PDF
Processing Large Datasets for the National Broadband Map with FME
PPTX
Wei's notes on MapReduce Scheduling
PPTX
A load balancing model based on cloud partitioning for the public cloud. ppt
Map reduce presentation
Hadoop deconstructing map reduce job step by step
Hadoop map reduce v2
Adaptive Execution Support for Malleable Computation
Hadoop secondary sort and a custom comparator
Introduction to map reduce
Load balancing In cloud - In a semi distributed system
Hadoop combiner and partitioner
Map reduce
Hadoop map reduce in operation
load balancing in public cloud
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
Mapreduce total order sorting technique
An Enhanced MapReduce Model (on BSP)
Processing Large Datasets for the National Broadband Map with FME
Processing Large Datasets for the National Broadband Map with FME
Wei's notes on MapReduce Scheduling
A load balancing model based on cloud partitioning for the public cloud. ppt
Ad

Similar to Map reduce (20)

PPTX
MapReduce presentation
PDF
Architecting and productionising data science applications at scale
PPTX
Big Data.pptx
PDF
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
PPTX
Spark Overview and Performance Issues
PPT
Map Reduce
PPTX
Hadoop fault tolerance
PPTX
MapReduce.pptx
PDF
E031201032036
PPTX
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
PDF
MapReduce basics
PPT
Hadoop mapreduce and yarn frame work- unit5
PDF
Introduction of MapReduce
PDF
PDF
Architecting for the cloud map reduce creating
DOCX
Big data unit iv and v lecture notes qb model exam
PPT
Introduction To Map Reduce
PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
PPTX
Juniper Innovation Contest
MapReduce presentation
Architecting and productionising data science applications at scale
Big Data.pptx
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
Spark Overview and Performance Issues
Map Reduce
Hadoop fault tolerance
MapReduce.pptx
E031201032036
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
MapReduce basics
Hadoop mapreduce and yarn frame work- unit5
Introduction of MapReduce
Architecting for the cloud map reduce creating
Big data unit iv and v lecture notes qb model exam
Introduction To Map Reduce
Mapreduce is for Hadoop Ecosystem in Data Science
Juniper Innovation Contest
Ad

Recently uploaded (20)

PDF
The Role of Pathology AI in Translational Cancer Research and Education
PPTX
inbound2857676998455010149.pptxmmmmmmmmm
PPTX
MBA JAPAN: 2025 the University of Waseda
PPTX
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
PPTX
ifsm.pptx, institutional food service management
PDF
REPORT CARD OF GRADE 2 2025-2026 MATATAG
PDF
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
PPTX
Introduction to Fundamentals of Data Security
PPTX
machinelearningoverview-250809184828-927201d2.pptx
PPTX
ch20 Database System Architecture by Rizvee
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPTX
Hushh Hackathon for IIT Bombay: Create your very own Agents
PPTX
Chapter security of computer_8_v8.1.pptx
PDF
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
PPTX
lung disease detection using transfer learning approach.pptx
PPT
Classification methods in data analytics.ppt
PDF
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPTX
OJT-Narrative-Presentation-Entrep-group.pptx_20250808_102837_0000.pptx
The Role of Pathology AI in Translational Cancer Research and Education
inbound2857676998455010149.pptxmmmmmmmmm
MBA JAPAN: 2025 the University of Waseda
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
ifsm.pptx, institutional food service management
REPORT CARD OF GRADE 2 2025-2026 MATATAG
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
Introduction to Fundamentals of Data Security
machinelearningoverview-250809184828-927201d2.pptx
ch20 Database System Architecture by Rizvee
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
Hushh Hackathon for IIT Bombay: Create your very own Agents
Chapter security of computer_8_v8.1.pptx
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
lung disease detection using transfer learning approach.pptx
Classification methods in data analytics.ppt
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PPT for Diseases (1)-2, types of diseases.pptx
1 hour to get there before the game is done so you don’t need a car seat for ...
OJT-Narrative-Presentation-Entrep-group.pptx_20250808_102837_0000.pptx

Map reduce

  • 1. MapReduce : Simplified Data Processing on Large Cluster Dae Ho Kim, Dept of Computer Science, Sangmyung Univ.
  • 3. Introduction The Age of Big Data SNS IoT Smart Phone
  • 4. Introduction The Age of Big Data SNS IoT Smart Phone Large-scale Computation
  • 5. Introduction The Age of Big Data Automatic Powerful Simple
  • 6. Introduction The Age of Big Data Automatic Powerful Simple MapReduce
  • 13. Implementation Fault Tolerance 1. Worker Failure • If no response is received from a worker in a certain amount of time, the master marks the worker as failed. • Any map task or reduce task in progress on a failed worker is reset to idle and becomes eligible for rescheduling. • Completed map task are re-executed on an failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. • When a map task is executed first by worker A and then later executed by worker B (because A failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker A will read the data from worker B.
  • 14. Implementation Fault Tolerance 2. Master Failure • Our current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire. 3. Semantics in the Presence of Failures • When the user-supplied map and reduce operators are deterministic functions of their input values, our distributed implementation produces the same output as would have been produced by a non-faulting sequential execution of the entire program. • If the master receives a completion message for an already completed map task, it ignores the message. • If the same reduce task is executed on multiple machines, multiple rename calls will be executed for the same final output file. We rely on the atomic rename operation provided by the underlying file system to guarantee that the final file system state contains just the data produced by one execution of the reduce task. • The vast majority of our map and reduce operators are deterministic, and the fact that our semantics are equivalent to a sequential execution in this case makes it very easy for programmers to reason about their program’s behavior.
  • 15. Implementation Backup Tasks  Backup Tasks • One of the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”: a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. • When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. The task is marked as completed whenever either the primary or the backup execution completes. • The sort program described in Section 5.3 takes 44% longer to complete when the backup task mechanism is disabled.
  • 17. Refinements Partitioning Function, Combiner Function  Partitioning Function • Data gets partitioned across these tasks using a partitioning function on the intermediate key. • A default partitioning function is provided that uses hashing (e.g. “hash(key) mod R”). • For example, using “hash(Hostname(urlkey)) mod R” as the partitioning function causes all URLs from the same host to end up in the same output file.  Combiner Function • In some cases, there is significant repetition in the intermediate keys produced by each map task, and the user specified Reduce function is commutative and associative. • We allow the user to specify an optional Combiner function that does partial merging of this data before it is sent over the network. • The Combiner function is executed on each machine that performs a map task.
  • 18. Refinements Skipping Bad Records, Counters  Skipping Bad Records • Sometimes it is acceptable to ignore a few records, for example when doing statistical analysis on a large data set. • We provide an optional mode of execution where the MapReduce library detects which records cause deterministic crashes and skips these records in order to make forward progress. • When the master has seen more than one failure on a particular record, the signal handler indicates that the record should be skipped when it issues the next re-execution of the corresponding Map or Reduce task.  Counters • To use this facility, user code creates a named counter object and then increments the counter appropriately in the Map and/or Reduce function. • The current counter values are also displayed on the master status page so that a human can watch the progress of the live computation.

Editor's Notes

  • #6: 그래서 나온 것이 분산컴퓨팅인데 분산컴퓨팅이란 여러 대의 컴퓨터가 하나의 작업을 나누어 처리하는 방식이다. 그리고 이 분산 컴퓨팅을 보다 쉽고 간편하게 하기 위해 만든 것이 맵 리듀스 .
  • #7: 그래서 나온 것이 분산컴퓨팅인데 분산컴퓨팅이란 여러 대의 컴퓨터가 하나의 작업을 나누어 처리하는 방식이다. 그리고 이 분산 컴퓨팅을 보다 쉽고 간편하게 하기 위해 만든 것이 맵 리듀스 .
  • #16: Straggler : CPU, Memory, Local disk, Network bandwidth etc..