Map reduce

MapReduce : Simplified Data
Processing on Large Cluster
Dae Ho Kim, Dept of Computer Science, Sangmyung Univ.

Introduction
The Age of Big Data

Introduction
The Age of Big Data
SNS
IoT
Smart
Phone

Introduction
The Age of Big Data
SNS
IoT
Smart
Phone Large-scale Computation

Introduction
The Age of Big Data
Automatic
Powerful
Simple

Introduction
The Age of Big Data
Automatic
Powerful
Simple
MapReduce

Concept Description
MapReduce
MapReduce
Map Reduce
Input Data -> key / value Merge Values

Implementation
Overview, Fault Tolerance, …

Implementation
Execution Overview
In-progress completeidle
Map Reduce
MasterWorker

Implementation
Execution Overview

Implementation
Fault Tolerance
1. Worker Failure
• If no response is received from a worker in a certain amount of time, the master marks the
worker as failed.
• Any map task or reduce task in progress on a failed worker is reset to idle and becomes
eligible for rescheduling.
• Completed map task are re-executed on an failure because their output is stored on the
local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do
not need to be re-executed since their output is stored in a global file system.
• When a map task is executed first by worker A and then later executed by worker B
(because A failed), all workers executing reduce tasks are notified of the re-execution. Any
reduce task that has not already read the data from worker A will read the data from
worker B.

Implementation
Fault Tolerance
2. Master Failure
• Our current implementation aborts the MapReduce computation if the master fails.
Clients can check for this condition and retry the MapReduce operation if they desire.
3. Semantics in the Presence of Failures
• When the user-supplied map and reduce operators are deterministic functions of their
input values, our distributed implementation produces the same output as would have
been produced by a non-faulting sequential execution of the entire program.
• If the master receives a completion message for an already completed map task, it
ignores the message.
• If the same reduce task is executed on multiple machines, multiple rename calls will be
executed for the same final output file. We rely on the atomic rename operation provided
by the underlying file system to guarantee that the final file system state contains just the
data produced by one execution of the reduce task.
• The vast majority of our map and reduce operators are deterministic, and the fact that our
semantics are equivalent to a sequential execution in this case makes it very easy for
programmers to reason about their program’s behavior.

Implementation
Backup Tasks
 Backup Tasks
• One of the common causes that lengthens the total
time taken for a MapReduce operation is a “straggler”:
a machine that takes an unusually long time to
complete one of the last few map or reduce tasks in
the computation.
• When a MapReduce operation is close to completion,
the master schedules backup executions of the
remaining in-progress tasks. The task is marked as
completed whenever either the primary or the backup
execution completes.
• The sort program described in Section 5.3 takes 44%
longer to complete when the backup task mechanism is
disabled.

Refinements
Partitioning Function, Combiner Function, …

Refinements
Partitioning Function, Combiner Function
 Partitioning Function
• Data gets partitioned across these tasks using a partitioning function on the intermediate key.
• A default partitioning function is provided that uses hashing (e.g. “hash(key) mod R”).
• For example, using “hash(Hostname(urlkey)) mod R” as the partitioning function causes all
URLs from the same host to end up in the same output file.
 Combiner Function
• In some cases, there is significant repetition in the intermediate keys produced by each map
task, and the user specified Reduce function is commutative and associative.
• We allow the user to specify an optional Combiner function that does partial merging of this
data before it is sent over the network.
• The Combiner function is executed on each machine that performs a map task.

Refinements
Skipping Bad Records, Counters
 Skipping Bad Records
• Sometimes it is acceptable to ignore a few records, for example when doing statistical
analysis on a large data set.
• We provide an optional mode of execution where the MapReduce library detects which
records cause deterministic crashes and skips these records in order to make forward
progress.
• When the master has seen more than one failure on a particular record, the signal handler
indicates that the record should be skipped when it issues the next re-execution of the
corresponding Map or Reduce task.
 Counters
• To use this facility, user code creates a named counter object and then increments the
counter appropriately in the Map and/or Reduce function.
• The current counter values are also displayed on the master status page so that a human can
watch the progress of the live computation.

Map reduce

More Related Content

What's hot (19)

Similar to Map reduce (20)

Recently uploaded (20)

Map reduce

Editor's Notes