Hadoop tutorial for beginners-tibacademy.in

TIB Academy,
5/3, Varathur Road, Kundalahalli Gate,
Bangalore-560066.
+91-9513332301 / 02 www.tibacademy.in

 Open source software framework designed
for storage and processing of large scale
data on clusters of commodity hardware
 Created by Doug Cutting and Mike Carafella
in 2005.
 Cutting named the program after his son’s
toy elephant.

 Data-intensive text processing
 Assembly of large genomes
 Graph mining
 Machine learning and data mining
 Large scale social network analysis

• Contains Libraries and other
modules
Hadoop
Common
• Hadoop Distributed File SystemHDFS
• Yet Another Resource
Negotiator
Hadoop
YARN
• A programming model for large
scale data processing
Hadoop
MapReduce

 What were the limitations of earlier large-
scale computing?
 What requirements should an alternative
approach have?
 How does Hadoop address those
requirements?

 Historically computation was processor-
bound
› Data volume has been relatively small
› Complicated computations are performed on that
data
 Advances in computer technology has
historically centered around improving the
power of a single machine

 Moore’s Law
› The number of transistors on a dense integrated
circuit doubles every two years
 Single-core computing can’t scale with
current computing needs

 Power consumption limits the speed
increase we get from transistor density

 Allows developers
to use multiple
machines for a
single task

 Programming on a distributed system is
much more complex
› Synchronizing data exchanges
› Managing a finite bandwidth
› Controlling computation timing is complicated

“You know you have a distributed system when
the crash of a computer you’ve never
heard of stops you from getting any work
done.” –Leslie Lamport
 Distributed systems must be designed with
the expectation of failure

 Typically divided into Data Nodes and
Compute Nodes
 At compute time, data is copied to the
Compute Nodes
 Fine for relatively small amounts of data
 Modern systems deal with far more data
than was gathering in the past

 Facebook
› 500 TB per day
 Yahoo
› Over 170 PB
 eBay
› Over 6 PB
 Getting the data to the processors becomes
the bottleneck

 Must support partial
failure
 Must be scalable

 Failure of a single component must not cause
the failure of the entire system only a
degradation of the application performance
 Failure should not
result in the loss of
any data

 If a component fails, it should be able to
recover without restarting the entire system
 Component failure or recovery during a job
must not affect the final output

 Increasing resources should increase load
capacity
 Increasing the load on the system should
result in a graceful decline in performance
for all jobs
› Not system failure

 Based on work done by Google in the early
2000s
› “The Google File System” in 2003
› “MapReduce: Simplified Data Processing on
Large Clusters” in 2004
 The core idea was to distribute the data as it
is initially stored
› Each node can then perform computation on the
data it stores without moving the data for the
initial processing

 Applications are written in a high-level
programming language
› No network programming or temporal dependency
 Nodes should communicate as little as possible
› A “shared nothing” architecture
 Data is spread among the machines in advance
› Perform computation where the data is already
stored as often as possible

 When data is loaded onto the system it is
divided into blocks
› Typically 64MB or 128MB
 Tasks are divided into two phases
› Map tasks which are done on small portions of data
where the data is stored
› Reduce tasks which combine data to produce the
final output
 A master program allocates work to individual
nodes

 Failures are detected by the master program
which reassigns the work to a different node
 Restarting a task does not affect the nodes
working on other portions of the data
 If a failed node restarts, it is added back to the
system and assigned new tasks
 The master can redundantly execute the same
task to avoid slow running nodes

 Responsible for storing data on the cluster
 Data files are split into blocks and distributed
across the nodes in the cluster
 Each block is replicated multiple times

 HDFS is a file system written in Java based
on the Google’s GFS
 Provides redundant storage for massive
amounts of data

 HDFS works best with a smaller number of
large files
› Millions as opposed to billions of files
› Typically 100MB or more per file
 Files in HDFS are write once
 Optimized for streaming reads of large files
and not random reads

 Files are split into blocks
 Blocks are split across many machines at load
time
› Different blocks from the same file will be stored on
different machines
 Blocks are replicated across multiple machines
 The NameNode keeps track of which blocks
make up a file and where they are stored

 Default replication is 3-fold

 When a client wants to retrieve data
› Communicates with the NameNode to determine
which blocks make up a file and on which data
nodes those blocks are stored
› Then communicated directly with the data nodes
to read the data

 A method for distributing computation across
multiple nodes
 Each node processes the data that is stored at
that node
 Consists of two main phases
› Map
› Reduce

 Automatic parallelization and distribution
 Fault-Tolerance
 Provides a clean abstraction for
programmers to use

 Reads data as key/value pairs
› The key is often discarded
 Outputs zero or more key/value pairs

 Output from the mapper is sorted by key
 All values with the same key are guaranteed
to go to the same machine

 Called once for each unique key
 Gets a list of all values associated with a key
as input
 The reducer outputs zero or more final
key/value pairs
› Usually just one output per input key

 NameNode
› Holds the metadata for the HDFS
 Secondary NameNode
› Performs housekeeping functions for the NameNode
 DataNode
› Stores the actual HDFS data blocks
 JobTracker
› Manages MapReduce jobs
 TaskTracker
› Monitors individual Map and Reduce tasks

 Stores the HDFS file system information in a
fsimage
 Updates to the file system (add/remove blocks)
do not change the fsimage file
› They are instead written to a log file
 When starting the NameNode loads the fsimage
file and then applies the changes in the log file

 NOT a backup for the NameNode
 Periodically reads the log file and applies the
changes to the fsimage file bringing it up to
date
 Allows the NameNode to restart faster when
required

 JobTracker
› Determines the execution plan for the job
› Assigns individual tasks
 TaskTracker
› Keeps track of the performance of an individual
mapper or reducer

 MapReduce is very powerful, but can be
awkward to master
 These tools allow programmers who are
familiar with other programming styles to
take advantage of the power of MapReduce

 Hive
› Hadoop processing with SQL
 Pig
› Hadoop processing with scripting
 Cascading
› Pipe and Filter processing model
 HBase
› Database model built on top of Hadoop
 Flume
› Designed for large scale data movement

Hadoop tutorial for beginners-tibacademy.in

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Hadoop tutorial for beginners-tibacademy.in (20)

More from TIB Academy (17)

Recently uploaded (20)

Hadoop tutorial for beginners-tibacademy.in

Editor's Notes