Big Data Unit 4 - Hadoop

 What is hadoop
 Motivation of hadoop
 Hadoop distribution file system
 Map reduce
 Hadoop ecosystem

 Open source software framework designed
for storage and processing of large scale data
on clusters of commodity hardware
 Created by Doug Cutting and Mike Carafella
in 2005.
 Cutting named the program after his son’s
toy elephant.

 Data-intensive text processing
 Assembly of large genomes
 Graph mining
 Machine learning and data mining
 Large scale social network analysis

•Contains Libraries and other modulesHadoop Common
•Hadoop Distributed File SystemHDFS
•Yet Another Resource NegotiatorHadoop YARN
•A programming model for large scale
data processing
Hadoop
MapReduce

 What were the limitations of earlier large-
scale computing?
 What requirements should an alternative
approach have?
 How does Hadoop address those
requirements?

 Historically computation was processor-
bound
◦ Data volume has been relatively small
◦ Complicated computations are performed on that
data
 Advances in computer technology has
historically centered around improving the
power of a single machine

 Power consumption limits the speed increase
we get from transistor density

 Allows developers to
use multiple
machines for a single
task

 Programming on a distributed system is
much more complex
◦ Synchronizing data exchanges
◦ Managing a finite bandwidth
◦ Controlling computation timing is complicated

“You know you have a distributed system when
the crash of a computer you’ve never
heard of stops you from getting any work
done.” –Leslie Lamport
 Distributed systems must be designed with
the expectation of failure

 Typically divided into Data Nodes and
Compute Nodes
 At compute time, data is copied to the
Compute Nodes
 Fine for relatively small amounts of data
 Modern systems deal with far more data than
was gathering in the past

 Facebook
◦ 500 TB per day
 Yahoo
◦ Over 170 PB
 eBay
◦ Over 6 PB
 Getting the data to the processors becomes
the bottleneck

 If a component fails, it should be able to
recover without restarting the entire system
 Component failure or recovery during a job
must not affect the final output

 Increasing resources should increase load
capacity
 Increasing the load on the system should
result in a graceful decline in performance for
all jobs
◦ Not system failure

 Based on work done by Google in the early
2000s
◦ “The Google File System” in 2003
◦ “MapReduce: Simplified Data Processing on Large
Clusters” in 2004
 The core idea was to distribute the data as it
is initially stored
◦ Each node can then perform computation on the
data it stores without moving the data for the initial
processing

 Applications are written in a high-level
programming language
◦ No network programming or temporal dependency
 Nodes should communicate as little as
possible
◦ A “shared nothing” architecture
 Data is spread among the machines in
advance
◦ Perform computation where the data is already
stored as often as possible

 HDFS is a file system written in Java based on
the Google’s GFS
 Provides redundant storage for massive
amounts of data

 HDFS works best with a smaller number of
large files
◦ Millions as opposed to billions of files
◦ Typically 100MB or more per file
 Files in HDFS are write once
 Optimized for streaming reads of large files
and not random reads

 Files are split into blocks
 Blocks are split across many machines at load
time
◦ Different blocks from the same file will be stored on
different machines
 Blocks are replicated across multiple
machines
 The NameNode keeps track of which blocks
make up a file and where they are stored

 Default replication is 3-fold

 When a client wants to retrieve data
◦ Communicates with the NameNode to determine
which blocks make up a file and on which data
nodes those blocks are stored
◦ Then communicated directly with the data nodes to
read the data

 A method for distributing computation across
multiple nodes
 Each node processes the data that is stored at
that node
 Consists of two main phases
◦ Map
◦ Reduce

 Automatic parallelization and distribution
 Fault-Tolerance
 Provides a clean abstraction for programmers
to use

 Node based flat
 Suitable for structured, unstructured data.
Supports variety of data formats in real time
such as XML, JSON, text based flat file
formats, etc.
 Analytical, big data processing
 Big data processing, which does not require
any consistent relationships between data.
 In a hadoop cluster, a node requires only a
processor, a network card, and few hard
drives.

 Key aspects of hadoop
 Hadoop components
 Hadoop conceptual layer
 High-level architecture of hadoop

 Open source software
 Frame work
 Distributed
 Massive storage
 Faster processing

 Hadoop core components
1.HDFS:
a)storage components.
b)distributes data across several nodes.
c)natively redundant.
2.Mapreduce:
a) computational framework.
b)splits a task across multiple nodes.
c)processes data in parallel.

 Hadoop ecosystem: hadoop ecosystem are
supports projects to enhance the
functionality of hadoop components. the eco
projects are as follows:
1. HIVE
2. PIG
3. SQOOP
4. HBASE
5. FLUME
6. OOZIE

Big Data Unit 4 - Hadoop

More Related Content

What's hot (20)

Similar to Big Data Unit 4 - Hadoop (20)

Recently uploaded (20)

Big Data Unit 4 - Hadoop

Editor's Notes