Hadoop, a distributed framework for
Big Data
Presented By:
Bhushan Kulkarni
T.E(I.T)
Contents
1. Introduction and Hadoop’s history
2. Architecture in detail
3. Hadoop in industry
What is Hadoop?
• Apache top level project, open-source
implementation of frameworks for reliable,
scalable, distributed computing and data storage.
• It is a flexible and highly-available architecture for
large scale computation and data processing on a
network of commodity hardware.
What is Hadoop
• Hadoop is a software framework for distributed processing of
large datasets across large clusters of computers
• Large datasets  Terabytes or petabytes of data
• Large clusters  hundreds or thousands of nodes
• Hadoop is open-source implementation for Google
MapReduce
• Hadoop is based on a simple programming model called
MapReduce
• Hadoop is based on a simple data model, any data will fit
4
Brief History of Hadoop
• Google introduced Map reduce Algorithm.
• Doug Cutting and team took the solution
provided by Google and started an Open Source
Project called HADOOP in 2005 and Doug
named it after his son's toy elephant.
Hadoop’s Developers
Doug Cutting
2005: Doug Cutting and Michael J. Cafarella developed
Hadoop to support distribution for the Nutch search
engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache
Software Foundation.
Large-Scale Data
Analytics
• MapReduce computing paradigm (E.g., Hadoop) vs. Traditional
database systems
7
Database
vs.
 Many enterprises are turning to Hadoop
 Especially applications generating big data
 Web applications, social networks, scientific applications
Why Hadoop is able to compete?
8
Scalability (petabytes of data,
thousands of machines)
Database
vs.
Flexibility in accepting all data
formats (no schema)
Commodity inexpensive hardware
Efficient and simple fault-
tolerant mechanism
Performance (tons of indexing,
tuning, data organization tech.)
Structured Data
Key Components
• Hadoop framework consists on two main layers
• Distributed file system (HDFS)
• Execution engine (MapReduce)
9
Hadoop: How it Works
10
Hadoop Architecture
11
Master node (single node)
Many slave nodes
• Distributed file system (HDFS)
• Execution engine (MapReduce)
Hadoop Distributed File System
(HDFS)
12
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F
Blocks (64 MB)
Main Properties of HDFS
• Large: A HDFS instance may consist of thousands of server
machines, each storing part of the file system’s data
• Replication: Each data block is replicated many times
(default is 3)
• Failure: Failure occurs rarely
• Fault Tolerance: Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
• Namenode is consistently checking Datanodes
13
What is MapReduce?
• MapReduce is a programming model
• Programs written in this functional style are automatically
parallelized and executed on a large cluster of commodity
machines
• MapReduce is an associated implementation for processing
and generating large data sets.
MapReduce
MAP
map function that
processes a key/value
pair to generate a set of
intermediate key/value
pairs
REDUCE
and a reduce function
that merges all
intermediate values
associated with the
same intermediate key.
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the
namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
15
• This file has 5 Blocks  run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
Node 1 Node 2 Node 3
Properties of MapReduce Engine
(Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress
16
The Programming Model Of MapReduce
Map takes an input pair and produces a set of intermediate key/value pairs. The
MapReduce library groups together all intermediate values associated with the
same intermediate key.
The Reduce function, also written by the user, accepts an intermediate key I and a set of
values for that key. It merges together these values to form a possibly smaller set of values
MapReduce data flow with a single reduce task
MapReduce data flow with multiple reduce tasks
MapReduce data flow with no reduce tasks
Example 1 : Color Count
21
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks
on HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Job: Count the number of each color in a data set
Part0003
Part0002
Part0001
That’s the output file, it
has 3 parts on probably 3
different machines
Example 2: Color Filter
22
Job: Select only the blue and the green colors
Input blocks
on HDFS
Map
Map
Map
Map
Produces (k, v)
( , 1)
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
• Each map task will select only
the blue or green colors
• No need for reduce phase
Part0001
Part0002
Part0003
Part0004
That’s the output file, it
has 4 parts on probably 4
different machines
Why use Hadoop?
• Need to process Multi Petabyte Datasets
• Data may not have strict schema
• Expensive to build reliability in each application
• Need common infrastructure
• Very Large Distributed File System
• Assumes Commodity Hardware
• Optimized for Batch Processing
Who Uses MapReduce/Hadoop
• Google: Inventors of MapReduce computing paradigm
• Yahoo: Developing Hadoop open-source of MapReduce
• IBM, Microsoft, Oracle
• Facebook, Amazon, AOL, NetFlex
• Many others + universities and research labs
24
THANK YOU!!
25

Hadoop

  • 1.
    Hadoop, a distributedframework for Big Data Presented By: Bhushan Kulkarni T.E(I.T)
  • 2.
    Contents 1. Introduction andHadoop’s history 2. Architecture in detail 3. Hadoop in industry
  • 3.
    What is Hadoop? •Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. • It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.
  • 4.
    What is Hadoop •Hadoop is a software framework for distributed processing of large datasets across large clusters of computers • Large datasets  Terabytes or petabytes of data • Large clusters  hundreds or thousands of nodes • Hadoop is open-source implementation for Google MapReduce • Hadoop is based on a simple programming model called MapReduce • Hadoop is based on a simple data model, any data will fit 4
  • 5.
    Brief History ofHadoop • Google introduced Map reduce Algorithm. • Doug Cutting and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant.
  • 6.
    Hadoop’s Developers Doug Cutting 2005:Doug Cutting and Michael J. Cafarella developed Hadoop to support distribution for the Nutch search engine project. The project was funded by Yahoo. 2006: Yahoo gave the project to Apache Software Foundation.
  • 7.
    Large-Scale Data Analytics • MapReducecomputing paradigm (E.g., Hadoop) vs. Traditional database systems 7 Database vs.  Many enterprises are turning to Hadoop  Especially applications generating big data  Web applications, social networks, scientific applications
  • 8.
    Why Hadoop isable to compete? 8 Scalability (petabytes of data, thousands of machines) Database vs. Flexibility in accepting all data formats (no schema) Commodity inexpensive hardware Efficient and simple fault- tolerant mechanism Performance (tons of indexing, tuning, data organization tech.) Structured Data
  • 9.
    Key Components • Hadoopframework consists on two main layers • Distributed file system (HDFS) • Execution engine (MapReduce) 9
  • 10.
  • 11.
    Hadoop Architecture 11 Master node(single node) Many slave nodes • Distributed file system (HDFS) • Execution engine (MapReduce)
  • 12.
    Hadoop Distributed FileSystem (HDFS) 12 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F Blocks (64 MB)
  • 13.
    Main Properties ofHDFS • Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data • Replication: Each data block is replicated many times (default is 3) • Failure: Failure occurs rarely • Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS • Namenode is consistently checking Datanodes 13
  • 14.
    What is MapReduce? •MapReduce is a programming model • Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines • MapReduce is an associated implementation for processing and generating large data sets. MapReduce MAP map function that processes a key/value pair to generate a set of intermediate key/value pairs REDUCE and a reduce function that merges all intermediate values associated with the same intermediate key.
  • 15.
    Properties of MapReduceEngine • Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Decides on how many tasks will run (number of mappers) • Decides on where to run each mapper (concept of locality) 15 • This file has 5 Blocks  run 5 map tasks • Where to run the task reading block “1” • Try to run it on Node 1 or Node 3 Node 1 Node 2 Node 3
  • 16.
    Properties of MapReduceEngine (Cont’d) • Task Tracker is the slave node (runs on each datanode) • Receives the task from Job Tracker • Runs the task until completion (either map or reduce task) • Always in communication with the Job Tracker reporting progress 16
  • 17.
    The Programming ModelOf MapReduce Map takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values
  • 18.
    MapReduce data flowwith a single reduce task
  • 19.
    MapReduce data flowwith multiple reduce tasks
  • 20.
    MapReduce data flowwith no reduce tasks
  • 21.
    Example 1 :Color Count 21 Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Job: Count the number of each color in a data set Part0003 Part0002 Part0001 That’s the output file, it has 3 parts on probably 3 different machines
  • 22.
    Example 2: ColorFilter 22 Job: Select only the blue and the green colors Input blocks on HDFS Map Map Map Map Produces (k, v) ( , 1) Write to HDFS Write to HDFS Write to HDFS Write to HDFS • Each map task will select only the blue or green colors • No need for reduce phase Part0001 Part0002 Part0003 Part0004 That’s the output file, it has 4 parts on probably 4 different machines
  • 23.
    Why use Hadoop? •Need to process Multi Petabyte Datasets • Data may not have strict schema • Expensive to build reliability in each application • Need common infrastructure • Very Large Distributed File System • Assumes Commodity Hardware • Optimized for Batch Processing
  • 24.
    Who Uses MapReduce/Hadoop •Google: Inventors of MapReduce computing paradigm • Yahoo: Developing Hadoop open-source of MapReduce • IBM, Microsoft, Oracle • Facebook, Amazon, AOL, NetFlex • Many others + universities and research labs 24
  • 25.