Hadoop trainting-in-hyderabad@kelly technologies

Hadoop/MapReduce
Computing Paradigm
1
Special Topics in DBs
Large-Scale Data Management

Large-Scale Data Analytics
2
MapReduce computing paradigm (E.g., Hadoop) vs. Traditional database
systems
Database
vs.
 Many enterprises are turning to Hadoop
 Especially applications generating big data
 Web applications, social networks, scientific applications
www.kellytechno.com

Why Hadoop is able to compete?
3
Scalability (petabytes of data, thousands
of machines)
Database
vs.
Flexibility in accepting all data formats
(no schema)
Commodity inexpensive hardware
Efficient and simple fault-tolerant
mechanism
Performance (tons of indexing, tuning,
data organization tech.)
Features:
- Provenance tracking
- Annotation management
- ….
www.kellytechno.com

What is Hadoop
4
Hadoop is a software framework for distributed processing of large
datasets across large clusters of computers
Large datasets  Terabytes or petabytes of data
Large clusters  hundreds or thousands of nodes
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called
MapReduce
Hadoop is based on a simple data model, any data will fit
www.kellytechno.com

What is Hadoop (Cont’d)
5
Hadoop framework consists on two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)
www.kellytechno.com

Hadoop Master/Slave Architecture
6
Hadoop is designed as a master-slave shared-nothing architecture
Master node (single node)
Many slave nodes
www.kellytechno.com

Design Principles of Hadoop
7
Need to process big data
Need to parallelize computation across thousands of nodes
Commodity hardware
Large number of low-end cheap machines working in parallel to
solve a computing problem
This is in contrast to Parallel DBs
Small number of high-end expensive machines
www.kellytechno.com

Design Principles of Hadoop
8
Automatic parallelization & distribution
Hidden from the end-user
Fault tolerance and automatic recovery
Nodes/tasks will fail and will recover automatically
Clean and simple programming abstraction
Users only provide two functions “map” and “reduce”
www.kellytechno.com

How Uses MapReduce/Hadoop
9
Google: Inventors of MapReduce computing paradigm
Yahoo: Developing Hadoop open-source of MapReduce
IBM, Microsoft, Oracle
Facebook, Amazon, AOL, NetFlex
Many others + universities and research labs
www.kellytechno.com

Hadoop: How it Works
10 www.kellytechno.com

Hadoop Architecture
11
Master node (single node)
Many slave nodes
• Distributed file system (HDFS)
• Execution engine (MapReduce)
www.kellytechno.com

Hadoop Distributed File System (HDFS)
12
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
www.kellytechno.com

Main Properties of HDFS
13
Large: A HDFS instance may consist of thousands
of server machines, each storing part of the file
system’s data
Replication: Each data block is replicated many
times (default is 3)
Failure: Failure is the norm rather than exception
Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core architectural
goal of HDFS
Namenode is consistently checking Datanodes
www.kellytechno.com

Map-Reduce Execution Engine
(Example: Color Count)
14
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks on
HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Users only provide the “Map” and “Reduce” functions
www.kellytechno.com

Properties of MapReduce Engine
15
Job Tracker is the master node (runs with the namenode)
Receives the user’s job
Decides on how many tasks will run (number of mappers)
Decides on where to run each mapper (concept of locality)
• This file has 5 Blocks  run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
Node 1 Node 2 Node 3
www.kellytechno.com

Properties of MapReduce Engine (Cont’d)
16
Task Tracker is the slave node (runs on each datanode)
Receives the task from Job Tracker
Runs the task until completion (either map or reduce task)
Always in communication with the Job Tracker reporting progress
R e d u c e
R e d u c e
R e d u c e
M a p
M a p
M a p
M a p
P a r s e - h a s h
P a r s e - h a s h
P a r s e - h a s h
P a r s e - h a s h
In this example, 1 map-reduce job consists
of 4 map tasks and 3 reduce tasks
www.kellytechno.com

Key-Value Pairs
17
Mappers and Reducers are users’ code (provided functions)
Just need to obey the Key-Value pairs interface
Mappers:
Consume <key, value> pairs
Produce <key, value> pairs
Reducers:
Consume <key, <list of values>>
Produce <key, value>
Shuffling and Sorting:
Hidden phase between mappers and reducers
Groups all similar keys from all mappers, sorts and passes them to a
certain reducer in the form of <key, <list of values>>
www.kellytechno.com

MapReduce Phases
18
Deciding on what will be the key and what will be the value  developer’s
responsibility
www.kellytechno.com

Example 1: Word Count
19
Job: Count the occurrences of each word in a data set
Map
Tasks
Reduce
Tasks
www.kellytechno.com

Example 2: Color Count
20
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks on
HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Job: Count the number of each color in a data set
Part0003
Part0002
Part0001
That’s the output file, it has 3
parts on probably 3 different
machines
www.kellytechno.com

Example 3: Color Filter
21
Job: Select only the blue and the green colors
Input blocks on
HDFS
Map
Map
Map
Map
Produces (k, v)
( , 1)
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
• Each map task will select only the
blue or green colors
• No need for reduce phase
Part0001
Part0002
Part0003
Part0004
That’s the output file, it has 4
parts on probably 4 different
machines
www.kellytechno.com

Bigger Picture: Hadoop vs. Other Systems
22
Distributed Databases Hadoop
Computing Model - Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency control
- Notion of jobs
- Job is the unit of work
- No concurrency control
Data Model - Structured data with known schema
- Read/Write mode
- Any data will fit in any format
- (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare
- Recovery mechanisms
- Failures are common over thousands of
machines
- Simple yet efficient fault tolerance
Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
• Cloud Computing
• A computing model where any computing infrastructure can run on
the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s demand
• Example: Amazon EC2
www.kellytechno.com

Hadoop trainting-in-hyderabad@kelly technologies

More Related Content

What's hot

Viewers also liked

Similar to Hadoop trainting-in-hyderabad@kelly technologies

More from Kelly Technologies

Recently uploaded

Hadoop trainting-in-hyderabad@kelly technologies