Hadoop/MapReduce
Computing Paradigm
1
Special Topics in DBs
Large-Scale Data Management
Large-Scale Data Analytics
2
MapReduce computing paradigm (E.g., Hadoop) vs. Traditional database
systems
Database
vs.
 Many enterprises are turning to Hadoop
 Especially applications generating big data
 Web applications, social networks, scientific applications
www.kellytechno.com
Why Hadoop is able to compete?
3
Scalability (petabytes of data, thousands
of machines)
Database
vs.
Flexibility in accepting all data formats
(no schema)
Commodity inexpensive hardware
Efficient and simple fault-tolerant
mechanism
Performance (tons of indexing, tuning,
data organization tech.)
Features:
- Provenance tracking
- Annotation management
- ….
www.kellytechno.com
What is Hadoop
4
Hadoop is a software framework for distributed processing of large
datasets across large clusters of computers
Large datasets  Terabytes or petabytes of data
Large clusters  hundreds or thousands of nodes
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called
MapReduce
Hadoop is based on a simple data model, any data will fit
www.kellytechno.com
What is Hadoop (Cont’d)
5
Hadoop framework consists on two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)
www.kellytechno.com
Hadoop Master/Slave Architecture
6
Hadoop is designed as a master-slave shared-nothing architecture
Master node (single node)
Many slave nodes
www.kellytechno.com
Design Principles of Hadoop
7
Need to process big data
Need to parallelize computation across thousands of nodes
Commodity hardware
Large number of low-end cheap machines working in parallel to
solve a computing problem
This is in contrast to Parallel DBs
Small number of high-end expensive machines
www.kellytechno.com
Design Principles of Hadoop
8
Automatic parallelization & distribution
Hidden from the end-user
Fault tolerance and automatic recovery
Nodes/tasks will fail and will recover automatically
Clean and simple programming abstraction
Users only provide two functions “map” and “reduce”
www.kellytechno.com
How Uses MapReduce/Hadoop
9
Google: Inventors of MapReduce computing paradigm
Yahoo: Developing Hadoop open-source of MapReduce
IBM, Microsoft, Oracle
Facebook, Amazon, AOL, NetFlex
Many others + universities and research labs
www.kellytechno.com
Hadoop: How it Works
10 www.kellytechno.com
Hadoop Architecture
11
Master node (single node)
Many slave nodes
• Distributed file system (HDFS)
• Execution engine (MapReduce)
www.kellytechno.com
Hadoop Distributed File System (HDFS)
12
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
www.kellytechno.com
Main Properties of HDFS
13
Large: A HDFS instance may consist of thousands
of server machines, each storing part of the file
system’s data
Replication: Each data block is replicated many
times (default is 3)
Failure: Failure is the norm rather than exception
Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core architectural
goal of HDFS
Namenode is consistently checking Datanodes
www.kellytechno.com
Map-Reduce Execution Engine
(Example: Color Count)
14
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks on
HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Users only provide the “Map” and “Reduce” functions
www.kellytechno.com
Properties of MapReduce Engine
15
Job Tracker is the master node (runs with the namenode)
Receives the user’s job
Decides on how many tasks will run (number of mappers)
Decides on where to run each mapper (concept of locality)
• This file has 5 Blocks  run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
Node 1 Node 2 Node 3
www.kellytechno.com
Properties of MapReduce Engine (Cont’d)
16
Task Tracker is the slave node (runs on each datanode)
Receives the task from Job Tracker
Runs the task until completion (either map or reduce task)
Always in communication with the Job Tracker reporting progress
R e d u c e
R e d u c e
R e d u c e
M a p
M a p
M a p
M a p
P a r s e - h a s h
P a r s e - h a s h
P a r s e - h a s h
P a r s e - h a s h
In this example, 1 map-reduce job consists
of 4 map tasks and 3 reduce tasks
www.kellytechno.com
Key-Value Pairs
17
Mappers and Reducers are users’ code (provided functions)
Just need to obey the Key-Value pairs interface
Mappers:
Consume <key, value> pairs
Produce <key, value> pairs
Reducers:
Consume <key, <list of values>>
Produce <key, value>
Shuffling and Sorting:
Hidden phase between mappers and reducers
Groups all similar keys from all mappers, sorts and passes them to a
certain reducer in the form of <key, <list of values>>
www.kellytechno.com
MapReduce Phases
18
Deciding on what will be the key and what will be the value  developer’s
responsibility
www.kellytechno.com
Example 1: Word Count
19
Job: Count the occurrences of each word in a data set
Map
Tasks
Reduce
Tasks
www.kellytechno.com
Example 2: Color Count
20
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks on
HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Job: Count the number of each color in a data set
Part0003
Part0002
Part0001
That’s the output file, it has 3
parts on probably 3 different
machines
www.kellytechno.com
Example 3: Color Filter
21
Job: Select only the blue and the green colors
Input blocks on
HDFS
Map
Map
Map
Map
Produces (k, v)
( , 1)
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
• Each map task will select only the
blue or green colors
• No need for reduce phase
Part0001
Part0002
Part0003
Part0004
That’s the output file, it has 4
parts on probably 4 different
machines
www.kellytechno.com
Bigger Picture: Hadoop vs. Other Systems
22
Distributed Databases Hadoop
Computing Model - Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency control
- Notion of jobs
- Job is the unit of work
- No concurrency control
Data Model - Structured data with known schema
- Read/Write mode
- Any data will fit in any format
- (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare
- Recovery mechanisms
- Failures are common over thousands of
machines
- Simple yet efficient fault tolerance
Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
• Cloud Computing
• A computing model where any computing infrastructure can run on
the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s demand
• Example: Amazon EC2
www.kellytechno.com
23
Thank You
Presented By

Hadoop trainting-in-hyderabad@kelly technologies

  • 1.
    Hadoop/MapReduce Computing Paradigm 1 Special Topicsin DBs Large-Scale Data Management
  • 2.
    Large-Scale Data Analytics 2 MapReducecomputing paradigm (E.g., Hadoop) vs. Traditional database systems Database vs.  Many enterprises are turning to Hadoop  Especially applications generating big data  Web applications, social networks, scientific applications www.kellytechno.com
  • 3.
    Why Hadoop isable to compete? 3 Scalability (petabytes of data, thousands of machines) Database vs. Flexibility in accepting all data formats (no schema) Commodity inexpensive hardware Efficient and simple fault-tolerant mechanism Performance (tons of indexing, tuning, data organization tech.) Features: - Provenance tracking - Annotation management - …. www.kellytechno.com
  • 4.
    What is Hadoop 4 Hadoopis a software framework for distributed processing of large datasets across large clusters of computers Large datasets  Terabytes or petabytes of data Large clusters  hundreds or thousands of nodes Hadoop is open-source implementation for Google MapReduce Hadoop is based on a simple programming model called MapReduce Hadoop is based on a simple data model, any data will fit www.kellytechno.com
  • 5.
    What is Hadoop(Cont’d) 5 Hadoop framework consists on two main layers Distributed file system (HDFS) Execution engine (MapReduce) www.kellytechno.com
  • 6.
    Hadoop Master/Slave Architecture 6 Hadoopis designed as a master-slave shared-nothing architecture Master node (single node) Many slave nodes www.kellytechno.com
  • 7.
    Design Principles ofHadoop 7 Need to process big data Need to parallelize computation across thousands of nodes Commodity hardware Large number of low-end cheap machines working in parallel to solve a computing problem This is in contrast to Parallel DBs Small number of high-end expensive machines www.kellytechno.com
  • 8.
    Design Principles ofHadoop 8 Automatic parallelization & distribution Hidden from the end-user Fault tolerance and automatic recovery Nodes/tasks will fail and will recover automatically Clean and simple programming abstraction Users only provide two functions “map” and “reduce” www.kellytechno.com
  • 9.
    How Uses MapReduce/Hadoop 9 Google:Inventors of MapReduce computing paradigm Yahoo: Developing Hadoop open-source of MapReduce IBM, Microsoft, Oracle Facebook, Amazon, AOL, NetFlex Many others + universities and research labs www.kellytechno.com
  • 10.
    Hadoop: How itWorks 10 www.kellytechno.com
  • 11.
    Hadoop Architecture 11 Master node(single node) Many slave nodes • Distributed file system (HDFS) • Execution engine (MapReduce) www.kellytechno.com
  • 12.
    Hadoop Distributed FileSystem (HDFS) 12 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB) www.kellytechno.com
  • 13.
    Main Properties ofHDFS 13 Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data Replication: Each data block is replicated many times (default is 3) Failure: Failure is the norm rather than exception Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS Namenode is consistently checking Datanodes www.kellytechno.com
  • 14.
    Map-Reduce Execution Engine (Example:Color Count) 14 Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Users only provide the “Map” and “Reduce” functions www.kellytechno.com
  • 15.
    Properties of MapReduceEngine 15 Job Tracker is the master node (runs with the namenode) Receives the user’s job Decides on how many tasks will run (number of mappers) Decides on where to run each mapper (concept of locality) • This file has 5 Blocks  run 5 map tasks • Where to run the task reading block “1” • Try to run it on Node 1 or Node 3 Node 1 Node 2 Node 3 www.kellytechno.com
  • 16.
    Properties of MapReduceEngine (Cont’d) 16 Task Tracker is the slave node (runs on each datanode) Receives the task from Job Tracker Runs the task until completion (either map or reduce task) Always in communication with the Job Tracker reporting progress R e d u c e R e d u c e R e d u c e M a p M a p M a p M a p P a r s e - h a s h P a r s e - h a s h P a r s e - h a s h P a r s e - h a s h In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks www.kellytechno.com
  • 17.
    Key-Value Pairs 17 Mappers andReducers are users’ code (provided functions) Just need to obey the Key-Value pairs interface Mappers: Consume <key, value> pairs Produce <key, value> pairs Reducers: Consume <key, <list of values>> Produce <key, value> Shuffling and Sorting: Hidden phase between mappers and reducers Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>> www.kellytechno.com
  • 18.
    MapReduce Phases 18 Deciding onwhat will be the key and what will be the value  developer’s responsibility www.kellytechno.com
  • 19.
    Example 1: WordCount 19 Job: Count the occurrences of each word in a data set Map Tasks Reduce Tasks www.kellytechno.com
  • 20.
    Example 2: ColorCount 20 Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Job: Count the number of each color in a data set Part0003 Part0002 Part0001 That’s the output file, it has 3 parts on probably 3 different machines www.kellytechno.com
  • 21.
    Example 3: ColorFilter 21 Job: Select only the blue and the green colors Input blocks on HDFS Map Map Map Map Produces (k, v) ( , 1) Write to HDFS Write to HDFS Write to HDFS Write to HDFS • Each map task will select only the blue or green colors • No need for reduce phase Part0001 Part0002 Part0003 Part0004 That’s the output file, it has 4 parts on probably 4 different machines www.kellytechno.com
  • 22.
    Bigger Picture: Hadoopvs. Other Systems 22 Distributed Databases Hadoop Computing Model - Notion of transactions - Transaction is the unit of work - ACID properties, Concurrency control - Notion of jobs - Job is the unit of work - No concurrency control Data Model - Structured data with known schema - Read/Write mode - Any data will fit in any format - (un)(semi)structured - ReadOnly mode Cost Model - Expensive servers - Cheap commodity machines Fault Tolerance - Failures are rare - Recovery mechanisms - Failures are common over thousands of machines - Simple yet efficient fault tolerance Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance • Cloud Computing • A computing model where any computing infrastructure can run on the cloud • Hardware & Software are provided as remote services • Elastic: grows and shrinks based on the user’s demand • Example: Amazon EC2 www.kellytechno.com
  • 23.