Hadoop/MapReduce
Computing Paradigm
1
Special Topics in DBs
Large-Scale Data Management
Large-Scale Data Analytics
2
 MapReduce computing paradigm (E.g., Hadoop) vs.Traditional database
systems
Database
vs.
 Many enterprises are turning to Hadoop
 Especially applications generating big data
 Web applications, social networks, scientific applications
www.kellytechno.com
Why Hadoop is able to compete?
3
Scalability (petabytes of data, thousands
of machines)
Database
vs.
Flexibility in accepting all data formats
(no schema)
Commodity inexpensive hardware
Efficient and simple fault-tolerant
mechanism
Performance (tons of indexing, tuning,
data organization tech.)
Features:
- Provenance tracking
-Annotation management
- ….
www.kellytechno.com
What is Hadoop
4
 Hadoop is a software framework for distributed processing of large
datasets across large clusters of computers
 Large datasets Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
 Hadoop is open-source implementation for Google MapReduce
 Hadoop is based on a simple programming model called
MapReduce
 Hadoop is based on a simple data model, any data will fit
www.kellytechno.com
What is Hadoop (Cont’d)
5
 Hadoop framework consists on two main layers
 Distributed file system (HDFS)
 Execution engine (MapReduce)
www.kellytechno.com
Hadoop Master/Slave Architecture
6
 Hadoop is designed as a master-slave shared-nothing architecture
Master node (single node)
Many slave nodes
www.kellytechno.com
Design Principles of Hadoop
7
 Need to process big data
 Need to parallelize computation across thousands of nodes
 Commodity hardware
 Large number of low-end cheap machines working in parallel to
solve a computing problem
 This is in contrast to Parallel DBs
 Small number of high-end expensive machines
www.kellytechno.com
Design Principles of Hadoop
8
 Automatic parallelization & distribution
 Hidden from the end-user
 Fault tolerance and automatic recovery
 Nodes/tasks will fail and will recover automatically
 Clean and simple programming abstraction
 Users only provide two functions “map” and “reduce”
www.kellytechno.com
How Uses MapReduce/Hadoop
9
 Google: Inventors of MapReduce computing paradigm
 Yahoo: Developing Hadoop open-source of MapReduce
 IBM, Microsoft, Oracle
 Facebook,Amazon,AOL, NetFlex
 Many others + universities and research labs
www.kellytechno.com
Hadoop: How it Works
10 www.kellytechno.com
Hadoop Architecture
11
Master node (single node)
Many slave nodes
• Distributed file system (HDFS)
• Execution engine (MapReduce)
www.kellytechno.com
Hadoop Distributed File System (HDFS)
12
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
www.kellytechno.com
Main Properties of HDFS
13
 Large: A HDFS instance may consist of thousands
of server machines, each storing part of the file
system’s data
 Replication: Each data block is replicated many
times (default is 3)
 Failure: Failure is the norm rather than exception
 Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core architectural
goal of HDFS
 Namenode is consistently checking Datanodes
www.kellytechno.com
Map-Reduce Execution Engine
(Example: Color Count)
14
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks on
HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Users only provide the“Map”and“Reduce”functions
www.kellytechno.com
Properties of MapReduce Engine
15
 JobTracker is the master node (runs with the namenode)
 Receives the user’s job
 Decides on how many tasks will run (number of mappers)
 Decides on where to run each mapper (concept of locality)
• This file has 5 Blocks  run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
Node 1 Node 2 Node 3
www.kellytechno.com
Properties of MapReduce Engine (Cont’d)
16
 TaskTracker is the slave node (runs on each datanode)
 Receives the task from JobTracker
 Runs the task until completion (either map or reduce task)
 Always in communication with the JobTracker reporting progress
Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash
In this example,1 map-reduce job consists
of 4 map tasks and 3 reduce tasks
www.kellytechno.com
Key-Value Pairs
17
 Mappers and Reducers are users’ code (provided functions)
 Just need to obey the Key-Value pairs interface
 Mappers:
 Consume <key, value> pairs
 Produce <key, value> pairs
 Reducers:
 Consume <key, <list of values>>
 Produce <key, value>
 Shuffling and Sorting:
 Hidden phase between mappers and reducers
 Groups all similar keys from all mappers, sorts and passes them to a
certain reducer in the form of <key, <list of values>>
www.kellytechno.com
MapReduce Phases
18
Deciding on what will be the key and what will be the value  developer’s
responsibility
www.kellytechno.com
Example 1: Word Count
19
 Job: Count the occurrences of each word in a data set
Map
Tasks
Reduce
Tasks
www.kellytechno.com
Example 2: Color Count
20
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks on
HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Job: Count the number of each color in a data set
Part0003
Part0002
Part0001
That’s the output file, it has 3
parts on probably 3 different
machines
www.kellytechno.com
Example 3: Color Filter
21
Job: Select only the blue and the green colors
Input blocks on
HDFS
Map
Map
Map
Map
Produces (k, v)
( , 1)
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
• Each map task will select only the
blue or green colors
• No need for reduce phase
Part0001
Part0002
Part0003
Part0004
That’s the output file, it has 4
parts on probably 4 different
machines
www.kellytechno.com
Bigger Picture: Hadoop vs. Other Systems
22
Distributed Databases Hadoop
Computing Model - Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency control
- Notion of jobs
- Job is the unit of work
- No concurrency control
Data Model - Structured data with known schema
- Read/Write mode
- Any data will fit in any format
- (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
FaultTolerance - Failures are rare
- Recovery mechanisms
- Failures are common over thousands of
machines
- Simple yet efficient fault tolerance
Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
• Cloud Computing
• A computing model where any computing infrastructure can run on
the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s demand
• Example:Amazon EC2
www.kellytechno.com
23
ThankYou
Presented By

Hadoop trainting in hyderabad@kelly technologies

  • 1.
    Hadoop/MapReduce Computing Paradigm 1 Special Topicsin DBs Large-Scale Data Management
  • 2.
    Large-Scale Data Analytics 2 MapReduce computing paradigm (E.g., Hadoop) vs.Traditional database systems Database vs.  Many enterprises are turning to Hadoop  Especially applications generating big data  Web applications, social networks, scientific applications www.kellytechno.com
  • 3.
    Why Hadoop isable to compete? 3 Scalability (petabytes of data, thousands of machines) Database vs. Flexibility in accepting all data formats (no schema) Commodity inexpensive hardware Efficient and simple fault-tolerant mechanism Performance (tons of indexing, tuning, data organization tech.) Features: - Provenance tracking -Annotation management - …. www.kellytechno.com
  • 4.
    What is Hadoop 4 Hadoop is a software framework for distributed processing of large datasets across large clusters of computers  Large datasets Terabytes or petabytes of data  Large clusters  hundreds or thousands of nodes  Hadoop is open-source implementation for Google MapReduce  Hadoop is based on a simple programming model called MapReduce  Hadoop is based on a simple data model, any data will fit www.kellytechno.com
  • 5.
    What is Hadoop(Cont’d) 5  Hadoop framework consists on two main layers  Distributed file system (HDFS)  Execution engine (MapReduce) www.kellytechno.com
  • 6.
    Hadoop Master/Slave Architecture 6 Hadoop is designed as a master-slave shared-nothing architecture Master node (single node) Many slave nodes www.kellytechno.com
  • 7.
    Design Principles ofHadoop 7  Need to process big data  Need to parallelize computation across thousands of nodes  Commodity hardware  Large number of low-end cheap machines working in parallel to solve a computing problem  This is in contrast to Parallel DBs  Small number of high-end expensive machines www.kellytechno.com
  • 8.
    Design Principles ofHadoop 8  Automatic parallelization & distribution  Hidden from the end-user  Fault tolerance and automatic recovery  Nodes/tasks will fail and will recover automatically  Clean and simple programming abstraction  Users only provide two functions “map” and “reduce” www.kellytechno.com
  • 9.
    How Uses MapReduce/Hadoop 9 Google: Inventors of MapReduce computing paradigm  Yahoo: Developing Hadoop open-source of MapReduce  IBM, Microsoft, Oracle  Facebook,Amazon,AOL, NetFlex  Many others + universities and research labs www.kellytechno.com
  • 10.
    Hadoop: How itWorks 10 www.kellytechno.com
  • 11.
    Hadoop Architecture 11 Master node(single node) Many slave nodes • Distributed file system (HDFS) • Execution engine (MapReduce) www.kellytechno.com
  • 12.
    Hadoop Distributed FileSystem (HDFS) 12 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB) www.kellytechno.com
  • 13.
    Main Properties ofHDFS 13  Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data  Replication: Each data block is replicated many times (default is 3)  Failure: Failure is the norm rather than exception  Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS  Namenode is consistently checking Datanodes www.kellytechno.com
  • 14.
    Map-Reduce Execution Engine (Example:Color Count) 14 Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Users only provide the“Map”and“Reduce”functions www.kellytechno.com
  • 15.
    Properties of MapReduceEngine 15  JobTracker is the master node (runs with the namenode)  Receives the user’s job  Decides on how many tasks will run (number of mappers)  Decides on where to run each mapper (concept of locality) • This file has 5 Blocks  run 5 map tasks • Where to run the task reading block “1” • Try to run it on Node 1 or Node 3 Node 1 Node 2 Node 3 www.kellytechno.com
  • 16.
    Properties of MapReduceEngine (Cont’d) 16  TaskTracker is the slave node (runs on each datanode)  Receives the task from JobTracker  Runs the task until completion (either map or reduce task)  Always in communication with the JobTracker reporting progress Reduce Reduce Reduce Map Map Map Map Parse-hash Parse-hash Parse-hash Parse-hash In this example,1 map-reduce job consists of 4 map tasks and 3 reduce tasks www.kellytechno.com
  • 17.
    Key-Value Pairs 17  Mappersand Reducers are users’ code (provided functions)  Just need to obey the Key-Value pairs interface  Mappers:  Consume <key, value> pairs  Produce <key, value> pairs  Reducers:  Consume <key, <list of values>>  Produce <key, value>  Shuffling and Sorting:  Hidden phase between mappers and reducers  Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>> www.kellytechno.com
  • 18.
    MapReduce Phases 18 Deciding onwhat will be the key and what will be the value  developer’s responsibility www.kellytechno.com
  • 19.
    Example 1: WordCount 19  Job: Count the occurrences of each word in a data set Map Tasks Reduce Tasks www.kellytechno.com
  • 20.
    Example 2: ColorCount 20 Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Job: Count the number of each color in a data set Part0003 Part0002 Part0001 That’s the output file, it has 3 parts on probably 3 different machines www.kellytechno.com
  • 21.
    Example 3: ColorFilter 21 Job: Select only the blue and the green colors Input blocks on HDFS Map Map Map Map Produces (k, v) ( , 1) Write to HDFS Write to HDFS Write to HDFS Write to HDFS • Each map task will select only the blue or green colors • No need for reduce phase Part0001 Part0002 Part0003 Part0004 That’s the output file, it has 4 parts on probably 4 different machines www.kellytechno.com
  • 22.
    Bigger Picture: Hadoopvs. Other Systems 22 Distributed Databases Hadoop Computing Model - Notion of transactions - Transaction is the unit of work - ACID properties, Concurrency control - Notion of jobs - Job is the unit of work - No concurrency control Data Model - Structured data with known schema - Read/Write mode - Any data will fit in any format - (un)(semi)structured - ReadOnly mode Cost Model - Expensive servers - Cheap commodity machines FaultTolerance - Failures are rare - Recovery mechanisms - Failures are common over thousands of machines - Simple yet efficient fault tolerance Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance • Cloud Computing • A computing model where any computing infrastructure can run on the cloud • Hardware & Software are provided as remote services • Elastic: grows and shrinks based on the user’s demand • Example:Amazon EC2 www.kellytechno.com
  • 23.