SlideShare a Scribd company logo
unit-4
 What is hadoop
 Motivation of hadoop
 Hadoop distribution file system
 Map reduce
 Hadoop ecosystem
 Open source software framework designed
for storage and processing of large scale data
on clusters of commodity hardware
 Created by Doug Cutting and Mike Carafella
in 2005.
 Cutting named the program after his son’s
toy elephant.
 Data-intensive text processing
 Assembly of large genomes
 Graph mining
 Machine learning and data mining
 Large scale social network analysis
Big Data Unit 4 - Hadoop
•Contains Libraries and other modulesHadoop Common
•Hadoop Distributed File SystemHDFS
•Yet Another Resource NegotiatorHadoop YARN
•A programming model for large scale
data processing
Hadoop
MapReduce
 What were the limitations of earlier large-
scale computing?
 What requirements should an alternative
approach have?
 How does Hadoop address those
requirements?
 Historically computation was processor-
bound
◦ Data volume has been relatively small
◦ Complicated computations are performed on that
data
 Advances in computer technology has
historically centered around improving the
power of a single machine
 Power consumption limits the speed increase
we get from transistor density
 Allows developers to
use multiple
machines for a single
task
 Programming on a distributed system is
much more complex
◦ Synchronizing data exchanges
◦ Managing a finite bandwidth
◦ Controlling computation timing is complicated
“You know you have a distributed system when
the crash of a computer you’ve never
heard of stops you from getting any work
done.” –Leslie Lamport
 Distributed systems must be designed with
the expectation of failure
 Typically divided into Data Nodes and
Compute Nodes
 At compute time, data is copied to the
Compute Nodes
 Fine for relatively small amounts of data
 Modern systems deal with far more data than
was gathering in the past
 Facebook
◦ 500 TB per day
 Yahoo
◦ Over 170 PB
 eBay
◦ Over 6 PB
 Getting the data to the processors becomes
the bottleneck
 If a component fails, it should be able to
recover without restarting the entire system
 Component failure or recovery during a job
must not affect the final output
 Increasing resources should increase load
capacity
 Increasing the load on the system should
result in a graceful decline in performance for
all jobs
◦ Not system failure
 Based on work done by Google in the early
2000s
◦ “The Google File System” in 2003
◦ “MapReduce: Simplified Data Processing on Large
Clusters” in 2004
 The core idea was to distribute the data as it
is initially stored
◦ Each node can then perform computation on the
data it stores without moving the data for the initial
processing
 Applications are written in a high-level
programming language
◦ No network programming or temporal dependency
 Nodes should communicate as little as
possible
◦ A “shared nothing” architecture
 Data is spread among the machines in
advance
◦ Perform computation where the data is already
stored as often as possible
 HDFS is a file system written in Java based on
the Google’s GFS
 Provides redundant storage for massive
amounts of data
 HDFS works best with a smaller number of
large files
◦ Millions as opposed to billions of files
◦ Typically 100MB or more per file
 Files in HDFS are write once
 Optimized for streaming reads of large files
and not random reads
 Files are split into blocks
 Blocks are split across many machines at load
time
◦ Different blocks from the same file will be stored on
different machines
 Blocks are replicated across multiple
machines
 The NameNode keeps track of which blocks
make up a file and where they are stored
 Default replication is 3-fold
 When a client wants to retrieve data
◦ Communicates with the NameNode to determine
which blocks make up a file and on which data
nodes those blocks are stored
◦ Then communicated directly with the data nodes to
read the data
 A method for distributing computation across
multiple nodes
 Each node processes the data that is stored at
that node
 Consists of two main phases
◦ Map
◦ Reduce
 Automatic parallelization and distribution
 Fault-Tolerance
 Provides a clean abstraction for programmers
to use
 Node based flat
 Suitable for structured, unstructured data.
Supports variety of data formats in real time
such as XML, JSON, text based flat file
formats, etc.
 Analytical, big data processing
 Big data processing, which does not require
any consistent relationships between data.
 In a hadoop cluster, a node requires only a
processor, a network card, and few hard
drives.
 Key aspects of hadoop
 Hadoop components
 Hadoop conceptual layer
 High-level architecture of hadoop
 Open source software
 Frame work
 Distributed
 Massive storage
 Faster processing
 Hadoop core components
1.HDFS:
a)storage components.
b)distributes data across several nodes.
c)natively redundant.
2.Mapreduce:
a) computational framework.
b)splits a task across multiple nodes.
c)processes data in parallel.
 Hadoop ecosystem: hadoop ecosystem are
supports projects to enhance the
functionality of hadoop components. the eco
projects are as follows:
1. HIVE
2. PIG
3. SQOOP
4. HBASE
5. FLUME
6. OOZIE

More Related Content

What's hot (20)

PPTX
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
PPTX
Big data ppt
Shweta Sahu
 
PPTX
Intro to bigdata on gcp (1)
SahilRaina21
 
PPTX
Big Data Technology Stack : Nutshell
Khalid Imran
 
PDF
Lecture6 introduction to data streams
hktripathy
 
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
PDF
Big Data technology Landscape
ShivanandaVSeeri
 
PPTX
Big data
Mina Soltani
 
PPTX
Big Data with SQL Server
Mark Kromer
 
PDF
RDBMS vs Hadoop vs Spark
Laxmi8
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PPTX
Big Data Analytics
Tyrone Systems
 
PDF
Hdfs Dhruba
Jeff Hammerbacher
 
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
PPT
Big Tools for Big Data
Lewis Crawford
 
PDF
Big data and hadoop
Kishor Parkhe
 
PPTX
Big data
rajsandhu1989
 
PDF
Apache Hadoop - Big Data Engineering
BADR
 
PPTX
Mongo db
Kowndinya Mannepalli
 
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Big data ppt
Shweta Sahu
 
Intro to bigdata on gcp (1)
SahilRaina21
 
Big Data Technology Stack : Nutshell
Khalid Imran
 
Lecture6 introduction to data streams
hktripathy
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
Big Data technology Landscape
ShivanandaVSeeri
 
Big data
Mina Soltani
 
Big Data with SQL Server
Mark Kromer
 
RDBMS vs Hadoop vs Spark
Laxmi8
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Big Data Analytics
Tyrone Systems
 
Hdfs Dhruba
Jeff Hammerbacher
 
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Big Tools for Big Data
Lewis Crawford
 
Big data and hadoop
Kishor Parkhe
 
Big data
rajsandhu1989
 
Apache Hadoop - Big Data Engineering
BADR
 

Similar to Big Data Unit 4 - Hadoop (20)

PPTX
Hadoop tutorial for beginners-tibacademy.in
TIB Academy
 
PPTX
Apache Hadoop Big Data Technology
Jay Nagar
 
PPTX
Bigdata and Hadoop Introduction
umapavankumar kethavarapu
 
PPTX
Big Data and Hadoop
Flavio Vit
 
DOCX
Hadoop Seminar Report
Atul Kushwaha
 
PDF
getFamiliarWithHadoop
AmirReza Mohammadi
 
PPTX
Hadoop introduction
musrath mohammad
 
PPTX
2. hadoop fundamentals
Lokesh Ramaswamy
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PPT
hadoop
swatic018
 
PPT
hadoop
swatic018
 
PDF
cloud computing notes for enginnering students
onkaps18
 
PPTX
Hadoop.pptx
sonukumar379092
 
PPTX
Hadoop.pptx
arslanhaneef
 
PPTX
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
PPT
Hadoop online-training
Geohedrick
 
PDF
Hadoop paper
ATWIINE Simon Alex
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PPTX
Presentation sreenu dwh-services
Sreenu Musham
 
PPTX
Big Data UNIT 2 AKTU syllabus all topics covered
chinky1118
 
Hadoop tutorial for beginners-tibacademy.in
TIB Academy
 
Apache Hadoop Big Data Technology
Jay Nagar
 
Bigdata and Hadoop Introduction
umapavankumar kethavarapu
 
Big Data and Hadoop
Flavio Vit
 
Hadoop Seminar Report
Atul Kushwaha
 
getFamiliarWithHadoop
AmirReza Mohammadi
 
Hadoop introduction
musrath mohammad
 
2. hadoop fundamentals
Lokesh Ramaswamy
 
Introduction to Hadoop and Big Data
Joe Alex
 
hadoop
swatic018
 
hadoop
swatic018
 
cloud computing notes for enginnering students
onkaps18
 
Hadoop.pptx
sonukumar379092
 
Hadoop.pptx
arslanhaneef
 
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop online-training
Geohedrick
 
Hadoop paper
ATWIINE Simon Alex
 
Big data and hadoop overvew
Kunal Khanna
 
Presentation sreenu dwh-services
Sreenu Musham
 
Big Data UNIT 2 AKTU syllabus all topics covered
chinky1118
 
Ad

Recently uploaded (20)

PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PPTX
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
PPT
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PPTX
GRADE-3-PPT-EVE-2025-ENG-Q1-LESSON-1.pptx
EveOdrapngimapNarido
 
PDF
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PPTX
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
PPTX
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PPTX
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
PPTX
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
PDF
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PDF
Geographical Diversity of India 100 Mcq.pdf/ 7th class new ncert /Social/Samy...
Sandeep Swamy
 
PPTX
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
PPTX
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
PPTX
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
Horarios de distribución de agua en julio
pegazohn1978
 
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
GRADE-3-PPT-EVE-2025-ENG-Q1-LESSON-1.pptx
EveOdrapngimapNarido
 
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
Geographical Diversity of India 100 Mcq.pdf/ 7th class new ncert /Social/Samy...
Sandeep Swamy
 
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
Ad

Big Data Unit 4 - Hadoop

  • 2.  What is hadoop  Motivation of hadoop  Hadoop distribution file system  Map reduce  Hadoop ecosystem
  • 3.  Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and Mike Carafella in 2005.  Cutting named the program after his son’s toy elephant.
  • 4.  Data-intensive text processing  Assembly of large genomes  Graph mining  Machine learning and data mining  Large scale social network analysis
  • 6. •Contains Libraries and other modulesHadoop Common •Hadoop Distributed File SystemHDFS •Yet Another Resource NegotiatorHadoop YARN •A programming model for large scale data processing Hadoop MapReduce
  • 7.  What were the limitations of earlier large- scale computing?  What requirements should an alternative approach have?  How does Hadoop address those requirements?
  • 8.  Historically computation was processor- bound ◦ Data volume has been relatively small ◦ Complicated computations are performed on that data  Advances in computer technology has historically centered around improving the power of a single machine
  • 9.  Power consumption limits the speed increase we get from transistor density
  • 10.  Allows developers to use multiple machines for a single task
  • 11.  Programming on a distributed system is much more complex ◦ Synchronizing data exchanges ◦ Managing a finite bandwidth ◦ Controlling computation timing is complicated
  • 12. “You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.” –Leslie Lamport  Distributed systems must be designed with the expectation of failure
  • 13.  Typically divided into Data Nodes and Compute Nodes  At compute time, data is copied to the Compute Nodes  Fine for relatively small amounts of data  Modern systems deal with far more data than was gathering in the past
  • 14.  Facebook ◦ 500 TB per day  Yahoo ◦ Over 170 PB  eBay ◦ Over 6 PB  Getting the data to the processors becomes the bottleneck
  • 15.  If a component fails, it should be able to recover without restarting the entire system  Component failure or recovery during a job must not affect the final output
  • 16.  Increasing resources should increase load capacity  Increasing the load on the system should result in a graceful decline in performance for all jobs ◦ Not system failure
  • 17.  Based on work done by Google in the early 2000s ◦ “The Google File System” in 2003 ◦ “MapReduce: Simplified Data Processing on Large Clusters” in 2004  The core idea was to distribute the data as it is initially stored ◦ Each node can then perform computation on the data it stores without moving the data for the initial processing
  • 18.  Applications are written in a high-level programming language ◦ No network programming or temporal dependency  Nodes should communicate as little as possible ◦ A “shared nothing” architecture  Data is spread among the machines in advance ◦ Perform computation where the data is already stored as often as possible
  • 19.  HDFS is a file system written in Java based on the Google’s GFS  Provides redundant storage for massive amounts of data
  • 20.  HDFS works best with a smaller number of large files ◦ Millions as opposed to billions of files ◦ Typically 100MB or more per file  Files in HDFS are write once  Optimized for streaming reads of large files and not random reads
  • 21.  Files are split into blocks  Blocks are split across many machines at load time ◦ Different blocks from the same file will be stored on different machines  Blocks are replicated across multiple machines  The NameNode keeps track of which blocks make up a file and where they are stored
  • 23.  When a client wants to retrieve data ◦ Communicates with the NameNode to determine which blocks make up a file and on which data nodes those blocks are stored ◦ Then communicated directly with the data nodes to read the data
  • 24.  A method for distributing computation across multiple nodes  Each node processes the data that is stored at that node  Consists of two main phases ◦ Map ◦ Reduce
  • 25.  Automatic parallelization and distribution  Fault-Tolerance  Provides a clean abstraction for programmers to use
  • 26.  Node based flat  Suitable for structured, unstructured data. Supports variety of data formats in real time such as XML, JSON, text based flat file formats, etc.  Analytical, big data processing  Big data processing, which does not require any consistent relationships between data.  In a hadoop cluster, a node requires only a processor, a network card, and few hard drives.
  • 27.  Key aspects of hadoop  Hadoop components  Hadoop conceptual layer  High-level architecture of hadoop
  • 28.  Open source software  Frame work  Distributed  Massive storage  Faster processing
  • 29.  Hadoop core components 1.HDFS: a)storage components. b)distributes data across several nodes. c)natively redundant. 2.Mapreduce: a) computational framework. b)splits a task across multiple nodes. c)processes data in parallel.
  • 30.  Hadoop ecosystem: hadoop ecosystem are supports projects to enhance the functionality of hadoop components. the eco projects are as follows: 1. HIVE 2. PIG 3. SQOOP 4. HBASE 5. FLUME 6. OOZIE

Editor's Notes

  • #13: Example of failure issues. Linux lab is distributed file system, if the file server fails, what happens.
  • #22: Default replication is 3-fold