SlideShare a Scribd company logo
A SOFT INTRODUCTION
Contents 
Hadoop Introduction 
Ecosystem 
Architecture 
HDFS 
Map-Reduce 
Characteristics 
Flavors
A large ecosystem
Zookeeper 
(Coordination) 
Oozie 
(Workflow) 
Sqoop 
(Data Exchange) 
Hbase 
(Columnar Store) 
HIVE 
(SQL-Query) 
PIG 
(Scripting) 
Mahout 
(Data Mining) 
MapReduce 
(Distributed Processing Framework) 
HDFS 
(Hadoop Distributed Framework) 
Flume 
(Data Exchange) 
Avro 
(Data 
Serialization 
System) 
STORM 
KAFKA 
TAJO 
SCALA 
IMPALA 
(Real Time 
Processing)
Who uses Hadoop ?
Hadoop 
An open source project from the Apache Software Foundation 
It provides a software framework for distributing and running 
applications on clusters of servers 
It is inspired by Google's Map-Reduce programming model as well as its 
file system (GFS) 
Hadoop was originally written for the Nutch search engine project
Hadoop 
Hadoop is open source framework written in Java 
It efficiently process large volumes of data on a cluster of commodity 
hardware 
Hadoop can be setup on single machine, but real power of Hadoop 
comes with a cluster of machines, it can be scaled from single machine 
to thousand nodes on the fly 
Hadoop consists of two key parts – Hadoop Distributed File System 
(HDFS) and Map-Reduce
Architecture 
Master 
Secondary 
Master 
User 
. . . 
. . . 
. . . 
Slaves 
Hadoop Cluster
HDFS 
HDFS is a highly fault tolerant, distributed, reliable, scalable file system 
for data storage 
HDFS stores multiple copies of data on different nodes; a file is split up 
into blocks (default 64 MB) and stored across multiple machines 
Hadoop cluster typically has a single namenode and number of 
datanodes to form the HDFS cluster
HDFS
Map-Reduce 
Map-Reduce is a programming model designed for processing large 
volumes of data in parallel by dividing the work into a set of 
independent tasks 
Map-Reduce is a paradigm for distributed processing of large data set 
over a cluster of nodes
Hadoop Daemons 
Namenode (Master) (HDFS) 
SecondaryNameNode (3rd system / Slave) 
JobTracker (Master) (MR) 
DataNode (Slave) (HDFS) 
TaskTracker (Slave) (MR)
Hadoop Daemons 
HADOOP CLUSTER 
Storage Layer Computation Layer 
Phase 
HADOOP DAEMONS ARCHITECTURE 
Map-Reduce Job Tracker Task Tracker Task tracker Task tracker 
MapReduce jobs 
are submitted on 
jobtracker 
HDFS NameNode DataNode DateNode DataNode 
NameNode Stores 
Meta-data
task 
tracker 
task 
tracker 
task 
tracker 
task 
tracker 
data 
node 
data 
node 
data 
node 
job 
tracker 
name 
node 
data 
node 
task 
tracker 
data 
node 
Master Slaves 
Hadoop Cluster 
Map-Reduce 
Layer 
Storage 
Layer
Characteristics 
Open-source 
◦ Code can be modified according to business requirements 
Distributed Processing 
◦ Data is processed parallely on cluster of nodes in distributed manner 
Fault Tolerance 
◦ Failure of nodes or tasks are recovered automatically by the framework
Characteristics 
Reliablity 
◦ Data is reliably stored on the cluster of machine despite machine failures 
High Availability 
◦ Data is highly available and accessible despite hardware failure 
Scalablility 
◦ New hardware can be added to the nodes 
◦ Horizontal Scalablility – new nodes can be added on the fly
Characteristics 
Economic 
◦ Runs on cluster of comodity hardware 
Easy to use 
◦ No need of client to deal with distributed computing , framework takes care 
of all the things 
Data Locality 
◦ Move Computation to data instead of data to computation
Flavors 
Apache 
Cloudera 
MapR 
IBM 
Pivotal 
Connectors 
◦ Almost all the databases have provided their connector with Hadoop for fast 
data transfer
QUESTIONS ??

More Related Content

What's hot (20)

PPTX
Hadoop Technology
Ece Seçil AKBAŞ
 
PPT
Hadoop distributions - ecosystem
Jakub Stransky
 
PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
 
PPT
Hadoop technology
Sohini~~ Music
 
PPTX
Hadoop
Shamama Kamal
 
PPTX
HADOOP TECHNOLOGY ppt
sravya raju
 
PPTX
Hadoop And Their Ecosystem
sunera pathan
 
PDF
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
PDF
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Edureka!
 
PPTX
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
PPT
Hadoop hive presentation
Arvind Kumar
 
PPTX
Hadoop vs Apache Spark
ALTEN Calsoft Labs
 
PDF
An Introduction to Apache Spark
Elvis Saravia
 
PPTX
Hadoop technology
tipanagiriharika
 
PDF
Hadoop ecosystem
Stanley Wang
 
PPTX
Hadoop Architecture
Dr. C.V. Suresh Babu
 
PPTX
PPT on Hadoop
Shubham Parmar
 
PDF
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
maharajothip1
 
PPSX
Hadoop-Quick introduction
Sandeep Singh
 
PPT
Hadoop Technologies
Kannappan Sirchabesan
 
Hadoop Technology
Ece Seçil AKBAŞ
 
Hadoop distributions - ecosystem
Jakub Stransky
 
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hadoop technology
Sohini~~ Music
 
Hadoop
Shamama Kamal
 
HADOOP TECHNOLOGY ppt
sravya raju
 
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Edureka!
 
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Hadoop hive presentation
Arvind Kumar
 
Hadoop vs Apache Spark
ALTEN Calsoft Labs
 
An Introduction to Apache Spark
Elvis Saravia
 
Hadoop technology
tipanagiriharika
 
Hadoop ecosystem
Stanley Wang
 
Hadoop Architecture
Dr. C.V. Suresh Babu
 
PPT on Hadoop
Shubham Parmar
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
maharajothip1
 
Hadoop-Quick introduction
Sandeep Singh
 
Hadoop Technologies
Kannappan Sirchabesan
 

Viewers also liked (20)

PPTX
Hadoop and big data
Yukti Kaura
 
PDF
An Introduction of Apache Hadoop
KMS Technology
 
PPT
Vteme
AB Design
 
PPTX
Kacey Great Exuma Option
chglat
 
PPTX
Royalton Negril
chglat
 
PPTX
Azul fives destination wedding
chglat
 
PPTX
Secrets Akumal
chglat
 
PPT
Kauri
AB Design
 
PPTX
Design sprint - Saas and Online Business Meetup
FlorianFiechter
 
PPTX
Brenda Montego Bay Options
chglat
 
PPTX
Kristen Mexico
chglat
 
PPTX
Kristin PVR
chglat
 
PPTX
Shahrukh Turks & Caicos Honeymoon Options
chglat
 
PPTX
Maddi
chglat
 
PPTX
Daniel St. Lucia 2
chglat
 
PPT
C Hockings workshop @ SHU May 2011
viscabarca
 
PPTX
Excellence El Carmen
chglat
 
PPTX
Keenan St. Lucia
chglat
 
PPSX
Ray & courtney Mexico Part One
chglat
 
PPTX
Dominican Republic Options
chglat
 
Hadoop and big data
Yukti Kaura
 
An Introduction of Apache Hadoop
KMS Technology
 
Vteme
AB Design
 
Kacey Great Exuma Option
chglat
 
Royalton Negril
chglat
 
Azul fives destination wedding
chglat
 
Secrets Akumal
chglat
 
Kauri
AB Design
 
Design sprint - Saas and Online Business Meetup
FlorianFiechter
 
Brenda Montego Bay Options
chglat
 
Kristen Mexico
chglat
 
Kristin PVR
chglat
 
Shahrukh Turks & Caicos Honeymoon Options
chglat
 
Maddi
chglat
 
Daniel St. Lucia 2
chglat
 
C Hockings workshop @ SHU May 2011
viscabarca
 
Excellence El Carmen
chglat
 
Keenan St. Lucia
chglat
 
Ray & courtney Mexico Part One
chglat
 
Dominican Republic Options
chglat
 
Ad

Similar to Hadoop introduction (20)

PPTX
Introduction to Hadoop and Hadoop component
rebeccatho
 
PPTX
Hadoop
RittikaBaksi
 
PPTX
Apache Hadoop Big Data Technology
Jay Nagar
 
PPTX
Hadoop
Bhushan Kulkarni
 
PPT
Hadoop
Girish Khanzode
 
PPTX
Hadoop
Dinakar nk
 
ODP
Hadoop seminar
KrishnenduKrishh
 
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
 
PPTX
2. hadoop fundamentals
Lokesh Ramaswamy
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PPTX
Hadoop ppt1
chariorienit
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PPT
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
PPTX
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
PPT
An Introduction to Hadoop
DerrekYoungDotCom
 
PPTX
Hadoop.pptx
sonukumar379092
 
PPTX
Hadoop.pptx
arslanhaneef
 
PPTX
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
PDF
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
PDF
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop
RittikaBaksi
 
Apache Hadoop Big Data Technology
Jay Nagar
 
Hadoop
Dinakar nk
 
Hadoop seminar
KrishnenduKrishh
 
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
 
2. hadoop fundamentals
Lokesh Ramaswamy
 
Introduction to Hadoop and Big Data
Joe Alex
 
Hadoop ppt1
chariorienit
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
An Introduction to Hadoop
DerrekYoungDotCom
 
Hadoop.pptx
sonukumar379092
 
Hadoop.pptx
arslanhaneef
 
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Ad

More from Chirag Ahuja (10)

PDF
Deploy hadoop cluster
Chirag Ahuja
 
PDF
Word count example in hadoop mapreduce using java
Chirag Ahuja
 
PDF
Big data introduction
Chirag Ahuja
 
PPTX
Flume
Chirag Ahuja
 
PPTX
Hbase
Chirag Ahuja
 
PPTX
Pig
Chirag Ahuja
 
PPTX
Hive : WareHousing Over hadoop
Chirag Ahuja
 
PPTX
Mapreduce advanced
Chirag Ahuja
 
PPTX
MapReduce basic
Chirag Ahuja
 
PPTX
Hdfs
Chirag Ahuja
 
Deploy hadoop cluster
Chirag Ahuja
 
Word count example in hadoop mapreduce using java
Chirag Ahuja
 
Big data introduction
Chirag Ahuja
 
Hive : WareHousing Over hadoop
Chirag Ahuja
 
Mapreduce advanced
Chirag Ahuja
 
MapReduce basic
Chirag Ahuja
 

Recently uploaded (20)

PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 

Hadoop introduction

  • 2. Contents Hadoop Introduction Ecosystem Architecture HDFS Map-Reduce Characteristics Flavors
  • 4. Zookeeper (Coordination) Oozie (Workflow) Sqoop (Data Exchange) Hbase (Columnar Store) HIVE (SQL-Query) PIG (Scripting) Mahout (Data Mining) MapReduce (Distributed Processing Framework) HDFS (Hadoop Distributed Framework) Flume (Data Exchange) Avro (Data Serialization System) STORM KAFKA TAJO SCALA IMPALA (Real Time Processing)
  • 6. Hadoop An open source project from the Apache Software Foundation It provides a software framework for distributing and running applications on clusters of servers It is inspired by Google's Map-Reduce programming model as well as its file system (GFS) Hadoop was originally written for the Nutch search engine project
  • 7. Hadoop Hadoop is open source framework written in Java It efficiently process large volumes of data on a cluster of commodity hardware Hadoop can be setup on single machine, but real power of Hadoop comes with a cluster of machines, it can be scaled from single machine to thousand nodes on the fly Hadoop consists of two key parts – Hadoop Distributed File System (HDFS) and Map-Reduce
  • 8. Architecture Master Secondary Master User . . . . . . . . . Slaves Hadoop Cluster
  • 9. HDFS HDFS is a highly fault tolerant, distributed, reliable, scalable file system for data storage HDFS stores multiple copies of data on different nodes; a file is split up into blocks (default 64 MB) and stored across multiple machines Hadoop cluster typically has a single namenode and number of datanodes to form the HDFS cluster
  • 10. HDFS
  • 11. Map-Reduce Map-Reduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks Map-Reduce is a paradigm for distributed processing of large data set over a cluster of nodes
  • 12. Hadoop Daemons Namenode (Master) (HDFS) SecondaryNameNode (3rd system / Slave) JobTracker (Master) (MR) DataNode (Slave) (HDFS) TaskTracker (Slave) (MR)
  • 13. Hadoop Daemons HADOOP CLUSTER Storage Layer Computation Layer Phase HADOOP DAEMONS ARCHITECTURE Map-Reduce Job Tracker Task Tracker Task tracker Task tracker MapReduce jobs are submitted on jobtracker HDFS NameNode DataNode DateNode DataNode NameNode Stores Meta-data
  • 14. task tracker task tracker task tracker task tracker data node data node data node job tracker name node data node task tracker data node Master Slaves Hadoop Cluster Map-Reduce Layer Storage Layer
  • 15. Characteristics Open-source ◦ Code can be modified according to business requirements Distributed Processing ◦ Data is processed parallely on cluster of nodes in distributed manner Fault Tolerance ◦ Failure of nodes or tasks are recovered automatically by the framework
  • 16. Characteristics Reliablity ◦ Data is reliably stored on the cluster of machine despite machine failures High Availability ◦ Data is highly available and accessible despite hardware failure Scalablility ◦ New hardware can be added to the nodes ◦ Horizontal Scalablility – new nodes can be added on the fly
  • 17. Characteristics Economic ◦ Runs on cluster of comodity hardware Easy to use ◦ No need of client to deal with distributed computing , framework takes care of all the things Data Locality ◦ Move Computation to data instead of data to computation
  • 18. Flavors Apache Cloudera MapR IBM Pivotal Connectors ◦ Almost all the databases have provided their connector with Hadoop for fast data transfer