SlideShare a Scribd company logo
Introduction to Apache
       Hadoop
Agenda
•   Need for a new processing platform (BigData)
•   Origin of Hadoop
•   What is Hadoop & what it is not ?
•   Hadoop architecture
•   Hadoop components
    (Common/HDFS/MapReduce)
•   Hadoop ecosystem
•   When should we go for Hadoop ?
•   Real world use cases
•   Questions
Need for a new processing platform
              (BigData)
• What is BigData ?
     - Twitter (over 7 TB/day)
     - Facebook (over 10 TB/day)
     - Google (over 20 PB/day)
• Where does it come from ?
• Why to take so much of pain ?
     - Information everywhere, but where is the
       knowledge?
• Existing systems (vertical scalibility)
• Why Hadoop (horizontal scalibility)?
Origin of Hadoop
• Seminal whitepapers by Google in 2004 on a
  new programming paradigm to handle data at
  internet scale
• Hadoop started as a part of the Nutch project.
• In Jan 2006 Doug Cutting started working on
  Hadoop at Yahoo
• Factored out of Nutch in Feb 2006
• First release of Apache Hadoop in September
  2007
• Jan 2008 - Hadoop became a top level Apache
  project
Hadoop distributions
•   Amazon
•   Cloudera
•   MapR
•   HortonWorks
•   Microsoft Windows Azure.
•   IBM InfoSphere Biginsights
•   Datameer
•   EMC Greenplum HD Hadoop distribution
•   Hadapt
What is Hadoop ?

• Flexible infrastructure for large scale
  computation & data processing on a network
  of commodity hardware
• Completely written in java
• Open source & distributed under Apache
  license
• Hadoop Common, HDFS & MapReduce
What Hadoop is not
• A replacement for existing data warehouse
  systems

• An online transaction processing (OLTP)
  system

• A database
Hadoop architecture
• High level view (NN, DN, JT, TT) –
HDFS
•   Hadoop distributed file system
•   Default storage for the Hadoop cluster
•   NameNode/DataNode
•   The File System Namespace(similar to our
    local file system)
•   Master/slave architecture (1 master 'n' slaves)
•   Virtual not physical
•   Provides configurable replication (user
    specific)
•   Data is stored as chunks (64 MB default, but
    configurable) across all the nodes
HDFS architecture
Data replication in HDFS.
Rack awareness
MapReduce
• Framework provided by Hadoop to process
  large amount of data across a cluster of
  machines in a parallel manner
• Comprises of three classes –
   Mapper class
   Reducer class
   Driver class
• Tasktracker/ Jobtracker
• Reducer phase will start only after mapper is
  done
• Takes (k,v) pairs and emits (k,v) pair
MapReduce structure
MapReduce job flow
Modes of operation

• Standalone mode

• Pseudo-distributed mode

• Fully-distributed mode
Hadoop ecosystem
When should we go for Hadoop ?

•   Data is too huge
•   Processes are independent
•   Online analytical processing (OLAP)
•   Better scalability
•   Parallelism
•   Unstructured data
Real world use cases
• Clickstream analysis

• Sentiment analysis

• Recommendation engines

• Ad Targeting

• Search Quality
QUESTIONS ?
QUESTIONS ?

More Related Content

What's hot (19)

PPTX
Intro To Hadoop
Al Chin
 
PPTX
Big data and hadoop anupama
Anupama Prabhudesai
 
PDF
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
PPTX
4. hadoop גיא לבנברג
Taldor Group
 
PPTX
Hadoop Architecture
Ganesh B
 
PDF
Map reduce and hadoop at mylife
responseteam
 
PPTX
Hadoop
reddivarihareesh
 
PDF
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Cloudera, Inc.
 
PDF
Hadoop Ecosystem
Sandip Darwade
 
PPTX
Hadoop overview
Deborah Akuoko
 
PPSX
Hadoop-Quick introduction
Sandeep Singh
 
ODP
Hadoop Ecosystem Overview
Gerrit van Vuuren
 
PPT
Hadoop
Yojana Nanaware
 
PDF
Intro to Apache Hadoop
Sufi Nawaz
 
ODP
Introdution to Apache Hadoop
Mike Frampton
 
PPTX
Hadoop
Shamama Kamal
 
PPTX
Basic Hadoop Architecture V1 vs V2
VIVEKVANAVAN
 
PPTX
Hadoop..
NIKHIL P L
 
Intro To Hadoop
Al Chin
 
Big data and hadoop anupama
Anupama Prabhudesai
 
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
4. hadoop גיא לבנברג
Taldor Group
 
Hadoop Architecture
Ganesh B
 
Map reduce and hadoop at mylife
responseteam
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Cloudera, Inc.
 
Hadoop Ecosystem
Sandip Darwade
 
Hadoop overview
Deborah Akuoko
 
Hadoop-Quick introduction
Sandeep Singh
 
Hadoop Ecosystem Overview
Gerrit van Vuuren
 
Intro to Apache Hadoop
Sufi Nawaz
 
Introdution to Apache Hadoop
Mike Frampton
 
Hadoop
Shamama Kamal
 
Basic Hadoop Architecture V1 vs V2
VIVEKVANAVAN
 
Hadoop..
NIKHIL P L
 

Viewers also liked (15)

PDF
Building a cloud based managed BigData platform for the enterprise
Hemanth Yamijala
 
PDF
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter
 
PPTX
Dev Con 2014
yewint ko
 
PDF
Machina research big data and IoT
Business of Software Conference
 
PPTX
Introduction to Apache Hadoop
Christopher Pezza
 
ODP
Large scale crawling with Apache Nutch
Julien Nioche
 
PDF
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Cynthia Saracco
 
PDF
Overview - IBM Big Data Platform
Vikas Manoria
 
PDF
Big Data & Analytics Architecture
Arvind Sathi
 
PDF
Sparkler - Spark Crawler
Thamme Gowda
 
PDF
GDG İstanbul Şubat Etkinliği - Sunum
Cüneyt Yeşilkaya
 
PPTX
Big Data & Hadoop Tutorial
Edureka!
 
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
PDF
Hadoop Overview & Architecture
EMC
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
Building a cloud based managed BigData platform for the enterprise
Hemanth Yamijala
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter
 
Dev Con 2014
yewint ko
 
Machina research big data and IoT
Business of Software Conference
 
Introduction to Apache Hadoop
Christopher Pezza
 
Large scale crawling with Apache Nutch
Julien Nioche
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Cynthia Saracco
 
Overview - IBM Big Data Platform
Vikas Manoria
 
Big Data & Analytics Architecture
Arvind Sathi
 
Sparkler - Spark Crawler
Thamme Gowda
 
GDG İstanbul Şubat Etkinliği - Sunum
Cüneyt Yeşilkaya
 
Big Data & Hadoop Tutorial
Edureka!
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Hadoop Overview & Architecture
EMC
 
Big data and Hadoop
Rahul Agarwal
 
Ad

Similar to Introduction to apache hadoop copy (20)

PDF
Introduction to apache hadoop
Shashwat Shriparv
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PPTX
2. hadoop fundamentals
Lokesh Ramaswamy
 
PPTX
Hadoo its a good pdf to read some notes p.pptx
helloworldw793
 
PPTX
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
PPTX
Hadoop ppt1
chariorienit
 
ODP
Hadoop seminar
KrishnenduKrishh
 
PDF
Hadoop framework thesis (3)
JonySaini2
 
PPTX
Introduction to BIg Data and Hadoop
Amir Shaikh
 
PPTX
Hadoop.pptx
sonukumar379092
 
PPTX
Hadoop.pptx
arslanhaneef
 
PPTX
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
PPTX
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
PPTX
Cap 10 ingles
ElianaSalinas4
 
PPTX
Cap 10 ingles
ElianaSalinas4
 
PDF
Understanding Hadoop
Ahmed Ossama
 
PDF
Semantic web meetup 14.november 2013
Jean-Pierre König
 
PPTX
INTRODUCTION TO BIG DATA HADOOP
Krishna Sujeer
 
PDF
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
PDF
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Introduction to apache hadoop
Shashwat Shriparv
 
Big data and hadoop overvew
Kunal Khanna
 
2. hadoop fundamentals
Lokesh Ramaswamy
 
Hadoo its a good pdf to read some notes p.pptx
helloworldw793
 
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Hadoop ppt1
chariorienit
 
Hadoop seminar
KrishnenduKrishh
 
Hadoop framework thesis (3)
JonySaini2
 
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Hadoop.pptx
sonukumar379092
 
Hadoop.pptx
arslanhaneef
 
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Cap 10 ingles
ElianaSalinas4
 
Cap 10 ingles
ElianaSalinas4
 
Understanding Hadoop
Ahmed Ossama
 
Semantic web meetup 14.november 2013
Jean-Pierre König
 
INTRODUCTION TO BIG DATA HADOOP
Krishna Sujeer
 
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Ad

Recently uploaded (20)

PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Digital Circuits, important subject in CS
contactparinay1
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 

Introduction to apache hadoop copy

  • 2. Agenda • Need for a new processing platform (BigData) • Origin of Hadoop • What is Hadoop & what it is not ? • Hadoop architecture • Hadoop components (Common/HDFS/MapReduce) • Hadoop ecosystem • When should we go for Hadoop ? • Real world use cases • Questions
  • 3. Need for a new processing platform (BigData) • What is BigData ? - Twitter (over 7 TB/day) - Facebook (over 10 TB/day) - Google (over 20 PB/day) • Where does it come from ? • Why to take so much of pain ? - Information everywhere, but where is the knowledge? • Existing systems (vertical scalibility) • Why Hadoop (horizontal scalibility)?
  • 4. Origin of Hadoop • Seminal whitepapers by Google in 2004 on a new programming paradigm to handle data at internet scale • Hadoop started as a part of the Nutch project. • In Jan 2006 Doug Cutting started working on Hadoop at Yahoo • Factored out of Nutch in Feb 2006 • First release of Apache Hadoop in September 2007 • Jan 2008 - Hadoop became a top level Apache project
  • 5. Hadoop distributions • Amazon • Cloudera • MapR • HortonWorks • Microsoft Windows Azure. • IBM InfoSphere Biginsights • Datameer • EMC Greenplum HD Hadoop distribution • Hadapt
  • 6. What is Hadoop ? • Flexible infrastructure for large scale computation & data processing on a network of commodity hardware • Completely written in java • Open source & distributed under Apache license • Hadoop Common, HDFS & MapReduce
  • 7. What Hadoop is not • A replacement for existing data warehouse systems • An online transaction processing (OLTP) system • A database
  • 8. Hadoop architecture • High level view (NN, DN, JT, TT) –
  • 9. HDFS • Hadoop distributed file system • Default storage for the Hadoop cluster • NameNode/DataNode • The File System Namespace(similar to our local file system) • Master/slave architecture (1 master 'n' slaves) • Virtual not physical • Provides configurable replication (user specific) • Data is stored as chunks (64 MB default, but configurable) across all the nodes
  • 13. MapReduce • Framework provided by Hadoop to process large amount of data across a cluster of machines in a parallel manner • Comprises of three classes – Mapper class Reducer class Driver class • Tasktracker/ Jobtracker • Reducer phase will start only after mapper is done • Takes (k,v) pairs and emits (k,v) pair
  • 16. Modes of operation • Standalone mode • Pseudo-distributed mode • Fully-distributed mode
  • 18. When should we go for Hadoop ? • Data is too huge • Processes are independent • Online analytical processing (OLAP) • Better scalability • Parallelism • Unstructured data
  • 19. Real world use cases • Clickstream analysis • Sentiment analysis • Recommendation engines • Ad Targeting • Search Quality