SlideShare a Scribd company logo
∞
Agenda
   Need for a new processing platform (BigData)
   Origin of Hadoop
   What is Hadoop & what it is not ?
   Hadoop architecture
   Hadoop components
    (Common/HDFS/MapReduce)
   Hadoop ecosystem
   When should we go for Hadoop ?
   Real world use cases
   Questions
Need for a new processing
platform (Big Data)
   What is BigData ?
       - Twitter (over 7~ TB/day)
       - Facebook (over 10~ TB/day)
       - Google (over 20~ PB/day)
   Where does it come from ?
   Why to take so much of pain ?
        - Information everywhere, but where is the
          knowledge?
   Existing systems (vertical scalibility)
   Why Hadoop (horizontal scalibility)?
Origin of Hadoop
   Seminal whitepapers by Google in 2004
    on a new programming paradigm to
    handle data at internet scale
   Hadoop started as a part of the Nutch
    project.
   In Jan 2006 Doug Cutting started working
    on Hadoop at Yahoo
   Factored out of Nutch in Feb 2006
   First release of Apache Hadoop in
    September 2007
   Jan 2008 - Hadoop became a top level
    Apache project
Hadoop distributions

   Amazon
   Cloudera
   MapR
   HortonWorks
   Microsoft Windows Azure.
   IBM InfoSphere Biginsights
   Datameer
   EMC Greenplum HD Hadoop distribution
   Hadapt
What is Hadoop ?
 Flexibleinfrastructure for large
  scale computation & data
  processing on a network of
  commodity hardware
 Completely written in java
 Open source & distributed under
  Apache license
 Hadoop Common, HDFS &
  MapReduce
What Hadoop is not

A  replacement for existing data
  warehouse systems
 A File system
 An online transaction
  processing (OLTP) system
 Replacement of all
  programming logic
 A database
Hadoop architecture
   High level view (NN, DN, JT, TT) –
HDFS (Hadoop Distributed File
         System)
   Hadoop distributed file system
   Default storage for the Hadoop cluster
   NameNode/DataNode
   The File System Namespace(similar to our local
    file system)
   Master/slave architecture (1 master 'n' slaves)
   Virtual not physical
   Provides configurable replication (user specific)
   Data is stored as chunks (64 MB default, but
    configurable) across all the nodes
HDFS architecture
Data replication in HDFS.
Rack awareness




Typically large Hadoop clusters are arranged in racks and
network traffic between different nodes with in the same rack
is much more desirable than network traffic across the racks.
In addition Namenode tries to place replicas of block on
multiple racks for improved fault tolerance. A default
installation assumes all the nodes belong to the same rack.
MapReduce
   Framework provided by Hadoop to process
    large amount of data across a cluster of
    machines in a parallel manner
   Comprises of three classes –
    Mapper class
    Reducer class
    Driver class
   Tasktracker/ Jobtracker
   Reducer phase will start only after mapper is
    done
   Takes (k,v) pairs and emits (k,v) pair
Introduction to apache hadoop
   public static class Map extends Mapper<LongWritable,
    Text, Text, IntWritable> {
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text(); public void
      map(LongWritable key, Text value, Context context)
throws
       IOException, InterruptedException {
               String line = value.toString();
         StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         context.write(word, one); } } }
MapReduce job flow
Modes of operation

 Standalone   mode


 Pseudo-distributed    mode


 Fully-distributed   mode
Hadoop ecosystem
When should we go for
       Hadoop?
 Data   is too huge
 Processes    are independent
 Online   analytical processing
 (OLAP)
 Better   scalability
 Parallelism

 Unstructured    data
Real world use cases

Clickstream   analysis
Sentiment   analysis
Recommendation         engines
Ad   Targeting
Search   Quality
   What I have been doing…
     Seismic   Data Management & Processing
     WITSML    Server & Drilling Analytics
     Orchestra      Permission Map management for
      Search
     SDIS   (just started)
   Next steps: Get your hands dirty with
    code in a workshop on …
     Hadoop     Configuration
     HDFS    Data loading
     Map    Reduce programming
     Hbase

     Hive   & Pig
QUESTIONS ?

More Related Content

What's hot (20)

DOC
Hadoop cluster configuration
prabakaranbrick
 
PDF
Hadoop single node installation on ubuntu 14
jijukjoseph
 
ODP
Hadoop2.2
Sreejith P
 
PPTX
Learn to setup a Hadoop Multi Node Cluster
Edureka!
 
PPTX
Hadoop Installation presentation
puneet yadav
 
PPTX
HDFS Internals
Apache Apex
 
PDF
Word count program execution steps in hadoop
jijukjoseph
 
PPTX
Learn Hadoop Administration
Edureka!
 
PDF
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
PDF
Administer Hadoop Cluster
Edureka!
 
ODP
Hadoop admin
Balaji Rajan
 
PPTX
Hadoop installation with an example
Nikita Kesharwani
 
PDF
Secure Hadoop Cluster With Kerberos
Edureka!
 
PPTX
Bd class 2 complete
JigsawAcademy2014
 
PPTX
Hadoop architecture by ajay
Hadoop online training
 
ODT
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
PPTX
Hadoop administration
Aneesh Pulickal Karunakaran
 
PPTX
Hadoop Interview Questions and Answers
Big Data Interview Questions
 
PPT
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
PDF
Hadoop interview questions
Kalyan Hadoop
 
Hadoop cluster configuration
prabakaranbrick
 
Hadoop single node installation on ubuntu 14
jijukjoseph
 
Hadoop2.2
Sreejith P
 
Learn to setup a Hadoop Multi Node Cluster
Edureka!
 
Hadoop Installation presentation
puneet yadav
 
HDFS Internals
Apache Apex
 
Word count program execution steps in hadoop
jijukjoseph
 
Learn Hadoop Administration
Edureka!
 
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
Administer Hadoop Cluster
Edureka!
 
Hadoop admin
Balaji Rajan
 
Hadoop installation with an example
Nikita Kesharwani
 
Secure Hadoop Cluster With Kerberos
Edureka!
 
Bd class 2 complete
JigsawAcademy2014
 
Hadoop architecture by ajay
Hadoop online training
 
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
Hadoop administration
Aneesh Pulickal Karunakaran
 
Hadoop Interview Questions and Answers
Big Data Interview Questions
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
Hadoop interview questions
Kalyan Hadoop
 

Viewers also liked (20)

PPTX
Introduction to Apache Hadoop
Christopher Pezza
 
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
PPTX
Big Data & Hadoop Tutorial
Edureka!
 
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
PDF
Welcome to Tolteq: The Leader in MWD Technology
Gabe Trevino
 
PDF
Cath preso
Mark Stiltner
 
PDF
Demystifying Data Engineering
nathanmarz
 
KEY
The Secrets of Building Realtime Big Data Systems
nathanmarz
 
PDF
An Introduction of Apache Hadoop
KMS Technology
 
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz
 
PPTX
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
PDF
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
PDF
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
PDF
Kafka and Storm - event processing in realtime
Guido Schmutz
 
PPT
Big Data
NGDATA
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
ODP
Hadoop demo ppt
Phil Young
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
PPTX
What is Big Data?
Bernard Marr
 
PPTX
Big data ppt
Nasrin Hussain
 
Introduction to Apache Hadoop
Christopher Pezza
 
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Big Data & Hadoop Tutorial
Edureka!
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Welcome to Tolteq: The Leader in MWD Technology
Gabe Trevino
 
Cath preso
Mark Stiltner
 
Demystifying Data Engineering
nathanmarz
 
The Secrets of Building Realtime Big Data Systems
nathanmarz
 
An Introduction of Apache Hadoop
KMS Technology
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz
 
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
Kafka and Storm - event processing in realtime
Guido Schmutz
 
Big Data
NGDATA
 
Seminar Presentation Hadoop
Varun Narang
 
Hadoop demo ppt
Phil Young
 
Big data and Hadoop
Rahul Agarwal
 
What is Big Data?
Bernard Marr
 
Big data ppt
Nasrin Hussain
 
Ad

Similar to Introduction to apache hadoop (20)

PPTX
Introduction to apache hadoop copy
Mohammad_Tariq
 
PPTX
2. hadoop fundamentals
Lokesh Ramaswamy
 
PPTX
Hadoop ppt1
chariorienit
 
PPTX
Hadoop.pptx
sonukumar379092
 
PPTX
Hadoop.pptx
arslanhaneef
 
PPTX
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
PPTX
Hadoop Big Data A big picture
J S Jodha
 
ODP
Hadoop seminar
KrishnenduKrishh
 
PPTX
Big data and hadoop anupama
Anupama Prabhudesai
 
PPTX
Hadoo its a good pdf to read some notes p.pptx
helloworldw793
 
PPTX
Hadoop and Big Data: Revealed
Sachin Holla
 
PPSX
Hadoop-Quick introduction
Sandeep Singh
 
PPTX
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PPT
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
PPTX
Big Data Hadoop Technology
Rahul Sharma
 
PDF
Hadoop framework thesis (3)
JonySaini2
 
PDF
Hadoop and its role in Facebook: An Overview
rahulmonikasharma
 
PPTX
Hadoop introduction
Dong Ngoc
 
Introduction to apache hadoop copy
Mohammad_Tariq
 
2. hadoop fundamentals
Lokesh Ramaswamy
 
Hadoop ppt1
chariorienit
 
Hadoop.pptx
sonukumar379092
 
Hadoop.pptx
arslanhaneef
 
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop Big Data A big picture
J S Jodha
 
Hadoop seminar
KrishnenduKrishh
 
Big data and hadoop anupama
Anupama Prabhudesai
 
Hadoo its a good pdf to read some notes p.pptx
helloworldw793
 
Hadoop and Big Data: Revealed
Sachin Holla
 
Hadoop-Quick introduction
Sandeep Singh
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Big Data Hadoop Technology
Rahul Sharma
 
Hadoop framework thesis (3)
JonySaini2
 
Hadoop and its role in Facebook: An Overview
rahulmonikasharma
 
Hadoop introduction
Dong Ngoc
 
Ad

More from Shashwat Shriparv (20)

PPTX
Learning Linux Series Administrator Commands.pptx
Shashwat Shriparv
 
PPTX
LibreOffice 7.3.pptx
Shashwat Shriparv
 
PPTX
Kerberos Architecture.pptx
Shashwat Shriparv
 
PPTX
Suspending a Process in Linux.pptx
Shashwat Shriparv
 
PPTX
Kerberos Architecture.pptx
Shashwat Shriparv
 
PPTX
Command Seperators.pptx
Shashwat Shriparv
 
PPTX
R language introduction
Shashwat Shriparv
 
PPTX
Hive query optimization infinity
Shashwat Shriparv
 
PPTX
H base introduction & development
Shashwat Shriparv
 
PPTX
Hbase interact with shell
Shashwat Shriparv
 
PPT
H base development
Shashwat Shriparv
 
PPTX
H base
Shashwat Shriparv
 
PPTX
My sql
Shashwat Shriparv
 
PPTX
Apache tomcat
Shashwat Shriparv
 
PPTX
Linux 4 you
Shashwat Shriparv
 
DOCX
Java interview questions
Shashwat Shriparv
 
DOCX
C# interview quesions
Shashwat Shriparv
 
PPTX
Inventory system
Shashwat Shriparv
 
Learning Linux Series Administrator Commands.pptx
Shashwat Shriparv
 
LibreOffice 7.3.pptx
Shashwat Shriparv
 
Kerberos Architecture.pptx
Shashwat Shriparv
 
Suspending a Process in Linux.pptx
Shashwat Shriparv
 
Kerberos Architecture.pptx
Shashwat Shriparv
 
Command Seperators.pptx
Shashwat Shriparv
 
R language introduction
Shashwat Shriparv
 
Hive query optimization infinity
Shashwat Shriparv
 
H base introduction & development
Shashwat Shriparv
 
Hbase interact with shell
Shashwat Shriparv
 
H base development
Shashwat Shriparv
 
Apache tomcat
Shashwat Shriparv
 
Linux 4 you
Shashwat Shriparv
 
Java interview questions
Shashwat Shriparv
 
C# interview quesions
Shashwat Shriparv
 
Inventory system
Shashwat Shriparv
 

Recently uploaded (20)

PPTX
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
PDF
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PPTX
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PPTX
QUARTER 1 WEEK 2 PLOT, POV AND CONFLICTS
KynaParas
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PDF
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PDF
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
PPTX
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
PDF
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PDF
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PPTX
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
QUARTER 1 WEEK 2 PLOT, POV AND CONFLICTS
KynaParas
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
Horarios de distribución de agua en julio
pegazohn1978
 
Dimensions of Societal Planning in Commonism
StefanMz
 
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 

Introduction to apache hadoop

  • 1.
  • 2. Agenda  Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)  Hadoop ecosystem  When should we go for Hadoop ?  Real world use cases  Questions
  • 3. Need for a new processing platform (Big Data)  What is BigData ? - Twitter (over 7~ TB/day) - Facebook (over 10~ TB/day) - Google (over 20~ PB/day)  Where does it come from ?  Why to take so much of pain ?  - Information everywhere, but where is the  knowledge?  Existing systems (vertical scalibility)  Why Hadoop (horizontal scalibility)?
  • 4. Origin of Hadoop  Seminal whitepapers by Google in 2004 on a new programming paradigm to handle data at internet scale  Hadoop started as a part of the Nutch project.  In Jan 2006 Doug Cutting started working on Hadoop at Yahoo  Factored out of Nutch in Feb 2006  First release of Apache Hadoop in September 2007  Jan 2008 - Hadoop became a top level Apache project
  • 5. Hadoop distributions  Amazon  Cloudera  MapR  HortonWorks  Microsoft Windows Azure.  IBM InfoSphere Biginsights  Datameer  EMC Greenplum HD Hadoop distribution  Hadapt
  • 6. What is Hadoop ?  Flexibleinfrastructure for large scale computation & data processing on a network of commodity hardware  Completely written in java  Open source & distributed under Apache license  Hadoop Common, HDFS & MapReduce
  • 7. What Hadoop is not A replacement for existing data warehouse systems  A File system  An online transaction processing (OLTP) system  Replacement of all programming logic  A database
  • 8. Hadoop architecture  High level view (NN, DN, JT, TT) –
  • 9. HDFS (Hadoop Distributed File System)  Hadoop distributed file system  Default storage for the Hadoop cluster  NameNode/DataNode  The File System Namespace(similar to our local file system)  Master/slave architecture (1 master 'n' slaves)  Virtual not physical  Provides configurable replication (user specific)  Data is stored as chunks (64 MB default, but configurable) across all the nodes
  • 12. Rack awareness Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition Namenode tries to place replicas of block on multiple racks for improved fault tolerance. A default installation assumes all the nodes belong to the same rack.
  • 13. MapReduce  Framework provided by Hadoop to process large amount of data across a cluster of machines in a parallel manner  Comprises of three classes – Mapper class Reducer class Driver class  Tasktracker/ Jobtracker  Reducer phase will start only after mapper is done  Takes (k,v) pairs and emits (k,v) pair
  • 15. public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
  • 17. Modes of operation  Standalone mode  Pseudo-distributed mode  Fully-distributed mode
  • 19. When should we go for Hadoop?  Data is too huge  Processes are independent  Online analytical processing (OLAP)  Better scalability  Parallelism  Unstructured data
  • 20. Real world use cases Clickstream analysis Sentiment analysis Recommendation engines Ad Targeting Search Quality
  • 21. What I have been doing…  Seismic Data Management & Processing  WITSML Server & Drilling Analytics  Orchestra Permission Map management for Search  SDIS (just started)  Next steps: Get your hands dirty with code in a workshop on …  Hadoop Configuration  HDFS Data loading  Map Reduce programming  Hbase  Hive & Pig