SlideShare a Scribd company logo
Hadoop


    Rahul Jain
    Software Engineer
    http://www.linkedin.com/in/rahuldausa



                                      1
Agenda
• Hadoop
   –   Introduction
   –   Hadoop (Why)
   –   Hadoop History
   –   Uses of Hadoop
   –   High Level Architecture
   –   Map-Reduce
• HDFS
   – GFS (Google File System)
   – HDFS Architecture
• Installation/Configuration
• Examples

                                     2
Introduction
 An open source software framework
 Supports Data intensive Distributed Applications.
 Enables Application to work thousand of computational
  independent computers and petabytes of data.
 Derived from Google’s Map-Reduce and Google File System
  papers.
 Written in the Java Programming Language.
 Started by Doug Cutting, who named it after his son’s toy
  elephant to support distribution for the Nutch ( A sub-project
  of Lucene)




                                                                   3
Hadoop (Why)
• Need to process huge datasets on large no. of computers.
• It is expensive to build reliability into each application.
• Nodes fails everyday
        - Failure is expected, rather than exceptional.
- Need common infrastructure
   - Efficient, reliable, easy to use.
   - Open sourced , Apache License




                                                                4
Hadoop History
•   Dec 2004 – Google GFS paper published
•   July 2005 – Nutch uses Map-Reduce
•   Jan 2006 – Doug Cutting joins Yahoo!
•   Feb 2006 – Become Lucene Subproject
•   Apr 2007 – Yahoo! On 1000 node cluster
•   Jan 2008 – An Apache Top Level Project
•   Feb 2008 – Yahoo Production search index

                                               5
What is Hadoop Used for ?
•   Searching (Yahoo)
•   Log Processing
•   Recommendation Systems (Facebook, LinkedIn, eBay, Amazon)
•   Analytics(Facebook, LinkedIn)
•   Video and Image Analysis (NASA)
•   Data Retention



                                                            6
Hadoop High Level Architecture




                                 7
Map-Reduce
Framework for processing parallel
problems across huge datasets using a
large numbers of computers(nodes),
collectively referred as
Cluster : If all nodes are on same local network and
uses similar network.
Or
Grid: If the nodes are shared across geographically
and uses more heterogeneous hardware.

Consists Two Step :
1.Map Step- The master node takes the input,
divides it into smaller sub-problems, and distributes
them to worker nodes. A worker node may do this
again in turn, leading to a multi-level tree structure.
The worker node processes the smaller problem,
and passes the answer back to its master node.
2.Reduce Step -The master node then collects
the answers to all the sub-problems and combines
them in some way to form the output – the answer
to the problem it was originally trying to solve.

                                                          Multiple Map-Reduce phases

                                                                                       8
Map-Reduce Life-Cycle




Credit : http://code.google.com/edu/parallel/mapreduce-tutorial.html
                                                                       9
HDFS
Hadoop Distributed File System




                                 10
Lets Understand GFS first …
               Google File System




                                    11
GFS Architecture




                   12
Goals of HDFS
1. Very Large Distributed File System
      - 10K nodes, 100 million files, 10 PB
2. Assumes Commodity Hardware
      - Files are replicated to handle hardware failure
      - Detect failures and recovers from them
3. Optimized for Batch Processing
      - Data locations exposed so that computation can move to
      where data resides.


                                                            13
14
Installation/ Configuration
  [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/hdfs-site.xml     [rjain@10.1.110.12 hadoop-1.0.3]$ pwd
  <configuration>                                             /home/rjain/hadoop-1.0.3
  <property>
       <name>dfs.replication</name>
                                                              [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-all.sh
       <value>1</value>
      </property>                                             [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-mapred.sh
   <property>
      <name>dfs.permissions</name>
                                                              [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-dfs.sh
      <value>true</value>
      </property>                                             [rjain@10.1.110.12 hadoop-1.0.3]$ bin/hadoop fs
  <property>                                                  Usage: java FsShell
     <name>dfs.data.dir</name>                                      [-ls <path>]
     <value>/home/rjain/rahul/hdfs/data</value>                     [-lsr <path>] : Recursive version of ls. Similar to Unix ls -R.
    </property>                                                     [-du <path>] : Displays aggregate length of files contained in the directory or the length of a file.
    <property>                                                      [-dus <path>] : Displays a summary of file lengths.
     <name>dfs.name.dir</name>                                      [-count[-q] <path>]
     <value>/home/rjain/rahul/hdfs/name</value>                     [-mv <src> <dst>]
    </property>                                                     [-cp <src> <dst>]
  </configuration>                                                  [-rm [-skipTrash] <path>]
                                                                    [-rmr [-skipTrash] <path>] : Recursive version of delete(rm).
  [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/mapred-site.xml         [-expunge] : Empty the Trash
  <configuration>                                                   [-put <localsrc> ... <dst>] : Copy single src, or multiple srcs from local file system to the
      <property>                                              destination filesystem
           <name>mapred.job.tracker</name>                          [-copyFromLocal <localsrc> ... <dst>]
           <value>localhost:9001</value>                            [-moveFromLocal <localsrc> ... <dst>]
      </property>                                                   [-get [-ignoreCrc] [-crc] <src> <localdst>]
  </configuration>                                                  [-getmerge <src> <localdst> [addnl]]
                                                                    [-cat <src>]
  [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/core-site.xml           [-text <src>] : Takes a source file and outputs the file in text format. The allowed formats are zip
  <configuration>                                             and TextRecordInputStream.
      <property>                                                    [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>]
           <name>fs.default.name</name>                             [-moveToLocal [-crc] <src> <localdst>]
            <value>hdfs://localhost:9000</value>                    [-mkdir <path>]
      </property>                                                   [-setrep [-R] [-w] <rep> <path/file>] : Changes the replication factor of a file
  </configuration>                                                  [-touchz <path>] : Create a file of zero length.
                                                                    [-test -[ezd] <path>] : -e check to see if the file exists. Return 0 if true. -z check to see if the file is
[rjain@10.1.110.12 hadoop-1.0.3]$ jps                         zero length. Return 0 if true. -d check to see if the path is directory. Return 0 if true.
29756 SecondaryNameNode                                             [-stat [format] <path>] : Returns the stat information on the path like created time of dir
19847 TaskTracker                                                   [-tail [-f] <file>] : Displays last kilobyte of the file to stdout
18756 Jps                                                           [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
29483 NameNode                                                      [-chown [-R] [OWNER][:[GROUP]] PATH...]
29619 DataNode                                                      [-chgrp [-R] GROUP PATH...]                                                                       15
19711 JobTracker                                                    [-help [cmd]]
HDFS- Read/Write Example
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

Given an input/output file name as string, we construct inFile/outFile Path objects.
Most of the FileSystem APIs accepts Path objects.

Path inFile = new Path(argv[0]);
Path outFile = new Path(argv[1]);

Validate the input/output paths before reading/writing.

if (!fs.exists(inFile))
  printAndExit("Input file not found");
if (!fs.isFile(inFile))
  printAndExit("Input should be a file");
if (fs.exists(outFile))
  printAndExit("Output already exists");

Open inFile for reading.

FSDataInputStream in = fs.open(inFile);

Open outFile for writing.

FSDataOutputStream out = fs.create(outFile);

Read from input stream and write to output stream until EOF.

while ((bytesRead = in.read(buffer)) > 0) {
  out.write(buffer, 0, bytesRead);
}

Close the streams when done.

in.close();
out.close();
                                                                                       16
Hadoop Sub-Projects
•   Hadoop Common: The common utilities that support the other Hadoop subprojects.
•   Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput
    access to application data.
•   Hadoop MapReduce: A software framework for distributed processing of large data sets on
    compute clusters.

Other Hadoop-related projects at Apache include:

•   Avro™: A data serialization system.
•   Cassandra™: A scalable multi-master database with no single points of failure.
•   Chukwa™: A data collection system for managing large distributed systems.
•   HBase™: A scalable, distributed database that supports structured data storage for large tables.
•   Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
•   Mahout™: A Scalable machine learning and data mining library.
•   Pig™: A high-level data-flow language and execution framework for parallel computation.
•   ZooKeeper™: A high-performance coordination service for distributed applications.




                                                                                                       17
Questions ?




              18

More Related Content

What's hot (20)

PDF
Hive Quick Start Tutorial
Carl Steinbach
 
PPTX
6.hive
Prashant Gupta
 
PDF
Apache Spark Introduction
sudhakara st
 
PDF
Hadoop-Introduction
Sandeep Deshmukh
 
ODP
Hadoop2.2
Sreejith P
 
PPT
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
PPTX
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
PPTX
HDFS Internals
Apache Apex
 
PDF
Sqoop
Prashant Gupta
 
PPT
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
PPTX
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Cloudera, Inc.
 
PDF
Apache Hadoop and HBase
Cloudera, Inc.
 
PDF
From docker to kubernetes: running Apache Hadoop in a cloud native way
DataWorks Summit
 
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar
 
PPT
An Introduction to Hadoop
DerrekYoungDotCom
 
PDF
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
PPT
Hadoop Tutorial
awesomesos
 
PDF
Hive Anatomy
nzhang
 
PDF
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
PDF
Nov 2011 HUG: Blur - Lucene on Hadoop
Yahoo Developer Network
 
Hive Quick Start Tutorial
Carl Steinbach
 
Apache Spark Introduction
sudhakara st
 
Hadoop-Introduction
Sandeep Deshmukh
 
Hadoop2.2
Sreejith P
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
HDFS Internals
Apache Apex
 
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Cloudera, Inc.
 
Apache Hadoop and HBase
Cloudera, Inc.
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
DataWorks Summit
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar
 
An Introduction to Hadoop
DerrekYoungDotCom
 
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Hadoop Tutorial
awesomesos
 
Hive Anatomy
nzhang
 
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Nov 2011 HUG: Blur - Lucene on Hadoop
Yahoo Developer Network
 

Viewers also liked (18)

PPTX
Apache kafka
Rahul Jain
 
PDF
Introduction to Kafka Streams
Guozhang Wang
 
PPTX
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
PDF
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop - Lessons Learned
tcurdt
 
PPTX
Big data- HDFS(2nd presentation)
Takrim Ul Islam Laskar
 
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop introduction
Subhas Kumar Ghosh
 
PDF
HDFS Design Principles
Konstantin V. Shvachko
 
PPTX
Hadoop HDFS Architeture and Design
sudhakara st
 
PDF
Hadoop & Big Data benchmarking
Bart Vandewoestyne
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PPTX
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
PPSX
Hadoop
Nishant Gandhi
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
PDF
reveal.js 3.0.0
Hakim El Hattab
 
Apache kafka
Rahul Jain
 
Introduction to Kafka Streams
Guozhang Wang
 
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Hadoop - Lessons Learned
tcurdt
 
Big data- HDFS(2nd presentation)
Takrim Ul Islam Laskar
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Hadoop introduction
Subhas Kumar Ghosh
 
HDFS Design Principles
Konstantin V. Shvachko
 
Hadoop HDFS Architeture and Design
sudhakara st
 
Hadoop & Big Data benchmarking
Bart Vandewoestyne
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
Seminar Presentation Hadoop
Varun Narang
 
reveal.js 3.0.0
Hakim El Hattab
 
Ad

Similar to Hadoop & HDFS for Beginners (20)

PPTX
Hadoop 20111117
exsuns
 
PPTX
Hadoop 20111215
exsuns
 
PDF
Hadoop operations basic
Hafizur Rahman
 
PDF
Interacting with hdfs
Pradeep Kumbhar
 
PPTX
5c_BigData_Hadoop_HDFS.PPTX
Miguel720844
 
PDF
GOTO 2011 preso: 3x Hadoop
fvanvollenhoven
 
PDF
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
PPTX
Hadoop
Mukesh kumar
 
PPTX
Hadoop Interacting with HDFS
Apache Apex
 
PDF
RuG Guest Lecture
fvanvollenhoven
 
PDF
Hadoop-HDFS-commands.pdf hadoop commands
zirrakrupa123
 
PDF
Apache Hadoop Shell Rewrite
Allen Wittenauer
 
PDF
Hdfs java api
Trieu Dao Minh
 
PDF
データ解析技術入門(Hadoop編)
Takumi Asai
 
PDF
394753714-hdfc-command-biga data tecnology
SupriyaGhosh51
 
PDF
field_guide_to_hadoop_pentaho
Martin Ferguson
 
PDF
第2回 Hadoop 輪読会
Toshihiro Suzuki
 
ODP
Hadoop HDFS by rohitkapa
kapa rohit
 
ODP
HDFS presented by VIJAY
thevijayps
 
PPTX
Hadoop Installation presentation
puneet yadav
 
Hadoop 20111117
exsuns
 
Hadoop 20111215
exsuns
 
Hadoop operations basic
Hafizur Rahman
 
Interacting with hdfs
Pradeep Kumbhar
 
5c_BigData_Hadoop_HDFS.PPTX
Miguel720844
 
GOTO 2011 preso: 3x Hadoop
fvanvollenhoven
 
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
Hadoop
Mukesh kumar
 
Hadoop Interacting with HDFS
Apache Apex
 
RuG Guest Lecture
fvanvollenhoven
 
Hadoop-HDFS-commands.pdf hadoop commands
zirrakrupa123
 
Apache Hadoop Shell Rewrite
Allen Wittenauer
 
Hdfs java api
Trieu Dao Minh
 
データ解析技術入門(Hadoop編)
Takumi Asai
 
394753714-hdfc-command-biga data tecnology
SupriyaGhosh51
 
field_guide_to_hadoop_pentaho
Martin Ferguson
 
第2回 Hadoop 輪読会
Toshihiro Suzuki
 
Hadoop HDFS by rohitkapa
kapa rohit
 
HDFS presented by VIJAY
thevijayps
 
Hadoop Installation presentation
puneet yadav
 
Ad

More from Rahul Jain (13)

PDF
Flipkart Strategy Analysis and Recommendation
Rahul Jain
 
PPTX
Emerging technologies /frameworks in Big Data
Rahul Jain
 
PPTX
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
Introduction to Machine Learning
Rahul Jain
 
PPTX
Introduction to Scala
Rahul Jain
 
PPTX
What is NoSQL and CAP Theorem
Rahul Jain
 
PPTX
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
PPTX
Introduction to Apache Lucene/Solr
Rahul Jain
 
PPTX
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
PPTX
Introduction to Kafka and Zookeeper
Rahul Jain
 
DOC
Hibernate tutorial for beginners
Rahul Jain
 
Flipkart Strategy Analysis and Recommendation
Rahul Jain
 
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Introduction to Apache Spark
Rahul Jain
 
Introduction to Machine Learning
Rahul Jain
 
Introduction to Scala
Rahul Jain
 
What is NoSQL and CAP Theorem
Rahul Jain
 
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
Introduction to Apache Lucene/Solr
Rahul Jain
 
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Introduction to Kafka and Zookeeper
Rahul Jain
 
Hibernate tutorial for beginners
Rahul Jain
 

Recently uploaded (20)

PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Digital Circuits, important subject in CS
contactparinay1
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 

Hadoop & HDFS for Beginners

  • 1. Hadoop Rahul Jain Software Engineer http://www.linkedin.com/in/rahuldausa 1
  • 2. Agenda • Hadoop – Introduction – Hadoop (Why) – Hadoop History – Uses of Hadoop – High Level Architecture – Map-Reduce • HDFS – GFS (Google File System) – HDFS Architecture • Installation/Configuration • Examples 2
  • 3. Introduction  An open source software framework  Supports Data intensive Distributed Applications.  Enables Application to work thousand of computational independent computers and petabytes of data.  Derived from Google’s Map-Reduce and Google File System papers.  Written in the Java Programming Language.  Started by Doug Cutting, who named it after his son’s toy elephant to support distribution for the Nutch ( A sub-project of Lucene) 3
  • 4. Hadoop (Why) • Need to process huge datasets on large no. of computers. • It is expensive to build reliability into each application. • Nodes fails everyday - Failure is expected, rather than exceptional. - Need common infrastructure - Efficient, reliable, easy to use. - Open sourced , Apache License 4
  • 5. Hadoop History • Dec 2004 – Google GFS paper published • July 2005 – Nutch uses Map-Reduce • Jan 2006 – Doug Cutting joins Yahoo! • Feb 2006 – Become Lucene Subproject • Apr 2007 – Yahoo! On 1000 node cluster • Jan 2008 – An Apache Top Level Project • Feb 2008 – Yahoo Production search index 5
  • 6. What is Hadoop Used for ? • Searching (Yahoo) • Log Processing • Recommendation Systems (Facebook, LinkedIn, eBay, Amazon) • Analytics(Facebook, LinkedIn) • Video and Image Analysis (NASA) • Data Retention 6
  • 7. Hadoop High Level Architecture 7
  • 8. Map-Reduce Framework for processing parallel problems across huge datasets using a large numbers of computers(nodes), collectively referred as Cluster : If all nodes are on same local network and uses similar network. Or Grid: If the nodes are shared across geographically and uses more heterogeneous hardware. Consists Two Step : 1.Map Step- The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. 2.Reduce Step -The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. Multiple Map-Reduce phases 8
  • 9. Map-Reduce Life-Cycle Credit : http://code.google.com/edu/parallel/mapreduce-tutorial.html 9
  • 11. Lets Understand GFS first … Google File System 11
  • 13. Goals of HDFS 1. Very Large Distributed File System - 10K nodes, 100 million files, 10 PB 2. Assumes Commodity Hardware - Files are replicated to handle hardware failure - Detect failures and recovers from them 3. Optimized for Batch Processing - Data locations exposed so that computation can move to where data resides. 13
  • 14. 14
  • 15. Installation/ Configuration [[email protected] hadoop-1.0.3]$ vi conf/hdfs-site.xml [[email protected] hadoop-1.0.3]$ pwd <configuration> /home/rjain/hadoop-1.0.3 <property> <name>dfs.replication</name> [[email protected] hadoop-1.0.3]$ bin/start-all.sh <value>1</value> </property> [[email protected] hadoop-1.0.3]$ bin/start-mapred.sh <property> <name>dfs.permissions</name> [[email protected] hadoop-1.0.3]$ bin/start-dfs.sh <value>true</value> </property> [[email protected] hadoop-1.0.3]$ bin/hadoop fs <property> Usage: java FsShell <name>dfs.data.dir</name> [-ls <path>] <value>/home/rjain/rahul/hdfs/data</value> [-lsr <path>] : Recursive version of ls. Similar to Unix ls -R. </property> [-du <path>] : Displays aggregate length of files contained in the directory or the length of a file. <property> [-dus <path>] : Displays a summary of file lengths. <name>dfs.name.dir</name> [-count[-q] <path>] <value>/home/rjain/rahul/hdfs/name</value> [-mv <src> <dst>] </property> [-cp <src> <dst>] </configuration> [-rm [-skipTrash] <path>] [-rmr [-skipTrash] <path>] : Recursive version of delete(rm). [[email protected] hadoop-1.0.3]$ vi conf/mapred-site.xml [-expunge] : Empty the Trash <configuration> [-put <localsrc> ... <dst>] : Copy single src, or multiple srcs from local file system to the <property> destination filesystem <name>mapred.job.tracker</name> [-copyFromLocal <localsrc> ... <dst>] <value>localhost:9001</value> [-moveFromLocal <localsrc> ... <dst>] </property> [-get [-ignoreCrc] [-crc] <src> <localdst>] </configuration> [-getmerge <src> <localdst> [addnl]] [-cat <src>] [[email protected] hadoop-1.0.3]$ vi conf/core-site.xml [-text <src>] : Takes a source file and outputs the file in text format. The allowed formats are zip <configuration> and TextRecordInputStream. <property> [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] <name>fs.default.name</name> [-moveToLocal [-crc] <src> <localdst>] <value>hdfs://localhost:9000</value> [-mkdir <path>] </property> [-setrep [-R] [-w] <rep> <path/file>] : Changes the replication factor of a file </configuration> [-touchz <path>] : Create a file of zero length. [-test -[ezd] <path>] : -e check to see if the file exists. Return 0 if true. -z check to see if the file is [[email protected] hadoop-1.0.3]$ jps zero length. Return 0 if true. -d check to see if the path is directory. Return 0 if true. 29756 SecondaryNameNode [-stat [format] <path>] : Returns the stat information on the path like created time of dir 19847 TaskTracker [-tail [-f] <file>] : Displays last kilobyte of the file to stdout 18756 Jps [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] 29483 NameNode [-chown [-R] [OWNER][:[GROUP]] PATH...] 29619 DataNode [-chgrp [-R] GROUP PATH...] 15 19711 JobTracker [-help [cmd]]
  • 16. HDFS- Read/Write Example Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Given an input/output file name as string, we construct inFile/outFile Path objects. Most of the FileSystem APIs accepts Path objects. Path inFile = new Path(argv[0]); Path outFile = new Path(argv[1]); Validate the input/output paths before reading/writing. if (!fs.exists(inFile)) printAndExit("Input file not found"); if (!fs.isFile(inFile)) printAndExit("Input should be a file"); if (fs.exists(outFile)) printAndExit("Output already exists"); Open inFile for reading. FSDataInputStream in = fs.open(inFile); Open outFile for writing. FSDataOutputStream out = fs.create(outFile); Read from input stream and write to output stream until EOF. while ((bytesRead = in.read(buffer)) > 0) { out.write(buffer, 0, bytesRead); } Close the streams when done. in.close(); out.close(); 16
  • 17. Hadoop Sub-Projects • Hadoop Common: The common utilities that support the other Hadoop subprojects. • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. • Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. Other Hadoop-related projects at Apache include: • Avro™: A data serialization system. • Cassandra™: A scalable multi-master database with no single points of failure. • Chukwa™: A data collection system for managing large distributed systems. • HBase™: A scalable, distributed database that supports structured data storage for large tables. • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. • Mahout™: A Scalable machine learning and data mining library. • Pig™: A high-level data-flow language and execution framework for parallel computation. • ZooKeeper™: A high-performance coordination service for distributed applications. 17