SlideShare a Scribd company logo
Big Data and Hadoop
Module 1: Introduction to Big Data and Hadoop
Slide 2
Session Objectives
This Session will help you to:
ᗍ Understand what is Big Data?
ᗍ List the challenges associated with Big Data
ᗍ Understand the difference between Real-time and Batch Processing
ᗍ Understand Hadoop capabilities
ᗍ Understand Hadoop ecosystem
Slide 3
Definition of Big Data
Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and
unstructured data
that is so large that it's difficult to process using traditional database and software techniques
In most enterprise scenarios the data is too big or it moves too fast or it exceeds current processing capacity.
Big data has the potential to help companies improve operations and make faster, more intelligent decisions
Big Data is the term applied to data sets whose size is beyond the ability of the commonly used software
tools to capture, manage, and process
Slide 4
Walmart
ᗍ US$ 485.651 Billion US Retailer
ᗍ Handles more than a million transactions every day, produces more 2.5 Petabytes on daily
basis
ᗍ Has dedicated data centers across the world to handle the above data (Has one in Bangalore
also)
Slide 5
Facebook
ᗍ It has about a billion users as we speak
ᗍ Generates close to 500 TB of data per day
ᗍ Fires 70 thousand queries on that every
day
ᗍ Inventors and one of the biggest users to
hive
Slide 6
Big Data Context with Case Studies
Cricket Telecast on Star
Sports
ᗍ Keys to success for a team
ᗍ Batsman’s strong or weak zone. Runs scoring graph
ᗍ Bowlers speed, Swing and Wicket Taking delivery
graph
Slide 7
What is Big Data?
ᗍ Huge Amount of Data (Terabytes or Petabytes)
ᗍ Big data is the term for a collection of data sets
so large and complex that it becomes difficult
to process using on-hand database
management tools or traditional data
processing applications
ᗍ The challenges include capture, curation,
storage, search, sharing, transfer, analysis, and
visualization
Three types of data can be identified:
ᗍ Unstructured Data
• Data which do not have a pre-defined data
model
• E.g. Text files, log files
ᗍ Semi-structured Data
• Data which do not have a formal data model
• E.g. XML files
ᗍ Structured Data
• Data which is represented in a tabular format
• E.g. Databases
Slide 8
Types of Data
Slide 9
Characteristics of Big Data – 4 V’s
Volum
e
Variet
y
GB
TB
P
B
M
B
Audi
o
Photo Web
Video
Slide 10
The V’s of Big Data
ᗍ Volume: 12 terabytes of Tweets created each day
ᗍ Velocity: Scrutinize 5 million trade events created each day to identify potential fraud
ᗍ Variety: Trade data, Sensor data, Audio, Video, Flight Tracking, R&D, Log files, Social media and more
ᗍ Veracity: The quality of the data being captured can vary greatly. Accuracy of analysis depends on the
veracity of the source data
Slide 11
Limitations of Big Data/Existing DWH
Solutions
ᗍ Two aspects: Storage of data and Analysis of data
ᗍ Limitation of existing IT infrastructure and resources
ᗍ Vertical Scalability is not always a solution: Upgrading server and
storage
ᗍ RDBMS is not designed to scale out
ᗍ Can not handle unstructured data
ᗍ Cost of commercially available solutions is significantly high
Slide 12
Need for New Approach
ᗍ A new approach to the problem is required:
ᗍ Process all types of data; Structured, Semi Structured and Unstructured
data
ᗍ Store and Process massive amount of data easily
ᗍ Cost of system; Process and Manage data economically
ᗍ Speed of processing
Slide 13
What is Hadoop?
Apache Hadoop is a framework that allows the distributed processing of large data sets
across clusters of commodity computers using a simple programming mode
It is an Open-source Data Management with scale-out storage and distributed
processing
Slide 14
What is Hadoop? (Cont’d)
ᗍ Apache Hadoop is a framework that allows for distributed processing of large data sets stored across
clusters of commodity computers using simple programming model
ᗍ A Free, Java-based programming framework that supports the processing of large data sets in a
distributed
computing environment
ᗍ Based on Google File System (GFS)
ᗍ Runs applications on distributed systems with thousands of nodes
Slide 15
Hadoop Key Features
ᗍ Simple architecture
ᗍ Scalability; Designed for Massive scale
ᗍ Availability; High degree of fault tolerance; Designed to recover from Failures; Robust
ᗍ Low Cost; Low software and hardware costs; Designed to run on commodity servers
ᗍ Speed of Operations; Distributed file system provides fast data transfers among nodes
ᗍ Parallel Programming Model; An easy to use programming paradigm that scales through 1000s of
nodes and petabytes of data
ᗍ Allows data analysis without first be modeled, cleansed and loaded
Slide 16
Hadoop Key Characteristics
Characteristics
Reliable
Economical
Flexible
Scalable
HDFS: Data Storage frame work
Slide 17
Hadoop Core Components
Map Reduce: Data Processing Framework
Slide 18
Hadoop Ecosystem
Unstructured or
Semi-Structured
data
Structured
Data
Apache Oozie (Workflow)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework HBase
Other
YARN
Frameworks
(Spark, GIRAPH)
YARN
Cluster Resource Management
HDFS
(Hadoop Distributed
File System)
Flume
Sqoop
Import Or Export
Slide 19
Hadoop Services
The core services of Hadoop are:
ᗍ NameNode
ᗍ DataNode
ᗍ Resource Manager [Job Tracker in 1.0]
ᗍ Node Manager [TaskTracker in 1.0]
ᗍ Secondary NameNode
You can use Hadoop in following modes:
ᗍ Standalone (or Local) Mode
• No Hadoop daemons, entire process runs in a single JVM
• Suitable for running Hadoop programs during initial installation and Hadoop software
testing
• It doesn’t have any DFS available
ᗍ Pseudo-Distributed Mode
• Hadoop daemons up, but on a single machine
• Best suited for development
ᗍ Fully-Distributed/Clustered/Prod Mode
• Hadoop daemons run on a cluster of machines
• Best suited for production environments
Slide 20
Different Hadoop Modes
Slide 21
Hadoop Deployment Modes
ᗍ Stand Standalone or Local mode
• Everything runs on single JVM
• Good for Development
ᗍ Pseudo-Distributed Mode
• All services running on single machine, a cluster simulation on one
machine
• Good For Test Environment
ᗍ Fully Distributed Mode
• Hadoop Services running on multiple machines on a cluster
• Production Environment
ᗍ Its the physical division of data file done by HDFS while
storing it
ᗍ 128 MB of blocks size by default for Hadoop 2.0
ᗍ Example:
File
1
A
B
128 MB
256
MB 128 MB
Slide 22
Blocks
Slide 23
Blocks (Cont’d)
File
2
A
B
128 MB
250
MB
122 MB
File
3
A
B
128 MB
300
MB
44 MB
128 MB
C
ᗍ Computer Racks
• Computer Rack is a physical chassis that can house multiple computers or servers simultaneously.
It is a
mounting rack that has the ability to install more than one computer
ᗍ Block Replication in HDFS
• Provides redundancy and fault tolerance to the data saved
• The default value is 3
Slide 24
Computer Racks & Block Replication
HDFS stores blocks on the cluster in a rack aware fashion i.e. one block on one rack and the other two
blocks on other rack
Rack 1 Rack 2 Rack 3
1 5 9
2 6 10
3 7 11
4 8 12
Block A :
Block B :
BlockC :
Slide 25
HDFS Rack Awareness
Slide 26
Hadoop Distributed File System (HDFS)
The key features of Hadoop HDFS are:
ᗍ Storing large sets of data files (in TB/ PB)
ᗍ Distributed across multiple machines
ᗍ Inbuilt Fault tolerance & Reliability; Data replication
Creating multiple replicas of each data block and distributing them on computers throughout the cluster
to enable reliable and rapid data access
ᗍ Providing high-throughput access to data blocks; Low Latency data access
ᗍ Write once read many concept
Slide 27
Hadoop Distributed File System (HDFS)
(Cont’d)
 Master/slave architecture
 HDFS cluster consists of a single Namenode, a master server that manages the
file system namespace and regulates access to files by clients.
 There are a number of DataNodes usually one per node in a cluster.
 The DataNodes manage storage attached to the nodes that they run on.
 HDFS exposes a file system namespace and allows user data to be stored in
files.
 A file is split into one or more blocks and set of blocks are stored in
DataNodes.
 DataNodes: serves read, write requests, performs block creation, deletion, and
replication upon instruction from Namenode
Slide 28
File system Namespace
HDFS READ ANATOMY
Hadoop Write anatomy
Job Tracker and TaskTracker
The primary function of the job tracker is resource management
(managing the task trackers), tracking resource availability and task life
cycle management (tracking its progress, fault tolerance etc.)
The task tracker has a simple function of following the orders of the job
tracker and updating the job tracker with its progress status periodically
HDFS is rack aware in the sense that the
namenode and the job tracker obtain a list
of rack ids corresponding to each of the
slave nodes (data nodes) and creates a
mapping between the IP address and the
rack id. HDFS uses this knowledge to
replicate data across different racks so that
data is not lost in the event of a complete
rack power outage or switch failure
Rack Awareness
ᗍ Mappers:
Mappers are java programs confirming to Google’s Map Reduce algorithm framework. These programs
run on each of the blocks of big data file saved on the cluster
ᗍ Reducers:
Similar to Mappers, Reducers are also java programs confirming to Google’s Map Reduce algorithm
framework. They are aggregate functions which are supposed to run on the outputs coming out of
mappers
Slide 34
Mapper & Reduce – Basic Concepts
Slide 35
Hadoop Configuration Files
Configuration
Filenames
Description of Log Files
hadoop-env.sh Environment variables that are used in the scripts to run Hadoop
core-site.xml Core Hadoop Configuration settings which are common to HDFS
and MapReduce
hdfs-site.xml HDFS Configuration settings for HDFS daemons, the NameNode,
the secondary NameNode and the data nodes
mapred-site.xml MapReduce specific Configuration settings like Job History Server
yarn-site.xml Configuration settings for Shuffle Mechanism with respect to
YARN implementation.
masters A list of machines (one per line) that each run a secondary
NameNode
slaves A list of machines (one per line) that each run a slave machine
running
DataNode and a NodeManager daemons
Slide 36
Hadoop Ecosystem
 # Usage: # hadoop fs -mkdir <paths>
 # Example: hadoop fs -mkdir /root
 #Example : hadoop fs –mkdir /root/training
CREATE A DIRECTORY IN HDFS
 Copies single src file from local file system to the Hadoop
Distributed File System.
 hadoop fs -put <local-src> ... <HDFS_dest_path>
 hadoop fs –put /home/cloudera/Desktop/Employee.csv /root/training
PUT Command
 Lists the contents of a directory
 # Usage: # hadoop fs -ls <args>
 # Example: hadoop fs -ls /root/
 Try Yourself: hadoop fs -lsr /root/
 This –lsr is called recursive search
Ls command
 Copies/Downloads files from HDFS to the local file system
 hadoop fs -get <hdfs_src> <localdst>
 Example: hadoop fs -get /user/root/employee.csv /home/cloudera/desktop/emloyee.csv
 To Avoid this error give new name for file or diffrenet local path.
 hadoop fs -get /user/root/employee.csv /home/cloudera/desktop/emloyee123.csv
Get Command
 To copy file from one hdfs location to other.
 Usage: hadoop fs -cp <source> <dest>
 Example:
 hadoop fs -cp /root/Employee.csv /root/training/Employee.csv
 Please again try hadoop fs –lsr /root/
Cp command
 Same purpose as of put command
 Usage:
 hadoop fs -copyFromLocal <localsrc> URI
 Example:
 hadoop fs –put /home/cloudera/Desktop/Student.csv
/root/training
copyFromLocal
 Same purpose as that of get command
 hadoop fs –copyToLocal <hdfs_src> <localdst>
 Example: hadoop fs - copyToLocal /user/root/employee.csv
/home/cloudera/desktop/emloyee321.csv
copyToLocal
 Display last few lines of a file.
 Usage :
 hadoop fs -tail <path[filename]>
 Example:
 hadoop fs -tail /user/root/employee.csv
Tail Command
 To display the complete file
 Usage: hadoop fs –cat <arg as file name>
Example: hadoop fs –cat /user/root/employee.csv
cat command
 hadoop fs rm:
Removes the specified list of files and empty directories. An
example is shown below:

hadoop fs –rm /root/employee.csv
 Try –rm –r option also and see differnce for /root/
Rm Command
Slide 47
Any Questions

More Related Content

Similar to Module 1- Introduction to Big Data and Hadoop (20)

PPTX
Bigdata workshop february 2015
clairvoyantllc
 
PPT
Large scale computing
Bhupesh Bansal
 
PPTX
Managing Big data with Hadoop
Nalini Mehta
 
PPTX
2. hadoop fundamentals
Lokesh Ramaswamy
 
PPTX
Hadoop info
Nikita Sure
 
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PPTX
Big Data and Hadoop
MaulikLakhani
 
PPTX
MOD-2 presentation on engineering students
rishavkumar1402
 
PPTX
Hadoop and BigData - July 2016
Ranjith Sekar
 
PDF
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
PPTX
Introduction to BIg Data and Hadoop
Amir Shaikh
 
PDF
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
PDF
Big Data Architecture and Deployment
Cisco Canada
 
PDF
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET Journal
 
PPTX
Big data
Abilash Mavila
 
PPTX
Hadoop
RittikaBaksi
 
DOCX
Big data and Hadoop overview
Nitesh Ghosh
 
PDF
Semantic web meetup 14.november 2013
Jean-Pierre König
 
Bigdata workshop february 2015
clairvoyantllc
 
Large scale computing
Bhupesh Bansal
 
Managing Big data with Hadoop
Nalini Mehta
 
2. hadoop fundamentals
Lokesh Ramaswamy
 
Hadoop info
Nikita Sure
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Introduction to Hadoop and Big Data
Joe Alex
 
Big Data and Hadoop
MaulikLakhani
 
MOD-2 presentation on engineering students
rishavkumar1402
 
Hadoop and BigData - July 2016
Ranjith Sekar
 
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
Big Data Architecture and Deployment
Cisco Canada
 
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET Journal
 
Big data
Abilash Mavila
 
Hadoop
RittikaBaksi
 
Big data and Hadoop overview
Nitesh Ghosh
 
Semantic web meetup 14.november 2013
Jean-Pierre König
 

Recently uploaded (20)

PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
BinarySearchTree in datastructures in detail
kichokuttu
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
Research Methodology Overview Introduction
ayeshagul29594
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Ad

Module 1- Introduction to Big Data and Hadoop

  • 1. Big Data and Hadoop Module 1: Introduction to Big Data and Hadoop
  • 2. Slide 2 Session Objectives This Session will help you to: ᗍ Understand what is Big Data? ᗍ List the challenges associated with Big Data ᗍ Understand the difference between Real-time and Batch Processing ᗍ Understand Hadoop capabilities ᗍ Understand Hadoop ecosystem
  • 3. Slide 3 Definition of Big Data Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques In most enterprise scenarios the data is too big or it moves too fast or it exceeds current processing capacity. Big data has the potential to help companies improve operations and make faster, more intelligent decisions Big Data is the term applied to data sets whose size is beyond the ability of the commonly used software tools to capture, manage, and process
  • 4. Slide 4 Walmart ᗍ US$ 485.651 Billion US Retailer ᗍ Handles more than a million transactions every day, produces more 2.5 Petabytes on daily basis ᗍ Has dedicated data centers across the world to handle the above data (Has one in Bangalore also)
  • 5. Slide 5 Facebook ᗍ It has about a billion users as we speak ᗍ Generates close to 500 TB of data per day ᗍ Fires 70 thousand queries on that every day ᗍ Inventors and one of the biggest users to hive
  • 6. Slide 6 Big Data Context with Case Studies Cricket Telecast on Star Sports ᗍ Keys to success for a team ᗍ Batsman’s strong or weak zone. Runs scoring graph ᗍ Bowlers speed, Swing and Wicket Taking delivery graph
  • 7. Slide 7 What is Big Data? ᗍ Huge Amount of Data (Terabytes or Petabytes) ᗍ Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications ᗍ The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization
  • 8. Three types of data can be identified: ᗍ Unstructured Data • Data which do not have a pre-defined data model • E.g. Text files, log files ᗍ Semi-structured Data • Data which do not have a formal data model • E.g. XML files ᗍ Structured Data • Data which is represented in a tabular format • E.g. Databases Slide 8 Types of Data
  • 9. Slide 9 Characteristics of Big Data – 4 V’s Volum e Variet y GB TB P B M B Audi o Photo Web Video
  • 10. Slide 10 The V’s of Big Data ᗍ Volume: 12 terabytes of Tweets created each day ᗍ Velocity: Scrutinize 5 million trade events created each day to identify potential fraud ᗍ Variety: Trade data, Sensor data, Audio, Video, Flight Tracking, R&D, Log files, Social media and more ᗍ Veracity: The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source data
  • 11. Slide 11 Limitations of Big Data/Existing DWH Solutions ᗍ Two aspects: Storage of data and Analysis of data ᗍ Limitation of existing IT infrastructure and resources ᗍ Vertical Scalability is not always a solution: Upgrading server and storage ᗍ RDBMS is not designed to scale out ᗍ Can not handle unstructured data ᗍ Cost of commercially available solutions is significantly high
  • 12. Slide 12 Need for New Approach ᗍ A new approach to the problem is required: ᗍ Process all types of data; Structured, Semi Structured and Unstructured data ᗍ Store and Process massive amount of data easily ᗍ Cost of system; Process and Manage data economically ᗍ Speed of processing
  • 13. Slide 13 What is Hadoop? Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of commodity computers using a simple programming mode It is an Open-source Data Management with scale-out storage and distributed processing
  • 14. Slide 14 What is Hadoop? (Cont’d) ᗍ Apache Hadoop is a framework that allows for distributed processing of large data sets stored across clusters of commodity computers using simple programming model ᗍ A Free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment ᗍ Based on Google File System (GFS) ᗍ Runs applications on distributed systems with thousands of nodes
  • 15. Slide 15 Hadoop Key Features ᗍ Simple architecture ᗍ Scalability; Designed for Massive scale ᗍ Availability; High degree of fault tolerance; Designed to recover from Failures; Robust ᗍ Low Cost; Low software and hardware costs; Designed to run on commodity servers ᗍ Speed of Operations; Distributed file system provides fast data transfers among nodes ᗍ Parallel Programming Model; An easy to use programming paradigm that scales through 1000s of nodes and petabytes of data ᗍ Allows data analysis without first be modeled, cleansed and loaded
  • 16. Slide 16 Hadoop Key Characteristics Characteristics Reliable Economical Flexible Scalable
  • 17. HDFS: Data Storage frame work Slide 17 Hadoop Core Components Map Reduce: Data Processing Framework
  • 18. Slide 18 Hadoop Ecosystem Unstructured or Semi-Structured data Structured Data Apache Oozie (Workflow) Pig Latin Data Analysis Hive DW System MapReduce Framework HBase Other YARN Frameworks (Spark, GIRAPH) YARN Cluster Resource Management HDFS (Hadoop Distributed File System) Flume Sqoop Import Or Export
  • 19. Slide 19 Hadoop Services The core services of Hadoop are: ᗍ NameNode ᗍ DataNode ᗍ Resource Manager [Job Tracker in 1.0] ᗍ Node Manager [TaskTracker in 1.0] ᗍ Secondary NameNode
  • 20. You can use Hadoop in following modes: ᗍ Standalone (or Local) Mode • No Hadoop daemons, entire process runs in a single JVM • Suitable for running Hadoop programs during initial installation and Hadoop software testing • It doesn’t have any DFS available ᗍ Pseudo-Distributed Mode • Hadoop daemons up, but on a single machine • Best suited for development ᗍ Fully-Distributed/Clustered/Prod Mode • Hadoop daemons run on a cluster of machines • Best suited for production environments Slide 20 Different Hadoop Modes
  • 21. Slide 21 Hadoop Deployment Modes ᗍ Stand Standalone or Local mode • Everything runs on single JVM • Good for Development ᗍ Pseudo-Distributed Mode • All services running on single machine, a cluster simulation on one machine • Good For Test Environment ᗍ Fully Distributed Mode • Hadoop Services running on multiple machines on a cluster • Production Environment
  • 22. ᗍ Its the physical division of data file done by HDFS while storing it ᗍ 128 MB of blocks size by default for Hadoop 2.0 ᗍ Example: File 1 A B 128 MB 256 MB 128 MB Slide 22 Blocks
  • 23. Slide 23 Blocks (Cont’d) File 2 A B 128 MB 250 MB 122 MB File 3 A B 128 MB 300 MB 44 MB 128 MB C
  • 24. ᗍ Computer Racks • Computer Rack is a physical chassis that can house multiple computers or servers simultaneously. It is a mounting rack that has the ability to install more than one computer ᗍ Block Replication in HDFS • Provides redundancy and fault tolerance to the data saved • The default value is 3 Slide 24 Computer Racks & Block Replication
  • 25. HDFS stores blocks on the cluster in a rack aware fashion i.e. one block on one rack and the other two blocks on other rack Rack 1 Rack 2 Rack 3 1 5 9 2 6 10 3 7 11 4 8 12 Block A : Block B : BlockC : Slide 25 HDFS Rack Awareness
  • 26. Slide 26 Hadoop Distributed File System (HDFS) The key features of Hadoop HDFS are: ᗍ Storing large sets of data files (in TB/ PB) ᗍ Distributed across multiple machines ᗍ Inbuilt Fault tolerance & Reliability; Data replication Creating multiple replicas of each data block and distributing them on computers throughout the cluster to enable reliable and rapid data access ᗍ Providing high-throughput access to data blocks; Low Latency data access ᗍ Write once read many concept
  • 27. Slide 27 Hadoop Distributed File System (HDFS) (Cont’d)
  • 28.  Master/slave architecture  HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.  There are a number of DataNodes usually one per node in a cluster.  The DataNodes manage storage attached to the nodes that they run on.  HDFS exposes a file system namespace and allows user data to be stored in files.  A file is split into one or more blocks and set of blocks are stored in DataNodes.  DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode Slide 28 File system Namespace
  • 31. Job Tracker and TaskTracker The primary function of the job tracker is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking its progress, fault tolerance etc.) The task tracker has a simple function of following the orders of the job tracker and updating the job tracker with its progress status periodically
  • 32. HDFS is rack aware in the sense that the namenode and the job tracker obtain a list of rack ids corresponding to each of the slave nodes (data nodes) and creates a mapping between the IP address and the rack id. HDFS uses this knowledge to replicate data across different racks so that data is not lost in the event of a complete rack power outage or switch failure
  • 34. ᗍ Mappers: Mappers are java programs confirming to Google’s Map Reduce algorithm framework. These programs run on each of the blocks of big data file saved on the cluster ᗍ Reducers: Similar to Mappers, Reducers are also java programs confirming to Google’s Map Reduce algorithm framework. They are aggregate functions which are supposed to run on the outputs coming out of mappers Slide 34 Mapper & Reduce – Basic Concepts
  • 35. Slide 35 Hadoop Configuration Files Configuration Filenames Description of Log Files hadoop-env.sh Environment variables that are used in the scripts to run Hadoop core-site.xml Core Hadoop Configuration settings which are common to HDFS and MapReduce hdfs-site.xml HDFS Configuration settings for HDFS daemons, the NameNode, the secondary NameNode and the data nodes mapred-site.xml MapReduce specific Configuration settings like Job History Server yarn-site.xml Configuration settings for Shuffle Mechanism with respect to YARN implementation. masters A list of machines (one per line) that each run a secondary NameNode slaves A list of machines (one per line) that each run a slave machine running DataNode and a NodeManager daemons
  • 37.  # Usage: # hadoop fs -mkdir <paths>  # Example: hadoop fs -mkdir /root  #Example : hadoop fs –mkdir /root/training CREATE A DIRECTORY IN HDFS
  • 38.  Copies single src file from local file system to the Hadoop Distributed File System.  hadoop fs -put <local-src> ... <HDFS_dest_path>  hadoop fs –put /home/cloudera/Desktop/Employee.csv /root/training PUT Command
  • 39.  Lists the contents of a directory  # Usage: # hadoop fs -ls <args>  # Example: hadoop fs -ls /root/  Try Yourself: hadoop fs -lsr /root/  This –lsr is called recursive search Ls command
  • 40.  Copies/Downloads files from HDFS to the local file system  hadoop fs -get <hdfs_src> <localdst>  Example: hadoop fs -get /user/root/employee.csv /home/cloudera/desktop/emloyee.csv  To Avoid this error give new name for file or diffrenet local path.  hadoop fs -get /user/root/employee.csv /home/cloudera/desktop/emloyee123.csv Get Command
  • 41.  To copy file from one hdfs location to other.  Usage: hadoop fs -cp <source> <dest>  Example:  hadoop fs -cp /root/Employee.csv /root/training/Employee.csv  Please again try hadoop fs –lsr /root/ Cp command
  • 42.  Same purpose as of put command  Usage:  hadoop fs -copyFromLocal <localsrc> URI  Example:  hadoop fs –put /home/cloudera/Desktop/Student.csv /root/training copyFromLocal
  • 43.  Same purpose as that of get command  hadoop fs –copyToLocal <hdfs_src> <localdst>  Example: hadoop fs - copyToLocal /user/root/employee.csv /home/cloudera/desktop/emloyee321.csv copyToLocal
  • 44.  Display last few lines of a file.  Usage :  hadoop fs -tail <path[filename]>  Example:  hadoop fs -tail /user/root/employee.csv Tail Command
  • 45.  To display the complete file  Usage: hadoop fs –cat <arg as file name> Example: hadoop fs –cat /user/root/employee.csv cat command
  • 46.  hadoop fs rm: Removes the specified list of files and empty directories. An example is shown below:  hadoop fs –rm /root/employee.csv  Try –rm –r option also and see differnce for /root/ Rm Command