Big data- HDFS(2nd presentation)

Presentation on Big Data-
HADOOP DISTRIBUTED FILE
SYSTEM(H.D.F.S)
Presented by –
TAKRIM UL ISLAM LASKAR(120103006)

Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• IBM : Blue Cloud?
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!

HADOOP DISTRIBUTED FILE SYSTEM(HDFS)
 Highly fault tolerant
 High throughput
 Suitable for application with large data sets
 Streaming access to file system data
 Can be built out of commodity hardware

Commodity Hardware
Typically in 2 level architecture
– Nodes are commodity PCs
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit

Why DFS?
1 MACHINE
• 4 I/O channels
• Each channel- 100mb/s
10 MACHINES
• 4 I/O channels
• Each channel- 100mb/s
…...
45 Minutes 4.5 Minutes
READ 1TB DATA

Goals of HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to where
data resides
– Provides very high aggregate bandwidth

Functions of a NameNode
• Manages File System Namespace
– Maps a file name to a set of blocks
– Maps a block to the DataNodes where it resides
• Cluster Configuration Management
• Replication Engine for Blocks

NameNode Metadata
• Meta-data in Memory
– The entire metadata is in main memory (RAM)
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc

DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes

Block Placement
• Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
• Clients read from nearest replica.

Heartbeats
• DataNodes send heartbeat to the NameNode
– Once every 3 seconds
• NameNode used heartbeats to detect DataNode failure

Replication Engine
• NameNode detects DataNode failures
– Chooses new DataNodes for new replicas
– Balances disk usage
– Balances communication traffic to DataNodes

HDFS- 2nd generation
NameNode-
 Active NameNode
 Passive NameNode

Data Pipelining
• Client retrieves a list of DataNodes on which to place replicas of a
block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the next DataNode in the
Pipeline
• When all replicas are written, the Client moves on to write the next
block in file

Secondary NameNode
• Copies FsImage and Transaction Log from NameNode to a temporary
directory
• Merges FSImage and Transaction Log into a new FSImage in
temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged

User Interface
• Command for HDFS User:
– hadoop dfs -mkdir /foodir
– hadoop dfs -cat /foodir/myfile.txt
– hadoop dfs -rm /foodir myfile.txt
• Command for HDFS Administrator
– hadoop dfsadmin -report
– hadoop dfsadmin -decommission datanodename
• Web Interface
– http://host:port/dfshealth.jsp

References :
1. Youtube Lecture video on chennal ‘ Training on Big Data and
Hadoop ’ By User ‘Edureka’.
2. ‘White Book Of Big Data’ By ‘Fujistu’ .
3. ‘Big Data For Dummies’ by ‘A Wiley Brand’ .
4. Research paper by ‘Kalapriya Kannan’ in ‘IBM Research Labs’.
5. Collected data from “slideshare.com”

Useful Links
• HDFS Design:
– http://hadoop.apache.org/core/docs/current/hdfs_design.html
• Hadoop Information
– http://hadoop.apache.org/
• Hadoop API:
• – http://hadoop.apache.org/core/docs/current/api/

-THANK YOU.
I appreciate your patience.

Big data- HDFS(2nd presentation)

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Big data- HDFS(2nd presentation) (20)

More from Takrim Ul Islam Laskar (6)

Recently uploaded (20)

Big data- HDFS(2nd presentation)