Presentation on Big Data-
HADOOP DISTRIBUTED FILE
SYSTEM(H.D.F.S)
Presented by –
TAKRIM UL ISLAM LASKAR(120103006)
Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• IBM : Blue Cloud?
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
HADOOP DISTRIBUTED FILE SYSTEM(HDFS)
 Highly fault tolerant
 High throughput
 Suitable for application with large data sets
 Streaming access to file system data
 Can be built out of commodity hardware
Commodity Hardware
Typically in 2 level architecture
– Nodes are commodity PCs
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit
Why DFS?
1 MACHINE
• 4 I/O channels
• Each channel- 100mb/s
10 MACHINES
• 4 I/O channels
• Each channel- 100mb/s
…...
45 Minutes 4.5 Minutes
READ 1TB DATA
Goals of HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to where
data resides
– Provides very high aggregate bandwidth
HDFS Architecture
Functions of a NameNode
• Manages File System Namespace
– Maps a file name to a set of blocks
– Maps a block to the DataNodes where it resides
• Cluster Configuration Management
• Replication Engine for Blocks
NameNode Metadata
• Meta-data in Memory
– The entire metadata is in main memory (RAM)
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc
DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
Block Placement
• Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
• Clients read from nearest replica.
Heartbeats
• DataNodes send heartbeat to the NameNode
– Once every 3 seconds
• NameNode used heartbeats to detect DataNode failure
Replication Engine
• NameNode detects DataNode failures
– Chooses new DataNodes for new replicas
– Balances disk usage
– Balances communication traffic to DataNodes
HDFS- 2nd generation
NameNode-
 Active NameNode
 Passive NameNode
Data Pipelining
• Client retrieves a list of DataNodes on which to place replicas of a
block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the next DataNode in the
Pipeline
• When all replicas are written, the Client moves on to write the next
block in file
Secondary NameNode
• Copies FsImage and Transaction Log from NameNode to a temporary
directory
• Merges FSImage and Transaction Log into a new FSImage in
temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged
User Interface
• Command for HDFS User:
– hadoop dfs -mkdir /foodir
– hadoop dfs -cat /foodir/myfile.txt
– hadoop dfs -rm /foodir myfile.txt
• Command for HDFS Administrator
– hadoop dfsadmin -report
– hadoop dfsadmin -decommission datanodename
• Web Interface
– http://host:port/dfshealth.jsp
Hadoop Map/Reduce
• The Map-Reduce programming model
– Framework for distributed processing of large data sets
– Pluggable user code runs in generic framework
• Common design pattern in data processing
cat * | grep | sort | unique -c | cat > file input | map | shuffle |
reduce | output
• Natural for:
– Log processing
– Web search indexing
– Ad-hoc queries
References :
1. Youtube Lecture video on chennal ‘ Training on Big Data and
Hadoop ’ By User ‘Edureka’.
2. ‘White Book Of Big Data’ By ‘Fujistu’ .
3. ‘Big Data For Dummies’ by ‘A Wiley Brand’ .
4. Research paper by ‘Kalapriya Kannan’ in ‘IBM Research Labs’.
5. Collected data from “slideshare.com”
Useful Links
• HDFS Design:
– http://hadoop.apache.org/core/docs/current/hdfs_design.html
• Hadoop Information
– http://hadoop.apache.org/
• Hadoop API:
• – http://hadoop.apache.org/core/docs/current/api/
-THANK YOU.
I appreciate your patience.

More Related Content

PDF
Hadoop Distributed File System
PPTX
Hadoop Distributed File System
PDF
Hdfs architecture
PPTX
Hadoop HDFS Architeture and Design
PPTX
Hadoop hdfs
ODP
Hadoop HDFS by rohitkapa
PPTX
Ravi Namboori Hadoop & HDFS Architecture
PPTX
Hadoop Distributed File System
Hadoop Distributed File System
Hadoop Distributed File System
Hdfs architecture
Hadoop HDFS Architeture and Design
Hadoop hdfs
Hadoop HDFS by rohitkapa
Ravi Namboori Hadoop & HDFS Architecture
Hadoop Distributed File System

What's hot (20)

PDF
HDFS Design Principles
PPTX
Hadoop Distributed File System
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PPT
Hadoop training in hyderabad-kellytechnologies
PPTX
Introduction to HDFS
PPTX
Introduction to hadoop and hdfs
PDF
Hadoop HDFS
PPTX
Hadoop Operations - Best Practices from the Field
PPTX
Hadoop architecture meetup
PPTX
presentation_Hadoop_File_System
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
PPT
Hadoop Architecture
PDF
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
PPTX
HDFS: Hadoop Distributed Filesystem
PDF
Difference between hadoop 2 vs hadoop 3
PDF
Distributed Computing with Apache Hadoop: Technology Overview
PPTX
2.introduction to hdfs
PPTX
Democratizing Memory Storage
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PPTX
Hadoop distributed file system
HDFS Design Principles
Hadoop Distributed File System
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop training in hyderabad-kellytechnologies
Introduction to HDFS
Introduction to hadoop and hdfs
Hadoop HDFS
Hadoop Operations - Best Practices from the Field
Hadoop architecture meetup
presentation_Hadoop_File_System
Apache Hadoop YARN, NameNode HA, HDFS Federation
Hadoop Architecture
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
HDFS: Hadoop Distributed Filesystem
Difference between hadoop 2 vs hadoop 3
Distributed Computing with Apache Hadoop: Technology Overview
2.introduction to hdfs
Democratizing Memory Storage
HDFS Tiered Storage: Mounting Object Stores in HDFS
Hadoop distributed file system

Viewers also liked (9)

PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
PDF
Hadoop introduction
PDF
Hadoop - Lessons Learned
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PDF
Hadoop & Big Data benchmarking
PPTX
Hadoop & HDFS for Beginners
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PPSX
PPT
Seminar Presentation Hadoop
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Hadoop introduction
Hadoop - Lessons Learned
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Hadoop & Big Data benchmarking
Hadoop & HDFS for Beginners
Unleashing the Power of Apache Atlas with Apache Ranger
Seminar Presentation Hadoop

Similar to Big data- HDFS(2nd presentation) (20)

PPT
HDFS_architecture.ppt
PPT
Borthakur hadoop univ-research
PPTX
Big Data-Session, data engineering and scala
PPTX
Setting up a big data platform at kelkoo
PPTX
Cloud Computing - Cloud Technologies and Advancements
PPT
Hadoop training in bangalore
PPT
Apache hadoop and hive
PPTX
Hadoop File System.pptx
PPTX
Hadoop Fundamentals
PPTX
Hadoop fundamentals
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PDF
Hadoop data management
PDF
Hadoop security
PDF
Introduction to distributed file systems
PPTX
Hadoop ppt1
PDF
Tutorial Haddop 2.3
PPTX
Hadoop and HDFS
PDF
Lecture 2 part 1
PPTX
Hadoop introduction
PDF
Storage solutions for High Performance Computing
HDFS_architecture.ppt
Borthakur hadoop univ-research
Big Data-Session, data engineering and scala
Setting up a big data platform at kelkoo
Cloud Computing - Cloud Technologies and Advancements
Hadoop training in bangalore
Apache hadoop and hive
Hadoop File System.pptx
Hadoop Fundamentals
Hadoop fundamentals
hdfs readrmation ghghg bigdats analytics info.pdf
Hadoop data management
Hadoop security
Introduction to distributed file systems
Hadoop ppt1
Tutorial Haddop 2.3
Hadoop and HDFS
Lecture 2 part 1
Hadoop introduction
Storage solutions for High Performance Computing

More from Takrim Ul Islam Laskar (6)

PPTX
Facial Emotion Detection on Children's Emotional Face
PPTX
Facial emotion detection on babies' emotional face using Deep Learning.
PPTX
Sentiment Analysis on Human with special Concentration on infants’ emotional ...
PPTX
Indian Sign Language Recognition Method For Deaf People
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
PPTX
Big data(1st presentation)
Facial Emotion Detection on Children's Emotional Face
Facial emotion detection on babies' emotional face using Deep Learning.
Sentiment Analysis on Human with special Concentration on infants’ emotional ...
Indian Sign Language Recognition Method For Deaf People
Introduction to Apache Hive(Big Data, Final Seminar)
Big data(1st presentation)

Recently uploaded (20)

PPTX
Report in SIP_Distance_Learning_Technology_Impact.pptx
PDF
Decision Optimization - From Theory to Practice
PDF
The Digital Engine Room: Unlocking APAC’s Economic and Digital Potential thro...
PDF
Ebook - The Future of AI A Comprehensive Guide.pdf
PPTX
Information-Technology-in-Human-Society.pptx
PPTX
Presentation - Principles of Instructional Design.pptx
PDF
Applying Agentic AI in Enterprise Automation
PDF
TicketRoot: Event Tech Solutions Deck 2025
PDF
TrustArc Webinar - Data Minimization in Practice_ Reducing Risk, Enhancing Co...
PDF
Slides World Game (s) Great Redesign Eco Economic Epochs.pdf
PDF
Secure Java Applications against Quantum Threats
PDF
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
PPTX
How to use fields_get method in Odoo 18
PDF
Human Computer Interaction Miterm Lesson
PPTX
maintenance powerrpoint for adaprive and preventive
PDF
EGCB_Solar_Project_Presentation_and Finalcial Analysis.pdf
PPTX
From XAI to XEE through Influence and Provenance.Controlling model fairness o...
PDF
Uncertainty-aware contextual multi-armed bandits for recommendations in e-com...
PPTX
Blending method and technology for hydrogen.pptx
PDF
Fitaura: AI & Machine Learning Powered Fitness Tracker
Report in SIP_Distance_Learning_Technology_Impact.pptx
Decision Optimization - From Theory to Practice
The Digital Engine Room: Unlocking APAC’s Economic and Digital Potential thro...
Ebook - The Future of AI A Comprehensive Guide.pdf
Information-Technology-in-Human-Society.pptx
Presentation - Principles of Instructional Design.pptx
Applying Agentic AI in Enterprise Automation
TicketRoot: Event Tech Solutions Deck 2025
TrustArc Webinar - Data Minimization in Practice_ Reducing Risk, Enhancing Co...
Slides World Game (s) Great Redesign Eco Economic Epochs.pdf
Secure Java Applications against Quantum Threats
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
How to use fields_get method in Odoo 18
Human Computer Interaction Miterm Lesson
maintenance powerrpoint for adaprive and preventive
EGCB_Solar_Project_Presentation_and Finalcial Analysis.pdf
From XAI to XEE through Influence and Provenance.Controlling model fairness o...
Uncertainty-aware contextual multi-armed bandits for recommendations in e-com...
Blending method and technology for hydrogen.pptx
Fitaura: AI & Machine Learning Powered Fitness Tracker

Big data- HDFS(2nd presentation)

  • 1. Presentation on Big Data- HADOOP DISTRIBUTED FILE SYSTEM(H.D.F.S) Presented by – TAKRIM UL ISLAM LASKAR(120103006)
  • 2. Who uses Hadoop? • Amazon/A9 • Facebook • Google • IBM : Blue Cloud? • Joost • Last.fm • New York Times • PowerSet • Veoh • Yahoo!
  • 3. HADOOP DISTRIBUTED FILE SYSTEM(HDFS)  Highly fault tolerant  High throughput  Suitable for application with large data sets  Streaming access to file system data  Can be built out of commodity hardware
  • 4. Commodity Hardware Typically in 2 level architecture – Nodes are commodity PCs – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit
  • 5. Why DFS? 1 MACHINE • 4 I/O channels • Each channel- 100mb/s 10 MACHINES • 4 I/O channels • Each channel- 100mb/s …... 45 Minutes 4.5 Minutes READ 1TB DATA
  • 6. Goals of HDFS • Very Large Distributed File System – 10K nodes, 100 million files, 10 PB • Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them • Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth
  • 8. Functions of a NameNode • Manages File System Namespace – Maps a file name to a set of blocks – Maps a block to the DataNodes where it resides • Cluster Configuration Management • Replication Engine for Blocks
  • 9. NameNode Metadata • Meta-data in Memory – The entire metadata is in main memory (RAM) • Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor • A Transaction Log – Records file creations, file deletions. etc
  • 10. DataNode • A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block – Serves data and meta-data to Clients • Block Report – Periodically sends a report of all existing blocks to the NameNode • Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 11. Block Placement • Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed • Clients read from nearest replica.
  • 12. Heartbeats • DataNodes send heartbeat to the NameNode – Once every 3 seconds • NameNode used heartbeats to detect DataNode failure
  • 13. Replication Engine • NameNode detects DataNode failures – Chooses new DataNodes for new replicas – Balances disk usage – Balances communication traffic to DataNodes
  • 14. HDFS- 2nd generation NameNode-  Active NameNode  Passive NameNode
  • 15. Data Pipelining • Client retrieves a list of DataNodes on which to place replicas of a block • Client writes block to the first DataNode • The first DataNode forwards the data to the next DataNode in the Pipeline • When all replicas are written, the Client moves on to write the next block in file
  • 16. Secondary NameNode • Copies FsImage and Transaction Log from NameNode to a temporary directory • Merges FSImage and Transaction Log into a new FSImage in temporary directory • Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged
  • 17. User Interface • Command for HDFS User: – hadoop dfs -mkdir /foodir – hadoop dfs -cat /foodir/myfile.txt – hadoop dfs -rm /foodir myfile.txt • Command for HDFS Administrator – hadoop dfsadmin -report – hadoop dfsadmin -decommission datanodename • Web Interface – http://host:port/dfshealth.jsp
  • 18. Hadoop Map/Reduce • The Map-Reduce programming model – Framework for distributed processing of large data sets – Pluggable user code runs in generic framework • Common design pattern in data processing cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output • Natural for: – Log processing – Web search indexing – Ad-hoc queries
  • 19. References : 1. Youtube Lecture video on chennal ‘ Training on Big Data and Hadoop ’ By User ‘Edureka’. 2. ‘White Book Of Big Data’ By ‘Fujistu’ . 3. ‘Big Data For Dummies’ by ‘A Wiley Brand’ . 4. Research paper by ‘Kalapriya Kannan’ in ‘IBM Research Labs’. 5. Collected data from “slideshare.com”
  • 20. Useful Links • HDFS Design: – http://hadoop.apache.org/core/docs/current/hdfs_design.html • Hadoop Information – http://hadoop.apache.org/ • Hadoop API: • – http://hadoop.apache.org/core/docs/current/api/
  • 21. -THANK YOU. I appreciate your patience.