HDFS Internals
Bhupesh Chawda
bhupesh@apache.org
DataTorrent
Image Source: https://help.marklogic.com/news/list/Index/10
Agenda
What are Blocks?
● A physical storage disk has a block size - minimum amount of data it can
read or write. Normally 512 bytes.
● File systems for a single disk also deal with data in blocks. Normally few
kilo bytes (4 kb).
● Hadoop has a much larger block size. By default it is 64 mb.
● Files in HDFS are broken down into block sized chunks and are stored as
independent units.
● However, files smaller than a block size do not occupy the entire block.
○ Should I care?
Why so large blocks?
● Minimize disk seek times
● Assuming 10 ms of seek time, and 100 MB/s as disk transfer rate, if block
size if 100 MB, then seek time is 1% of transfer time which is small enough
to ignore.
● Hence default is 64 MB while many production environments also use 128
MB.
HDFS Architecture
Image Source: https://hadoop.apache.org
Namenode and Datanode
● Master - Namenode
○ Manages file system namespace
○ File system tree and metadata for all files and directories
○ Stores this info in -
■ Namespace image
■ Edit log
○ Knows for a given file which datanodes has the corresponding blocks. Reconstructed at
startup
● Worker - Datanode
○ Store and retrieve blocks as requested by clients
○ Periodically report back to the namenode on the list of blocks they are storing
HDFS Storage
Image Source: https://developer.yahoo.com/hadoop/tutorial/module2.html
Secondary Namenode
Image Source: http://www.quickmeme.com/meme/35ke38
Secondary Namenode
● Not a backup namenode
● Periodically merge the namespace image with the edit log, if edit log
becomes too large
● Usually runs on a different machine than the namenode
● The secondary however always lags behind primary and hence the
merged copy cannot be used in case of primary failure
● In event of primary failure, copy the primary namespace image to the
secondary and run it as the new primary.
Writing a File in HDFS
Image Source: Hadoop The definitive guide, 4th edition
Reading a file in HDFS
Image Source: Hadoop The definitive guide, 4th edition
HDFS Block Placement
Image Source: Hadoop The definitive guide, 4th edition
Small File Problem?
Each file occupies namespace irrespective of file size!!
Image Source: http://www.bodhtree.com/blog/2012/09/28/hadoop-how-to-manage-huge-numbers-of-small-files-in-hdfs/
Further Reading
HDFS Comics :-)
https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLW
E0OGItYTU5OGMxYjc0N2M1
Sample:
Thank You!!
Please send your questions at:
bhupesh@apache.org

More Related Content

PDF
Storage in hadoop
ODP
Barcamp MySQL
PDF
Intro to Apache Hadoop
PPTX
Hadoop hdfs
PDF
HDFS Deep Dive
PPTX
Ravi Namboori Hadoop & HDFS Architecture
PDF
Hdfs architecture
PPTX
Hadoop Distributed File System
Storage in hadoop
Barcamp MySQL
Intro to Apache Hadoop
Hadoop hdfs
HDFS Deep Dive
Ravi Namboori Hadoop & HDFS Architecture
Hdfs architecture
Hadoop Distributed File System

What's hot (20)

PPTX
DHT2 - O Brother, Where Art Thou with Shyam Ranganathan
PPTX
Redis Modules - Redis India Tour - 2017
PPTX
HDFS Basics
PPTX
Hadoop at a glance
PPTX
2.introduction to hdfs
ODP
Hadoop HDFS by rohitkapa
PPTX
Some key value stores using log-structure
PPTX
Introduction to Redis
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PDF
Optimizing RocksDB for Open-Channel SSDs
PPTX
The Hive Think Tank: Rocking the Database World with RocksDB
PPTX
Redis database
PPTX
Democratizing Memory Storage
PPTX
Hadoop distributed file system
PPTX
redis basics
PPTX
MongoDB Replication fundamentals - Desert Code Camp - October 2014
PPTX
Hadoop Distributed File System
PDF
Hadoop Distributed File System
PDF
Optimizing MongoDB: Lessons Learned at Localytics
PDF
Migrating from MySQL to MongoDB
DHT2 - O Brother, Where Art Thou with Shyam Ranganathan
Redis Modules - Redis India Tour - 2017
HDFS Basics
Hadoop at a glance
2.introduction to hdfs
Hadoop HDFS by rohitkapa
Some key value stores using log-structure
Introduction to Redis
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Optimizing RocksDB for Open-Channel SSDs
The Hive Think Tank: Rocking the Database World with RocksDB
Redis database
Democratizing Memory Storage
Hadoop distributed file system
redis basics
MongoDB Replication fundamentals - Desert Code Camp - October 2014
Hadoop Distributed File System
Hadoop Distributed File System
Optimizing MongoDB: Lessons Learned at Localytics
Migrating from MySQL to MongoDB
Ad

Similar to Hdfs internals (20)

PPTX
HDFS Internals
PPTX
Clustering and types of Clustering in Data analytics
PDF
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
PPTX
Big data with HDFS and Mapreduce
PPTX
Data Analytics presentation.pptx
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PPTX
Introduction to HDFS
PDF
Apache Hadoop In Theory And Practice
PPTX
Hadoop HDFS Architeture and Design
PDF
Chapter2.pdf
PDF
Intro to big data choco devday - 23-01-2014
PPTX
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
PPTX
Introduction to HDFS
PDF
Hadoop data management
ODP
Hadoop admin
ODP
Apache hadoop
PPTX
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
PPTX
Cloud Computing - Cloud Technologies and Advancements
PPTX
Hadoop architecture by ajay
PPTX
Hadoop Distributed File System
HDFS Internals
Clustering and types of Clustering in Data analytics
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Big data with HDFS and Mapreduce
Data Analytics presentation.pptx
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Introduction to HDFS
Apache Hadoop In Theory And Practice
Hadoop HDFS Architeture and Design
Chapter2.pdf
Intro to big data choco devday - 23-01-2014
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Introduction to HDFS
Hadoop data management
Hadoop admin
Apache hadoop
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Cloud Computing - Cloud Technologies and Advancements
Hadoop architecture by ajay
Hadoop Distributed File System
Ad

Recently uploaded (20)

PDF
DNT Brochure 2025 – ISV Solutions @ D365
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PDF
Microsoft Office 365 Crack Download Free
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PPTX
assetexplorer- product-overview - presentation
PDF
MCP Security Tutorial - Beginner to Advanced
PDF
AI Guide for Business Growth - Arna Softech
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PDF
Autodesk AutoCAD Crack Free Download 2025
PDF
CCleaner 6.39.11548 Crack 2025 License Key
DNT Brochure 2025 – ISV Solutions @ D365
GSA Content Generator Crack (2025 Latest)
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Microsoft Office 365 Crack Download Free
Tech Workshop Escape Room Tech Workshop
Salesforce Agentforce AI Implementation.pdf
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
How to Use SharePoint as an ISO-Compliant Document Management System
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
assetexplorer- product-overview - presentation
MCP Security Tutorial - Beginner to Advanced
AI Guide for Business Growth - Arna Softech
Designing Intelligence for the Shop Floor.pdf
Topaz Photo AI Crack New Download (Latest 2025)
Autodesk AutoCAD Crack Free Download 2025
CCleaner 6.39.11548 Crack 2025 License Key

Hdfs internals

  • 3. What are Blocks? ● A physical storage disk has a block size - minimum amount of data it can read or write. Normally 512 bytes. ● File systems for a single disk also deal with data in blocks. Normally few kilo bytes (4 kb). ● Hadoop has a much larger block size. By default it is 64 mb. ● Files in HDFS are broken down into block sized chunks and are stored as independent units. ● However, files smaller than a block size do not occupy the entire block. ○ Should I care?
  • 4. Why so large blocks? ● Minimize disk seek times ● Assuming 10 ms of seek time, and 100 MB/s as disk transfer rate, if block size if 100 MB, then seek time is 1% of transfer time which is small enough to ignore. ● Hence default is 64 MB while many production environments also use 128 MB.
  • 5. HDFS Architecture Image Source: https://hadoop.apache.org
  • 6. Namenode and Datanode ● Master - Namenode ○ Manages file system namespace ○ File system tree and metadata for all files and directories ○ Stores this info in - ■ Namespace image ■ Edit log ○ Knows for a given file which datanodes has the corresponding blocks. Reconstructed at startup ● Worker - Datanode ○ Store and retrieve blocks as requested by clients ○ Periodically report back to the namenode on the list of blocks they are storing
  • 7. HDFS Storage Image Source: https://developer.yahoo.com/hadoop/tutorial/module2.html
  • 8. Secondary Namenode Image Source: http://www.quickmeme.com/meme/35ke38
  • 9. Secondary Namenode ● Not a backup namenode ● Periodically merge the namespace image with the edit log, if edit log becomes too large ● Usually runs on a different machine than the namenode ● The secondary however always lags behind primary and hence the merged copy cannot be used in case of primary failure ● In event of primary failure, copy the primary namespace image to the secondary and run it as the new primary.
  • 10. Writing a File in HDFS Image Source: Hadoop The definitive guide, 4th edition
  • 11. Reading a file in HDFS Image Source: Hadoop The definitive guide, 4th edition
  • 12. HDFS Block Placement Image Source: Hadoop The definitive guide, 4th edition
  • 13. Small File Problem? Each file occupies namespace irrespective of file size!! Image Source: http://www.bodhtree.com/blog/2012/09/28/hadoop-how-to-manage-huge-numbers-of-small-files-in-hdfs/
  • 14. Further Reading HDFS Comics :-) https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLW E0OGItYTU5OGMxYjc0N2M1