SlideShare a Scribd company logo
Presentation on Big Data-
HADOOP DISTRIBUTED FILE
SYSTEM(H.D.F.S)
Presented by –
TAKRIM UL ISLAM LASKAR(120103006)
Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• IBM : Blue Cloud?
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
HADOOP DISTRIBUTED FILE SYSTEM(HDFS)
 Highly fault tolerant
 High throughput
 Suitable for application with large data sets
 Streaming access to file system data
 Can be built out of commodity hardware
Commodity Hardware
Typically in 2 level architecture
– Nodes are commodity PCs
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit
Why DFS?
1 MACHINE
• 4 I/O channels
• Each channel- 100mb/s
10 MACHINES
• 4 I/O channels
• Each channel- 100mb/s
…...
45 Minutes 4.5 Minutes
READ 1TB DATA
Goals of HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to where
data resides
– Provides very high aggregate bandwidth
HDFS Architecture
Functions of a NameNode
• Manages File System Namespace
– Maps a file name to a set of blocks
– Maps a block to the DataNodes where it resides
• Cluster Configuration Management
• Replication Engine for Blocks
NameNode Metadata
• Meta-data in Memory
– The entire metadata is in main memory (RAM)
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc
DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
Block Placement
• Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
• Clients read from nearest replica.
Heartbeats
• DataNodes send heartbeat to the NameNode
– Once every 3 seconds
• NameNode used heartbeats to detect DataNode failure
Replication Engine
• NameNode detects DataNode failures
– Chooses new DataNodes for new replicas
– Balances disk usage
– Balances communication traffic to DataNodes
HDFS- 2nd generation
NameNode-
 Active NameNode
 Passive NameNode
Data Pipelining
• Client retrieves a list of DataNodes on which to place replicas of a
block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the next DataNode in the
Pipeline
• When all replicas are written, the Client moves on to write the next
block in file
Secondary NameNode
• Copies FsImage and Transaction Log from NameNode to a temporary
directory
• Merges FSImage and Transaction Log into a new FSImage in
temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged
User Interface
• Command for HDFS User:
– hadoop dfs -mkdir /foodir
– hadoop dfs -cat /foodir/myfile.txt
– hadoop dfs -rm /foodir myfile.txt
• Command for HDFS Administrator
– hadoop dfsadmin -report
– hadoop dfsadmin -decommission datanodename
• Web Interface
– http://host:port/dfshealth.jsp
Hadoop Map/Reduce
• The Map-Reduce programming model
– Framework for distributed processing of large data sets
– Pluggable user code runs in generic framework
• Common design pattern in data processing
cat * | grep | sort | unique -c | cat > file input | map | shuffle |
reduce | output
• Natural for:
– Log processing
– Web search indexing
– Ad-hoc queries
References :
1. Youtube Lecture video on chennal ‘ Training on Big Data and
Hadoop ’ By User ‘Edureka’.
2. ‘White Book Of Big Data’ By ‘Fujistu’ .
3. ‘Big Data For Dummies’ by ‘A Wiley Brand’ .
4. Research paper by ‘Kalapriya Kannan’ in ‘IBM Research Labs’.
5. Collected data from “slideshare.com”
Useful Links
• HDFS Design:
– http://hadoop.apache.org/core/docs/current/hdfs_design.html
• Hadoop Information
– http://hadoop.apache.org/
• Hadoop API:
• – http://hadoop.apache.org/core/docs/current/api/
-THANK YOU.
I appreciate your patience.

More Related Content

What's hot (20)

PDF
HDFS Design Principles
Konstantin V. Shvachko
 
PPTX
Hadoop Distributed File System
Vaibhav Jain
 
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
PPT
Hadoop training in hyderabad-kellytechnologies
Kelly Technologies
 
PPTX
Introduction to HDFS
Bhavesh Padharia
 
PPTX
Introduction to hadoop and hdfs
shrey mehrotra
 
PDF
Hadoop HDFS
Vigen Sahakyan
 
PPTX
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
PPTX
Hadoop architecture meetup
vmoorthy
 
PPTX
presentation_Hadoop_File_System
Brett Keim
 
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
Adam Kawa
 
PPT
Hadoop Architecture
Delhi/NCR HUG
 
PDF
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
DataWorks Summit/Hadoop Summit
 
PPTX
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
PDF
Difference between hadoop 2 vs hadoop 3
Manish Chopra
 
PDF
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
PPTX
2.introduction to hdfs
databloginfo
 
PPTX
Democratizing Memory Storage
DataWorks Summit
 
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop distributed file system
Anshul Bhatnagar
 
HDFS Design Principles
Konstantin V. Shvachko
 
Hadoop Distributed File System
Vaibhav Jain
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Hadoop training in hyderabad-kellytechnologies
Kelly Technologies
 
Introduction to HDFS
Bhavesh Padharia
 
Introduction to hadoop and hdfs
shrey mehrotra
 
Hadoop HDFS
Vigen Sahakyan
 
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Hadoop architecture meetup
vmoorthy
 
presentation_Hadoop_File_System
Brett Keim
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Adam Kawa
 
Hadoop Architecture
Delhi/NCR HUG
 
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
DataWorks Summit/Hadoop Summit
 
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
Difference between hadoop 2 vs hadoop 3
Manish Chopra
 
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
2.introduction to hdfs
databloginfo
 
Democratizing Memory Storage
DataWorks Summit
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit/Hadoop Summit
 
Hadoop distributed file system
Anshul Bhatnagar
 

Viewers also liked (9)

PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop introduction
Subhas Kumar Ghosh
 
PDF
Hadoop - Lessons Learned
tcurdt
 
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop & Big Data benchmarking
Bart Vandewoestyne
 
PPTX
Hadoop & HDFS for Beginners
Rahul Jain
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PPSX
Hadoop
Nishant Gandhi
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Hadoop introduction
Subhas Kumar Ghosh
 
Hadoop - Lessons Learned
tcurdt
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Hadoop & Big Data benchmarking
Bart Vandewoestyne
 
Hadoop & HDFS for Beginners
Rahul Jain
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Seminar Presentation Hadoop
Varun Narang
 
Ad

Similar to Big data- HDFS(2nd presentation) (20)

PPT
HDFS_architecture.ppt
vijayapraba1
 
PPT
Borthakur hadoop univ-research
saintdevil163
 
PPTX
Big Data-Session, data engineering and scala
ssusera3b277
 
PPTX
Setting up a big data platform at kelkoo
Fabrice dos Santos
 
PPTX
Cloud Computing - Cloud Technologies and Advancements
Sathishkumar Jaganathan
 
PPT
Hadoop training in bangalore
Kelly Technologies
 
PPT
Apache hadoop and hive
srikanthhadoop
 
PPTX
Hadoop File System.pptx
AakashBerlia1
 
PPTX
Hadoop Fundamentals
its_skm
 
PPTX
Hadoop fundamentals
InMobi Technology
 
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
 
PDF
Hadoop data management
Subhas Kumar Ghosh
 
PDF
Hadoop security
Biju Nair
 
PDF
Introduction to distributed file systems
Viet-Trung TRAN
 
PPTX
Hadoop ppt1
chariorienit
 
PDF
Tutorial Haddop 2.3
Atanu Chatterjee
 
PPTX
Hadoop and HDFS
SatyaHadoop
 
PDF
Lecture 2 part 1
Jazan University
 
PPTX
Hadoop introduction
musrath mohammad
 
PDF
Storage solutions for High Performance Computing
gmateesc
 
HDFS_architecture.ppt
vijayapraba1
 
Borthakur hadoop univ-research
saintdevil163
 
Big Data-Session, data engineering and scala
ssusera3b277
 
Setting up a big data platform at kelkoo
Fabrice dos Santos
 
Cloud Computing - Cloud Technologies and Advancements
Sathishkumar Jaganathan
 
Hadoop training in bangalore
Kelly Technologies
 
Apache hadoop and hive
srikanthhadoop
 
Hadoop File System.pptx
AakashBerlia1
 
Hadoop Fundamentals
its_skm
 
Hadoop fundamentals
InMobi Technology
 
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
 
Hadoop data management
Subhas Kumar Ghosh
 
Hadoop security
Biju Nair
 
Introduction to distributed file systems
Viet-Trung TRAN
 
Hadoop ppt1
chariorienit
 
Tutorial Haddop 2.3
Atanu Chatterjee
 
Hadoop and HDFS
SatyaHadoop
 
Lecture 2 part 1
Jazan University
 
Hadoop introduction
musrath mohammad
 
Storage solutions for High Performance Computing
gmateesc
 
Ad

More from Takrim Ul Islam Laskar (6)

PPTX
Facial Emotion Detection on Children's Emotional Face
Takrim Ul Islam Laskar
 
PPTX
Facial emotion detection on babies' emotional face using Deep Learning.
Takrim Ul Islam Laskar
 
PPTX
Sentiment Analysis on Human with special Concentration on infants’ emotional ...
Takrim Ul Islam Laskar
 
PPTX
Indian Sign Language Recognition Method For Deaf People
Takrim Ul Islam Laskar
 
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
PPTX
Big data(1st presentation)
Takrim Ul Islam Laskar
 
Facial Emotion Detection on Children's Emotional Face
Takrim Ul Islam Laskar
 
Facial emotion detection on babies' emotional face using Deep Learning.
Takrim Ul Islam Laskar
 
Sentiment Analysis on Human with special Concentration on infants’ emotional ...
Takrim Ul Islam Laskar
 
Indian Sign Language Recognition Method For Deaf People
Takrim Ul Islam Laskar
 
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
Big data(1st presentation)
Takrim Ul Islam Laskar
 

Recently uploaded (20)

DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Biography of Daniel Podor.pdf
Daniel Podor
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 

Big data- HDFS(2nd presentation)

  • 1. Presentation on Big Data- HADOOP DISTRIBUTED FILE SYSTEM(H.D.F.S) Presented by – TAKRIM UL ISLAM LASKAR(120103006)
  • 2. Who uses Hadoop? • Amazon/A9 • Facebook • Google • IBM : Blue Cloud? • Joost • Last.fm • New York Times • PowerSet • Veoh • Yahoo!
  • 3. HADOOP DISTRIBUTED FILE SYSTEM(HDFS)  Highly fault tolerant  High throughput  Suitable for application with large data sets  Streaming access to file system data  Can be built out of commodity hardware
  • 4. Commodity Hardware Typically in 2 level architecture – Nodes are commodity PCs – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit
  • 5. Why DFS? 1 MACHINE • 4 I/O channels • Each channel- 100mb/s 10 MACHINES • 4 I/O channels • Each channel- 100mb/s …... 45 Minutes 4.5 Minutes READ 1TB DATA
  • 6. Goals of HDFS • Very Large Distributed File System – 10K nodes, 100 million files, 10 PB • Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them • Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth
  • 8. Functions of a NameNode • Manages File System Namespace – Maps a file name to a set of blocks – Maps a block to the DataNodes where it resides • Cluster Configuration Management • Replication Engine for Blocks
  • 9. NameNode Metadata • Meta-data in Memory – The entire metadata is in main memory (RAM) • Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor • A Transaction Log – Records file creations, file deletions. etc
  • 10. DataNode • A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block – Serves data and meta-data to Clients • Block Report – Periodically sends a report of all existing blocks to the NameNode • Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 11. Block Placement • Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed • Clients read from nearest replica.
  • 12. Heartbeats • DataNodes send heartbeat to the NameNode – Once every 3 seconds • NameNode used heartbeats to detect DataNode failure
  • 13. Replication Engine • NameNode detects DataNode failures – Chooses new DataNodes for new replicas – Balances disk usage – Balances communication traffic to DataNodes
  • 14. HDFS- 2nd generation NameNode-  Active NameNode  Passive NameNode
  • 15. Data Pipelining • Client retrieves a list of DataNodes on which to place replicas of a block • Client writes block to the first DataNode • The first DataNode forwards the data to the next DataNode in the Pipeline • When all replicas are written, the Client moves on to write the next block in file
  • 16. Secondary NameNode • Copies FsImage and Transaction Log from NameNode to a temporary directory • Merges FSImage and Transaction Log into a new FSImage in temporary directory • Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged
  • 17. User Interface • Command for HDFS User: – hadoop dfs -mkdir /foodir – hadoop dfs -cat /foodir/myfile.txt – hadoop dfs -rm /foodir myfile.txt • Command for HDFS Administrator – hadoop dfsadmin -report – hadoop dfsadmin -decommission datanodename • Web Interface – http://host:port/dfshealth.jsp
  • 18. Hadoop Map/Reduce • The Map-Reduce programming model – Framework for distributed processing of large data sets – Pluggable user code runs in generic framework • Common design pattern in data processing cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output • Natural for: – Log processing – Web search indexing – Ad-hoc queries
  • 19. References : 1. Youtube Lecture video on chennal ‘ Training on Big Data and Hadoop ’ By User ‘Edureka’. 2. ‘White Book Of Big Data’ By ‘Fujistu’ . 3. ‘Big Data For Dummies’ by ‘A Wiley Brand’ . 4. Research paper by ‘Kalapriya Kannan’ in ‘IBM Research Labs’. 5. Collected data from “slideshare.com”
  • 20. Useful Links • HDFS Design: – http://hadoop.apache.org/core/docs/current/hdfs_design.html • Hadoop Information – http://hadoop.apache.org/ • Hadoop API: • – http://hadoop.apache.org/core/docs/current/api/
  • 21. -THANK YOU. I appreciate your patience.