SlideShare a Scribd company logo
Big data and HDFS
Section 2
Agenda
2
 DFS
 what are the Problems with big data?
 Hadoop is the solution! What is Hadoop?
 HDFS
 YARN
DFS
3
• A distributed file system (DFS) is a file system that enables clients to access
file storage from multiple hosts through a computer network as if the user
was accessing local storage.
• Files are spread across multiple storage servers and in multiple locations,
which enables users to share data and storage resources.
Long-term information
storage
Access result of a process
later
Store large amounts of
information
Why to store Data in general?
Enable access of multiple
processes
File
Big data and Hadoop Section..............
Buy a bigger disk?
?
Copy data to an external hard
drive?
?
OR
PERSONAL
WORK
Big data and Hadoop Section..............
Rack
Distributed File System (DFS)
Rack
Cluster Node
Rack
1 2 3 4 5
Data
1
2
3 4
5
Block
Rack
1 2 3 4 5
Data
1
2
3 4
5
Analyze part 5 here!
Rack
1 2 3 4 5
Data
1
3
5
5
1
1 2
2
1
2
4
5
3
3
3
2
4
5
3
1
4
4
2 5 4
Rack
1 2 3 4 5
Data
1
3
5
5
1
1 2
2
1
2
4
5
3
3
3
2
4
5
3
1
4
4
2 5 4
Rack
1 2 3 4 5
Data
1
2
3 4
5
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
4
4
4
3 4
5: Reader 1
5: Reader 3
5: Reader 2
High Concurrency
vs.
Low Consistency
Rack
Distributed File System
(DFS)
Rack
Data scalability
Fault tolerance
High concurrency
Data replication
Data partitioning
When many storage computers (racks or nodes) are
connected throw the network we call it a DFS
Big data problems
• Storing huge and
exponentially
growing datasets.
• Processing data
having
complex
structure.
• the bottleneck of
bringing huge
amount data to
computation unit.
1
7
Hadoop is the
solution!
What is Hadoop?
Hadoop Ecosystem
19
• It is a services or frameworks (Open-source projects) which
solves big data problems.
• You can consider it as a suite which encompasses a number of
services (ingesting, storing, analyzing and maintaining) inside
it.
• Ex: To analyze the transaction data from a RDBMS, we need to ingest it into the Hadoop
Distributed File System (HDFS).
20
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
HDFS
28
• Hadoop Distributed File System is the core component or you can say, the backbone of
Hadoop Ecosystem.
• HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).
• It helps us in storing our data across various nodes and maintaining the log file about the
stored data (metadata).
• HDFS has two core components, i.e. NameNode and DataNode.
HDFS
29
1. The NameNode is the main node and it doesn’t store the actual data. It
contains metadata, just like a log file or you can say as a table of content.
Therefore, it requires less storage and high computational resources.
2. On the other hand, all your data is stored on the DataNodes and hence it
requires more storage resources. These DataNodes are commodity
hardware (like your laptops and desktops) in the distributed
environment. That’s the reason, why Hadoop solutions are very cost
effective.
3. You always communicate to the NameNode while writing the data. Then,
it internally sends a request to the client to store and replicate data on
various DataNodes.
30
31
For more information on how it works visit : https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/
Big data and Hadoop Section..............
YAR
N
33
• Yet Another Resource Negotiator (YARN)
• Consider YARN as the brain of your Hadoop Ecosystem. It performs all
your processing activities by allocating resources and scheduling tasks.
• It has two major components, i.e. ResourceManager and NodeManager.
• Recourse Manager controls all the recourses and decide who gets what.
• The NodeManager in YARN is responsible for launching the containers with
specified resource constraints.
• NodeManagers are installed on every DataNode.
YAR
N
Resourc
e
Manager
Node Manager
Scheduler App Manager Container
= machine
App Master
2/23/24 PRESENTATION TITLE 34
It performs the
scheduling of jobs
based on policies and
priorities defined by
the administrator.
It monitors the
health of the App
Master in the
cluster and
manages failover in
case of failures.
Manage the execution of
tasks within these
containers, monitor their
progress, and handle any
task failures or
reassignments.
35
Big data and Hadoop Section..............
Hive: A data warehousing and
SQL-like query language tool that
allows analysts to interact with
data stored in HDFS using familiar
SQL syntax.
Pig: A high-level scripting platform
for processing and analyzing large
datasets. It simplifies complex
data transformations using a
simple scripting language.
Big data and Hadoop Section..............
Spark: Although not
originally part of Hadoop,
Apache Spark is often
integrated into the
ecosystem. It offers an in-
memory processing engine
for faster data processing,
machine learning, and graph
processing.
Zookeeper in Hadoop can be
considered a centralized
repository where distributed
applications can put data into
and retrieve data from. It
makes a distributed system
work together as a whole
using its synchronization,
serialization, and
coordination goals.
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Advantages of Hadoop for Big Data
46
• Speed. Hadoop’s concurrent processing, MapReduce model, and HDFS
lets users run complex queries in just a few seconds.
• Diversity. Hadoop’s HDFS can store different data formats, like structured,
semi- structured, and unstructured.
• Cost-Effective. Hadoop is an open-source data framework.
• Resilient . Data stored in a node is replicated in other cluster nodes,
ensuring fault tolerance.
• Scalable. Since Hadoop functions in a distributed environment, you can easily
add more servers.
Next we will begin
with Hadoop
47

More Related Content

Similar to Big data and Hadoop Section.............. (20)

PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
PPTX
Managing Big data with Hadoop
Nalini Mehta
 
PPTX
Module 1- Introduction to Big Data and Hadoop
SiddheshMhatre27
 
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
 
PPTX
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
PPTX
Big Data-Session, data engineering and scala
ssusera3b277
 
PDF
Hadoop architecture-tutorial
vinayiqbusiness
 
PPTX
Seminar ppt
RajatTripathi34
 
PPTX
Big Data and Hadoop
Mr. Ankit
 
DOCX
project report on hadoop
Manoj Jangalva
 
PPT
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
ManiMaran230751
 
PPTX
Big Data & Hadoop
Ankan Banerjee
 
PDF
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
maharajothip1
 
PPSX
Hadoop – big deal
Abhishek Kumar
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PPTX
Big Data UNIT 2 AKTU syllabus all topics covered
chinky1118
 
PDF
Hadoop overview.pdf
Sunil D Patil
 
PDF
big data hadoop technonolgy for storing and processing data
preetik9044
 
PPTX
Apache Hadoop Big Data Technology
Jay Nagar
 
PPTX
Hadoop - HDFS
KavyaGo
 
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Managing Big data with Hadoop
Nalini Mehta
 
Module 1- Introduction to Big Data and Hadoop
SiddheshMhatre27
 
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
 
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Big Data-Session, data engineering and scala
ssusera3b277
 
Hadoop architecture-tutorial
vinayiqbusiness
 
Seminar ppt
RajatTripathi34
 
Big Data and Hadoop
Mr. Ankit
 
project report on hadoop
Manoj Jangalva
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
ManiMaran230751
 
Big Data & Hadoop
Ankan Banerjee
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
maharajothip1
 
Hadoop – big deal
Abhishek Kumar
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Big Data UNIT 2 AKTU syllabus all topics covered
chinky1118
 
Hadoop overview.pdf
Sunil D Patil
 
big data hadoop technonolgy for storing and processing data
preetik9044
 
Apache Hadoop Big Data Technology
Jay Nagar
 
Hadoop - HDFS
KavyaGo
 

Recently uploaded (20)

PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PDF
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Ad

Big data and Hadoop Section..............

  • 1. Big data and HDFS Section 2
  • 2. Agenda 2  DFS  what are the Problems with big data?  Hadoop is the solution! What is Hadoop?  HDFS  YARN
  • 3. DFS 3 • A distributed file system (DFS) is a file system that enables clients to access file storage from multiple hosts through a computer network as if the user was accessing local storage. • Files are spread across multiple storage servers and in multiple locations, which enables users to share data and storage resources.
  • 4. Long-term information storage Access result of a process later Store large amounts of information Why to store Data in general? Enable access of multiple processes
  • 7. Buy a bigger disk? ? Copy data to an external hard drive? ? OR
  • 10. Rack Distributed File System (DFS) Rack Cluster Node
  • 11. Rack 1 2 3 4 5 Data 1 2 3 4 5 Block
  • 12. Rack 1 2 3 4 5 Data 1 2 3 4 5 Analyze part 5 here!
  • 13. Rack 1 2 3 4 5 Data 1 3 5 5 1 1 2 2 1 2 4 5 3 3 3 2 4 5 3 1 4 4 2 5 4
  • 14. Rack 1 2 3 4 5 Data 1 3 5 5 1 1 2 2 1 2 4 5 3 3 3 2 4 5 3 1 4 4 2 5 4
  • 15. Rack 1 2 3 4 5 Data 1 2 3 4 5 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 4 4 4 3 4 5: Reader 1 5: Reader 3 5: Reader 2 High Concurrency vs. Low Consistency
  • 16. Rack Distributed File System (DFS) Rack Data scalability Fault tolerance High concurrency Data replication Data partitioning When many storage computers (racks or nodes) are connected throw the network we call it a DFS
  • 17. Big data problems • Storing huge and exponentially growing datasets. • Processing data having complex structure. • the bottleneck of bringing huge amount data to computation unit. 1 7
  • 19. Hadoop Ecosystem 19 • It is a services or frameworks (Open-source projects) which solves big data problems. • You can consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and maintaining) inside it. • Ex: To analyze the transaction data from a RDBMS, we need to ingest it into the Hadoop Distributed File System (HDFS).
  • 20. 20
  • 28. HDFS 28 • Hadoop Distributed File System is the core component or you can say, the backbone of Hadoop Ecosystem. • HDFS is the one, which makes it possible to store different types of large data sets (i.e. structured, unstructured and semi structured data). • It helps us in storing our data across various nodes and maintaining the log file about the stored data (metadata). • HDFS has two core components, i.e. NameNode and DataNode.
  • 29. HDFS 29 1. The NameNode is the main node and it doesn’t store the actual data. It contains metadata, just like a log file or you can say as a table of content. Therefore, it requires less storage and high computational resources. 2. On the other hand, all your data is stored on the DataNodes and hence it requires more storage resources. These DataNodes are commodity hardware (like your laptops and desktops) in the distributed environment. That’s the reason, why Hadoop solutions are very cost effective. 3. You always communicate to the NameNode while writing the data. Then, it internally sends a request to the client to store and replicate data on various DataNodes.
  • 30. 30
  • 31. 31 For more information on how it works visit : https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/
  • 33. YAR N 33 • Yet Another Resource Negotiator (YARN) • Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing activities by allocating resources and scheduling tasks. • It has two major components, i.e. ResourceManager and NodeManager. • Recourse Manager controls all the recourses and decide who gets what. • The NodeManager in YARN is responsible for launching the containers with specified resource constraints. • NodeManagers are installed on every DataNode.
  • 34. YAR N Resourc e Manager Node Manager Scheduler App Manager Container = machine App Master 2/23/24 PRESENTATION TITLE 34 It performs the scheduling of jobs based on policies and priorities defined by the administrator. It monitors the health of the App Master in the cluster and manages failover in case of failures. Manage the execution of tasks within these containers, monitor their progress, and handle any task failures or reassignments.
  • 35. 35
  • 37. Hive: A data warehousing and SQL-like query language tool that allows analysts to interact with data stored in HDFS using familiar SQL syntax. Pig: A high-level scripting platform for processing and analyzing large datasets. It simplifies complex data transformations using a simple scripting language.
  • 39. Spark: Although not originally part of Hadoop, Apache Spark is often integrated into the ecosystem. It offers an in- memory processing engine for faster data processing, machine learning, and graph processing.
  • 40. Zookeeper in Hadoop can be considered a centralized repository where distributed applications can put data into and retrieve data from. It makes a distributed system work together as a whole using its synchronization, serialization, and coordination goals.
  • 46. Advantages of Hadoop for Big Data 46 • Speed. Hadoop’s concurrent processing, MapReduce model, and HDFS lets users run complex queries in just a few seconds. • Diversity. Hadoop’s HDFS can store different data formats, like structured, semi- structured, and unstructured. • Cost-Effective. Hadoop is an open-source data framework. • Resilient . Data stored in a node is replicated in other cluster nodes, ensuring fault tolerance. • Scalable. Since Hadoop functions in a distributed environment, you can easily add more servers.
  • 47. Next we will begin with Hadoop 47

Editor's Notes

  • #16: commodity cluster instead of
  • #17: It breaks down complex tasks into smaller sub-tasks, which are processed in parallel by map and reduce functions
  • #20: Apache Sqoop, which is part of CDH, is that tool. Sqoop is a tool used to automatically load our relational data from MySOL into HDFS, while preserving the structure. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data into HDFS.
  • #31: here are two files associated with the metadata: FsImage: It contains the complete state of the file system namespace since the start of the NameNode. EditLogs: It contains all the recent modifications made to the file system with respect to the most recent FsImage.
  • #40: Distributed applications are difficult to coordinate and work with ,Because many machines are involved, race conditions and deadlocks are common problems when implementing distributed applications. A race condition occurs when a machine tries to perform two or more operations at once, and this can be resolved with ZooKeeper’s serialization feature. A deadlock occurs when two or more computers attempt to access the same shared resource simultaneously.