Big data and Hadoop Section..............

Agenda
2
 DFS
 what are the Problems with big data?
 Hadoop is the solution! What is Hadoop?
 HDFS
 YARN

DFS
3
• A distributed file system (DFS) is a file system that enables clients to access
file storage from multiple hosts through a computer network as if the user
was accessing local storage.
• Files are spread across multiple storage servers and in multiple locations,
which enables users to share data and storage resources.

Long-term information
storage
Access result of a process
later
Store large amounts of
information
Why to store Data in general?
Enable access of multiple
processes

Buy a bigger disk?
?
Copy data to an external hard
drive?
?
OR

Rack
Distributed File System (DFS)
Rack
Cluster Node

Rack
1 2 3 4 5
Data
1
2
3 4
5
Block

Rack
1 2 3 4 5
Data
1
2
3 4
5
Analyze part 5 here!

Rack
1 2 3 4 5
Data
1
3
5
5
1
1 2
2
1
2
4
5
3
3
3
2
4
5
3
1
4
4
2 5 4

Rack
1 2 3 4 5
Data
1
2
3 4
5
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
4
4
4
3 4
5: Reader 1
5: Reader 3
5: Reader 2
High Concurrency
vs.
Low Consistency

Rack
Distributed File System
(DFS)
Rack
Data scalability
Fault tolerance
High concurrency
Data replication
Data partitioning
When many storage computers (racks or nodes) are
connected throw the network we call it a DFS

Big data problems
• Storing huge and
exponentially
growing datasets.
• Processing data
having
complex
structure.
• the bottleneck of
bringing huge
amount data to
computation unit.
1
7

Hadoop is the
solution!
What is Hadoop?

Hadoop Ecosystem
19
• It is a services or frameworks (Open-source projects) which
solves big data problems.
• You can consider it as a suite which encompasses a number of
services (ingesting, storing, analyzing and maintaining) inside
it.
• Ex: To analyze the transaction data from a RDBMS, we need to ingest it into the Hadoop
Distributed File System (HDFS).

HDFS
28
• Hadoop Distributed File System is the core component or you can say, the backbone of
Hadoop Ecosystem.
• HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).
• It helps us in storing our data across various nodes and maintaining the log file about the
stored data (metadata).
• HDFS has two core components, i.e. NameNode and DataNode.

HDFS
29
1. The NameNode is the main node and it doesn’t store the actual data. It
contains metadata, just like a log file or you can say as a table of content.
Therefore, it requires less storage and high computational resources.
2. On the other hand, all your data is stored on the DataNodes and hence it
requires more storage resources. These DataNodes are commodity
hardware (like your laptops and desktops) in the distributed
environment. That’s the reason, why Hadoop solutions are very cost
effective.
3. You always communicate to the NameNode while writing the data. Then,
it internally sends a request to the client to store and replicate data on
various DataNodes.

31
For more information on how it works visit : https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/

YAR
N
33
• Yet Another Resource Negotiator (YARN)
• Consider YARN as the brain of your Hadoop Ecosystem. It performs all
your processing activities by allocating resources and scheduling tasks.
• It has two major components, i.e. ResourceManager and NodeManager.
• Recourse Manager controls all the recourses and decide who gets what.
• The NodeManager in YARN is responsible for launching the containers with
specified resource constraints.
• NodeManagers are installed on every DataNode.

YAR
N
Resourc
e
Manager
Node Manager
Scheduler App Manager Container
= machine
App Master
2/23/24 PRESENTATION TITLE 34
It performs the
scheduling of jobs
based on policies and
priorities defined by
the administrator.
It monitors the
health of the App
Master in the
cluster and
manages failover in
case of failures.
Manage the execution of
tasks within these
containers, monitor their
progress, and handle any
task failures or
reassignments.

Hive: A data warehousing and
SQL-like query language tool that
allows analysts to interact with
data stored in HDFS using familiar
SQL syntax.
Pig: A high-level scripting platform
for processing and analyzing large
datasets. It simplifies complex
data transformations using a
simple scripting language.

Spark: Although not
originally part of Hadoop,
Apache Spark is often
integrated into the
ecosystem. It offers an in-
memory processing engine
for faster data processing,
machine learning, and graph
processing.

Zookeeper in Hadoop can be
considered a centralized
repository where distributed
applications can put data into
and retrieve data from. It
makes a distributed system
work together as a whole
using its synchronization,
serialization, and
coordination goals.

Advantages of Hadoop for Big Data
46
• Speed. Hadoop’s concurrent processing, MapReduce model, and HDFS
lets users run complex queries in just a few seconds.
• Diversity. Hadoop’s HDFS can store different data formats, like structured,
semi- structured, and unstructured.
• Cost-Effective. Hadoop is an open-source data framework.
• Resilient . Data stored in a node is replicated in other cluster nodes,
ensuring fault tolerance.
• Scalable. Since Hadoop functions in a distributed environment, you can easily
add more servers.

Next we will begin
with Hadoop
47

Big data and Hadoop Section..............

More Related Content

Similar to Big data and Hadoop Section.............. (20)

Recently uploaded (20)

Big data and Hadoop Section..............

Editor's Notes