Big Data Storage Concepts from the "Big Data concepts Technology and Architecture" book.pptx

Big Data Storage Concepts
Big Data concepts Technology and Architecture
Raghad Joukhadar
2023-2024

• Introduction
• Cluster computing
• Types of cluster
• Cluster Structure
• Distribution Models
• Sharding
• Data Replication
• Sharding and Replication
• Distributed File System
• Relational and Non-Relational Databases
• RDBMS Databases
• NoSQL Databases
• NewSQL Databases
• Scaling Up and Scaling Out Storage
Plan

Introduction
– Example : Hadoop
• open-source
• allows organizations to effectively
store and analyze large volumes of
data.
• The big data revolution provides significant improvements to the
data storage architecture.
• Need for framework for storing data on clusters of commodity
hardware

Cluster Computing
• A group of loosely coupled
computers that work together
closely, so it can be viewed as a
“single larger and more
powerful virtual computer”.
• The cluster components are
connected together through
local area networks (LANs).

Overview of Cluster computing
• The login node acts as the
gateway into the cluster.
• When the cluster has to be
accessed by the users from a
public network, the user has to
login to the login node.
• This is to prevent unauthorized
access by the users.

Cluster Benefits
• Scalability,
– by removing nodes or adding additional nodes as per the
demand without hindering the system
• Availability,
– As nodes within the cluster provide backup to each other in the
event of a failure
• Performance,
– Multiple computing resources are connected together in a
cluster increasing the performance

TYPES OF CLUSTER (purpose)
• High Availability Clusters
– Nodes in a highly available cluster must have access to a
shared storage
– If a node becomes inoperative, continuous service is
provided by failing over service from the inoperative cluster
node to another, without administrative intervention

TYPES OF CLUSTER cont..
• Load Balancing Cluster
– Distributes incoming requests among multiple nodes running the
same programs or having the same content
– If a node in a load-balancing cluster goes down, the load from that
node is switched over to another node
– Optimize the use of resources, minimize response time

TYPES OF CLUSTER (Structure)
• Symmetric
– Each node functions as an
individual computer capable
of running applications.
– Additional machines can be
added as needed.

Cluster Structure
• Asymmetric
– Are a type of cluster
structure in which one
machine acts as the head
node
– it serves as the gateway
between the user and the
remaining nodes.

Distribution Models
• There are several distribution models
– Replication: placing the same set of data over multiple nodes.
– Sharding: placing different sets of data on different nodes
– Sharding & Replication :can either be used alone or together

Replication
• Replication is the process of creating copies of the same set
of data across multiple servers.
• The copy of a block is called replica.
• To overcome issues like:
– when a node crashes, the data stored in that node will be lost
– when a node is down for maintenance, it will not be available until
the maintenance process is over.

Replication Advantages
• Replication makes the system fault tolerant since the data is
not lost when an individual node fails as the data is
redundant across the nodes.
• Replication increases the data availability as the same copy
of data is available across multiple nodes.

Replication Models
Master-slave
• Master controls one or more
devices known as slaves
• The flow of control is only
from master to the slaves
• Incoming data are written on
the master node
• Read requests are handled by
slave nodes
• This architecture supports
intensive read requests
• The cluster still suffers from single
point of failure, if the master fails
• The writes are limited to the
maximum capacity that a master
can handle

Replication Models
• All the nodes have the same
responsibility and are at the
same level
• Either of the devices involved
in the process can initiate
communication
• The nodes consume as well
as donate the resources
• Reliability is improved through
replication
Peer-Peer

Sharding
• Partitioning very large data sets into smaller and easily
manageable chunks called shards.
• The shards are stored by distributing them across multiple
machines called nodes.
• No two shards of the same file are stored in the same node
• Shards spread across multiple nodes collectively constitute the
data set.

Sharding Advantages
• Scalability where new shards can be added at runtime
without shutting down the application for maintenance
• Improves the fault tolerance of the system as the failure of a
node affects only the block of the data stored in that
particular node.

Sharding & Replication
• In sharding when a node goes down, the data stored in the
node will be lost.
• So it provides only a limited fault tolerance to the system.
• Sharding and replication can be combined to make the system
fault tolerant and highly available.

Sharding & Replication Example
•

Distributed File System (DFS)
• A file system is a way of storing and organizing the data on storage devices
(HD, DVDs, ...) and to keep track of the files stored on them.
• The file is the smallest unit of storage defined by the file system to pile data.
• File systems store and retrieve data for the application to run effectively and
efficiently on the operating systems.
• A distributed file system stores the files across cluster nodes and allows the
clients to access the files from the cluster.
• Files are distributed across the nodes, but logically it appears to as if they are
residing on the clients local machine.
• Since a DFS provides access to more than one client simultaneously, the
server organizes updates for the clients to access the current updated
version of the file, and no version conflicts arise.
• Big data widely adopts a distributed file system known as Hadoop Distributed
File System (HDFS)

DFS Key concepts
• Data replication where the copies of data are distributed on
multiple cluster nodes so that there is no single point of failure,
which increases the reliability.
• The client can communicate with any of the closest available
nodes to reduce latency and network traffic
• Fault tolerance is achieved through data replication as the data
will not be lost in case of node failure due to the redundancy in
the data across nodes.

Relational and Non-Relational Databases

Relational and Non-Relational Databases
Relational Databases
• Organize data into tables of rows
(records) & columns
(attributes|fields)
• Unsuitable when organizations
collect vast amount of customer
databases, transactions, and other
data, which may not be structured to
fit into relational databases.
Non-Relational
• This has led to the evolution of non-
relational databases, which are
schema-less.
• NoSQL is a non-relational database

Properties of RDBMS Databases
• Is vertically scalable (by increasing server hardware power)
• Exhibits ACID (atomicity, consistency, isolation,durability) properties
• Support data that adhere to a specific schema
• Can no longer keep pace with the volume, velocity, and variety of data being
generated and consumed

Properties of NoSQL Databases
• Includes all non-relational databases
• Exhibits the BASE (basically available, soft state, eventually consistent) model
• Are not appropriate for implementing large transactions

Properties of NewSQL Databases
• Aim to combine the scalability and performance benefits of NoSQL
databases with the familiar relational data model and ACID transaction
guarantees of traditional SQL databases
• Horizontally scalable
• Fault tolerant
• Support relational data model with three layers: the administrative,
transactional, and storage layer.
• The applications : those that execute the same queries repeatedly with
different inputs and have a large number of transactions

high
performance
fault tolerant distributed in-memory scale-out
Clustrix yes yes yes - -
NuoDB - yes yes - yes
VoltDB yes yes yes yes yes
MemSQL yes yes yes yes -
NewSQL Databases comparison

Scaling up vs. Scaling out
Scaling out
(Horizontal)
Scaling up
(Vertical)

Big Data Storage Concepts from the "Big Data concepts Technology and Architecture" book.pptx

More Related Content

What's hot (20)

Similar to Big Data Storage Concepts from the "Big Data concepts Technology and Architecture" book.pptx (20)

Recently uploaded (20)

Big Data Storage Concepts from the "Big Data concepts Technology and Architecture" book.pptx