Big Data Storage Concepts
Big Data concepts Technology and Architecture
Raghad Joukhadar
2023-2024
• Introduction
• Cluster computing
• Types of cluster
• Cluster Structure
• Distribution Models
• Sharding
• Data Replication
• Sharding and Replication
• Distributed File System
• Relational and Non-Relational Databases
• RDBMS Databases
• NoSQL Databases
• NewSQL Databases
• Scaling Up and Scaling Out Storage
Plan
Introduction
– Example : Hadoop
• open-source
• allows organizations to effectively
store and analyze large volumes of
data.
• The big data revolution provides significant improvements to the
data storage architecture.
• Need for framework for storing data on clusters of commodity
hardware
Cluster Computing
• A group of loosely coupled
computers that work together
closely, so it can be viewed as a
“single larger and more
powerful virtual computer”.
• The cluster components are
connected together through
local area networks (LANs).
Overview of Cluster computing
• The login node acts as the
gateway into the cluster.
• When the cluster has to be
accessed by the users from a
public network, the user has to
login to the login node.
• This is to prevent unauthorized
access by the users.
Cluster Benefits
• Scalability,
– by removing nodes or adding additional nodes as per the
demand without hindering the system
• Availability,
– As nodes within the cluster provide backup to each other in the
event of a failure
• Performance,
– Multiple computing resources are connected together in a
cluster increasing the performance
TYPES OF CLUSTER (purpose)
• High Availability Clusters
– Nodes in a highly available cluster must have access to a
shared storage
– If a node becomes inoperative, continuous service is
provided by failing over service from the inoperative cluster
node to another, without administrative intervention
TYPES OF CLUSTER cont..
• Load Balancing Cluster
– Distributes incoming requests among multiple nodes running the
same programs or having the same content
– If a node in a load-balancing cluster goes down, the load from that
node is switched over to another node
– Optimize the use of resources, minimize response time
TYPES OF CLUSTER (Structure)
• Symmetric
– Each node functions as an
individual computer capable
of running applications.
– Additional machines can be
added as needed.
Cluster Structure
• Asymmetric
– Are a type of cluster
structure in which one
machine acts as the head
node
– it serves as the gateway
between the user and the
remaining nodes.
Distribution Models
• There are several distribution models
– Replication: placing the same set of data over multiple nodes.
– Sharding: placing different sets of data on different nodes
– Sharding & Replication :can either be used alone or together
Replication
• Replication is the process of creating copies of the same set
of data across multiple servers.
• The copy of a block is called replica.
• To overcome issues like:
– when a node crashes, the data stored in that node will be lost
– when a node is down for maintenance, it will not be available until
the maintenance process is over.
Data Replication Example
•
Replication Advantages
• Replication makes the system fault tolerant since the data is
not lost when an individual node fails as the data is
redundant across the nodes.
• Replication increases the data availability as the same copy
of data is available across multiple nodes.
Replication Models
Master-slave
• Master controls one or more
devices known as slaves
• The flow of control is only
from master to the slaves
• Incoming data are written on
the master node
• Read requests are handled by
slave nodes
• This architecture supports
intensive read requests
• The cluster still suffers from single
point of failure, if the master fails
• The writes are limited to the
maximum capacity that a master
can handle
Replication Models
• All the nodes have the same
responsibility and are at the
same level
• Either of the devices involved
in the process can initiate
communication
• The nodes consume as well
as donate the resources
• Reliability is improved through
replication
Peer-Peer
Sharding
• Partitioning very large data sets into smaller and easily
manageable chunks called shards.
• The shards are stored by distributing them across multiple
machines called nodes.
• No two shards of the same file are stored in the same node
• Shards spread across multiple nodes collectively constitute the
data set.
Sharding Examples
Sharding Advantages
• Scalability where new shards can be added at runtime
without shutting down the application for maintenance
• Improves the fault tolerance of the system as the failure of a
node affects only the block of the data stored in that
particular node.
Sharding & Replication
• In sharding when a node goes down, the data stored in the
node will be lost.
• So it provides only a limited fault tolerance to the system.
• Sharding and replication can be combined to make the system
fault tolerant and highly available.
Sharding & Replication Example
•
Distributed File System (DFS)
• A file system is a way of storing and organizing the data on storage devices
(HD, DVDs, ...) and to keep track of the files stored on them.
• The file is the smallest unit of storage defined by the file system to pile data.
• File systems store and retrieve data for the application to run effectively and
efficiently on the operating systems.
• A distributed file system stores the files across cluster nodes and allows the
clients to access the files from the cluster.
• Files are distributed across the nodes, but logically it appears to as if they are
residing on the clients local machine.
• Since a DFS provides access to more than one client simultaneously, the
server organizes updates for the clients to access the current updated
version of the file, and no version conflicts arise.
• Big data widely adopts a distributed file system known as Hadoop Distributed
File System (HDFS)
DFS Key concepts
• Data replication where the copies of data are distributed on
multiple cluster nodes so that there is no single point of failure,
which increases the reliability.
• The client can communicate with any of the closest available
nodes to reduce latency and network traffic
• Fault tolerance is achieved through data replication as the data
will not be lost in case of node failure due to the redundancy in
the data across nodes.
Relational and Non-Relational Databases
Relational and Non-Relational Databases
Relational Databases
• Organize data into tables of rows
(records) & columns
(attributes|fields)
• Unsuitable when organizations
collect vast amount of customer
databases, transactions, and other
data, which may not be structured to
fit into relational databases.
Non-Relational
• This has led to the evolution of non-
relational databases, which are
schema-less.
• NoSQL is a non-relational database
Properties of RDBMS Databases
• Is vertically scalable (by increasing server hardware power)
• Exhibits ACID (atomicity, consistency, isolation,durability) properties
• Support data that adhere to a specific schema
• Can no longer keep pace with the volume, velocity, and variety of data being
generated and consumed
Properties of NoSQL Databases
• Includes all non-relational databases
• Exhibits the BASE (basically available, soft state, eventually consistent) model
• Are not appropriate for implementing large transactions
Properties of NewSQL Databases
• Aim to combine the scalability and performance benefits of NoSQL
databases with the familiar relational data model and ACID transaction
guarantees of traditional SQL databases
• Horizontally scalable
• Fault tolerant
• Support relational data model with three layers: the administrative,
transactional, and storage layer.
• The applications : those that execute the same queries repeatedly with
different inputs and have a large number of transactions
high
performance
fault tolerant distributed in-memory scale-out
Clustrix yes yes yes - -
NuoDB - yes yes - yes
VoltDB yes yes yes yes yes
MemSQL yes yes yes yes -
NewSQL Databases comparison
Scaling up vs. Scaling out
Scaling out
(Horizontal)
Scaling up
(Vertical)
THANK YOU
ANY
QUESTIONS?

More Related Content

PPT
Information Retrieval Models
PPTX
Social Network Visualization 101
PPTX
Informatio retrival evaluation
PDF
CS6010 Social Network Analysis Unit III
PPT
Distributed databases
PPTX
METS(Metadata Encoding and Transmission Standard )
PPTX
Data Modeling Basics
PPTX
Denormalization
Information Retrieval Models
Social Network Visualization 101
Informatio retrival evaluation
CS6010 Social Network Analysis Unit III
Distributed databases
METS(Metadata Encoding and Transmission Standard )
Data Modeling Basics
Denormalization

What's hot (20)

PPTX
Automated catologuing system
PPT
FRSAD Functional Requirements for Subject Authority Data model
PPTX
Data Warehouse Fundamentals
PPTX
Object Relational Database Management System(ORDBMS)
PPT
RDA & serials-transitioning to rda within a marc 21 framework
PDF
Digital Library Initiatives in India : An Overview
PPTX
Key-Value NoSQL Database
PPT
Data Warehouse Modeling
PPTX
Database management systems components
PPTX
Vector space model of information retrieval
PDF
Inmon & kimball method
PDF
Difference between fact tables and dimension tables
PPTX
DIGITAL LIBRARY ARCHITECTURE
PPTX
Thesaurus 2101
PPTX
Web mining
PDF
Functional Requirements For Bibliographic Records - FRBR
PDF
Role of a DBA
PDF
Data Warehouse Implementation
PPTX
multi dimensional data model
PPTX
Classaurus classification
Automated catologuing system
FRSAD Functional Requirements for Subject Authority Data model
Data Warehouse Fundamentals
Object Relational Database Management System(ORDBMS)
RDA & serials-transitioning to rda within a marc 21 framework
Digital Library Initiatives in India : An Overview
Key-Value NoSQL Database
Data Warehouse Modeling
Database management systems components
Vector space model of information retrieval
Inmon & kimball method
Difference between fact tables and dimension tables
DIGITAL LIBRARY ARCHITECTURE
Thesaurus 2101
Web mining
Functional Requirements For Bibliographic Records - FRBR
Role of a DBA
Data Warehouse Implementation
multi dimensional data model
Classaurus classification
Ad

Similar to Big Data Storage Concepts from the "Big Data concepts Technology and Architecture" book.pptx (20)

PPTX
Distributed systems and scalability rules
PPTX
UNIT II (1).pptx
PPTX
Distribution Models.pptxgdfgdfgdfgfdgdfg
PPTX
UNIT-4 NOTES.pptx for engagement ring start kr dena
PPTX
NOSQL DATABASES UNIT-3 FOR ENGINEERING STUDENTS
PDF
Datastores
PPTX
Module 2_ Distribution Model.pptx Notes
ODP
Front Range PHP NoSQL Databases
PPTX
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
PDF
NOSQL -lecture 1 mongo database expalnation.pdf
PPTX
Introduction to No SQL - Learn nosql databases
PDF
Lecture-04-Principles of data management.pdf
PDF
Highly available distributed databases, how they work, javier ramirez at teowaki
PPTX
Hadoop
PPTX
Big data and hadoop
PDF
NoSQL BIg Data Analytics Mongo DB and Cassandra .pdf
PDF
System Design Basics by Pratyush Majumdar
PDF
Data management in cloud study of existing systems and future opportunities
PPTX
No sq lv2
PPTX
Bigdata and Hadoop Introduction
Distributed systems and scalability rules
UNIT II (1).pptx
Distribution Models.pptxgdfgdfgdfgfdgdfg
UNIT-4 NOTES.pptx for engagement ring start kr dena
NOSQL DATABASES UNIT-3 FOR ENGINEERING STUDENTS
Datastores
Module 2_ Distribution Model.pptx Notes
Front Range PHP NoSQL Databases
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
NOSQL -lecture 1 mongo database expalnation.pdf
Introduction to No SQL - Learn nosql databases
Lecture-04-Principles of data management.pdf
Highly available distributed databases, how they work, javier ramirez at teowaki
Hadoop
Big data and hadoop
NoSQL BIg Data Analytics Mongo DB and Cassandra .pdf
System Design Basics by Pratyush Majumdar
Data management in cloud study of existing systems and future opportunities
No sq lv2
Bigdata and Hadoop Introduction
Ad

Recently uploaded (20)

PPTX
Top Website Bugs That Hurt User Experience – And How Expert Web Design Fixes
DOCX
Powerful Ways AIRCONNECT INFOSYSTEMS Pvt Ltd Enhances IT Infrastructure in In...
PPTX
Partner to Customer - Sales Presentation_V23.01.pptx
PPTX
最新版美国埃默里大学毕业证(Emory毕业证书)原版定制文凭学历认证
PPTX
IT-Human Computer Interaction Report.pptx
PDF
JuanConnect E-Wallet Guide for new users.pdf
PPTX
Digital Project Mastery using Autodesk Docs Workshops
PDF
Public for study about wiring to confirm.
PPTX
Concepts of Object Oriented Programming.
PPTX
Networking2-LECTURE2 this is our lessons
PPTX
Artificial_Intelligence_Basics use in our daily life
DOCX
Memecoinist Update: Best Meme Coins 2025, Trump Meme Coin Predictions, and th...
PDF
ilide.info-huawei-odn-solution-introduction-pdf-pr_a17152ead66ea2617ffbd01e8c...
PDF
Computer Networking, Internet, Casting in Network
PDF
Testing & QA Checklist for Magento to Shopify Migration Success.pdf
PDF
The_Decisive_Battle_of_Yarmuk,battle of yarmuk
PPTX
Viva Digitally Software-Defined Wide Area Network.pptx
PDF
Virtual Guard Technology Provider_ Remote Security Service Solutions.pdf
PPSX
AI AppSec Threats and Defenses 20250822.ppsx
PPTX
Introduction: Living in the IT ERA.pptx
Top Website Bugs That Hurt User Experience – And How Expert Web Design Fixes
Powerful Ways AIRCONNECT INFOSYSTEMS Pvt Ltd Enhances IT Infrastructure in In...
Partner to Customer - Sales Presentation_V23.01.pptx
最新版美国埃默里大学毕业证(Emory毕业证书)原版定制文凭学历认证
IT-Human Computer Interaction Report.pptx
JuanConnect E-Wallet Guide for new users.pdf
Digital Project Mastery using Autodesk Docs Workshops
Public for study about wiring to confirm.
Concepts of Object Oriented Programming.
Networking2-LECTURE2 this is our lessons
Artificial_Intelligence_Basics use in our daily life
Memecoinist Update: Best Meme Coins 2025, Trump Meme Coin Predictions, and th...
ilide.info-huawei-odn-solution-introduction-pdf-pr_a17152ead66ea2617ffbd01e8c...
Computer Networking, Internet, Casting in Network
Testing & QA Checklist for Magento to Shopify Migration Success.pdf
The_Decisive_Battle_of_Yarmuk,battle of yarmuk
Viva Digitally Software-Defined Wide Area Network.pptx
Virtual Guard Technology Provider_ Remote Security Service Solutions.pdf
AI AppSec Threats and Defenses 20250822.ppsx
Introduction: Living in the IT ERA.pptx

Big Data Storage Concepts from the "Big Data concepts Technology and Architecture" book.pptx

  • 1. Big Data Storage Concepts Big Data concepts Technology and Architecture Raghad Joukhadar 2023-2024
  • 2. • Introduction • Cluster computing • Types of cluster • Cluster Structure • Distribution Models • Sharding • Data Replication • Sharding and Replication • Distributed File System • Relational and Non-Relational Databases • RDBMS Databases • NoSQL Databases • NewSQL Databases • Scaling Up and Scaling Out Storage Plan
  • 3. Introduction – Example : Hadoop • open-source • allows organizations to effectively store and analyze large volumes of data. • The big data revolution provides significant improvements to the data storage architecture. • Need for framework for storing data on clusters of commodity hardware
  • 4. Cluster Computing • A group of loosely coupled computers that work together closely, so it can be viewed as a “single larger and more powerful virtual computer”. • The cluster components are connected together through local area networks (LANs).
  • 5. Overview of Cluster computing • The login node acts as the gateway into the cluster. • When the cluster has to be accessed by the users from a public network, the user has to login to the login node. • This is to prevent unauthorized access by the users.
  • 6. Cluster Benefits • Scalability, – by removing nodes or adding additional nodes as per the demand without hindering the system • Availability, – As nodes within the cluster provide backup to each other in the event of a failure • Performance, – Multiple computing resources are connected together in a cluster increasing the performance
  • 7. TYPES OF CLUSTER (purpose) • High Availability Clusters – Nodes in a highly available cluster must have access to a shared storage – If a node becomes inoperative, continuous service is provided by failing over service from the inoperative cluster node to another, without administrative intervention
  • 8. TYPES OF CLUSTER cont.. • Load Balancing Cluster – Distributes incoming requests among multiple nodes running the same programs or having the same content – If a node in a load-balancing cluster goes down, the load from that node is switched over to another node – Optimize the use of resources, minimize response time
  • 9. TYPES OF CLUSTER (Structure) • Symmetric – Each node functions as an individual computer capable of running applications. – Additional machines can be added as needed.
  • 10. Cluster Structure • Asymmetric – Are a type of cluster structure in which one machine acts as the head node – it serves as the gateway between the user and the remaining nodes.
  • 11. Distribution Models • There are several distribution models – Replication: placing the same set of data over multiple nodes. – Sharding: placing different sets of data on different nodes – Sharding & Replication :can either be used alone or together
  • 12. Replication • Replication is the process of creating copies of the same set of data across multiple servers. • The copy of a block is called replica. • To overcome issues like: – when a node crashes, the data stored in that node will be lost – when a node is down for maintenance, it will not be available until the maintenance process is over.
  • 14. Replication Advantages • Replication makes the system fault tolerant since the data is not lost when an individual node fails as the data is redundant across the nodes. • Replication increases the data availability as the same copy of data is available across multiple nodes.
  • 15. Replication Models Master-slave • Master controls one or more devices known as slaves • The flow of control is only from master to the slaves • Incoming data are written on the master node • Read requests are handled by slave nodes • This architecture supports intensive read requests • The cluster still suffers from single point of failure, if the master fails • The writes are limited to the maximum capacity that a master can handle
  • 16. Replication Models • All the nodes have the same responsibility and are at the same level • Either of the devices involved in the process can initiate communication • The nodes consume as well as donate the resources • Reliability is improved through replication Peer-Peer
  • 17. Sharding • Partitioning very large data sets into smaller and easily manageable chunks called shards. • The shards are stored by distributing them across multiple machines called nodes. • No two shards of the same file are stored in the same node • Shards spread across multiple nodes collectively constitute the data set.
  • 19. Sharding Advantages • Scalability where new shards can be added at runtime without shutting down the application for maintenance • Improves the fault tolerance of the system as the failure of a node affects only the block of the data stored in that particular node.
  • 20. Sharding & Replication • In sharding when a node goes down, the data stored in the node will be lost. • So it provides only a limited fault tolerance to the system. • Sharding and replication can be combined to make the system fault tolerant and highly available.
  • 21. Sharding & Replication Example •
  • 22. Distributed File System (DFS) • A file system is a way of storing and organizing the data on storage devices (HD, DVDs, ...) and to keep track of the files stored on them. • The file is the smallest unit of storage defined by the file system to pile data. • File systems store and retrieve data for the application to run effectively and efficiently on the operating systems. • A distributed file system stores the files across cluster nodes and allows the clients to access the files from the cluster. • Files are distributed across the nodes, but logically it appears to as if they are residing on the clients local machine. • Since a DFS provides access to more than one client simultaneously, the server organizes updates for the clients to access the current updated version of the file, and no version conflicts arise. • Big data widely adopts a distributed file system known as Hadoop Distributed File System (HDFS)
  • 23. DFS Key concepts • Data replication where the copies of data are distributed on multiple cluster nodes so that there is no single point of failure, which increases the reliability. • The client can communicate with any of the closest available nodes to reduce latency and network traffic • Fault tolerance is achieved through data replication as the data will not be lost in case of node failure due to the redundancy in the data across nodes.
  • 25. Relational and Non-Relational Databases Relational Databases • Organize data into tables of rows (records) & columns (attributes|fields) • Unsuitable when organizations collect vast amount of customer databases, transactions, and other data, which may not be structured to fit into relational databases. Non-Relational • This has led to the evolution of non- relational databases, which are schema-less. • NoSQL is a non-relational database
  • 26. Properties of RDBMS Databases • Is vertically scalable (by increasing server hardware power) • Exhibits ACID (atomicity, consistency, isolation,durability) properties • Support data that adhere to a specific schema • Can no longer keep pace with the volume, velocity, and variety of data being generated and consumed
  • 27. Properties of NoSQL Databases • Includes all non-relational databases • Exhibits the BASE (basically available, soft state, eventually consistent) model • Are not appropriate for implementing large transactions
  • 28. Properties of NewSQL Databases • Aim to combine the scalability and performance benefits of NoSQL databases with the familiar relational data model and ACID transaction guarantees of traditional SQL databases • Horizontally scalable • Fault tolerant • Support relational data model with three layers: the administrative, transactional, and storage layer. • The applications : those that execute the same queries repeatedly with different inputs and have a large number of transactions
  • 29. high performance fault tolerant distributed in-memory scale-out Clustrix yes yes yes - - NuoDB - yes yes - yes VoltDB yes yes yes yes yes MemSQL yes yes yes yes - NewSQL Databases comparison
  • 30. Scaling up vs. Scaling out Scaling out (Horizontal) Scaling up (Vertical)