SlideShare a Scribd company logo
Nilesh Salpe
A collection of computers that appear to its user as one computer.
Characteristics
 The computers operate concurrently.
 The computers fail independently.
 The computers do not share global clock.
 Examples
 Amazon.com
 Cassandra Database
 Is multi-core processor distributed system ?
 Is single core computer with peripherals (Wifi , printer , multiple displays etc. )distributed
system ?
 Distributed Storage
 Relational , Mongo , Cassandra, HDFS (Hadoop Distributed File System), Hbase , Redis
 Distributed Computation
 Hadoop, Spark, Storm , Akka , Apache Flink
 Distributed Synchronization
 NTP (Network time protocol ) , Vector Clocks
 Distributed Consensus
 Paxos, Zookeeper
 Distributed Messaging
 Apache Kafka , RabbitMQ
 Load Balancers
 Round Robin , weighted Round Robin , min load , weighted load , Session Aware etc.
 Serialization
 Protocol Buffers , Thrift , Avro etc.
 Single-Master storage
 Powerful machine and scaling up.
 Type of loads
Read heavy load
Write heavy load
Mixed read write load
 Scaling Strategies / Data distribution
 Read Replication (scaling out)
 Sharding (scaling out)
Master Node – Updates must pass through master node.
Followers Nodes – Gets asynchronous replication or data propagation from master
to follower nodes.
Read requests can be fulfilled by follower nodes. It increases over-all I/O of system.
Problems of such design :
 Increased complexity of replication.
 No guarantee of strong consistency.
 Read after write scenarios will not guarantee of latest value.
 Master node could be bottleneck for write request.
Model is suitable for read heavy work load.
Example : Google search engine , Relational Data base with cluster.
Maste
r
X =5
F1
X =5
Follower
X = 3 Follower
X =3
Data distribution techniques
Sharding
 Used in relational databases and distributed databases.
 Could be manual to completely automated based on scheme.
Consistent hashing
 Used in distributed data bases.
 It is automated .
 Used to partition data across multiple nodes based on some key or aspect.
 Techniques ranges from manual to automated sharding
 Functional Partitioning (burden on client)
Example : Store all user data on one node and transaction data on another node.
 Horizontal partitioning (popular)
 Ranges
 Hashes
 Directory
 Vertical partitioning (less popular)
 Data belongs to same set/table/relation is distributed across nodes.
Value Node
10 F1
N1
N2
X = 10
N3
N4
Shard
Router
Basics of Distributed Systems - Distributed Storage
 Shard routing layer to distribute writes/reads
 More complexity
 Routing layer , awareness of network topology , handle dynamic cluster
 Limited data model
 Every data model should have key which is used for routing.
 Limited data access patterns
 All read/write/update/delete queries must include the key.
 Redundant data for more access models.
 For accessing data with more than one key , Data need to be stored by those keys so multiple copies or de-
normalized data .
 Need to consider data access patterns before designing the models.
 Too much scatter gather for aggregations , read might slow down system.
 OLTP will slow down the system.
 Number of shards need to be decided early in system design.
 In case of hash function of in-memory map , when we re-hash after load factor
increases after certain threshold we have to re-hash all keys . So modular hash
function will not work in case when number of buckets of map are changing
dynamically .
 Consistent hashing is technique used to limit the reshuffling of keys when a hash
table structure is rebalanced .(Dynamically changing say number of buckets ).
 Hash space is shared by key hash space and virtual node hash space.
 Keys/virtual nodes are hashed to same value despite of number of physical nodes.
Only difference is they stored on different physical nodes.
 Advantages
 Avoids re-hashing of all keys when nodes leave and join.
 Example : For Cassandra there are 128 virtual nodes per physical node.
A , B are physical nodes mapped to 8 virtual nodes A , B ,C,D are physical nodes mapped to 8 virtual nodes
Node Hash
A (0-8),(16-24) ,(32-40) ,
(48-56)
B (8-16) , (24-32) ,(40-
48) ,(56-64)
Node Hash
A 0-8),(32-40)
B (8-16), (40-48)
C (16-24), (48-56)
D (24-32) , (56-64)
Remove Node C,D
50% keys are affected
Add Node C,D
50% keys are affected
Lets say we have hash function which gives 8 bit hash. So hash space is 2^8 = 64.
hash(John) = 00111100 hash(Jane) = 00011000 . Hashes are assigned to node segment ahead of node in
CWD.
 Eventual Consistency
 Consistency Tuning
R+W > N
R – Number of replicas to responds for successful reading
W – Number of replicas to respond for successful writing/updates
N – Number of replicas
 Failures
 Node offline
 Network latency
 GC like process making node un-responsive
 Hinted Handoffs , read repairs
 Huge impact on design ( write then read scenarios)
 CAP
 Consistency - Every request gets most recent value after (write/update/delete)
 Availability – Every request receives response without error .(No guarantee of
most recent value)
 Partition Tolerance – The system continues to respond despite of arbitrary
number of nodes fails. (can not communicate with other nodes temporarily or
permanently due to network partition , congestion or communication delays ,
GC pause in case of JVM)
 Ground Reality
 Partition Tolerance is must. Nobody wants data loss.
 Practical choice is always choosing between Consistency and Availability.
 Example:
 Amazon S3 services chooses availability over consistency so it is A-P system.
Basics of Distributed Systems - Distributed Storage
 In relational database
 Two phase commit in distributed relational databases. (suffer throughput)
 ACID properties of transaction in relational databases .
A – Atomic , Transactions ( bundle of statements) completes all or nothing.
C – Consistency , Keep database in valid state before and after transaction.
I – Isolation , Transactions acts like it is alone working the data.(serializable , repeatable
reads, read committed, read uncommitted ( dirty reads , phantom reads) )
D- Durability , Once transaction committed , changes are permanent.
 Can roll-back like that transaction as if did not happen
 Options in distributed storage systems
 Lighter transactions are supported like update if present etc.
 Write-off (no money back not guarantee of delivery )
 Re-try (try with exponential time interval )
 Compensating actions (say revert credit card payment)
 Distributed transactions (2PL) (slow you down.)
 Main reason to scarifies transactions is availability.
 Impact on design of applications using distributed storage.
 Aspects to consider
 Scale
 Transactional Needs
 Highly available
 Design to failures
Storage options and scenarios
 Relational databases
 Strong transactional requirements (OLTP systems)
 NoSQL
 Giant distributed hash table (which can not fit on single machine) with
nested keys.
 Key value stores Map<K,V>
 Document databases Map<K,{k1:v1,k2:v2,…}> , value is generally of type JSON or some
kind of serializable/de- serializable format or binary file.
 Columnar databases SortedMap<<K1,K2,K3..> , V>
 Graph databases AdjacencyMap<K ,[K1,K2,K3..]> , lots of small relations or links
 Search Engines , lots of indexes based on search requirements Map<K1,K> ,Map<K2,K>
,Map<K3,K> .. actual raw document storage Map<K,V>
 Distributed Computation …
Basics of Distributed Systems - Distributed Storage

More Related Content

What's hot (20)

PDF
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
PPT
Cassandra advanced part-ll
achudhivi
 
PPT
BDAS RDD study report v1.2
Stefanie Zhao
 
PDF
Cloud Spanner
Anatol Alizar
 
PPTX
Modern software design in Big data era
Bill GU
 
PPTX
Hadoop fault tolerance
Pallav Jha
 
PPTX
Corbett osdi12 slides (1)
Aksh54
 
ODP
Cassandra Insider
Knoldus Inc.
 
PDF
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
Spark Summit
 
DOCX
Cassandra architecture
Nagender Varahala
 
PDF
Meet Hadoop Family: part 1
caizer_x
 
ODP
Cassandra Data Modelling
Knoldus Inc.
 
PPT
Bhupeshbansal bigdata
Bhupesh Bansal
 
PPTX
Scheduling in distributed systems - Andrii Vozniuk
Andrii Vozniuk
 
PDF
Write intensive workloads and lsm trees
Tilak Patidar
 
PDF
Meet Hadoop Family: part 3
caizer_x
 
PDF
Predicting rainfall using ensemble of ensembles
Varad Meru
 
PPTX
The design and implementation of modern column oriented databases
Tilak Patidar
 
PPTX
Distributed Caching - Cache Unleashed
Avishek Patra
 
PPT
Evolving as a professional software developer
Anton Kirillov
 
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Cassandra advanced part-ll
achudhivi
 
BDAS RDD study report v1.2
Stefanie Zhao
 
Cloud Spanner
Anatol Alizar
 
Modern software design in Big data era
Bill GU
 
Hadoop fault tolerance
Pallav Jha
 
Corbett osdi12 slides (1)
Aksh54
 
Cassandra Insider
Knoldus Inc.
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
Spark Summit
 
Cassandra architecture
Nagender Varahala
 
Meet Hadoop Family: part 1
caizer_x
 
Cassandra Data Modelling
Knoldus Inc.
 
Bhupeshbansal bigdata
Bhupesh Bansal
 
Scheduling in distributed systems - Andrii Vozniuk
Andrii Vozniuk
 
Write intensive workloads and lsm trees
Tilak Patidar
 
Meet Hadoop Family: part 3
caizer_x
 
Predicting rainfall using ensemble of ensembles
Varad Meru
 
The design and implementation of modern column oriented databases
Tilak Patidar
 
Distributed Caching - Cache Unleashed
Avishek Patra
 
Evolving as a professional software developer
Anton Kirillov
 

Similar to Basics of Distributed Systems - Distributed Storage (20)

PDF
Design Patterns for Distributed Non-Relational Databases
guestdfd1ec
 
PPTX
Cassandra
Upaang Saxena
 
PDF
Design Patterns For Distributed NO-reational databases
lovingprince58
 
PDF
Distributed Systems: scalability and high availability
Renato Lucindo
 
PPT
Handling Data in Mega Scale Web Systems
Vineet Gupta
 
PDF
Designing for Concurrency
Susan Potter
 
PPT
5266732.ppt
hothyfa
 
PPTX
Distributed systems and scalability rules
Oleg Tsal-Tsalko
 
PPT
20. Parallel Databases in DBMS
koolkampus
 
PPT
Schemaless Databases
Dan Gunter
 
PDF
Data Storage Management
Nisheet Mahajan
 
PPTX
Hadoop bigdata overview
harithakannan
 
PDF
About "Apache Cassandra"
Jihyun Ahn
 
PDF
Cassandra for Sysadmins
Nathan Milford
 
PDF
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
Joseph Kuo
 
PPTX
Cassandra & Python - Springfield MO User Group
Adam Hutson
 
PDF
Hbase: an introduction
Jean-Baptiste Poullet
 
PPTX
Cassandra internals
narsiman
 
PPTX
NoSQL Introduction, Theory, Implementations
Firat Atagun
 
PDF
Everything you always wanted to know about highly available distributed datab...
Codemotion
 
Design Patterns for Distributed Non-Relational Databases
guestdfd1ec
 
Cassandra
Upaang Saxena
 
Design Patterns For Distributed NO-reational databases
lovingprince58
 
Distributed Systems: scalability and high availability
Renato Lucindo
 
Handling Data in Mega Scale Web Systems
Vineet Gupta
 
Designing for Concurrency
Susan Potter
 
5266732.ppt
hothyfa
 
Distributed systems and scalability rules
Oleg Tsal-Tsalko
 
20. Parallel Databases in DBMS
koolkampus
 
Schemaless Databases
Dan Gunter
 
Data Storage Management
Nisheet Mahajan
 
Hadoop bigdata overview
harithakannan
 
About "Apache Cassandra"
Jihyun Ahn
 
Cassandra for Sysadmins
Nathan Milford
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
Joseph Kuo
 
Cassandra & Python - Springfield MO User Group
Adam Hutson
 
Hbase: an introduction
Jean-Baptiste Poullet
 
Cassandra internals
narsiman
 
NoSQL Introduction, Theory, Implementations
Firat Atagun
 
Everything you always wanted to know about highly available distributed datab...
Codemotion
 
Ad

Recently uploaded (20)

PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Ad

Basics of Distributed Systems - Distributed Storage

  • 2. A collection of computers that appear to its user as one computer. Characteristics  The computers operate concurrently.  The computers fail independently.  The computers do not share global clock.  Examples  Amazon.com  Cassandra Database  Is multi-core processor distributed system ?  Is single core computer with peripherals (Wifi , printer , multiple displays etc. )distributed system ?
  • 3.  Distributed Storage  Relational , Mongo , Cassandra, HDFS (Hadoop Distributed File System), Hbase , Redis  Distributed Computation  Hadoop, Spark, Storm , Akka , Apache Flink  Distributed Synchronization  NTP (Network time protocol ) , Vector Clocks  Distributed Consensus  Paxos, Zookeeper  Distributed Messaging  Apache Kafka , RabbitMQ  Load Balancers  Round Robin , weighted Round Robin , min load , weighted load , Session Aware etc.  Serialization  Protocol Buffers , Thrift , Avro etc.
  • 4.  Single-Master storage  Powerful machine and scaling up.  Type of loads Read heavy load Write heavy load Mixed read write load  Scaling Strategies / Data distribution  Read Replication (scaling out)  Sharding (scaling out)
  • 5. Master Node – Updates must pass through master node. Followers Nodes – Gets asynchronous replication or data propagation from master to follower nodes. Read requests can be fulfilled by follower nodes. It increases over-all I/O of system. Problems of such design :  Increased complexity of replication.  No guarantee of strong consistency.  Read after write scenarios will not guarantee of latest value.  Master node could be bottleneck for write request. Model is suitable for read heavy work load. Example : Google search engine , Relational Data base with cluster.
  • 7. Data distribution techniques Sharding  Used in relational databases and distributed databases.  Could be manual to completely automated based on scheme. Consistent hashing  Used in distributed data bases.  It is automated .
  • 8.  Used to partition data across multiple nodes based on some key or aspect.  Techniques ranges from manual to automated sharding  Functional Partitioning (burden on client) Example : Store all user data on one node and transaction data on another node.  Horizontal partitioning (popular)  Ranges  Hashes  Directory  Vertical partitioning (less popular)
  • 9.  Data belongs to same set/table/relation is distributed across nodes.
  • 10. Value Node 10 F1 N1 N2 X = 10 N3 N4 Shard Router
  • 12.  Shard routing layer to distribute writes/reads  More complexity  Routing layer , awareness of network topology , handle dynamic cluster  Limited data model  Every data model should have key which is used for routing.  Limited data access patterns  All read/write/update/delete queries must include the key.  Redundant data for more access models.  For accessing data with more than one key , Data need to be stored by those keys so multiple copies or de- normalized data .  Need to consider data access patterns before designing the models.  Too much scatter gather for aggregations , read might slow down system.  OLTP will slow down the system.  Number of shards need to be decided early in system design.
  • 13.  In case of hash function of in-memory map , when we re-hash after load factor increases after certain threshold we have to re-hash all keys . So modular hash function will not work in case when number of buckets of map are changing dynamically .  Consistent hashing is technique used to limit the reshuffling of keys when a hash table structure is rebalanced .(Dynamically changing say number of buckets ).  Hash space is shared by key hash space and virtual node hash space.  Keys/virtual nodes are hashed to same value despite of number of physical nodes. Only difference is they stored on different physical nodes.  Advantages  Avoids re-hashing of all keys when nodes leave and join.  Example : For Cassandra there are 128 virtual nodes per physical node.
  • 14. A , B are physical nodes mapped to 8 virtual nodes A , B ,C,D are physical nodes mapped to 8 virtual nodes Node Hash A (0-8),(16-24) ,(32-40) , (48-56) B (8-16) , (24-32) ,(40- 48) ,(56-64) Node Hash A 0-8),(32-40) B (8-16), (40-48) C (16-24), (48-56) D (24-32) , (56-64) Remove Node C,D 50% keys are affected Add Node C,D 50% keys are affected Lets say we have hash function which gives 8 bit hash. So hash space is 2^8 = 64. hash(John) = 00111100 hash(Jane) = 00011000 . Hashes are assigned to node segment ahead of node in CWD.
  • 15.  Eventual Consistency  Consistency Tuning R+W > N R – Number of replicas to responds for successful reading W – Number of replicas to respond for successful writing/updates N – Number of replicas  Failures  Node offline  Network latency  GC like process making node un-responsive  Hinted Handoffs , read repairs  Huge impact on design ( write then read scenarios)
  • 16.  CAP  Consistency - Every request gets most recent value after (write/update/delete)  Availability – Every request receives response without error .(No guarantee of most recent value)  Partition Tolerance – The system continues to respond despite of arbitrary number of nodes fails. (can not communicate with other nodes temporarily or permanently due to network partition , congestion or communication delays , GC pause in case of JVM)  Ground Reality  Partition Tolerance is must. Nobody wants data loss.  Practical choice is always choosing between Consistency and Availability.  Example:  Amazon S3 services chooses availability over consistency so it is A-P system.
  • 18.  In relational database  Two phase commit in distributed relational databases. (suffer throughput)  ACID properties of transaction in relational databases . A – Atomic , Transactions ( bundle of statements) completes all or nothing. C – Consistency , Keep database in valid state before and after transaction. I – Isolation , Transactions acts like it is alone working the data.(serializable , repeatable reads, read committed, read uncommitted ( dirty reads , phantom reads) ) D- Durability , Once transaction committed , changes are permanent.  Can roll-back like that transaction as if did not happen  Options in distributed storage systems  Lighter transactions are supported like update if present etc.  Write-off (no money back not guarantee of delivery )  Re-try (try with exponential time interval )  Compensating actions (say revert credit card payment)  Distributed transactions (2PL) (slow you down.)  Main reason to scarifies transactions is availability.  Impact on design of applications using distributed storage.
  • 19.  Aspects to consider  Scale  Transactional Needs  Highly available  Design to failures
  • 20. Storage options and scenarios  Relational databases  Strong transactional requirements (OLTP systems)  NoSQL  Giant distributed hash table (which can not fit on single machine) with nested keys.  Key value stores Map<K,V>  Document databases Map<K,{k1:v1,k2:v2,…}> , value is generally of type JSON or some kind of serializable/de- serializable format or binary file.  Columnar databases SortedMap<<K1,K2,K3..> , V>  Graph databases AdjacencyMap<K ,[K1,K2,K3..]> , lots of small relations or links  Search Engines , lots of indexes based on search requirements Map<K1,K> ,Map<K2,K> ,Map<K3,K> .. actual raw document storage Map<K,V>

Editor's Notes

  • #3: The computers operate concurrently. - No shared things like CPU ,GPU , Memory etc. The computers fail independently. – Computers might fail due to hardware failures /power outages etc.
  • #4: Rough overview of all these in next session. Messaging is key part of observer /pub-sub design pattern. Paxos – protocol to resolve conflict of values between multiple machines.
  • #5: Lets start with our own e-commerce site with small customers.
  • #10: By Range Easy , No even distribution , it depends on data characteristics. By Hash Better distribution if hash function is good . Needs re-hashing of all keys and transfer of most of keys. Which increases internal network traffic. By directory Create virtual directory for say server and object but management is challenging as data increases.
  • #12: Used for tabular data or relational data base . Less common , as relation grows in row number not in column . Binary objects columns might be vertically partitioned .
  • #15: Exmaple Riak uses SHA-1 160 bit hash function . So hash space could be 2^160-1
  • #19: Companies do not use two phase commits for speed and throughput. XA transactions implementation of distributed transactions . Co-Ordinator node and participants nodes . Commit Request (voting phase) prepare phase Send query to all participants , execute query but do not commit Say yes Commit Phase If all say yes then commit else abort . Transaction Receive order Process payment Enque Order Process order Deliver the order Parallel Separate Systems Failures Payment failure Out of stock Hardware failures Services failure Global locks Holds on critical resource affecting the chain Light weight transactions – like update if exists etc .