SlideShare a Scribd company logo
Designing Large­Scale 
 Distributed Systems


   Ashwani Priyedarshi
“the network is the computer.”

 John Gage, Sun Microsystems
“A distributed system is one in which the failure 
 of a computer you didn’t even know existed can 
       render your own computer unusable.”

                 Leslie Lamport
“Of three properties of distributed data systems­ 
 consistency, availability, partition­tolerance – 
                   choose two.”

       Eric Brewer, CAP Theorem, PODC 2000
Agenda
●   Consistency Models
●   Transactions
●   Why to distribute?
●   Decentralized Architecture
●   Design Techniques & Tradeoffs
●   Few Real World Examples
●   Conclusions
Consistency Model
• Restricts possible values that a read operation on 
  an item can return
  – Some are very restrictive, others are less
  – The less restrictive ones are easier to implement


• The most natural semantic for storage system is ­ 
  "read should return the last written value”
  – In case of concurrent accesses and multiple replicas, it's 
    not easy to identify what "last write" means
Strict Consistency
●   Assumes the existence of absolute global time
●   It is impossible to implement on a large distributed 
    system
●   No two operations (in different clients) allowed at the 
    same time
●   Example: Sequence (a) satisfies strict consistency, but 
    sequence (b) does not
Sequential Consistency
●   The result of any execution is the same as if 
     ●   the read and write operations by all processes on the data 
         store were executed in some sequential order
     ●   the operations of each individual process appear in this 
         sequence in the order specified by its program
●   All processes see the same interleaving of operations
●   Many interleavings are valid
●   Different runs of a program might act differently
●   Example: Sequence (a) satisfies sequential consistency, 
    but sequence (b) does not
Consistency vs Availability
•   In large shared­data distributed systems, network 
    partitions are a given

•   Consistency or Availability

•   Both options require the client developer to be aware 
    of what the system is offering
Eventual Consistency
•   An eventual consistent storage system guarantees that 
    if no new updates are made to the object, eventually 
    all accesses will return the last updated value

•   If no failures occur, the maximum size of the 
    inconsistency window can be determined based on factors 
    such as:
    – load on the system
    – communication delays
    – number of replicas


•   The most popular system that implements eventual 
    consistency is DNS
Quorum­based Technique 
•   To enforce consistent operation in a distributed 
    system.
•   Consider the following parameters:
    – N = Total number of replicas
    – W = Replicas to wait for acknowledgement during writes
    – R = Replicas to access during reads
•   If W+R > N
    – the read set and the write set always overlap and one can 
      guarantee strong consistency
•   If W+R <= N
    – the read and write set might not overlap and consistency 
      cannot be guaranteed
Agenda
●   Consistency Models
●   Transactions
●   Why to distribute?
●   Decentralized Architecture
●   Design Techniques & Tradeoffs
●   Few Real World Examples
●   Conclusions
Transactions
●   Extended form of consistency across multiple operations
●   Example: Transfer money from A to B
    ●   Subtract from A
    ●   Add to B
●   What if something happens in between?
    ●   Another transaction on A or B
    ●   Machine Crashes
    ●   ...
Why Transactions?
●   Correctness
●   Consistency
●   Enforce Invariants
●   ACID
Agenda
●   Consistency Models
●   Transactions
●   Why to distribute?
●   Decentralized Architecture
●   Design Techniques & Tradeoffs
●   Few Real World Examples
●   Conclusions
Why to distribute?
●   Catastrophic Failures
●   Expected Failures
●   Routine Maintenance
●   Geolocality
    ●   CDN, edge caching
Why NOT to distribute?
●   Within a Datacenter
    ●   High bandwidth: 1­100Gbps interconnects
    ●   Low latency: < 1ms within a rack, < 5ms across
    ●   Little to no cost
●   Between Datacenters
    ●   Low bandwidth: 10Mbps­1Gbps
    ●   High latency: expect 100s of ms
    ●   High Cost for fiber
Agenda
●   Consistency Models
●   Transactions
●   Why to distribute?
●   Decentralized Architecture
●   Design Techniques & Tradeoffs
●   Few Real World Examples
●   Conclusions
Decentralized Architecture
●   Operating from multiple data­centers simultaneously
●   Hard problem
●   Maintaining consistency? Harder
●   Transactions? Hardest
Option 1: Don't
●   Most common
    ●   Make sure data­center never goes down
●   Bad at catastrophic failure
    ●   Large scale data loss
●   Not great for serving
    ●   No geolocation
Option 2: Primary with hot 
failover(s)
●   Better, but not ideal
    ●   Mediocre at catastrophic failure
    ●   Window of lost data
    ●   Failover data may be inconsistent
●   Geolocated for reads, not for writes
Option 3: Truly Distributed
●   Simultaneous writes in different DCs, maintaining 
    consistency
●   Two­way: Hard
●   N­way: Harder
●   Handles catastrophic failure, geolocality
●   But high latency
Agenda
●   Consistency Models
●   Transactions
●   Why to distribute?
●   Decentralized Architecture
●   Design Techniques & Tradeoffs
●   Few Real World Examples
●   Conclusions
Tradeoffs

               Backups   M/S   MM   2PC   Paxos
Consistency
Transactions
Latency
Throughput
Data Loss
Failover
Backups
●   Make a copy
●   Weak consistency
●   Usually no transactions
Tradeoffs – Backups

                    Backups   M/S   MM   2PC   Paxos
Consistency    Weak
Transactions   No
Latency        Low
Throughput     High
Data Loss      High
Failover       Down
Master/slave replication
●   Usually asynchronous
    ●   Good for throughput, latency
●   Weak/eventual consistency
●   Support transactions
Tradeoffs – Master/Slave

                    Backups          M/S   MM   2PC   Paxos
Consistency    Weak           Eventual
Transactions   No             Full
Latency        Low            Low
Throughput     High           High
Data Loss      High           Some
Failover       Down           Read Only
Multi­master replication
●   Asynchronous, eventual consistency
●   Concurrent writes
●   Need serialization protocol
    ●   e.g. monotonically increasing timestamps
    ●   Either with master election or distributed consensus protocol
●   No strong consistency
●   No global transactions
Tradeoffs ­ Multi­master

                    Backups          M/S           MM   2PC   Paxos
Consistency    Weak           Eventual     Eventual
Transactions   No             Full         Local
Latency        Low            Low          Low
Throughput     High           High         High
Data Loss      High           Some         Some
Failover       Down           Read Only    Read/write
Two Phase Commit
●   Semi­distributed consensus protocol
    ●   deterministic coordinator
●   1: Request 2: Commit/Abort
●   Heavyweight, synchronous, high latency
●   3PC: Asynchronous (One extra round trip)
●   Poor Throughput
Tradeoffs ­ 2PC

                    Backups          M/S           MM          2PC   Paxos
Consistency    Weak           Eventual     Eventual     Strong
Transactions   No             Full         Local        Full
Latency        Low            Low          Low          High
Throughput     High           High         High         Low
Data Loss      High           Some         Some         None
Failover       Down           Read Only    Read/write   Read/write
Paxos
●   Decentralized, distributed consensus protocol
●   Protocol similar to 2PC/3PC
    ●   Lighter, but still high latency
●   Three class of agents: proposers, acceptors, learners
●   1. a) prepare b) promise 2. a) accept b) accepted 
●   Survives minority failure
Tradeoffs

                    Backups          M/S           MM          2PC      Paxos
Consistency    Weak           Eventual     Eventual     Strong       Strong
Transactions   No             Full         Local        Full         Full
Latency        Low            Low          Low          High         High
Throughput     High           High         High         Low          Medium
Data Loss      High           Some         Some         None         None
Failover       Down           Read Only    Read/write   Read/write   Read/write
Agenda
●   Consistency Models
●   Transactions
●   Why to distribute?
●   Decentralized Architecture
●   Design Techniques & Tradeoffs
●   Few Real World Examples
●   Conclusions
Examples
●   Megastore
    ●   Google's Scalable, Highly Available Datastore
    ●   Strong Consistency, Paxos
    ●   Optimized for reads
●   Dynamo
    ●   Amazon’s Highly Available Key­value Store
    ●   Eventual Consistency, Consistent Hashing, Vector Clocks
    ●   Optimized for writes
●   PNUTS
    ●   Yahoo's Massively Parallel & Distributed Database System
    ●   Timeline Consistency 
    ●   Optimized for reads
Conclusions
●   No silver bullet
    ●   There are no simple solutions
●   Design systems based on application needs
The End
Designing large scale distributed systems
Backup Slides
Vector Clocks
• Used to capture causality between different 
  versions of the same object.
• A vector clock is a list of (node, counter) pairs.
• Every version of every object is associated with 
  one vector clock.
• If the counters on the first object’s clock are 
  less­than­or­equal to all of the nodes in the 
  second clock, then the first is an ancestor of the 
  second and can be forgotten.
Vector Clock Example
Partitioning Algorithm

• Consistent hashing:
  – The output range of a hash 
    function is treated as a 
    fixed circular space or 
    “ring”.
• Virtual Nodes
  – Each node can be responsible 
    for more than one virtual 
    node.
  – When a new node is added, it 
    is assigned multiple 
    positions.
  – Various Advantages

More Related Content

What's hot (20)

PDF
PSR-3 logs using Monolog and Graylog
OCoderFest
 
PPT
Virtualização
Wellington Oliveira
 
PDF
Hadoopを用いた大規模ログ解析
shuichi iida
 
PDF
Learning from google megastore (Part-1)
Schubert Zhang
 
PPTX
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
PDF
Tutorial Django + Python
Mateus Padua
 
PDF
Computer Fundamentals Chapter 14 os
Saumya Sahu
 
PPTX
Display Ads Platform에 대한 약간 그럴싸한 안내와 잡담
Juseok Kim
 
PDF
Gerenciamento de memória cap 03 (ii unidade)
Faculdade Mater Christi
 
PDF
Redshift VS BigQuery
Kostas Pardalis
 
PDF
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
Edureka!
 
PPT
High Frequency Trading and NoSQL database
Peter Lawrey
 
PPTX
Apache Hadoop Big Data Technology
Jay Nagar
 
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
PDF
An overview of Neo4j Internals
Tobias Lindaaker
 
PDF
Temporal Data
Command Prompt., Inc
 
PPTX
compressao de dados
midleofmidle
 
PPTX
Enteprise Integration Patterns
Alessandro Kieras
 
PDF
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
Varad Meru
 
PDF
Balkan - data eng meetup - data fusion
Balkan Misirli
 
PSR-3 logs using Monolog and Graylog
OCoderFest
 
Virtualização
Wellington Oliveira
 
Hadoopを用いた大規模ログ解析
shuichi iida
 
Learning from google megastore (Part-1)
Schubert Zhang
 
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
Tutorial Django + Python
Mateus Padua
 
Computer Fundamentals Chapter 14 os
Saumya Sahu
 
Display Ads Platform에 대한 약간 그럴싸한 안내와 잡담
Juseok Kim
 
Gerenciamento de memória cap 03 (ii unidade)
Faculdade Mater Christi
 
Redshift VS BigQuery
Kostas Pardalis
 
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
Edureka!
 
High Frequency Trading and NoSQL database
Peter Lawrey
 
Apache Hadoop Big Data Technology
Jay Nagar
 
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
An overview of Neo4j Internals
Tobias Lindaaker
 
Temporal Data
Command Prompt., Inc
 
compressao de dados
midleofmidle
 
Enteprise Integration Patterns
Alessandro Kieras
 
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
Varad Meru
 
Balkan - data eng meetup - data fusion
Balkan Misirli
 

Similar to Designing large scale distributed systems (20)

ODP
Distributed systems and consistency
seldo
 
PDF
Intro to distributed systems
Ahmed Soliman
 
PPTX
Modern Distributed Messaging and RPC
Max Alexejev
 
PDF
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Otávio Carvalho
 
PDF
Scalability broad strokes
Gagan Bajpai
 
PPTX
Storing the real world data
Athira Mukundan
 
PDF
Sistemas Distribuidos
Locaweb
 
PDF
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
PPSX
LMAX Disruptor - High Performance Inter-Thread Messaging Library
Sebastian Andrasoni
 
PPTX
Megastore by Google
Ankita Kapratwar
 
PDF
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
PDF
Buytaert kris my_sql-pacemaker
kuchinskaya
 
PDF
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang
 
PDF
M|18 Choosing the Right High Availability Strategy for You
MariaDB plc
 
PDF
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
PDF
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
ODP
Everything you always wanted to know about Distributed databases, at devoxx l...
javier ramirez
 
PDF
Concurrency, Parallelism And IO
Piyush Katariya
 
PDF
Design Patterns for Distributed Non-Relational Databases
guestdfd1ec
 
PDF
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Adrianos Dadis
 
Distributed systems and consistency
seldo
 
Intro to distributed systems
Ahmed Soliman
 
Modern Distributed Messaging and RPC
Max Alexejev
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Otávio Carvalho
 
Scalability broad strokes
Gagan Bajpai
 
Storing the real world data
Athira Mukundan
 
Sistemas Distribuidos
Locaweb
 
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
Sebastian Andrasoni
 
Megastore by Google
Ankita Kapratwar
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
Buytaert kris my_sql-pacemaker
kuchinskaya
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang
 
M|18 Choosing the Right High Availability Strategy for You
MariaDB plc
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
Everything you always wanted to know about Distributed databases, at devoxx l...
javier ramirez
 
Concurrency, Parallelism And IO
Piyush Katariya
 
Design Patterns for Distributed Non-Relational Databases
guestdfd1ec
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Adrianos Dadis
 
Ad

Recently uploaded (20)

PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
Français Patch Tuesday - Juillet
Ivanti
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PPTX
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Français Patch Tuesday - Juillet
Ivanti
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
July Patch Tuesday
Ivanti
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Ad

Designing large scale distributed systems

  • 5. Agenda ● Consistency Models ● Transactions ● Why to distribute? ● Decentralized Architecture ● Design Techniques & Tradeoffs ● Few Real World Examples ● Conclusions
  • 6. Consistency Model • Restricts possible values that a read operation on  an item can return – Some are very restrictive, others are less – The less restrictive ones are easier to implement • The most natural semantic for storage system is ­  "read should return the last written value” – In case of concurrent accesses and multiple replicas, it's  not easy to identify what "last write" means
  • 7. Strict Consistency ● Assumes the existence of absolute global time ● It is impossible to implement on a large distributed  system ● No two operations (in different clients) allowed at the  same time ● Example: Sequence (a) satisfies strict consistency, but  sequence (b) does not
  • 8. Sequential Consistency ● The result of any execution is the same as if  ● the read and write operations by all processes on the data  store were executed in some sequential order ● the operations of each individual process appear in this  sequence in the order specified by its program ● All processes see the same interleaving of operations ● Many interleavings are valid ● Different runs of a program might act differently ● Example: Sequence (a) satisfies sequential consistency,  but sequence (b) does not
  • 9. Consistency vs Availability • In large shared­data distributed systems, network  partitions are a given • Consistency or Availability • Both options require the client developer to be aware  of what the system is offering
  • 10. Eventual Consistency • An eventual consistent storage system guarantees that  if no new updates are made to the object, eventually  all accesses will return the last updated value • If no failures occur, the maximum size of the  inconsistency window can be determined based on factors  such as: – load on the system – communication delays – number of replicas • The most popular system that implements eventual  consistency is DNS
  • 11. Quorum­based Technique  • To enforce consistent operation in a distributed  system. • Consider the following parameters: – N = Total number of replicas – W = Replicas to wait for acknowledgement during writes – R = Replicas to access during reads • If W+R > N – the read set and the write set always overlap and one can  guarantee strong consistency • If W+R <= N – the read and write set might not overlap and consistency  cannot be guaranteed
  • 12. Agenda ● Consistency Models ● Transactions ● Why to distribute? ● Decentralized Architecture ● Design Techniques & Tradeoffs ● Few Real World Examples ● Conclusions
  • 13. Transactions ● Extended form of consistency across multiple operations ● Example: Transfer money from A to B ● Subtract from A ● Add to B ● What if something happens in between? ● Another transaction on A or B ● Machine Crashes ● ...
  • 14. Why Transactions? ● Correctness ● Consistency ● Enforce Invariants ● ACID
  • 15. Agenda ● Consistency Models ● Transactions ● Why to distribute? ● Decentralized Architecture ● Design Techniques & Tradeoffs ● Few Real World Examples ● Conclusions
  • 16. Why to distribute? ● Catastrophic Failures ● Expected Failures ● Routine Maintenance ● Geolocality ● CDN, edge caching
  • 17. Why NOT to distribute? ● Within a Datacenter ● High bandwidth: 1­100Gbps interconnects ● Low latency: < 1ms within a rack, < 5ms across ● Little to no cost ● Between Datacenters ● Low bandwidth: 10Mbps­1Gbps ● High latency: expect 100s of ms ● High Cost for fiber
  • 18. Agenda ● Consistency Models ● Transactions ● Why to distribute? ● Decentralized Architecture ● Design Techniques & Tradeoffs ● Few Real World Examples ● Conclusions
  • 19. Decentralized Architecture ● Operating from multiple data­centers simultaneously ● Hard problem ● Maintaining consistency? Harder ● Transactions? Hardest
  • 20. Option 1: Don't ● Most common ● Make sure data­center never goes down ● Bad at catastrophic failure ● Large scale data loss ● Not great for serving ● No geolocation
  • 21. Option 2: Primary with hot  failover(s) ● Better, but not ideal ● Mediocre at catastrophic failure ● Window of lost data ● Failover data may be inconsistent ● Geolocated for reads, not for writes
  • 22. Option 3: Truly Distributed ● Simultaneous writes in different DCs, maintaining  consistency ● Two­way: Hard ● N­way: Harder ● Handles catastrophic failure, geolocality ● But high latency
  • 23. Agenda ● Consistency Models ● Transactions ● Why to distribute? ● Decentralized Architecture ● Design Techniques & Tradeoffs ● Few Real World Examples ● Conclusions
  • 24. Tradeoffs Backups M/S MM 2PC Paxos Consistency Transactions Latency Throughput Data Loss Failover
  • 25. Backups ● Make a copy ● Weak consistency ● Usually no transactions
  • 26. Tradeoffs – Backups Backups M/S MM 2PC Paxos Consistency Weak Transactions No Latency Low Throughput High Data Loss High Failover Down
  • 27. Master/slave replication ● Usually asynchronous ● Good for throughput, latency ● Weak/eventual consistency ● Support transactions
  • 28. Tradeoffs – Master/Slave Backups M/S MM 2PC Paxos Consistency Weak Eventual Transactions No Full Latency Low Low Throughput High High Data Loss High Some Failover Down Read Only
  • 29. Multi­master replication ● Asynchronous, eventual consistency ● Concurrent writes ● Need serialization protocol ● e.g. monotonically increasing timestamps ● Either with master election or distributed consensus protocol ● No strong consistency ● No global transactions
  • 30. Tradeoffs ­ Multi­master Backups M/S MM 2PC Paxos Consistency Weak Eventual Eventual Transactions No Full Local Latency Low Low Low Throughput High High High Data Loss High Some Some Failover Down Read Only Read/write
  • 31. Two Phase Commit ● Semi­distributed consensus protocol ● deterministic coordinator ● 1: Request 2: Commit/Abort ● Heavyweight, synchronous, high latency ● 3PC: Asynchronous (One extra round trip) ● Poor Throughput
  • 32. Tradeoffs ­ 2PC Backups M/S MM 2PC Paxos Consistency Weak Eventual Eventual Strong Transactions No Full Local Full Latency Low Low Low High Throughput High High High Low Data Loss High Some Some None Failover Down Read Only Read/write Read/write
  • 33. Paxos ● Decentralized, distributed consensus protocol ● Protocol similar to 2PC/3PC ● Lighter, but still high latency ● Three class of agents: proposers, acceptors, learners ● 1. a) prepare b) promise 2. a) accept b) accepted  ● Survives minority failure
  • 34. Tradeoffs Backups M/S MM 2PC Paxos Consistency Weak Eventual Eventual Strong Strong Transactions No Full Local Full Full Latency Low Low Low High High Throughput High High High Low Medium Data Loss High Some Some None None Failover Down Read Only Read/write Read/write Read/write
  • 35. Agenda ● Consistency Models ● Transactions ● Why to distribute? ● Decentralized Architecture ● Design Techniques & Tradeoffs ● Few Real World Examples ● Conclusions
  • 36. Examples ● Megastore ● Google's Scalable, Highly Available Datastore ● Strong Consistency, Paxos ● Optimized for reads ● Dynamo ● Amazon’s Highly Available Key­value Store ● Eventual Consistency, Consistent Hashing, Vector Clocks ● Optimized for writes ● PNUTS ● Yahoo's Massively Parallel & Distributed Database System ● Timeline Consistency  ● Optimized for reads
  • 37. Conclusions ● No silver bullet ● There are no simple solutions ● Design systems based on application needs
  • 41. Vector Clocks • Used to capture causality between different  versions of the same object. • A vector clock is a list of (node, counter) pairs. • Every version of every object is associated with  one vector clock. • If the counters on the first object’s clock are  less­than­or­equal to all of the nodes in the  second clock, then the first is an ancestor of the  second and can be forgotten.
  • 43. Partitioning Algorithm • Consistent hashing: – The output range of a hash  function is treated as a  fixed circular space or  “ring”. • Virtual Nodes – Each node can be responsible  for more than one virtual  node. – When a new node is added, it  is assigned multiple  positions. – Various Advantages