SlideShare a Scribd company logo
Vineet Gupta | GM – Software Engineering | Directi http://www.vineetgupta.com Licensed under Creative Commons Attribution Sharealike Noncommercial Intelligent People. Uncommon Ideas.
22M+ users Dozens of DB servers Dozens of Web servers Six specialized graph database servers to run recommendations engine Source:   http://highscalability.com/digg-architecture
1 TB / Day 100 M blogs indexed / day 10 B objects indexed / day 0.5 B photos and videos Data doubles in 6 months Users double in 6 months Source:   http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/
2 PB Raw Storage 470 M photos, 4-5 sizes each 400 k photos added / day 35 M photos in Squid cache (total) 2 M photos in Squid RAM 38k reqs / sec to Memcached 4 B queries / day Source:   http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html
Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters 2 PB of data 26 B queries / day 1 B page views / day 3 B API calls / month 15,000 App servers Source:   http://highscalability.com/ebay-architecture/
450,000 low cost commodity servers in 2006 Indexed 8 B web-pages in 2005 200 GFS clusters (1 cluster = 1,000 – 5,000 machines) Read / write thruput = 40 GB / sec across a cluster Map-Reduce 100k jobs / day 20 PB of data processed / day 10k MapReduce programs Source:   http://highscalability.com/google-architecture/
Data Size ~ PB Data Growth ~ TB / day No of servers – 10s to 10,000 No of datacenters – 1 to 10 Queries – B+ / day
Host App Server DB Server RAM CPU CPU CPU RAM RAM
Sunfire X4640 M2 8 x 6-core 2.6 GHz $ 27k to $ 170k PowerEdge R200 Dual core 2.8 GHz Around $ 550
Increasing the hardware resources on a host Pros Simple to implement Fast turnaround time Cons Finite limit Hardware does not scale linearly (diminishing returns for each incremental unit) Requires downtime Increases Downtime Impact Incremental costs increase exponentially
T1, T2, T3, T4 App Layer
T1, T2, T3, T4 App Layer T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 Each node has its own copy of data Shared Nothing Cluster
Read : Write = 4:1 Scale reads at cost of writes! Duplicate Data – each node has its own copy Master Slave Writes sent to one node, cascaded to others Multi-Master Writes can be sent to multiple nodes Can lead to deadlocks Requires conflict management
Master App Layer Slave Slave Slave Slave n x Writes – Async vs. Sync SPOF Async - Critical Reads from Master!
Master App Layer Master Slave Slave Slave n x Writes – Async vs. Sync No SPOF Conflicts! O(N2) or O(N3) resolution
Write Read Write Read Write Read Write Read Write Read Write Read Write Read Per Server: 4R, 1W 2R, 1W 1R, 1W
Vertical Partitioning Divide data on tables / columns Scale to as many boxes as there are tables or columns Finite Horizontal Partitioning Divide data on rows Scale to as many boxes as there are rows! Limitless scaling
T1, T2, T3, T4, T5 App Layer
T3 App Layer T4 T5 T2 T1 Facebook - User table, posts table can be on separate nodes Joins need to be done in code (Why have them?)
T3 App Layer T4 T5 T2 T1 First million rows T3 T4 T5 T2 T1 Second million rows T3 T4 T5 T2 T1 Third million rows
Value Based Split on timestamp of posts Split on first alphabet of user name Hash Based Use a hash function to determine cluster Lookup Map First Come First Serve Round Robin
Source:   http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495
In distributed systems, much weaker forms of consistency are often acceptable, e.g., Only a few (or even one) possible writers of data, and/or Read-mostly data (seldom modified), and/or Stale data may be acceptable Eventual consistency If no updates take place for a long time, all replicas will eventually become consistent Implementation Need only ensure updates eventually reach all of the replicated copies of the data
Monotonic Reads If a node sees a version x at time t, it will never see an older version at a later time Monotonic Writes A write operation by a process on a data item x is completed before any successive write operation on x by the same process Read your writes The effect of a write operation by a process on data item x will always be seen by a successive read operation on x by the same process Writes follow Reads Write occurs on a copy of x that is at least as recent as the last copy read by the process
Many Kinds of Computing are “Append-Only” Lots of observations are made about the world Debits, credits, Purchase-Orders, Customer-Change-Requests, etc As time moves on, more observations are added You can’t change the history but you can add new observations Derived Results May Be Calculated Estimate of the “current” inventory Frequently inaccurate Historic Rollups Are Calculated Monthly bank statements
 
5 joins for 1 query! Do you think FB would do this? And how would you do joins with partitioned data? De-normalization removes joins But increases data volume However disk is cheap and getting cheaper And can lead to inconsistent data But only if we do UPDATEs and DELETEs
Normalization’s Goal Is Eliminating Update Anomalies Can Be Changed Without “Funny Behavior” Each Data Item Lives in One Place Emp # Emp Name Mgr # Mgr Name Emp Phone Mgr Phone 47 Joe 13 Sam 5-1234 6-9876 18 Sally 38 Harry 3-3123 5-6782 91 Pete 13 Sam 2-1112 6-9876 66 Mary 02 Betty 5-7349 4-0101 Classic problem with de-normalization Can’t update Sam’s phone # since there are many copies De-normalization is OK if you aren’t going to update! Source:   http://blogs.msdn.com/pathelland/
Partitioning for scaling Replication for availability No ACID transactions No JOINs Immutable data No cascaded UPDATEs and DELETEs
 
Partitioning – for R/W scaling Replication – for availability Versioning – for immutable data Eventual Consistency Error detection and handling
Google – BigTable Amazon – Dynamo Facebook – Cassandra (BigTable + Dynamo) LinkedIn – Voldemort (similar to Dynamo) Many more
Tens of millions of customers served at peak times Tens of thousands of servers Both customers and servers distributed world wide
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html   Eventually consistent data store Always writable Decentralized All nodes have the same responsibilities
 
Similar to Chord Each node gets an ID from the space of keys Nodes are arranged in a ring Data stored on the first node clockwise of the current placement of the data key Replication Preference lists of N nodes following the associated node
A problem with the Chord scheme Nodes placed randomly on ring Leads to uneven data & load distribution In Dynamo “ Virtual” nodes Each physical node has multiple virtual nodes More powerful machines have more virtual nodes Distribute virtual nodes across the ring
Updates generate a new timestamp Vector clocks are used Eventual consistency Multiple versions of the same object might co-exist Syntactic Reconciliation System might be able to resolve conflicts automatically Semantic Reconciliation Conflict resolution pushed to application
 
Request arrives at a node (coordinator) Ideally the node responsible for the particular key Else forwards request to the node responsible for that key and that node will become the coordinator The first N healthy and distinct nodes following the key position are considered for the request Application defines N = total number of participating nodes R = number of nodes required for successful Read W = number of nodes required for successful write R + W > N gives quorum
Writes Requires generation of a new vector clock by coordinator Coordinator writes locally Forwards to N nodes, if W-1 respond then the write was successful Reads Forwards to N nodes, if R-1 respond then forwards to user Only unique responses forwarded User handles merging if multiple versions exist
Sloppy Quorum Read write ops performed on first N healthy nodes Increases availability Hinted Handoff If node in preference list is not available, send replica to a node further down in the list With a hint containing the identity of the original node The receiving node keeps checking for the original If the original becomes available, transfers replica to it
Replica Synchronization Synchronize with another node Each node maintains a separate Merkel tree for each key range it hosts Nodes exchange roots of trees for common key-ranges Quickly determine divergent keys by comparing hashes
Ring Membership Membership is explicit to avoid re-balancing of partition assignment Use background gossip to build 1-hop DHT Use external entity to bootstrap the system to avoid partitioned rings Failure Detection Node A finds node B unreachable (for servicing a request) A uses other nodes to service requests and periodically checks B A does not assume B to have failed No globally consistent view of failure (because of explicit ring membership)
Application Configurable (N, R, W) Every node is aware of the data hosted by its peers requiring the gossiping of the full routing table with other nodes scalability is limited by this to a few hundred nodes hierarchy may help to overcome the limitation
Typical configuration for the Dynamo (N, R, W) is (3, 2, 2) Some implementations vary (N, R, W) Always write might have W=1 (Shopping Cart) Product catalog might have R=1 and W=N Response requirement is 300ms for any request (read or write)
Consistency vs. Availability 99.94% one version 0.00057% two 0.00047% three 0.00009% four Server-driven or Client-driven coordination Server-driven  uses load balancers forwards requests to desired set of nodes Client-driven 50% faster requires the polling of Dynamo membership updates the client is responsible for determining the appropriate nodes to send the request to Successful responses (without time-out) 99.9995%
 
 
 
 
Enormous data (and high growth) Traditional solutions don’t work Distributed databases Lots of interesting work happening Great time for young programmers! Problem solving ability
 
Intelligent People. Uncommon Ideas. Licensed under Creative Commons Attribution Sharealike Noncommercial

More Related Content

What's hot (19)

PPTX
Jstorm introduction-0.9.6
longda feng
 
PDF
PHP Backends for Real-Time User Interaction using Apache Storm.
DECK36
 
PDF
getFamiliarWithHadoop
AmirReza Mohammadi
 
PPTX
Hadoop fault tolerance
Pallav Jha
 
PPT
Sector Sphere 2009
lilyco
 
PPTX
HUG Nov 2010: HDFS Raid - Facebook
Yahoo Developer Network
 
PDF
Hdfs high availability
Hadoop User Group
 
PPTX
Hadoop training-in-hyderabad
sreehari orienit
 
PDF
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
DataStax Academy
 
PPTX
Spark vs storm
Trong Ton
 
PPTX
S4: Distributed Stream Computing Platform
Farzad Nozarian
 
PDF
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Data Con LA
 
PPT
Presentation on Hadoop Technology
OpenDev
 
PDF
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Sid Anand
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PPTX
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
PDF
Document Similarity with Cloud Computing
Bryan Bende
 
PPTX
Distributed Caching - Cache Unleashed
Avishek Patra
 
Jstorm introduction-0.9.6
longda feng
 
PHP Backends for Real-Time User Interaction using Apache Storm.
DECK36
 
getFamiliarWithHadoop
AmirReza Mohammadi
 
Hadoop fault tolerance
Pallav Jha
 
Sector Sphere 2009
lilyco
 
HUG Nov 2010: HDFS Raid - Facebook
Yahoo Developer Network
 
Hdfs high availability
Hadoop User Group
 
Hadoop training-in-hyderabad
sreehari orienit
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
DataStax Academy
 
Spark vs storm
Trong Ton
 
S4: Distributed Stream Computing Platform
Farzad Nozarian
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Data Con LA
 
Presentation on Hadoop Technology
OpenDev
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Sid Anand
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
Document Similarity with Cloud Computing
Bryan Bende
 
Distributed Caching - Cache Unleashed
Avishek Patra
 

Viewers also liked (11)

PPTX
Handling Data in Mega Scale Systems
Directi Group
 
PDF
Hpts 2011 flexible_oltp
Jags Ramnarayan
 
PDF
Reduce Side Joins
Edureka!
 
PPT
Introduction to Tokenization
Nabeel Yoosuf
 
PPTX
Denormalization
Sohail Haider
 
PDF
Efficient Duplicate Detection Over Massive Data Sets
Pradeeban Kathiravelu, Ph.D.
 
PDF
What is Payment Tokenization?
Rambus Inc
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
PPTX
Overview of AWS Services for your Enterprise
Blazeclan Technologies Private Limited
 
PDF
Tuple map reduce: beyond classic mapreduce
datasalt
 
PPTX
MySQL Visual Analysis and Scale-out Strategy definition - Webinar deck
Vladi Vexler
 
Handling Data in Mega Scale Systems
Directi Group
 
Hpts 2011 flexible_oltp
Jags Ramnarayan
 
Reduce Side Joins
Edureka!
 
Introduction to Tokenization
Nabeel Yoosuf
 
Denormalization
Sohail Haider
 
Efficient Duplicate Detection Over Massive Data Sets
Pradeeban Kathiravelu, Ph.D.
 
What is Payment Tokenization?
Rambus Inc
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Overview of AWS Services for your Enterprise
Blazeclan Technologies Private Limited
 
Tuple map reduce: beyond classic mapreduce
datasalt
 
MySQL Visual Analysis and Scale-out Strategy definition - Webinar deck
Vladi Vexler
 
Ad

Similar to Handling Data in Mega Scale Web Systems (20)

ODP
Front Range PHP NoSQL Databases
Jon Meredith
 
PDF
Modeling data and best practices for the Azure Cosmos DB.
Mohammad Asif
 
PDF
Azure Cosmos DB - Technical Deep Dive
Andre Essing
 
PDF
Distributed Systems: scalability and high availability
Renato Lucindo
 
PPT
PNUTS
Ruchika Mehresh
 
PPT
Pnuts
Ruchika Mehresh
 
PPT
Pnuts Review
Ruchika Mehresh
 
PPTX
Cloud storage
Zeeshan Bilal
 
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
PPTX
Basics of Distributed Systems - Distributed Storage
Nilesh Salpe
 
PPTX
Tech-Spark: Exploring the Cosmos DB
Ralph Attard
 
PDF
Design Patterns For Distributed NO-reational databases
lovingprince58
 
PDF
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
PDF
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
Naoki (Neo) SATO
 
PPTX
17-NoSQL.pptx
levichan1
 
PPT
NOSQL Database: Apache Cassandra
Folio3 Software
 
PPT
MYSQL
gilashikwa
 
PDF
Design Patterns for Distributed Non-Relational Databases
guestdfd1ec
 
PPTX
Need for Time series Database
Pramit Choudhary
 
PPTX
Multi site Clustering with Windows Server 2008 Enterprise
Paulo Freitas
 
Front Range PHP NoSQL Databases
Jon Meredith
 
Modeling data and best practices for the Azure Cosmos DB.
Mohammad Asif
 
Azure Cosmos DB - Technical Deep Dive
Andre Essing
 
Distributed Systems: scalability and high availability
Renato Lucindo
 
Pnuts Review
Ruchika Mehresh
 
Cloud storage
Zeeshan Bilal
 
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
Basics of Distributed Systems - Distributed Storage
Nilesh Salpe
 
Tech-Spark: Exploring the Cosmos DB
Ralph Attard
 
Design Patterns For Distributed NO-reational databases
lovingprince58
 
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
Naoki (Neo) SATO
 
17-NoSQL.pptx
levichan1
 
NOSQL Database: Apache Cassandra
Folio3 Software
 
MYSQL
gilashikwa
 
Design Patterns for Distributed Non-Relational Databases
guestdfd1ec
 
Need for Time series Database
Pramit Choudhary
 
Multi site Clustering with Windows Server 2008 Enterprise
Paulo Freitas
 
Ad

Recently uploaded (20)

PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 

Handling Data in Mega Scale Web Systems

  • 1. Vineet Gupta | GM – Software Engineering | Directi http://www.vineetgupta.com Licensed under Creative Commons Attribution Sharealike Noncommercial Intelligent People. Uncommon Ideas.
  • 2. 22M+ users Dozens of DB servers Dozens of Web servers Six specialized graph database servers to run recommendations engine Source: http://highscalability.com/digg-architecture
  • 3. 1 TB / Day 100 M blogs indexed / day 10 B objects indexed / day 0.5 B photos and videos Data doubles in 6 months Users double in 6 months Source: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/
  • 4. 2 PB Raw Storage 470 M photos, 4-5 sizes each 400 k photos added / day 35 M photos in Squid cache (total) 2 M photos in Squid RAM 38k reqs / sec to Memcached 4 B queries / day Source: http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html
  • 5. Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters 2 PB of data 26 B queries / day 1 B page views / day 3 B API calls / month 15,000 App servers Source: http://highscalability.com/ebay-architecture/
  • 6. 450,000 low cost commodity servers in 2006 Indexed 8 B web-pages in 2005 200 GFS clusters (1 cluster = 1,000 – 5,000 machines) Read / write thruput = 40 GB / sec across a cluster Map-Reduce 100k jobs / day 20 PB of data processed / day 10k MapReduce programs Source: http://highscalability.com/google-architecture/
  • 7. Data Size ~ PB Data Growth ~ TB / day No of servers – 10s to 10,000 No of datacenters – 1 to 10 Queries – B+ / day
  • 8. Host App Server DB Server RAM CPU CPU CPU RAM RAM
  • 9. Sunfire X4640 M2 8 x 6-core 2.6 GHz $ 27k to $ 170k PowerEdge R200 Dual core 2.8 GHz Around $ 550
  • 10. Increasing the hardware resources on a host Pros Simple to implement Fast turnaround time Cons Finite limit Hardware does not scale linearly (diminishing returns for each incremental unit) Requires downtime Increases Downtime Impact Incremental costs increase exponentially
  • 11. T1, T2, T3, T4 App Layer
  • 12. T1, T2, T3, T4 App Layer T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 Each node has its own copy of data Shared Nothing Cluster
  • 13. Read : Write = 4:1 Scale reads at cost of writes! Duplicate Data – each node has its own copy Master Slave Writes sent to one node, cascaded to others Multi-Master Writes can be sent to multiple nodes Can lead to deadlocks Requires conflict management
  • 14. Master App Layer Slave Slave Slave Slave n x Writes – Async vs. Sync SPOF Async - Critical Reads from Master!
  • 15. Master App Layer Master Slave Slave Slave n x Writes – Async vs. Sync No SPOF Conflicts! O(N2) or O(N3) resolution
  • 16. Write Read Write Read Write Read Write Read Write Read Write Read Write Read Per Server: 4R, 1W 2R, 1W 1R, 1W
  • 17. Vertical Partitioning Divide data on tables / columns Scale to as many boxes as there are tables or columns Finite Horizontal Partitioning Divide data on rows Scale to as many boxes as there are rows! Limitless scaling
  • 18. T1, T2, T3, T4, T5 App Layer
  • 19. T3 App Layer T4 T5 T2 T1 Facebook - User table, posts table can be on separate nodes Joins need to be done in code (Why have them?)
  • 20. T3 App Layer T4 T5 T2 T1 First million rows T3 T4 T5 T2 T1 Second million rows T3 T4 T5 T2 T1 Third million rows
  • 21. Value Based Split on timestamp of posts Split on first alphabet of user name Hash Based Use a hash function to determine cluster Lookup Map First Come First Serve Round Robin
  • 22. Source: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495
  • 23. In distributed systems, much weaker forms of consistency are often acceptable, e.g., Only a few (or even one) possible writers of data, and/or Read-mostly data (seldom modified), and/or Stale data may be acceptable Eventual consistency If no updates take place for a long time, all replicas will eventually become consistent Implementation Need only ensure updates eventually reach all of the replicated copies of the data
  • 24. Monotonic Reads If a node sees a version x at time t, it will never see an older version at a later time Monotonic Writes A write operation by a process on a data item x is completed before any successive write operation on x by the same process Read your writes The effect of a write operation by a process on data item x will always be seen by a successive read operation on x by the same process Writes follow Reads Write occurs on a copy of x that is at least as recent as the last copy read by the process
  • 25. Many Kinds of Computing are “Append-Only” Lots of observations are made about the world Debits, credits, Purchase-Orders, Customer-Change-Requests, etc As time moves on, more observations are added You can’t change the history but you can add new observations Derived Results May Be Calculated Estimate of the “current” inventory Frequently inaccurate Historic Rollups Are Calculated Monthly bank statements
  • 26.  
  • 27. 5 joins for 1 query! Do you think FB would do this? And how would you do joins with partitioned data? De-normalization removes joins But increases data volume However disk is cheap and getting cheaper And can lead to inconsistent data But only if we do UPDATEs and DELETEs
  • 28. Normalization’s Goal Is Eliminating Update Anomalies Can Be Changed Without “Funny Behavior” Each Data Item Lives in One Place Emp # Emp Name Mgr # Mgr Name Emp Phone Mgr Phone 47 Joe 13 Sam 5-1234 6-9876 18 Sally 38 Harry 3-3123 5-6782 91 Pete 13 Sam 2-1112 6-9876 66 Mary 02 Betty 5-7349 4-0101 Classic problem with de-normalization Can’t update Sam’s phone # since there are many copies De-normalization is OK if you aren’t going to update! Source: http://blogs.msdn.com/pathelland/
  • 29. Partitioning for scaling Replication for availability No ACID transactions No JOINs Immutable data No cascaded UPDATEs and DELETEs
  • 30.  
  • 31. Partitioning – for R/W scaling Replication – for availability Versioning – for immutable data Eventual Consistency Error detection and handling
  • 32. Google – BigTable Amazon – Dynamo Facebook – Cassandra (BigTable + Dynamo) LinkedIn – Voldemort (similar to Dynamo) Many more
  • 33. Tens of millions of customers served at peak times Tens of thousands of servers Both customers and servers distributed world wide
  • 34. http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Eventually consistent data store Always writable Decentralized All nodes have the same responsibilities
  • 35.  
  • 36. Similar to Chord Each node gets an ID from the space of keys Nodes are arranged in a ring Data stored on the first node clockwise of the current placement of the data key Replication Preference lists of N nodes following the associated node
  • 37. A problem with the Chord scheme Nodes placed randomly on ring Leads to uneven data & load distribution In Dynamo “ Virtual” nodes Each physical node has multiple virtual nodes More powerful machines have more virtual nodes Distribute virtual nodes across the ring
  • 38. Updates generate a new timestamp Vector clocks are used Eventual consistency Multiple versions of the same object might co-exist Syntactic Reconciliation System might be able to resolve conflicts automatically Semantic Reconciliation Conflict resolution pushed to application
  • 39.  
  • 40. Request arrives at a node (coordinator) Ideally the node responsible for the particular key Else forwards request to the node responsible for that key and that node will become the coordinator The first N healthy and distinct nodes following the key position are considered for the request Application defines N = total number of participating nodes R = number of nodes required for successful Read W = number of nodes required for successful write R + W > N gives quorum
  • 41. Writes Requires generation of a new vector clock by coordinator Coordinator writes locally Forwards to N nodes, if W-1 respond then the write was successful Reads Forwards to N nodes, if R-1 respond then forwards to user Only unique responses forwarded User handles merging if multiple versions exist
  • 42. Sloppy Quorum Read write ops performed on first N healthy nodes Increases availability Hinted Handoff If node in preference list is not available, send replica to a node further down in the list With a hint containing the identity of the original node The receiving node keeps checking for the original If the original becomes available, transfers replica to it
  • 43. Replica Synchronization Synchronize with another node Each node maintains a separate Merkel tree for each key range it hosts Nodes exchange roots of trees for common key-ranges Quickly determine divergent keys by comparing hashes
  • 44. Ring Membership Membership is explicit to avoid re-balancing of partition assignment Use background gossip to build 1-hop DHT Use external entity to bootstrap the system to avoid partitioned rings Failure Detection Node A finds node B unreachable (for servicing a request) A uses other nodes to service requests and periodically checks B A does not assume B to have failed No globally consistent view of failure (because of explicit ring membership)
  • 45. Application Configurable (N, R, W) Every node is aware of the data hosted by its peers requiring the gossiping of the full routing table with other nodes scalability is limited by this to a few hundred nodes hierarchy may help to overcome the limitation
  • 46. Typical configuration for the Dynamo (N, R, W) is (3, 2, 2) Some implementations vary (N, R, W) Always write might have W=1 (Shopping Cart) Product catalog might have R=1 and W=N Response requirement is 300ms for any request (read or write)
  • 47. Consistency vs. Availability 99.94% one version 0.00057% two 0.00047% three 0.00009% four Server-driven or Client-driven coordination Server-driven uses load balancers forwards requests to desired set of nodes Client-driven 50% faster requires the polling of Dynamo membership updates the client is responsible for determining the appropriate nodes to send the request to Successful responses (without time-out) 99.9995%
  • 48.  
  • 49.  
  • 50.  
  • 51.  
  • 52. Enormous data (and high growth) Traditional solutions don’t work Distributed databases Lots of interesting work happening Great time for young programmers! Problem solving ability
  • 53.  
  • 54. Intelligent People. Uncommon Ideas. Licensed under Creative Commons Attribution Sharealike Noncommercial