SlideShare a Scribd company logo
Dissecting Scalable Database
Architectures
Doug Judd
CEO, Hypertable Inc.
Talk Outline
• Scalable “NoSQL” Architectures
• Next-generation Architectures
• Future Evolution - Hardware Trends
Scalable NoSQL
Architecture Categories
• Auto-sharding
• Dynamo
• Bigtable
Auto-Sharding
Auto-Sharding
Auto-sharding Systems
• Oracle NoSQL Database
• MongoDB
Dynamo
• “Dynamo: Amazon’s Highly Available Key-value Store”
   – Amazon.com, 2007
• Distributed Hash Table (DHT)
• Handles inter-datacenter replication
• Designed for High Write Availability
Consistent Hashing
Eventual Consistency
Vector Clocks
Dynamo-based Systems
•   Cassandra
•   DynamoDB
•   Riak
•   Voldemort
Bigtable
• “Bigtable: A Distributed Storage System for Structured Data”
   - Google, Inc., OSDI ’06
• Ordered
• Consistent
• Not designed to handle inter-datacenter replication
Google Architecture
Google File System
Google File System
Table: Growth Process
Scaling (part 1)
Scaling (part 2)
Scaling (part 3)
System overview
Database Model
• Sparse, two-dimensional table with cell versions
• Cells are identified by a 4-part key
   •   Row (string)
   •   Column Family
   •   Column Qualifier (string)
   •   Timestamp
Table: Visual Representation
Table: Actual Representation
Anatomy of a Key
• Column Family is represented with 1 byte
• Timestamp and revision are stored big-endian,
  ones-compliment
• Simple byte-wise comparison
Log Structured Merge Tree
Range Server: CellStore
• Sequence of 65K blocks of
  compressed key/value pairs
Bloom Filter
• Associated with each Cell Store
• Dramatically reduces disk access
• Tells you if key is definitively not present
Request Routing
Bigtable-based Systems
• Accumulo
• HBase
• Hypertable
Next-generation Architectures
• PNUTS (Yahoo, Inc.)
• Spanner (Google, Inc.)
• Dremel (Google, Inc.)
PNUTS
• Geographically distributed database
• Designed for low-latency access
• Manages hashed or ordered tables of records
  • Hashed tables implemented via proprietary disk-based hash
  • Ordered tables implemented with MySQL+InnoDB
• Not optimized for bulk storage (image, videos, …)
• Runs as a hosted service inside Yahoo!
PNUTS System Architecture
Record-level Mastering
• Provides per-record timeline consistency
• Master is adaptively changed to suit workload
• Region names are two bytes associated with each record
PNUTS API
•   Read-any
•   Read-critical(required_version)
•   Read-latest
•   Write
•   Test-and-set-write(required_version)
Spanner
•   Globally distributed database (cross-datacenter replication)
•   Synchronously Replicated
•   Externally-consistent distributed transactions
•   Globally distributed transaction management
•   SQL-based query language
Spanner Server Organization
Spanserver
• Manages 100-1000 tablets
• A tablet is similar to a Bigtable tablet and manages a bag of
  mappings:
              (key:string, timestamp:int64) -> string
• Single Paxos state machine implemented on top of each tablet
• Tablet may contain multiple directories
  • Set of contiguous keys that share a common prefix
  • Unit of data placement
  • Can be moved between Tablets for performance reasons
TrueTime
• Universal Clock
• Set of time master servers per-datacenter
  • GPL clock via GPS receivers with dedicated antennas
  • Atomic clock
• Time daemon runs on every machine
• TrueTime API:
Spanner Software Stack
Externally-consistent Operations
•   Read-Write Transaction
•   Read-Only Transaction
•   Snapshot Read (client-provided timestamp)
•   Snapshot Read (client-provided bound)
•   Schema Change Transaction
Dremel
•   Scalable, interactive ad-hoc query system
•   Designed to operate on read-only data
•   Handles nested data (Protocol Buffers)
•   Can run aggregation queries over trillion-row tables in seconds
Columnar Storage Format




• Novel format for storing lists of nested records (Protocol
  Buffers)
• Highly space-efficient
• Algorithm for dissecting list of nested records into columns
• Algorithm for reassembling columns into list of records
Multi-level Execution Trees




• Execution model for one-pass aggregations returning small
  and medium-sized results (very common at Google)
• Query gets re-written as it passes down the execution tree.
• On the way up, intermediate servers perform a parallel
  aggregation of partial results.
Performance
Example Queries
• SELECT SUM(CountWords(txtField)) / COUNT(*) FROM T1

• SELECT country, SUM(item.amount) FROM T2
  GROUP BY country

• SELECT domain, SUM(item.amount) FROM T2
  WHERE domain CONTAINS ’.net’
  GROUP BY domain

• SELECT COUNT(DISTINCT a) FROM T5
Future Evolution - Hardware
Trends
• SSD Drives
• Disk Drives
• Networking
Flash Memory Rated Lifetime
(P/E Cycles)




     Source: Bleak Future of NAND Flash Memory,
             Grupp et al., FAST 2012
Flash Memory Average BER at
Rated Lifetime




     Source: Bleak Future of NAND Flash Memory,
             Grupp et al., FAST 2012
Disk: Areal Density Trend




      Source: GPFS Scans 10 Billion Files in 43 Minutes.
              © Copyright IBM Corporation 2011
Disk: Maximum Sustained
Bandwidth Trend




     Source: GPFS Scans 10 Billion Files in 43 Minutes.
             © Copyright IBM Corporation 2011
Time Required to Sequentially Fill a
SATA Drive
Average Seek Time




     Source: GPFS Scans 10 Billion Files in 43 Minutes.
             © Copyright IBM Corporation 2011
Average Rotational Latency




      Source: GPFS Scans 10 Billion Files in 43 Minutes.
              © Copyright IBM Corporation 2011
Time Required to Randomly
Read a SATA Drive
Ethernet
• 10GbE
  • Starting to replace 1GbE for server NICs
  • De facto network port for new servers in 2014
• 40GbE
  • Data center core & aggregation
  • Top-of-rack server aggregation
• 100GbE
   • Service Provider core and aggregation
   • Metro and large Campus core
   • Data center core & aggregation
• No technology currently exists to transport 40 Gbps or 100 Gbps as a
  single stream over existing copper or fiber
• 40GbE & 100GbE solved using either 4 or 10 parallel 10GbE
  “lanes”
10GbE Adoption Curve (?)




     Source: CREHAN RESEARCH Inc. © Copyright 2012
The End
Thank you!

More Related Content

What's hot (20)

PPTX
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit
 
PPTX
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Yahoo Developer Network
 
PPTX
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Cloudera, Inc.
 
PPTX
Interactive Hadoop via Flash and Memory
Chris Nauroth
 
PDF
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Alluxio, Inc.
 
PPTX
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Hyunsik Choi
 
PDF
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
PPTX
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
 
PPTX
La big datacamp2014_vikram_dixit
Data Con LA
 
PDF
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
PDF
Beyond Postgres: Interesting Projects, Tools and forks
Sameer Kumar
 
PDF
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
NoSQLmatters
 
PPTX
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
PPTX
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Athiq Ahamed
 
PDF
HBase Advanced - Lars George
JAX London
 
PDF
Large-scale Web Apps @ Pinterest
HBaseCon
 
PDF
2016 jan-pugs-meetup-v9.5-features
Sameer Kumar
 
PPTX
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
 
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Yahoo Developer Network
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Cloudera, Inc.
 
Interactive Hadoop via Flash and Memory
Chris Nauroth
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Alluxio, Inc.
 
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Hyunsik Choi
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
 
La big datacamp2014_vikram_dixit
Data Con LA
 
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Beyond Postgres: Interesting Projects, Tools and forks
Sameer Kumar
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
NoSQLmatters
 
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Athiq Ahamed
 
HBase Advanced - Lars George
JAX London
 
Large-scale Web Apps @ Pinterest
HBaseCon
 
2016 jan-pugs-meetup-v9.5-features
Sameer Kumar
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
 

Similar to Dissecting Scalable Database Architectures (20)

PPTX
Complex Analytics with NoSQL Data Store in Real Time
Nati Shalom
 
ODP
Front Range PHP NoSQL Databases
Jon Meredith
 
PPTX
Pnuts yahoo!’s hosted data serving platform
lammya aa
 
PDF
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
 
PPTX
Chapter Six Storage-systemsgggggggg.pptx
BinyamBekeleMoges
 
PDF
Aerospike Hybrid Memory Architecture
Aerospike, Inc.
 
PDF
What's Next for Google's BigTable
Sqrrl
 
PDF
MySQL Cluster Scaling to a Billion Queries
Bernd Ocklin
 
PPT
Schemaless Databases
Dan Gunter
 
PDF
Consistency, Availability, Partition: Make Your Choice
Andrea Giuliano
 
PPTX
HBase in Practice
larsgeorge
 
PDF
3 map reduce perspectives
Genoveva Vargas-Solar
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPT
SQL or NoSQL, that is the question!
Andraz Tori
 
PPTX
JasperWorld 2012: Reinventing Data Management by Max Schireson
MongoDB
 
PPTX
Spinning Brown Donuts
David Pechon
 
PPTX
Spinning Brown Donuts: Why Storage Still Counts
Sparkhound Inc.
 
PDF
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
PDF
Accelerating big data with ioMemory and Cisco UCS and NOSQL
Sumeet Bansal
 
PPTX
Best storage engine for MySQL
tomflemingh2
 
Complex Analytics with NoSQL Data Store in Real Time
Nati Shalom
 
Front Range PHP NoSQL Databases
Jon Meredith
 
Pnuts yahoo!’s hosted data serving platform
lammya aa
 
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
 
Chapter Six Storage-systemsgggggggg.pptx
BinyamBekeleMoges
 
Aerospike Hybrid Memory Architecture
Aerospike, Inc.
 
What's Next for Google's BigTable
Sqrrl
 
MySQL Cluster Scaling to a Billion Queries
Bernd Ocklin
 
Schemaless Databases
Dan Gunter
 
Consistency, Availability, Partition: Make Your Choice
Andrea Giuliano
 
HBase in Practice
larsgeorge
 
3 map reduce perspectives
Genoveva Vargas-Solar
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
SQL or NoSQL, that is the question!
Andraz Tori
 
JasperWorld 2012: Reinventing Data Management by Max Schireson
MongoDB
 
Spinning Brown Donuts
David Pechon
 
Spinning Brown Donuts: Why Storage Still Counts
Sparkhound Inc.
 
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
Accelerating big data with ioMemory and Cisco UCS and NOSQL
Sumeet Bansal
 
Best storage engine for MySQL
tomflemingh2
 
Ad

Recently uploaded (20)

PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
July Patch Tuesday
Ivanti
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
July Patch Tuesday
Ivanti
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Ad

Dissecting Scalable Database Architectures

Editor's Notes

  • #6: Strengths: Multiple-servers can satisfy readsWeaknesses: 1. Failover, 2. Mapping service on single machine, 3. Irregular growth patterns can cause imbalance
  • #7: Strengths: Multiple-servers can satisfy readsWeaknesses: 1. Failover, 2. Mapping service on single machine, 3. Irregular growth patterns can cause imbalance
  • #8: Designed for their “Shopping Cart” service
  • #9: An important part of any DHT is the mechanism by which keys get mapped …Supports incremental scalabilityGossip protocol is used to propagate membership changes
  • #10: Dynamo was designed for high write availability (Shopping Cart service)This is also how they handle inter-datacenter replicationRead repair (downside is reading becomes expensive)
  • #11: Dynamo uses Vector Clocks to assist with the reconciliation of divergent copies of objects in the systemVector Clocks are used to track revision history of objectsfor the purposes of reconciliation in the event of a divergenceAny storage node in Dynamo is eligible to receive client get and put operations for any keyOne vector clock is associated with every version of every objectIf the version numbers on the first object’s clock are <= to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.
  • #12: Strengths: Low latency writes, handles inter-datacenter replicationWeaknesses: Not ordered, Read-repair can impact read latency
  • #30: Strengths: Ordered, consistentWeaknesses: Does not handle inter-datacenter replication
  • #33: This diagram shows two regions …Data replication happens via that Yahoo Message Broker (YMB), which is a pub/sub system that offers reliable message deliveryStorage units manage tabletsTablet controller contains authoritative mapping of tablets-to-storage-units, also orchestrates tablet movment
  • #34: PNUTS offers relaxed consistency guarantees across the dataset, but per-record timeline consistency through the use of Record Level MasteringThey’ve found this to be sufficient for most of their web use cases
  • #35: Read-any – reads any (possible stale) recent version. Low latency.Read-critical – Reads any record that
  • #37: Placement driver handles automated movement of data across zones.
  • #39: Time daemons synchronize with time masters every 30 seconds and apply a drift rate of 200 microseconds/s between synchronizationsThe reason for the two clocks is that they each have different failure modes.
  • #41: Read-Write transactions use pessimistic concurrency control (lock table)Reads are lock-free and can happen at an replica that is sufficiently up-to-date.
  • #44: Another key aspect of Dremel is how it handles certain common aggregation queries
  • #48: SLC – Single Level Cell
  • #50: Kryder’s law
  • #52: 40% CAGR for arial density, 15% CAGR for sustained write bandwidth
  • #54: 7,200 RPM, 10,000 RPM, 15,000 RPM