SlideShare a Scribd company logo
And now for something completely
different...
WiredTiger:
Fast data structures in C
Keith Bostic
MongoDB WiredTiger team
keith.bostic@mongodb.com
#MDBE16
You are here: database layers
Middleware
Networking
Query APIs
Storage Engine
#MDBE16
Storage engines are performance critical
Middleware
Networking
Query APIs
mmapV1
Storage Engine
RocksDB
Storage Engine
WiredTiger
Storage Engine
ACID
transactional guarantees
#MDBE16
WiredTiger
•  From (some of) the folks that brought you Berkeley DB
•  High performance data engine
•  scalable throughput with low latency
•  MongoDB’s default storage engine
•  a general-purpose workhorse
Next
Ø  Hardware (is the problem)
•  Hazard pointers
•  Skiplists
•  Ticket locks
#MDBE16
Modern servers have many CPUs/cores
core
3
core
2
core
1
core
N
#MDBE16
Each core has multiple memory caches
core
3
core
2
core
1
core
N
two or
more
caches
two or
more
caches
two or
more
caches
two or
more
caches
#MDBE16
Cache coherence: cores “snoop” on writes
core
3
core
2
core
1
core
N
two or
more
caches
two or
more
caches
two or
more
caches
Main Memory
two or
more
caches
#MDBE16
Traditional data engines struggle with this architecture
•  Writing “shared” memory is slow
•  but databases exist to manage shared access to data!
•  Snoopy cache-coherence scales poorly
#MDBE16
Programmers solve with locking
•  Locks are complex objects
•  get exclusive access to the lock state
•  review and update the lock state
•  “publish” (ensure every CPU sees the changes)
•  release exclusive access
#MDBE16
Locking is slow
•  Every operation requires exclusive access
•  even shared (“read”) locks require a lock/unlock cycle
•  thread stall is inevitable
•  Locks require notification of every CPU
•  Locks require exclusive access to the memory bus
#MDBE16
Locking is expensive
•  A lock per object is too much memory
•  POSIX locks cache-aligned, up to 128B
•  grouping objects under locks makes contention worse
•  More complexity to make locks “fair” and avoid starvation
•  add thread queues
•  wake-up the next thread waiting for the lock
#MDBE16
We need to find something else
If we can’t use locks, what do we use instead?
Today we’re going to talk about ways to get rid of locks.
#MDBE16
WiredTiger is written in C
•  Java or C++ are better choices for system programming
•  automatic memory management vs. malloc/free
•  exception handling vs. explicit error paths
•  widespread availability of reusable components
•  Giving up programmer productivity
#MDBE16
C is “portable assembler”
•  Marshall typed values to/from unaligned memory
•  streaming compression, encryption, checksums
•  unstructured I/O to/from stable storage
•  Light-weight access to shared data
•  use the underlying machine primitives that make up locks
•  algorithms/structures based on those primitives
You may have seen this last year:
Next
•  Hardware
Ø  Hazard pointers
•  Skiplists
•  Ticket locks
#MDBE16
Pages in the WiredTiger cache
page 2
page 6
page 8
page 9
Lots and lots (and lots) of pages
MongoDB worker threads read from disk
WiredTiger server threads evict to disk
#MDBE16
A reasonable page-locking implementation
•  MongoDB worker threads read, modify pages
•  WiredTiger server threads evict pages from the cache
•  Allocate a lock per page
•  MongoDB worker threads share pages
•  WiredTiger eviction threads require exclusive access
#MDBE16
Page locking in the WiredTiger cache
page 2
page 6
page 8
page 9
eviction
lock
lock
lock
lock
writer
reader
thread stall on read locks!
vulnerable to starvation
too much memory
#MDBE16
Introducing memory barriers
•  Memory barriers
•  order reads, writes or both across a line of code
•  compiler won’t cache values or reorder across a barrier
•  Locks imply memory barriers
#MDBE16
Something faster
•  Hazard pointers: a technique for avoiding locks
•  MongoDB worker threads
•  “log” intention to access a page
•  publish: a memory barrier to ensure global CPU visibility
•  Write to a per-thread memory location
•  write won’t collide with other worker threads
#MDBE16
What about eviction starvation?
•  Add a per-page “blocker”
•  MongoDB worker won’t proceed if the page is blocked
•  Cheap:
•  it’s only a bit of information
•  a read-only operation for workers
#MDBE16
Worker threads
•  Publish intent to access the page
•  Memory barrier to ensure global CPU visibility
•  If the page not blocked, it’s accessible
•  Clear intent to access when done
#MDBE16
Hazard pointers for workers
page 2
page 6
page 8
page 9
flag
writer
reader
flag
flag
flag
page 9
page 2
page 6
page 2
page 9
#MDBE16
Eviction server
•  Block future worker thread access
•  Memory barrier to ensure global CPU visibility
•  Review worker thread access intentions
•  can either wait or quit
•  Unblock worker thread access when done
#MDBE16
Hazard pointers for workers and eviction
page 2
page 6
page 8
page 9
flag
flag
flag
flag
writer
reader
page 9
page 2
page 6
page 2
page 9
eviction
#MDBE16
Something faster: hazard pointers
Replaces two lock/unlock pairs for each page access
... with a single memory barrier instruction.
•  Transfers work to the eviction server
•  MongoDB worker latency is what we care about
•  Memory costs
•  per-worker-thread list
•  per-page blocking flag
Next
•  Hardware
•  Hazard pointers
Ø  Skiplists
•  Ticket locks
#MDBE16
Introducing atomic instructions
•  Atomic increment or decrement
•  read a value
•  change it and store it back without the possibility of racing
•  Based on compare-and-swap (CAS) instruction
•  read value
•  update value if the value is unchanged
•  but fail if the value has changed
#MDBE16
Atomic prepend to singly-linked list
Update head if (and only if), head’s value is unchanged
head
NEW
new.next = head
compare_and_swap(head, new.next, new)
#MDBE16
How WiredTiger uses skiplists
•  WiredTiger pages start with a disk image
•  a compact representation we don’t want to modify
•  Inserts and updates for the disk image stored in skiplists
#MDBE16
Skiplists start with a linked list
Singly-linked list with sorted values: 7, 10, 13, 18, 21, 24
7 10 211813 24
#MDBE16
Skiplists: add additional linked lists
Each higher level “skips” over more of the list
1:7
3:7
2:7
1:10 1:211:181:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Search for 18
search starts at the top-level
1:7
3:7
2:7
1:10 1:211:181:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Search for 18
search starts at the top-level
1:7
3:7
2:7
1:10 1:211:181:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Search for 18
search starts at the top-level
1:7
3:7
2:7
1:10 1:211:181:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Search for 18
search starts at the top-level
1:7
3:7
2:7
1:10 1:211:181:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Search for 18
search starts at the top-level
1:7
3:7
2:7
1:10 1:211:181:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Search for 18
search starts at the top-level
1:7
3:7
2:7
1:10 1:211:181:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Skiplists, the great
Replaces a lock/unlock pair over the entire skiplist
with one atomic memory instruction per object level
•  Insert without locking
•  Search without locking, while inserting
•  Forward & backward traversal without locking, while inserting
#MDBE16
Skiplists, the good
•  Simpler code than a Btree
•  WiredTiger binary search ~200 lines of code
•  a typical skiplist search < 20
•  Fast search
•  a Btree guarantees search in logarithmic time
•  skiplists don’t offer a guarantee, but are usually close
#MDBE16
Skiplists, the not-so-good
•  Cache-unfriendly
•  every indirection a CPU cache miss
•  Memory-unfriendly
•  needs more memory for a data set than a Btree
•  Removal requires locking
•  WiredTiger is an MVCC engine (multiple values per key)
•  removal less important to WiredTiger
Next
•  Hardware
•  Hazard pointers
•  Skiplists
Ø  Ticket locks
#MDBE16
Ticket locks
•  WiredTiger still needs to lock objects
•  but we can make locks faster
•  Ticket locks
•  customers take a unique ticket number
•  customers served in ticket order
#MDBE16
Ticket locks
Please
Take a Number
42 43414039
Now
Serving
#MDBE16
Ticket locks
•  Two incrementing counters:
ticket: the next available ticket number
serving: the ticket number now being served
•  Thread takes a ticket number
•  Thread increments “next available”
•  Thread waits for “serving” to match its ticket number
•  When thread finishes, increments “serving”
#MDBE16
Ticket locks serialize threads
40
Now Serving
39
Thread A
39
40
39
40
41
Thread B
#MDBE16
Ticket locks are almost what we need
•  Ticket locks avoid starvation and are “fair”
•  Smaller memory footprint
•  Can be made significantly faster than POSIX locks
•  remember our compare-and-swap instructions!
•  But POSIX locks are shared between readers
#MDBE16
Ticket locks: shared vs. exclusive
•  Three incrementing counters:
ticket: the next available ticket number
readers: the next reader to be served
writers: the next writer to be served
#MDBE16
Readers run in parallel
40
Writers Readers
39
Thread A
39
40
41
41
39
40
41
42
39
40
41
42
Thread B
Thread
C
#MDBE16
Multiple variable updates without locking
•  Updating multiple counters would require locking
... but we can write the bus width atomically
•  Encode the entire lock state in a single 8B value
lock {
uint16_t readers;
uint16_t writers;
uint16_t ticket; // 64K simultaneous threads
uint16_t unused;
}
#MDBE16
Ticket locks
Replaces two higher-level lock/unlock calls
... with two atomic instructions.
#MDBE16
That’s a (very) fast introduction....
•  Hazard pointers
•  Skiplists
•  Ticket locks
Open Source implementations are available in WiredTiger, including Public
Domain ticket locks.
#MDBE16
WiredTiger distribution
•  Standalone application database toolkit library
•  key-value store (NoSQL)
•  row-store, column-store and LSM engines
•  schema layer includes data types and indexes
•  Another MongoDB Open Source contribution
•  WiredTiger available for other applications
•  https://github.com/wiredtiger
Thank you!
Keith Bostic
keith.bostic@mongodb.com
MongoDB Europe 2016 - Building WiredTiger

More Related Content

What's hot (20)

PPTX
Beyond the Basics 1: Storage Engines
MongoDB
 
PPTX
eHarmony - Messaging Platform with MongoDB Atlas
MongoDB
 
PDF
Mongo db 3.4 Overview
Norberto Leite
 
PPTX
Cloud Backup Overview
MongoDB
 
PPTX
An Enterprise Architect's View of MongoDB
MongoDB
 
PDF
An Elastic Metadata Store for eBay’s Media Platform
MongoDB
 
PPTX
Webinar: What's New in MongoDB 3.2
MongoDB
 
PPTX
How Thermo Fisher is Reducing Data Analysis Times from Days to Minutes with M...
MongoDB
 
PDF
MongoDB 3.2 Feature Preview
Norberto Leite
 
PPTX
MongoDB and Spark
Norberto Leite
 
PPTX
Prepare for Peak Holiday Season with MongoDB
MongoDB
 
PPTX
Scaling MongoDB to a Million Collections
MongoDB
 
PPTX
MongoDB Atlas
MongoDB
 
PPTX
Using Compass to Diagnose Performance Problems in Your Cluster
MongoDB
 
PPTX
A Free New World: Atlas Free Tier and How It Was Born
MongoDB
 
KEY
MongoDB vs Mysql. A devops point of view
Pierre Baillet
 
PPTX
Webinar: Simplifying the Database Experience with MongoDB Atlas
MongoDB
 
PPTX
Introduction To MongoDB
ElieHannouch
 
PPTX
Introducing Stitch
MongoDB
 
PDF
MongodB Internals
Norberto Leite
 
Beyond the Basics 1: Storage Engines
MongoDB
 
eHarmony - Messaging Platform with MongoDB Atlas
MongoDB
 
Mongo db 3.4 Overview
Norberto Leite
 
Cloud Backup Overview
MongoDB
 
An Enterprise Architect's View of MongoDB
MongoDB
 
An Elastic Metadata Store for eBay’s Media Platform
MongoDB
 
Webinar: What's New in MongoDB 3.2
MongoDB
 
How Thermo Fisher is Reducing Data Analysis Times from Days to Minutes with M...
MongoDB
 
MongoDB 3.2 Feature Preview
Norberto Leite
 
MongoDB and Spark
Norberto Leite
 
Prepare for Peak Holiday Season with MongoDB
MongoDB
 
Scaling MongoDB to a Million Collections
MongoDB
 
MongoDB Atlas
MongoDB
 
Using Compass to Diagnose Performance Problems in Your Cluster
MongoDB
 
A Free New World: Atlas Free Tier and How It Was Born
MongoDB
 
MongoDB vs Mysql. A devops point of view
Pierre Baillet
 
Webinar: Simplifying the Database Experience with MongoDB Atlas
MongoDB
 
Introduction To MongoDB
ElieHannouch
 
Introducing Stitch
MongoDB
 
MongodB Internals
Norberto Leite
 

Similar to MongoDB Europe 2016 - Building WiredTiger (20)

PPTX
Mongo db v3_deep_dive
Bryan Reinero
 
PPTX
Scaling and Transaction Futures
MongoDB
 
PDF
LDAP at Lightning Speed
C4Media
 
PDF
A Technical Introduction to WiredTiger
MongoDB
 
PPTX
WiredTiger & What's New in 3.0
MongoDB
 
PDF
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB
 
PPTX
CPU Caches
shinolajla
 
PPTX
Performance Tipping Points - Hitting Hardware Bottlenecks
MongoDB
 
PDF
cachegrand: A Take on High Performance Caching
ScyllaDB
 
PDF
Codebits Handivi
cfpinto
 
PPTX
What'sNnew in 3.0 Webinar
MongoDB
 
PDF
MongoDB WiredTiger Internals
Norberto Leite
 
PDF
2011 06-sq lite-forensics
viaForensics
 
PPTX
Multi version Concurrency Control and its applications in Advanced database s...
GauthamSK4
 
PPTX
Taming the resource tiger
Elizabeth Smith
 
PDF
Microprocessor lecture 2
Md. Murshedul Arifeen
 
PDF
How Databases Work - for Developers, Accidental DBAs and Managers
EDB
 
PDF
Challenges in Maintaining a High Performance Search Engine Written in Java
lucenerevolution
 
PDF
[B5]memcached scalability-bag lru-deview-100
NAVER D2
 
PPTX
In-memory Databases
Robert Friberg
 
Mongo db v3_deep_dive
Bryan Reinero
 
Scaling and Transaction Futures
MongoDB
 
LDAP at Lightning Speed
C4Media
 
A Technical Introduction to WiredTiger
MongoDB
 
WiredTiger & What's New in 3.0
MongoDB
 
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB
 
CPU Caches
shinolajla
 
Performance Tipping Points - Hitting Hardware Bottlenecks
MongoDB
 
cachegrand: A Take on High Performance Caching
ScyllaDB
 
Codebits Handivi
cfpinto
 
What'sNnew in 3.0 Webinar
MongoDB
 
MongoDB WiredTiger Internals
Norberto Leite
 
2011 06-sq lite-forensics
viaForensics
 
Multi version Concurrency Control and its applications in Advanced database s...
GauthamSK4
 
Taming the resource tiger
Elizabeth Smith
 
Microprocessor lecture 2
Md. Murshedul Arifeen
 
How Databases Work - for Developers, Accidental DBAs and Managers
EDB
 
Challenges in Maintaining a High Performance Search Engine Written in Java
lucenerevolution
 
[B5]memcached scalability-bag lru-deview-100
NAVER D2
 
In-memory Databases
Robert Friberg
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
Ad

Recently uploaded (20)

PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
BinarySearchTree in datastructures in detail
kichokuttu
 
What Is Data Integration and Transformation?
subhashenia
 
big data eco system fundamentals of data science
arivukarasi
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 

MongoDB Europe 2016 - Building WiredTiger

  • 1. And now for something completely different...
  • 4. #MDBE16 You are here: database layers Middleware Networking Query APIs Storage Engine
  • 5. #MDBE16 Storage engines are performance critical Middleware Networking Query APIs mmapV1 Storage Engine RocksDB Storage Engine WiredTiger Storage Engine ACID transactional guarantees
  • 6. #MDBE16 WiredTiger •  From (some of) the folks that brought you Berkeley DB •  High performance data engine •  scalable throughput with low latency •  MongoDB’s default storage engine •  a general-purpose workhorse
  • 7. Next Ø  Hardware (is the problem) •  Hazard pointers •  Skiplists •  Ticket locks
  • 8. #MDBE16 Modern servers have many CPUs/cores core 3 core 2 core 1 core N
  • 9. #MDBE16 Each core has multiple memory caches core 3 core 2 core 1 core N two or more caches two or more caches two or more caches two or more caches
  • 10. #MDBE16 Cache coherence: cores “snoop” on writes core 3 core 2 core 1 core N two or more caches two or more caches two or more caches Main Memory two or more caches
  • 11. #MDBE16 Traditional data engines struggle with this architecture •  Writing “shared” memory is slow •  but databases exist to manage shared access to data! •  Snoopy cache-coherence scales poorly
  • 12. #MDBE16 Programmers solve with locking •  Locks are complex objects •  get exclusive access to the lock state •  review and update the lock state •  “publish” (ensure every CPU sees the changes) •  release exclusive access
  • 13. #MDBE16 Locking is slow •  Every operation requires exclusive access •  even shared (“read”) locks require a lock/unlock cycle •  thread stall is inevitable •  Locks require notification of every CPU •  Locks require exclusive access to the memory bus
  • 14. #MDBE16 Locking is expensive •  A lock per object is too much memory •  POSIX locks cache-aligned, up to 128B •  grouping objects under locks makes contention worse •  More complexity to make locks “fair” and avoid starvation •  add thread queues •  wake-up the next thread waiting for the lock
  • 15. #MDBE16 We need to find something else If we can’t use locks, what do we use instead? Today we’re going to talk about ways to get rid of locks.
  • 16. #MDBE16 WiredTiger is written in C •  Java or C++ are better choices for system programming •  automatic memory management vs. malloc/free •  exception handling vs. explicit error paths •  widespread availability of reusable components •  Giving up programmer productivity
  • 17. #MDBE16 C is “portable assembler” •  Marshall typed values to/from unaligned memory •  streaming compression, encryption, checksums •  unstructured I/O to/from stable storage •  Light-weight access to shared data •  use the underlying machine primitives that make up locks •  algorithms/structures based on those primitives
  • 18. You may have seen this last year:
  • 19. Next •  Hardware Ø  Hazard pointers •  Skiplists •  Ticket locks
  • 20. #MDBE16 Pages in the WiredTiger cache page 2 page 6 page 8 page 9 Lots and lots (and lots) of pages MongoDB worker threads read from disk WiredTiger server threads evict to disk
  • 21. #MDBE16 A reasonable page-locking implementation •  MongoDB worker threads read, modify pages •  WiredTiger server threads evict pages from the cache •  Allocate a lock per page •  MongoDB worker threads share pages •  WiredTiger eviction threads require exclusive access
  • 22. #MDBE16 Page locking in the WiredTiger cache page 2 page 6 page 8 page 9 eviction lock lock lock lock writer reader thread stall on read locks! vulnerable to starvation too much memory
  • 23. #MDBE16 Introducing memory barriers •  Memory barriers •  order reads, writes or both across a line of code •  compiler won’t cache values or reorder across a barrier •  Locks imply memory barriers
  • 24. #MDBE16 Something faster •  Hazard pointers: a technique for avoiding locks •  MongoDB worker threads •  “log” intention to access a page •  publish: a memory barrier to ensure global CPU visibility •  Write to a per-thread memory location •  write won’t collide with other worker threads
  • 25. #MDBE16 What about eviction starvation? •  Add a per-page “blocker” •  MongoDB worker won’t proceed if the page is blocked •  Cheap: •  it’s only a bit of information •  a read-only operation for workers
  • 26. #MDBE16 Worker threads •  Publish intent to access the page •  Memory barrier to ensure global CPU visibility •  If the page not blocked, it’s accessible •  Clear intent to access when done
  • 27. #MDBE16 Hazard pointers for workers page 2 page 6 page 8 page 9 flag writer reader flag flag flag page 9 page 2 page 6 page 2 page 9
  • 28. #MDBE16 Eviction server •  Block future worker thread access •  Memory barrier to ensure global CPU visibility •  Review worker thread access intentions •  can either wait or quit •  Unblock worker thread access when done
  • 29. #MDBE16 Hazard pointers for workers and eviction page 2 page 6 page 8 page 9 flag flag flag flag writer reader page 9 page 2 page 6 page 2 page 9 eviction
  • 30. #MDBE16 Something faster: hazard pointers Replaces two lock/unlock pairs for each page access ... with a single memory barrier instruction. •  Transfers work to the eviction server •  MongoDB worker latency is what we care about •  Memory costs •  per-worker-thread list •  per-page blocking flag
  • 31. Next •  Hardware •  Hazard pointers Ø  Skiplists •  Ticket locks
  • 32. #MDBE16 Introducing atomic instructions •  Atomic increment or decrement •  read a value •  change it and store it back without the possibility of racing •  Based on compare-and-swap (CAS) instruction •  read value •  update value if the value is unchanged •  but fail if the value has changed
  • 33. #MDBE16 Atomic prepend to singly-linked list Update head if (and only if), head’s value is unchanged head NEW new.next = head compare_and_swap(head, new.next, new)
  • 34. #MDBE16 How WiredTiger uses skiplists •  WiredTiger pages start with a disk image •  a compact representation we don’t want to modify •  Inserts and updates for the disk image stored in skiplists
  • 35. #MDBE16 Skiplists start with a linked list Singly-linked list with sorted values: 7, 10, 13, 18, 21, 24 7 10 211813 24
  • 36. #MDBE16 Skiplists: add additional linked lists Each higher level “skips” over more of the list 1:7 3:7 2:7 1:10 1:211:181:13 1:24 2:13 2:21 3:21 2:24
  • 37. #MDBE16 Search for 18 search starts at the top-level 1:7 3:7 2:7 1:10 1:211:181:13 1:24 2:13 2:21 3:21 2:24
  • 38. #MDBE16 Search for 18 search starts at the top-level 1:7 3:7 2:7 1:10 1:211:181:13 1:24 2:13 2:21 3:21 2:24
  • 39. #MDBE16 Search for 18 search starts at the top-level 1:7 3:7 2:7 1:10 1:211:181:13 1:24 2:13 2:21 3:21 2:24
  • 40. #MDBE16 Search for 18 search starts at the top-level 1:7 3:7 2:7 1:10 1:211:181:13 1:24 2:13 2:21 3:21 2:24
  • 41. #MDBE16 Search for 18 search starts at the top-level 1:7 3:7 2:7 1:10 1:211:181:13 1:24 2:13 2:21 3:21 2:24
  • 42. #MDBE16 Search for 18 search starts at the top-level 1:7 3:7 2:7 1:10 1:211:181:13 1:24 2:13 2:21 3:21 2:24
  • 43. #MDBE16 Skiplists, the great Replaces a lock/unlock pair over the entire skiplist with one atomic memory instruction per object level •  Insert without locking •  Search without locking, while inserting •  Forward & backward traversal without locking, while inserting
  • 44. #MDBE16 Skiplists, the good •  Simpler code than a Btree •  WiredTiger binary search ~200 lines of code •  a typical skiplist search < 20 •  Fast search •  a Btree guarantees search in logarithmic time •  skiplists don’t offer a guarantee, but are usually close
  • 45. #MDBE16 Skiplists, the not-so-good •  Cache-unfriendly •  every indirection a CPU cache miss •  Memory-unfriendly •  needs more memory for a data set than a Btree •  Removal requires locking •  WiredTiger is an MVCC engine (multiple values per key) •  removal less important to WiredTiger
  • 46. Next •  Hardware •  Hazard pointers •  Skiplists Ø  Ticket locks
  • 47. #MDBE16 Ticket locks •  WiredTiger still needs to lock objects •  but we can make locks faster •  Ticket locks •  customers take a unique ticket number •  customers served in ticket order
  • 48. #MDBE16 Ticket locks Please Take a Number 42 43414039 Now Serving
  • 49. #MDBE16 Ticket locks •  Two incrementing counters: ticket: the next available ticket number serving: the ticket number now being served •  Thread takes a ticket number •  Thread increments “next available” •  Thread waits for “serving” to match its ticket number •  When thread finishes, increments “serving”
  • 50. #MDBE16 Ticket locks serialize threads 40 Now Serving 39 Thread A 39 40 39 40 41 Thread B
  • 51. #MDBE16 Ticket locks are almost what we need •  Ticket locks avoid starvation and are “fair” •  Smaller memory footprint •  Can be made significantly faster than POSIX locks •  remember our compare-and-swap instructions! •  But POSIX locks are shared between readers
  • 52. #MDBE16 Ticket locks: shared vs. exclusive •  Three incrementing counters: ticket: the next available ticket number readers: the next reader to be served writers: the next writer to be served
  • 53. #MDBE16 Readers run in parallel 40 Writers Readers 39 Thread A 39 40 41 41 39 40 41 42 39 40 41 42 Thread B Thread C
  • 54. #MDBE16 Multiple variable updates without locking •  Updating multiple counters would require locking ... but we can write the bus width atomically •  Encode the entire lock state in a single 8B value lock { uint16_t readers; uint16_t writers; uint16_t ticket; // 64K simultaneous threads uint16_t unused; }
  • 55. #MDBE16 Ticket locks Replaces two higher-level lock/unlock calls ... with two atomic instructions.
  • 56. #MDBE16 That’s a (very) fast introduction.... •  Hazard pointers •  Skiplists •  Ticket locks Open Source implementations are available in WiredTiger, including Public Domain ticket locks.
  • 57. #MDBE16 WiredTiger distribution •  Standalone application database toolkit library •  key-value store (NoSQL) •  row-store, column-store and LSM engines •  schema layer includes data types and indexes •  Another MongoDB Open Source contribution •  WiredTiger available for other applications •  https://github.com/wiredtiger