SlideShare a Scribd company logo
Demystifying data structures
and algorithms adopted By
database storage engine
Adewumi Sunkanmi D.
Demystifying data
structures and
algorithms used by
database storage engine
Adewumi Sunkanmi D.
Senior Software Engineer at Acronis
working on Advanced Automation, one
of the cloud services offered by Acronis
Cyber Cloud.
Outline
1. Overview of a three-tier application
2. Criteria for selecting the best database for an application
3. Overview of database architecture
4. Types for database storage engines and their tradeoffs
5. Q/A
client
POST
GET
client
server
POST
GET
WRITE
READ
server
client
Database
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
BigTable
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
BigTable
Neo4J
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
3. Support and familiarity of developers with database
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
3. Support and familiarity of developers with database
4. Rate of write and read and how EXACTLY are these
operations handled at the hardware level?
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
SELECT
COLS FROM WHERE
COL_ID students >
score 70
firstname lastname
“SELECT firstname, lastname FROM students WHERE score > 70;”
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
Disk
Types of storage engines
- Log Structured Merge (LSM) Tree
- Page Oriented (B-Tree)
https://www.cs.umb.edu/~poneil/lsmtree.pdf
Log Structured Merge Tree Storage
Engine
The LMS tree is an immutable disk resident data
structure and it is optimized for sequential writes while
maintaining the acceptable read performance.
Log Structured Merge Tree Storage
Engine
Three components
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
ben: 300
Memtable
e.g Red black
tree in RAM
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
ben: 300
Memtable
e.g Red black
tree in RAM
josh: 500
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red black
tree in RAM
ben: 300 josh: 500
Threshold reached!
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
SSD/HDD file (SSTable file)
T1
ben: 300
bin: 220
josh: 500
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
10MB
alexandar : 10
andreas : 50
…….
erik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (segment file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
Find(apa)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 300 mia: 220
write
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
How do we handle
update?
Since we return from the most
recent memtable or segment file, we
just insert the key with the new
value,
Ben will be returned from T2 not T1
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
How do we handle
delete?
Insert the key with a delete marker
called tombstone, since this will be
the most recent, we can tell it has
been deleted, e.g
ben->null
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
Yes, but compaction will help
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
Yes, but compaction will help
Compaction
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if we don’t find
the key, we search all
the SSTable files?
Compaction
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine Key present?: Strict NO if not
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if we don’t find
the key, we search all
the SSTable files?
Compaction
Optimtimize reads with Bloom Filters
Maybe or Maybe
not(99% accurate)
https://brilliant.org/wiki/bloom-filter/
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if power failure
happens before data
is flushed to disk?
Compaction
1. Persist write in an append only log file before
writing to in-memory table. WAL
2. Recreate memtable from last Log Sequence
Number.
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
Where is LSM tree Storage engine
used?
1. Apache Cassandra
2. WiredTiger
3. InfluxDB
4. Yugabyte DB
5. ScyllaDB
6. CockroachDB
7. Google’s BigTable
8. RocksDB
Types of storage engines
- Log Structured Merge (LSM) Tree
- Page Oriented (B-Tree)
https://carlosproal.com/ir/papers/p121-comer.pdf
https://carlosproal.com/ir/papers/p121-comer.pdf
B-Trees
B trees are page-oriented indexing structures
https://carlosproal.com/ir/papers/p121-comer.pdf
B-Trees
Important notes on B-tree
1. Store key value pairs (sorted by key)
2. Self balancing
3. Often used for indexing
4. Mutable data structure(in place update)
5. Each node is a fixed size block/page 4KB
6. Can only read or write one page at a time
https://carlosproal.com/ir/papers/p121-comer.pdf
Anatomy of B-Tree
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
69 70 78 85
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
key [69, 90)
val val val
B-Trees
https://sqlbak.com/academy/database-page
A database page
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
69 70 78 85
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
key [69, 90)
val val val
NOTE: Leaf Page contains both
the key and value
Anatomy of B-Trees
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
Branching factor = 5
Depth= 3
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
found!
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
found!
READ(78)
Anatomy of B-Tree
Searching for a key is faster because we are not scaning
all keys but only keys within range, takes O(log n)
Where n is the total number of keys
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
val val val val
70 78 85
INSERT(87)
87
69 val val val val
70 78 85 86 val
Branching factor - 5
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
Branching
factor
exceeded! > 5
Create new
page
val val val val
70 78 85 87
69 val val val val
70 78 85 86 val 87 val
Branching factor - 5
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
69 70 78
val val val 85 86 87
val val val
INSERT(87)
Branching
factor
exceeded! > 5
Create new
page
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
What if the parent page is full?
Split it
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
How does update work?
1. Find the leaf page with key
2. Edit the row
3. Overwrite the page
INSERT(87)
LSM trees Vs B-Trees storage engine
LSM Tree B-Tree
Optimized for write Optimized for read
Compressed better(No
Fragmentation)
Fragmentation wastes space
There can be duplicates before
compaction
Each key exist exactly in one
place
Strong transaction support
Spikes in write can cause slow
compaction due to many
SSTable files. Can cause Out
of Memory Error(OOM)
Space optimization in B-tree
Primary index(primary key index)
Secondary index
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Leaf page contains both key and value Leaf page contains both key and value
DUPLICATE !
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Store value offset(smaller in size)
Store value offset (smaller in size)
val1
val2
val3
val4
val5
…
Heap File
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Store value offset(smaller in size)
Store value offset (smaller in size)
val1
val2
val3
val4
val5
…
Heap File
Store value offset(smaller in size)
Extra Disk I/O
So you can store important
columns in leaf page and less
important columns in heap file
@gifted_dl
@gifted_dl
Adewumi Sunkanmi D.

More Related Content

Similar to Database Storage Engine Internals (20)

DOCX
My sql storage engines
Vasudeva Rao
 
PPT
The No SQL Principles and Basic Application Of Casandra Model
Rishikese MR
 
PDF
Types of Databases
kedar2310
 
PPTX
The last mile from db to disk
Bartosz Sypytkowski
 
PDF
Mysql database basic user guide
PoguttuezhiniVP
 
PDF
MySQL Storage Engines - which do you use? TokuDB? MyRocks? InnoDB?
Sveta Smirnova
 
PPT
9910559 jjjgjgjfs lke lwmerfml lew we.ppt
abduganiyevbekzod011
 
PPT
[Www.pkbulk.blogspot.com]file and indexing
AnusAhmad
 
PDF
Intro to column stores
Justin Swanhart
 
PPTX
HBase in Practice
larsgeorge
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPT
Big data hbase
ANSHUL GUPTA
 
PPT
Database Management Systems full lecture
thiru12741550
 
PDF
Cassandra 2.1 boot camp, Read/Write path
Joshua McKenzie
 
PDF
Bender kuszmaul tutorial-xldb12
Atner Yegorov
 
PDF
Data Structures and Algorithms for Big Databases
omnidba
 
PPTX
Indexing
Dr. C.V. Suresh Babu
 
PPT
Unit 4 data storage and querying
Ravindran Kannan
 
PPTX
Why databases cry at night
Michael Yarichuk
 
PDF
Demystifying datastores
vishnu rao
 
My sql storage engines
Vasudeva Rao
 
The No SQL Principles and Basic Application Of Casandra Model
Rishikese MR
 
Types of Databases
kedar2310
 
The last mile from db to disk
Bartosz Sypytkowski
 
Mysql database basic user guide
PoguttuezhiniVP
 
MySQL Storage Engines - which do you use? TokuDB? MyRocks? InnoDB?
Sveta Smirnova
 
9910559 jjjgjgjfs lke lwmerfml lew we.ppt
abduganiyevbekzod011
 
[Www.pkbulk.blogspot.com]file and indexing
AnusAhmad
 
Intro to column stores
Justin Swanhart
 
HBase in Practice
larsgeorge
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
Big data hbase
ANSHUL GUPTA
 
Database Management Systems full lecture
thiru12741550
 
Cassandra 2.1 boot camp, Read/Write path
Joshua McKenzie
 
Bender kuszmaul tutorial-xldb12
Atner Yegorov
 
Data Structures and Algorithms for Big Databases
omnidba
 
Unit 4 data storage and querying
Ravindran Kannan
 
Why databases cry at night
Michael Yarichuk
 
Demystifying datastores
vishnu rao
 

Recently uploaded (20)

PPTX
Prompt Like a Pro. Leveraging Salesforce Data to Power AI Workflows.pptx
Dele Amefo
 
PPTX
prodad heroglyph crack 2.0.214.2 Full Free Download
cracked shares
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
Why is partnering with a SaaS development company crucial for enterprise succ...
Nextbrain Technologies
 
PPTX
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
PDF
NPD Software -Omnex systems
omnex systems
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PPTX
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PPTX
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
PPTX
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
PPTX
Transforming Insights: How Generative AI is Revolutionizing Data Analytics
LetsAI Solutions
 
PDF
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
PPTX
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
Prompt Like a Pro. Leveraging Salesforce Data to Power AI Workflows.pptx
Dele Amefo
 
prodad heroglyph crack 2.0.214.2 Full Free Download
cracked shares
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Why is partnering with a SaaS development company crucial for enterprise succ...
Nextbrain Technologies
 
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
NPD Software -Omnex systems
omnex systems
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
Transforming Insights: How Generative AI is Revolutionizing Data Analytics
LetsAI Solutions
 
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
Ad

Database Storage Engine Internals

  • 1. Demystifying data structures and algorithms adopted By database storage engine Adewumi Sunkanmi D.
  • 2. Demystifying data structures and algorithms used by database storage engine
  • 3. Adewumi Sunkanmi D. Senior Software Engineer at Acronis working on Advanced Automation, one of the cloud services offered by Acronis Cyber Cloud.
  • 4. Outline 1. Overview of a three-tier application 2. Criteria for selecting the best database for an application 3. Overview of database architecture 4. Types for database storage engines and their tradeoffs 5. Q/A
  • 15. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph?
  • 16. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database
  • 17. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling
  • 18. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes)
  • 19. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes)
  • 20. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes) 3. Support and familiarity of developers with database
  • 21. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes) 3. Support and familiarity of developers with database 4. Rate of write and read and how EXACTLY are these operations handled at the hardware level?
  • 23. SELECT COLS FROM WHERE COL_ID students > score 70 firstname lastname “SELECT firstname, lastname FROM students WHERE score > 70;”
  • 26. Types of storage engines - Log Structured Merge (LSM) Tree - Page Oriented (B-Tree)
  • 28. Log Structured Merge Tree Storage Engine The LMS tree is an immutable disk resident data structure and it is optimized for sequential writes while maintaining the acceptable read performance.
  • 29. Log Structured Merge Tree Storage Engine Three components 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table)
  • 30. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177
  • 31. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write ben: 300 Memtable e.g Red black tree in RAM
  • 32. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write ben: 300 Memtable e.g Red black tree in RAM josh: 500
  • 33. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red black tree in RAM ben: 300 josh: 500 Threshold reached!
  • 34. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 SSD/HDD file (SSTable file) T1 ben: 300 bin: 220 josh: 500
  • 35. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB
  • 36. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 10MB alexandar : 10 andreas : 50 ……. erik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ………
  • 37. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB
  • 38. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (segment file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB Find(apa)
  • 39. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 300 mia: 220 write
  • 40. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 SSD/HDD file (SSTable file)
  • 41. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) SSD/HDD file (SSTable file)
  • 42. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 SSD/HDD file (SSTable file)
  • 43. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 SSD/HDD file (SSTable file)
  • 44. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 SSD/HDD file (SSTable file)
  • 45. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 SSD/HDD file (SSTable file)
  • 46. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 How do we handle update? Since we return from the most recent memtable or segment file, we just insert the key with the new value, Ben will be returned from T2 not T1 SSD/HDD file (SSTable file)
  • 47. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 How do we handle delete? Insert the key with a delete marker called tombstone, since this will be the most recent, we can tell it has been deleted, e.g ben->null SSD/HDD file (SSTable file)
  • 48. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( SSD/HDD file (SSTable file)
  • 49. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( Yes, but compaction will help SSD/HDD file (SSTable file)
  • 50. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( Yes, but compaction will help Compaction SSD/HDD file (SSTable file)
  • 51. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if we don’t find the key, we search all the SSTable files? Compaction SSD/HDD file (SSTable file)
  • 52. Log Structured Merge Tree Storage Engine Key present?: Strict NO if not 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if we don’t find the key, we search all the SSTable files? Compaction Optimtimize reads with Bloom Filters Maybe or Maybe not(99% accurate) https://brilliant.org/wiki/bloom-filter/ SSD/HDD file (SSTable file)
  • 53. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if power failure happens before data is flushed to disk? Compaction 1. Persist write in an append only log file before writing to in-memory table. WAL 2. Recreate memtable from last Log Sequence Number. SSD/HDD file (SSTable file)
  • 54. Log Structured Merge Tree Storage Engine Where is LSM tree Storage engine used? 1. Apache Cassandra 2. WiredTiger 3. InfluxDB 4. Yugabyte DB 5. ScyllaDB 6. CockroachDB 7. Google’s BigTable 8. RocksDB
  • 55. Types of storage engines - Log Structured Merge (LSM) Tree - Page Oriented (B-Tree)
  • 58. https://carlosproal.com/ir/papers/p121-comer.pdf B-Trees Important notes on B-tree 1. Store key value pairs (sorted by key) 2. Self balancing 3. Often used for indexing 4. Mutable data structure(in place update) 5. Each node is a fixed size block/page 4KB 6. Can only read or write one page at a time
  • 59. https://carlosproal.com/ir/papers/p121-comer.pdf Anatomy of B-Tree 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 69 70 78 85 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) key [69, 90) val val val
  • 61. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 69 70 78 85 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) key [69, 90) val val val NOTE: Leaf Page contains both the key and value Anatomy of B-Trees
  • 62. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val Branching factor = 5 Depth= 3 Anatomy of B-Tree
  • 63. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 64. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 65. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 66. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val found! READ(78) Anatomy of B-Tree
  • 67. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val found! READ(78) Anatomy of B-Tree Searching for a key is faster because we are not scaning all keys but only keys within range, takes O(log n) Where n is the total number of keys
  • 68. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 val val val val 70 78 85 INSERT(87) 87 69 val val val val 70 78 85 86 val Branching factor - 5
  • 69. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 Branching factor exceeded! > 5 Create new page val val val val 70 78 85 87 69 val val val val 70 78 85 86 val 87 val Branching factor - 5 INSERT(87)
  • 70. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 69 70 78 val val val 85 86 87 val val val INSERT(87) Branching factor exceeded! > 5 Create new page
  • 71. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page INSERT(87)
  • 72. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page What if the parent page is full? Split it INSERT(87)
  • 73. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page How does update work? 1. Find the leaf page with key 2. Edit the row 3. Overwrite the page INSERT(87)
  • 74. LSM trees Vs B-Trees storage engine LSM Tree B-Tree Optimized for write Optimized for read Compressed better(No Fragmentation) Fragmentation wastes space There can be duplicates before compaction Each key exist exactly in one place Strong transaction support Spikes in write can cause slow compaction due to many SSTable files. Can cause Out of Memory Error(OOM)
  • 75. Space optimization in B-tree Primary index(primary key index) Secondary index
  • 76. Space optimization in B-tree Secondary index Primary index(primary key index) Leaf page contains both key and value Leaf page contains both key and value DUPLICATE !
  • 77. Space optimization in B-tree Secondary index Primary index(primary key index) Store value offset(smaller in size) Store value offset (smaller in size) val1 val2 val3 val4 val5 … Heap File
  • 78. Space optimization in B-tree Secondary index Primary index(primary key index) Store value offset(smaller in size) Store value offset (smaller in size) val1 val2 val3 val4 val5 … Heap File Store value offset(smaller in size) Extra Disk I/O So you can store important columns in leaf page and less important columns in heap file