SlideShare a Scribd company logo
MyRocks introduction and
production deployment at Facebook
Yoshinori Matsunobu
Production Engineer, Facebook
Jun 2017
Who am I
▪ Was a MySQL consultant at MySQL (Sun, Oracle) for 4 years
▪ Joined Facebook in March 2012
▪ MySQL 5.1 -> 5.6 upgrade
▪ Fast master failover without losing data
▪ Partly joined HBase Production Engineering team in 2014.H1
▪ Started a research project to integrate RocksDB and MySQL from
2014.H2, with MySQL Engineering Team and MariaDB
▪ Started MyRocks production deployment since 2016.H2
Agenda
▪ MySQL at Facebook
▪ Issues in InnoDB
▪ RocksDB and MyRocks overview
▪ Production Deployment
“Main MySQL Database” at Facebook
▪ Storing Social Graph
▪ Massively Sharded
▪ Petabytes scale
▪ Low latency
▪ Automated Operations
▪ Pure Flash Storage (Constrained by space, not by CPU/IOPS)
H/W trends and limitations
▪ SSD/Flash is getting affordable, but MLC Flash is still a bit expensive
▪ HDD: Large enough capacity but very limited IOPS
▪ Reducing read/write IOPS is very important -- Reducing write is harder
▪ SSD/Flash: Great read iops but limited space and write endurance
▪ Reducing space is higher priority
InnoDB Issue (1) -- Write Amplification
Row
Read
Modify
Write
Row
Row
Row
Row
Row
Row
Row
Row
Row
- 1 Byte Modification results in
one page write (4~16KB)
- InnoDB “Doublewrite” doubles
write volume
InnoDB Issue (2) -- B+Tree Fragmentation
INSERT INTO message_table (user_id) VALUES (31)
user_id RowID
1 10000
2 5
…
3 15321
60 431
Leaf Block 1
user_id RowID
1 10000
…
30 333
Leaf Block 1
user_id RowID
31 345
Leaf Block 2
60 431
…
Empty Empty
InnoDB Issue (3) -- Compression
Uncompressed
16KB page
Row
Row
Row
Compressed
to 5KB
Row
Row
Row
Using 8KB space
on storage
Row
Row
Row
0~4KB => 4KB
4~8KB => 8KB
8~16KB => 16KB
RocksDB
▪ http://rocksdb.org/
▪ Forked from LevelDB
▪ Key-Value LSM (Log Structured Merge) persistent store
▪ Embedded
▪ Data stored locally
▪ Optimized for fast storage
▪ LevelDB was created by Google
▪ Facebook forked and developed RocksDB
▪ Used at many backend services at Facebook, and many external large services
▪ Needs to write C++ or Java code to access RocksDB
RocksDB architecture overview
▪ Leveled LSM Structure
▪ MemTable
▪ WAL (Write Ahead Log)
▪ Compaction
▪ Column Family
RocksDB Architecture
Write Request
Memory
Switch Switch
Persistent Storage
Flush
Active MemTable
MemTable
MemTable
WAL
WAL
WAL
File1 File2 File3
File4 File5 File6
Files
Compaction
Write Path (1)
Write Request
Memory
Switch Switch
Persistent Storage
Flush
Active MemTable
MemTable
MemTable
WAL
WAL
WAL
File1 File2 File3
File4 File5 File6
Files
Compaction
Write Path (2)
Write Request
Memory
Switch Switch
Persistent Storage
Flush
Active MemTable
MemTable
MemTable
WAL
WAL
WAL
File1 File2 File3
File4 File5 File6
Files
Compaction
Write Path (3)
Write Request
Memory
Switch Switch
Persistent Storage
Flush
Active MemTable
MemTable
MemTable
WAL
WAL
WAL
File1 File2 File3
File4 File5 File6
Files
Compaction
Write Path (4)
Write Request
Memory
Switch Switch
Persistent Storage
Flush
Active MemTable
MemTable
MemTable
WAL
WAL
WAL
File1 File2 File3
File4 File5 File6
Files
Compaction
Read Path
Read Request
Memory Persistent Storage
Active MemTable
Bloom Filter
MemTableMemTable
Bloom Filter
WAL
WAL
WAL
File1 File2 File3
File4 File5 File6
Files
Read Path
Read Request
Memory Persistent Storage
Active MemTable
Bloom Filter
MemTableMemTable
Bloom Filter
WAL
WAL
WAL
File1 File2 File3
File4 File5 File6
Files
Index and Bloom
Filters cached In
Memory
File 1 File 3
How LSM/RocksDB works
INSERT INTO message (user_id) VALUES (31);
INSERT INTO message (user_id) VALUES (99999);
INSERT INTO message (user_id) VALUES (10000);
WAL/
MemTable
10000
99999
31
existing SSTs
10
1
150
50000
…
55000
50001
55700
110000
…
….
31
10000
99999
New SST
Compaction
Writing new SSTs sequentially
N random row modifications => A few sequential reads & writes
10
1
31
…
150
10000
55000
50001
55700
…
99999
LSM handles compression better
16MB SST File
5MB SST File
Block
Block
Row
Row
Row
Row
Row
Row
Row
Row
=> Aligned to OS sector (4KB unit)
=> Negligible OS page alignment
overhead
Reducing Space/Write Amplification
Append Only Prefix Key Encoding Zero-Filling metadata
WAL/
Memstore
Row
Row
Row
Row
Row
Row
sst
id1 id2 id3
100 200 1
100 200 2
100 200 3
100 200 4
id1 id2 id3
100 200 1
2
3
4
key value seq id flag
k1 v1 1234561 W
k2 v2 1234562 W
k3 v3 1234563 W
k4 v4 1234564 W
Seq id is 7 bytes in RocksDB. After compression,
“0” uses very little space
key value seq id flag
k1 v1 0 W
k2 v2 0 W
k3 v3 0 W
k4 v4 0 W
LSM Compaction Algorithm -- Level
▪ For each level, data is sorted by key
▪ Read Amplification: 1 ~ number of levels (depending on cache -- L0~L3 are usually cached)
▪ Write Amplification: 1 + 1 + fanout * (number of levels – 2) / 2
▪ Space Amplification: 1.11
▪ 11% is much smaller than B+Tree’s fragmentation
Read Penalty on LSM
MemTable
L0
L1
L2
L3
RocksDBInnoDB
SELECT id1, id2, time FROM t WHERE id1=100 AND id2=100 ORDER BY time DESC LIMIT 1000;
Index on (id1, id2, time)
Branch
Leaf
Range Scan with covering index is
done by just reading leaves sequentially,
and ordering is guaranteed (very efficient)
Merge is needed to do range scan with
ORDER BY
(L0-L2 are usually cached, but in total it needs
more CPU cycles than InnoDB)
Bloom Filter
MemTable
L0
L1
L2
L3
KeyMayExist(id=31) ?
false
false
Checking key may exist or not without reading data,
and skipping read i/o if it definitely does not exist
Column Family
Query atomicity across different key spaces.
▪ Column families:
▪ Separate MemTables and SST files
▪ Share transactional logs
WAL
(shared)
CF1
Active MemTable
CF2
Active MemTable
CF1
Active MemTable
CF1
Read Only
MemTable(s)
CF1
Active MemTable
CF2
Read Only
MemTable(s)
File File File
CF1 Persistent Files
File File File
CF2 Persistent Files
CF1
Compaction
CF2
Compaction
What is MyRocks
▪ MySQL on top of RocksDB (RocksDB storage engine)
▪ Taking both LSM advantages and MySQL features
▪ LSM advantage: Smaller space and lower write amplification
▪ MySQL features: SQL, Replication, Connectors and many tools
▪ Open Source, distributed from MariaDB and Percona as well
MySQL Clients
InnoDB RocksDB
Parser
Optimizer
Replication
etc
SQL/Connector
MySQL
http://myrocks.io/
MyRocks features
▪ Clustered Index (same as InnoDB)
▪ Transactions, including consistency between binlog and RocksDB
▪ Faster data loading, deletes and replication
▪ Dynamic Options
▪ TTL
▪ Crash Safety
▪ Online logical and binary backup
Performance (LinkBench)
▪ Space Usage
▪ QPS
▪ Flash writes per query
▪ HDD
▪ http://smalldatum.blogspot.com/2016/01/myrocks-vs-innodb-with-linkbench-over-7.html
Database Size (Compression)
QPS
Flush writes per query
On HDD workloads
Getting Started
▪ Downloading MySQL with MyRocks support (MariaDB, Percona Server)
▪ Configuring my.cnf
▪ Installing MySQL and initializing data directory
▪ Starting mysqld
▪ Creating and manipulating some tables
▪ Shutting down mysqld
▪ https://github.com/facebook/mysql-5.6/wiki/Getting-Started-with-MyRocks
my.cnf (minimal configuration)
▪ It’s generally not recommended mixing multiple transactional storage engines within the same
instance
▪ Not transactional across engines, Not tested enough
▪ Add “allow-multiple-engines” in my.cnf, if you really want to mix InnoDB and MyRocks
[mysqld]
rocksdb
default-storage-engine=rocksdb
skip-innodb
default-tmp-storage-engine=MyISAM
collation-server=latin1_bin (or utf8_bin, binary)
log-bin
binlog-format=ROW
Creating tables
▪ “ENGINE=RocksDB” is a syntax to create RocksDB tables
▪ Setting “default-storage-engine=RocksDB” in my.cnf is fine too
▪ It is generally recommended to have a PRIMARY KEY, though MyRocks allows tables without
PRIMARY KEYs
▪ Tables are automatically compressed, without any configuration in DDL. Compression
algorithm can be configured via my.cnf
CREATE TABLE t (
id INT PRIMARY KEY,
value1 INT,
value2 VARCHAR (100),
INDEX (value1)
) ENGINE=RocksDB COLLATE latin1_bin;
MyRocks in Production at Facebook
MyRocks Goals
InnoDB in main database
90%
SpaceIOCPU
Machine limit
15%20%
MyRocks in main database
45%
SpaceIOCPU
Machine limit
12%18%
18%
12%
45%
MyRocks goals (more details)
▪ Smaller space usage
▪ 50% compared to compressed InnoDB at FB
▪ Better write amplification
▪ Can use more affordable flash storage
▪ Fast, and small enough CPU usage with general purpose workloads
▪ Large enough data that don’t fit in RAM
▪ Point lookup, range scan, full index/table scan
▪ Insert, update, delete by point/range, and occasional bulk inserts
▪ Same or even smaller CPU usage compared to InnoDB at FB
▪ Make it possible to consolidate 2x more instances per machine
Facebook MySQL related teams
▪ Production Engineering (SRE/PE)
▪ MySQL Production Engineering
▪ Data Performance
▪ Software Engineering (SWE)
▪ RocksDB Engineering
▪ MySQL Engineering
▪ MyRocks
▪ Others (Replication, Client, InnoDB, etc)
Relationship between MySQL PE and SWE
▪ SWE does MySQL server upgrade
▪ Upgrading MySQL revision with several FB patches
▪ Hot fixes
▪ SWE participates in PE oncall
▪ Investigating issues that might have been caused by MySQL server codebase
▪ Core files analysis
▪ Can submit diffs each other
▪ “Hackaweek” to work on shorter term tasks for a week
MyRocks migration -- Technical Challenges
▪ Initial Migration
▪ Creating MyRocks instances without downtime
▪ Loading into MyRocks tables within reasonable time
▪ Verifying data consistency between InnoDB and MyRocks
▪ Continuous Monitoring
▪ Resource Usage like space, iops, cpu and memory
▪ Query plan outliers
▪ Stalls and crashes
MyRocks migration -- Technical Challenges (2)
▪ When running MyRocks on master
▪ RBR (Row based binary logging)
▪ Removing queries relying on InnoDB Gap Lock
▪ Robust XA support (binlog and RocksDB)
Creating first MyRocks instance without downtime
▪ Picking one of the InnoDB slave instances, then starting logical dump
and restore
▪ Stopping one slave does not affect services
Master (InnoDB)
Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (InnoDB) Slave4 (MyRocks)
Stop & Dump & Load
Faster Data Loading
Normal Write Path in MyRocks/RocksDB
Write Requests
MemTableWAL
Level 0 SST
Level 1 SST
Level max SST
….
Flush
Compaction
Compaction
Faster Write Path
Write Requests
Level max SST
“SET SESSION rocksdb_bulk_load=1;”
Original data must be sorted by primary key
Data migration steps
▪ Dst) Create table … ENGINE=ROCKSDB; (creating MyRocks tables with proper column families)
▪ Dst) ALTER TABLE DROP INDEX; (dropping secondary keys)
▪ Src) STOP SLAVE;
▪ mysqldump –host=innodb-host --order-by-primary | mysql –host=myrocks-host –init-
command=“set sql_log_bin=0; set rocksdb_bulk_load=1”
▪ Dst) ALTER TABLE ADD INDEX; (adding secondary keys)
▪ Src, Dst) START SLAVE;
Data Verification
▪ MyRocks/RocksDB is relatively new database technology
▪ Might have more bugs than robust InnoDB
▪ Ensuring data consistency helps avoid showing conflicting
results
Verification tests
▪ Index count check between primary key and secondary keys
▪ If any index is broken, it can be detected
▪ SELECT ‘PRIMARY’, COUNT(*) FROM t FORCE INDEX (PRIMARY)
UNION SELECT ‘idx1’, COUNT(*) FROM t FORCE INDEX (idx1)
▪ Can’t be used if there is no secondary key
▪ Index stats check
▪ Checking if “rows” show SHOW TABLE STATUS is not far different from actual row count
▪ Checksum tests w/ InnoDB
▪ Comparing between InnoDB instance and MyRocks instance
▪ Creating a transaction consistent snapshot at the same GTID position, scan, then compare
checksum
▪ Shadow correctness check
▪ Capturing read traffics
Shadow traffic tests
▪ We have a shadow test framework
▪ MySQL audit plugin to capture read/write queries from production instances
▪ Replaying them into shadow master instances
▪ Shadow master tests
▪ Client errors
▪ Rewriting queries relying on Gap Lock
▪ gap_lock_raise_error=1, gap_lock_write_log=1
Creating second MyRocks instance without downtime
Master (InnoDB)
Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (MyRocks) Slave4 (MyRocks)
myrocks_hotbackup
(Online binary backup)
Promoting MyRocks as a master
Master (MyRocks)
Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (InnoDB) Slave4 (MyRocks)
Crash Safety
▪ Crash Safety makes operations much easier
▪ Just restarting failed instances can restart replication
▪ No need to rebuild entire instances, as long as data is there
▪ Crash Safe Slave
▪ Crash Safe Master
▪ 2PC (binlog and WAL)
Master crash safety settings
sync-binlog rocksdb-flush-log-at-
trx-commit
rocksdb-enable-
2pc
rocksdb-wal-
recovery-mode
No data loss on
unplanned machine
reboot
1 (default) 1 (default) 1 (default) 1 (default)
No data loss on
mysqld crash &
recovery
0 2 1 2
No data loss if
always failover
0 2 0 2
Notable issues fixed during migration
▪ Lots of “Snapshot Conflict” errors
▪ Because of implementation differences in MyRocks (PostgreSQL Style snapshot isolation)
▪ Setting tx-isolation=READ-COMMITTED solved the problem
▪ Slaves stopped with I/O errors on reads
▪ We switched to make MyRocks abort on I/O errors for both reads and writes
▪ Index statistics bugs
▪ Inaccurate row counts/length on bulk loading -> fixed
▪ Cardinality was not updated on fast index creation -> fixed
▪ Crash safety bugs and some crash bugs in MyRocks/RocksDB
Preventing stalls
▪ Heavy writes cause lots of compactions, which may cause stalls
▪ Typical write stall cases
▪ Online schema change
▪ Massive data migration jobs that write lots of data
▪ Use fast data loading whenever possible
▪ InnoDB to MyRocks migration can utilize this technique
▪ Can be harder for data migration jobs that write into existing tables
When write stalls happen
▪ Estimated number of pending compaction bytes exceeded X bytes
▪ soft|hard_pending_compaction_bytes, default 64GB
▪ Number of L0 files exceeded level0_slowdown|stop_writes_trigger (default 10)
▪ Number of unflushed number of MemTables exceeded
max_write_buffer_number (default 4)
▪ All of these incidents are written to LOG as WARN level
▪ All of these options apply to each column family
What happens on write stalls
▪ Soft stalls
▪ COMMIT takes longer time than usual
▪ Total estimated written bytes to MemTable is capped to
rocksdb_delayed_write_rate, until slowdown conditions are cleared
▪ Default is 16MB/s (previously 2MB/s)
▪ Hard stalls
▪ All writes are blocked at COMMIT, until stop conditions are cleared
Mitigating Write Stalls
▪ Speed up compactions
▪ Use faster compression algorithm (LZ4 for higher levels, ZSTD in the bottommost)
▪ Increase rocksdb_max_background_compactions
▪ Reduce total bytes written
▪ But avoid using too strong compression algorithm on upper levels
▪ Use more write efficient compaction algorithm
▪ compaction_pri=kMinOverlappingRatio
▪ Delete files slowly on Flash
▪ Deleting too many large files cause TRIM stalls on Flash
▪ MyRocks has an option to throttle sst file deletion speed
▪ Binlog file deletions should be slowed down as well
Monitoring
▪ MyRocks files
▪ SHOW ENGINE ROCKSDB STATUS
▪ SHOW ENGINE ROCKSDB TRANSACTION STATUS
▪ LOG files
▪ information_schema tables
▪ sst_dump tool
▪ ldb tool
SHOW ENGINE ROCKSDB STATUS
▪ Column Family Statistics, including size, read and write amp per
level
▪ Memory usage
*************************** 7. row ***************************
Type: CF_COMPACTION
Name: default
Status:
** Compaction Stats [default] **
Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 2/0 51.58 0.5 0.0 0.0 0.0 0.3 0.3 0.0 0.0 0.0 40.3 7 10 0.669 0 0
L3 6/0 109.36 0.9 0.7 0.7 0.0 0.6 0.6 0.0 0.9 43.8 40.7 16 3 5.172 7494K 297K
L4 61/0 1247.31 1.0 2.0 0.3 1.7 2.0 0.2 0.0 6.9 49.7 48.5 41 9 4.593 15M 176K
L5 989/0 12592.86 1.0 2.0 0.3 1.8 1.9 0.1 0.0 7.4 8.1 7.4 258 8 32.209 17M 726K
L6 4271/0 127363.51 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0
Sum 5329/0 141364.62 0.0 4.7 1.2 3.5 4.7 1.2 0.0 17.9 15.0 15.0 321 30 10.707 41M 1200K
SHOW GLOBAL STATUS
mysql> show global status like 'rocksdb%';
+---------------------------------------+-------------+
| Variable_name | Value |
+---------------------------------------+-------------+
| rocksdb_rows_deleted | 216223 |
| rocksdb_rows_inserted | 1318158 |
| rocksdb_rows_read | 7102838 |
| rocksdb_rows_updated | 1997116 |
....
| rocksdb_bloom_filter_prefix_checked | 773124 |
| rocksdb_bloom_filter_prefix_useful | 308445 |
| rocksdb_bloom_filter_useful | 10108448 |
....
Kernel VM allocation stalls
▪ RocksDB has limited support for O_DIRECT
▪ Too many buffered writes may cause “VM allocation stalls” on
older Linux kernels
▪ Some queries may take more than 1s because of this
▪ Using more %sys
▪ Linux Kernel 4.6 is much better for handling buffered writes
▪ For details: https://lwn.net/Articles/704739/
More information
▪ MyRocks home, documentation:
▪ http://myrocks.io
▪ https://github.com/facebook/mysql-5.6/wiki
▪ Source Repo: https://github.com/facebook/mysql-5.6
▪ Bug Reports: https://github.com/facebook/mysql-5.6/issues
▪ Forum: https://groups.google.com/forum/#!forum/myrocks-dev
Summary
▪ We started migrating from InnoDB to MyRocks in our main database
▪ We run both master and slaves in production
▪ Major motivation was to save space
▪ Online data correctness check tool helped to find lots of data integrity
bugs and prevented from deploying inconsistent instances in production
▪ Bulk loading greatly reduced compaction stalls
▪ Transactions and crash safety are supported
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

What's hot (20)

PDF
The Full MySQL and MariaDB Parallel Replication Tutorial
Jean-François Gagné
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
PPTX
Hive+Tez: A performance deep dive
t3rmin4t0r
 
PDF
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Patroni - HA PostgreSQL made easy
Alexander Kukushkin
 
PPTX
Maria db 이중화구성_고민하기
NeoClova
 
PDF
Hadoopのシステム設計・運用のポイント
Cloudera Japan
 
PPTX
When is MyRocks good?
Alkin Tezuysal
 
DOCX
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
NeoClova
 
PDF
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
PDF
Introduction to MongoDB
Mike Dirolf
 
PPTX
RocksDB compaction
MIJIN AN
 
PPTX
Bucket your partitions wisely - Cassandra summit 2016
Markus Höfer
 
PDF
Maxscale_메뉴얼
NeoClova
 
PPTX
Sizing MongoDB Clusters
MongoDB
 
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
PDF
Cassandra Introduction & Features
DataStax Academy
 
PPTX
Introduction to Storm
Chandler Huang
 
PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
The Full MySQL and MariaDB Parallel Replication Tutorial
Jean-François Gagné
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Patroni - HA PostgreSQL made easy
Alexander Kukushkin
 
Maria db 이중화구성_고민하기
NeoClova
 
Hadoopのシステム設計・運用のポイント
Cloudera Japan
 
When is MyRocks good?
Alkin Tezuysal
 
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
NeoClova
 
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Introduction to MongoDB
Mike Dirolf
 
RocksDB compaction
MIJIN AN
 
Bucket your partitions wisely - Cassandra summit 2016
Markus Höfer
 
Maxscale_메뉴얼
NeoClova
 
Sizing MongoDB Clusters
MongoDB
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
Cassandra Introduction & Features
DataStax Academy
 
Introduction to Storm
Chandler Huang
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 

Similar to MyRocks introduction and production deployment (20)

PPTX
Migrating from InnoDB and HBase to MyRocks at Facebook
MariaDB plc
 
PPTX
M|18 How Facebook Migrated to MyRocks
MariaDB plc
 
PPTX
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive
 
PPTX
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive
 
PPTX
001 hbase introduction
Scott Miao
 
PDF
Database as a Service on the Oracle Database Appliance Platform
Maris Elsins
 
PPTX
NoSQL in Real-time Architectures
Ronen Botzer
 
PPTX
Introduction to TokuDB v7.5 and Read Free Replication
Tim Callaghan
 
PPTX
NoSQL
dbulic
 
PDF
MySQL highav Availability
Baruch Osoveskiy
 
ODP
Vote NO for MySQL
Ulf Wendel
 
PPTX
ZFS for Databases
ahl0003
 
PPTX
In-memory Databases
Robert Friberg
 
PDF
Geospatial Big Data - Foss4gNA
normanbarker
 
ODP
Mysql 2007 Tech At Digg V3
epee
 
PPTX
Storage Engine Wars at Parse
MongoDB
 
PDF
Linux and H/W optimizations for MySQL
Yoshinori Matsunobu
 
PDF
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
Ontico
 
PDF
High-Performance Storage Services with HailDB and Java
sunnygleason
 
PPTX
Why databases cry at night
Michael Yarichuk
 
Migrating from InnoDB and HBase to MyRocks at Facebook
MariaDB plc
 
M|18 How Facebook Migrated to MyRocks
MariaDB plc
 
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive
 
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive
 
001 hbase introduction
Scott Miao
 
Database as a Service on the Oracle Database Appliance Platform
Maris Elsins
 
NoSQL in Real-time Architectures
Ronen Botzer
 
Introduction to TokuDB v7.5 and Read Free Replication
Tim Callaghan
 
NoSQL
dbulic
 
MySQL highav Availability
Baruch Osoveskiy
 
Vote NO for MySQL
Ulf Wendel
 
ZFS for Databases
ahl0003
 
In-memory Databases
Robert Friberg
 
Geospatial Big Data - Foss4gNA
normanbarker
 
Mysql 2007 Tech At Digg V3
epee
 
Storage Engine Wars at Parse
MongoDB
 
Linux and H/W optimizations for MySQL
Yoshinori Matsunobu
 
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
Ontico
 
High-Performance Storage Services with HailDB and Java
sunnygleason
 
Why databases cry at night
Michael Yarichuk
 
Ad

More from Yoshinori Matsunobu (11)

PPTX
Consistency between Engine and Binlog under Reduced Durability
Yoshinori Matsunobu
 
PDF
データベース技術の羅針盤
Yoshinori Matsunobu
 
PDF
MHA for MySQLとDeNAのオープンソースの話
Yoshinori Matsunobu
 
PDF
Introducing MySQL MHA (JP/LT)
Yoshinori Matsunobu
 
PDF
MySQL for Large Scale Social Games
Yoshinori Matsunobu
 
PDF
Automated master failover
Yoshinori Matsunobu
 
PDF
ソーシャルゲームのためのデータベース設計
Yoshinori Matsunobu
 
PDF
More mastering the art of indexing
Yoshinori Matsunobu
 
PDF
SSD Deployment Strategies for MySQL
Yoshinori Matsunobu
 
PDF
Linux performance tuning & stabilization tips (mysqlconf2010)
Yoshinori Matsunobu
 
PPT
Linux/DB Tuning (DevSumi2010, Japanese)
Yoshinori Matsunobu
 
Consistency between Engine and Binlog under Reduced Durability
Yoshinori Matsunobu
 
データベース技術の羅針盤
Yoshinori Matsunobu
 
MHA for MySQLとDeNAのオープンソースの話
Yoshinori Matsunobu
 
Introducing MySQL MHA (JP/LT)
Yoshinori Matsunobu
 
MySQL for Large Scale Social Games
Yoshinori Matsunobu
 
Automated master failover
Yoshinori Matsunobu
 
ソーシャルゲームのためのデータベース設計
Yoshinori Matsunobu
 
More mastering the art of indexing
Yoshinori Matsunobu
 
SSD Deployment Strategies for MySQL
Yoshinori Matsunobu
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Yoshinori Matsunobu
 
Linux/DB Tuning (DevSumi2010, Japanese)
Yoshinori Matsunobu
 
Ad

Recently uploaded (20)

PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Tally software_Introduction_Presentation
AditiBansal54083
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 

MyRocks introduction and production deployment

  • 1. MyRocks introduction and production deployment at Facebook Yoshinori Matsunobu Production Engineer, Facebook Jun 2017
  • 2. Who am I ▪ Was a MySQL consultant at MySQL (Sun, Oracle) for 4 years ▪ Joined Facebook in March 2012 ▪ MySQL 5.1 -> 5.6 upgrade ▪ Fast master failover without losing data ▪ Partly joined HBase Production Engineering team in 2014.H1 ▪ Started a research project to integrate RocksDB and MySQL from 2014.H2, with MySQL Engineering Team and MariaDB ▪ Started MyRocks production deployment since 2016.H2
  • 3. Agenda ▪ MySQL at Facebook ▪ Issues in InnoDB ▪ RocksDB and MyRocks overview ▪ Production Deployment
  • 4. “Main MySQL Database” at Facebook ▪ Storing Social Graph ▪ Massively Sharded ▪ Petabytes scale ▪ Low latency ▪ Automated Operations ▪ Pure Flash Storage (Constrained by space, not by CPU/IOPS)
  • 5. H/W trends and limitations ▪ SSD/Flash is getting affordable, but MLC Flash is still a bit expensive ▪ HDD: Large enough capacity but very limited IOPS ▪ Reducing read/write IOPS is very important -- Reducing write is harder ▪ SSD/Flash: Great read iops but limited space and write endurance ▪ Reducing space is higher priority
  • 6. InnoDB Issue (1) -- Write Amplification Row Read Modify Write Row Row Row Row Row Row Row Row Row - 1 Byte Modification results in one page write (4~16KB) - InnoDB “Doublewrite” doubles write volume
  • 7. InnoDB Issue (2) -- B+Tree Fragmentation INSERT INTO message_table (user_id) VALUES (31) user_id RowID 1 10000 2 5 … 3 15321 60 431 Leaf Block 1 user_id RowID 1 10000 … 30 333 Leaf Block 1 user_id RowID 31 345 Leaf Block 2 60 431 … Empty Empty
  • 8. InnoDB Issue (3) -- Compression Uncompressed 16KB page Row Row Row Compressed to 5KB Row Row Row Using 8KB space on storage Row Row Row 0~4KB => 4KB 4~8KB => 8KB 8~16KB => 16KB
  • 9. RocksDB ▪ http://rocksdb.org/ ▪ Forked from LevelDB ▪ Key-Value LSM (Log Structured Merge) persistent store ▪ Embedded ▪ Data stored locally ▪ Optimized for fast storage ▪ LevelDB was created by Google ▪ Facebook forked and developed RocksDB ▪ Used at many backend services at Facebook, and many external large services ▪ Needs to write C++ or Java code to access RocksDB
  • 10. RocksDB architecture overview ▪ Leveled LSM Structure ▪ MemTable ▪ WAL (Write Ahead Log) ▪ Compaction ▪ Column Family
  • 11. RocksDB Architecture Write Request Memory Switch Switch Persistent Storage Flush Active MemTable MemTable MemTable WAL WAL WAL File1 File2 File3 File4 File5 File6 Files Compaction
  • 12. Write Path (1) Write Request Memory Switch Switch Persistent Storage Flush Active MemTable MemTable MemTable WAL WAL WAL File1 File2 File3 File4 File5 File6 Files Compaction
  • 13. Write Path (2) Write Request Memory Switch Switch Persistent Storage Flush Active MemTable MemTable MemTable WAL WAL WAL File1 File2 File3 File4 File5 File6 Files Compaction
  • 14. Write Path (3) Write Request Memory Switch Switch Persistent Storage Flush Active MemTable MemTable MemTable WAL WAL WAL File1 File2 File3 File4 File5 File6 Files Compaction
  • 15. Write Path (4) Write Request Memory Switch Switch Persistent Storage Flush Active MemTable MemTable MemTable WAL WAL WAL File1 File2 File3 File4 File5 File6 Files Compaction
  • 16. Read Path Read Request Memory Persistent Storage Active MemTable Bloom Filter MemTableMemTable Bloom Filter WAL WAL WAL File1 File2 File3 File4 File5 File6 Files
  • 17. Read Path Read Request Memory Persistent Storage Active MemTable Bloom Filter MemTableMemTable Bloom Filter WAL WAL WAL File1 File2 File3 File4 File5 File6 Files Index and Bloom Filters cached In Memory File 1 File 3
  • 18. How LSM/RocksDB works INSERT INTO message (user_id) VALUES (31); INSERT INTO message (user_id) VALUES (99999); INSERT INTO message (user_id) VALUES (10000); WAL/ MemTable 10000 99999 31 existing SSTs 10 1 150 50000 … 55000 50001 55700 110000 … …. 31 10000 99999 New SST Compaction Writing new SSTs sequentially N random row modifications => A few sequential reads & writes 10 1 31 … 150 10000 55000 50001 55700 … 99999
  • 19. LSM handles compression better 16MB SST File 5MB SST File Block Block Row Row Row Row Row Row Row Row => Aligned to OS sector (4KB unit) => Negligible OS page alignment overhead
  • 20. Reducing Space/Write Amplification Append Only Prefix Key Encoding Zero-Filling metadata WAL/ Memstore Row Row Row Row Row Row sst id1 id2 id3 100 200 1 100 200 2 100 200 3 100 200 4 id1 id2 id3 100 200 1 2 3 4 key value seq id flag k1 v1 1234561 W k2 v2 1234562 W k3 v3 1234563 W k4 v4 1234564 W Seq id is 7 bytes in RocksDB. After compression, “0” uses very little space key value seq id flag k1 v1 0 W k2 v2 0 W k3 v3 0 W k4 v4 0 W
  • 21. LSM Compaction Algorithm -- Level ▪ For each level, data is sorted by key ▪ Read Amplification: 1 ~ number of levels (depending on cache -- L0~L3 are usually cached) ▪ Write Amplification: 1 + 1 + fanout * (number of levels – 2) / 2 ▪ Space Amplification: 1.11 ▪ 11% is much smaller than B+Tree’s fragmentation
  • 22. Read Penalty on LSM MemTable L0 L1 L2 L3 RocksDBInnoDB SELECT id1, id2, time FROM t WHERE id1=100 AND id2=100 ORDER BY time DESC LIMIT 1000; Index on (id1, id2, time) Branch Leaf Range Scan with covering index is done by just reading leaves sequentially, and ordering is guaranteed (very efficient) Merge is needed to do range scan with ORDER BY (L0-L2 are usually cached, but in total it needs more CPU cycles than InnoDB)
  • 23. Bloom Filter MemTable L0 L1 L2 L3 KeyMayExist(id=31) ? false false Checking key may exist or not without reading data, and skipping read i/o if it definitely does not exist
  • 24. Column Family Query atomicity across different key spaces. ▪ Column families: ▪ Separate MemTables and SST files ▪ Share transactional logs WAL (shared) CF1 Active MemTable CF2 Active MemTable CF1 Active MemTable CF1 Read Only MemTable(s) CF1 Active MemTable CF2 Read Only MemTable(s) File File File CF1 Persistent Files File File File CF2 Persistent Files CF1 Compaction CF2 Compaction
  • 25. What is MyRocks ▪ MySQL on top of RocksDB (RocksDB storage engine) ▪ Taking both LSM advantages and MySQL features ▪ LSM advantage: Smaller space and lower write amplification ▪ MySQL features: SQL, Replication, Connectors and many tools ▪ Open Source, distributed from MariaDB and Percona as well MySQL Clients InnoDB RocksDB Parser Optimizer Replication etc SQL/Connector MySQL http://myrocks.io/
  • 26. MyRocks features ▪ Clustered Index (same as InnoDB) ▪ Transactions, including consistency between binlog and RocksDB ▪ Faster data loading, deletes and replication ▪ Dynamic Options ▪ TTL ▪ Crash Safety ▪ Online logical and binary backup
  • 27. Performance (LinkBench) ▪ Space Usage ▪ QPS ▪ Flash writes per query ▪ HDD ▪ http://smalldatum.blogspot.com/2016/01/myrocks-vs-innodb-with-linkbench-over-7.html
  • 29. QPS
  • 32. Getting Started ▪ Downloading MySQL with MyRocks support (MariaDB, Percona Server) ▪ Configuring my.cnf ▪ Installing MySQL and initializing data directory ▪ Starting mysqld ▪ Creating and manipulating some tables ▪ Shutting down mysqld ▪ https://github.com/facebook/mysql-5.6/wiki/Getting-Started-with-MyRocks
  • 33. my.cnf (minimal configuration) ▪ It’s generally not recommended mixing multiple transactional storage engines within the same instance ▪ Not transactional across engines, Not tested enough ▪ Add “allow-multiple-engines” in my.cnf, if you really want to mix InnoDB and MyRocks [mysqld] rocksdb default-storage-engine=rocksdb skip-innodb default-tmp-storage-engine=MyISAM collation-server=latin1_bin (or utf8_bin, binary) log-bin binlog-format=ROW
  • 34. Creating tables ▪ “ENGINE=RocksDB” is a syntax to create RocksDB tables ▪ Setting “default-storage-engine=RocksDB” in my.cnf is fine too ▪ It is generally recommended to have a PRIMARY KEY, though MyRocks allows tables without PRIMARY KEYs ▪ Tables are automatically compressed, without any configuration in DDL. Compression algorithm can be configured via my.cnf CREATE TABLE t ( id INT PRIMARY KEY, value1 INT, value2 VARCHAR (100), INDEX (value1) ) ENGINE=RocksDB COLLATE latin1_bin;
  • 35. MyRocks in Production at Facebook
  • 36. MyRocks Goals InnoDB in main database 90% SpaceIOCPU Machine limit 15%20% MyRocks in main database 45% SpaceIOCPU Machine limit 12%18% 18% 12% 45%
  • 37. MyRocks goals (more details) ▪ Smaller space usage ▪ 50% compared to compressed InnoDB at FB ▪ Better write amplification ▪ Can use more affordable flash storage ▪ Fast, and small enough CPU usage with general purpose workloads ▪ Large enough data that don’t fit in RAM ▪ Point lookup, range scan, full index/table scan ▪ Insert, update, delete by point/range, and occasional bulk inserts ▪ Same or even smaller CPU usage compared to InnoDB at FB ▪ Make it possible to consolidate 2x more instances per machine
  • 38. Facebook MySQL related teams ▪ Production Engineering (SRE/PE) ▪ MySQL Production Engineering ▪ Data Performance ▪ Software Engineering (SWE) ▪ RocksDB Engineering ▪ MySQL Engineering ▪ MyRocks ▪ Others (Replication, Client, InnoDB, etc)
  • 39. Relationship between MySQL PE and SWE ▪ SWE does MySQL server upgrade ▪ Upgrading MySQL revision with several FB patches ▪ Hot fixes ▪ SWE participates in PE oncall ▪ Investigating issues that might have been caused by MySQL server codebase ▪ Core files analysis ▪ Can submit diffs each other ▪ “Hackaweek” to work on shorter term tasks for a week
  • 40. MyRocks migration -- Technical Challenges ▪ Initial Migration ▪ Creating MyRocks instances without downtime ▪ Loading into MyRocks tables within reasonable time ▪ Verifying data consistency between InnoDB and MyRocks ▪ Continuous Monitoring ▪ Resource Usage like space, iops, cpu and memory ▪ Query plan outliers ▪ Stalls and crashes
  • 41. MyRocks migration -- Technical Challenges (2) ▪ When running MyRocks on master ▪ RBR (Row based binary logging) ▪ Removing queries relying on InnoDB Gap Lock ▪ Robust XA support (binlog and RocksDB)
  • 42. Creating first MyRocks instance without downtime ▪ Picking one of the InnoDB slave instances, then starting logical dump and restore ▪ Stopping one slave does not affect services Master (InnoDB) Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (InnoDB) Slave4 (MyRocks) Stop & Dump & Load
  • 43. Faster Data Loading Normal Write Path in MyRocks/RocksDB Write Requests MemTableWAL Level 0 SST Level 1 SST Level max SST …. Flush Compaction Compaction Faster Write Path Write Requests Level max SST “SET SESSION rocksdb_bulk_load=1;” Original data must be sorted by primary key
  • 44. Data migration steps ▪ Dst) Create table … ENGINE=ROCKSDB; (creating MyRocks tables with proper column families) ▪ Dst) ALTER TABLE DROP INDEX; (dropping secondary keys) ▪ Src) STOP SLAVE; ▪ mysqldump –host=innodb-host --order-by-primary | mysql –host=myrocks-host –init- command=“set sql_log_bin=0; set rocksdb_bulk_load=1” ▪ Dst) ALTER TABLE ADD INDEX; (adding secondary keys) ▪ Src, Dst) START SLAVE;
  • 45. Data Verification ▪ MyRocks/RocksDB is relatively new database technology ▪ Might have more bugs than robust InnoDB ▪ Ensuring data consistency helps avoid showing conflicting results
  • 46. Verification tests ▪ Index count check between primary key and secondary keys ▪ If any index is broken, it can be detected ▪ SELECT ‘PRIMARY’, COUNT(*) FROM t FORCE INDEX (PRIMARY) UNION SELECT ‘idx1’, COUNT(*) FROM t FORCE INDEX (idx1) ▪ Can’t be used if there is no secondary key ▪ Index stats check ▪ Checking if “rows” show SHOW TABLE STATUS is not far different from actual row count ▪ Checksum tests w/ InnoDB ▪ Comparing between InnoDB instance and MyRocks instance ▪ Creating a transaction consistent snapshot at the same GTID position, scan, then compare checksum ▪ Shadow correctness check ▪ Capturing read traffics
  • 47. Shadow traffic tests ▪ We have a shadow test framework ▪ MySQL audit plugin to capture read/write queries from production instances ▪ Replaying them into shadow master instances ▪ Shadow master tests ▪ Client errors ▪ Rewriting queries relying on Gap Lock ▪ gap_lock_raise_error=1, gap_lock_write_log=1
  • 48. Creating second MyRocks instance without downtime Master (InnoDB) Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (MyRocks) Slave4 (MyRocks) myrocks_hotbackup (Online binary backup)
  • 49. Promoting MyRocks as a master Master (MyRocks) Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (InnoDB) Slave4 (MyRocks)
  • 50. Crash Safety ▪ Crash Safety makes operations much easier ▪ Just restarting failed instances can restart replication ▪ No need to rebuild entire instances, as long as data is there ▪ Crash Safe Slave ▪ Crash Safe Master ▪ 2PC (binlog and WAL)
  • 51. Master crash safety settings sync-binlog rocksdb-flush-log-at- trx-commit rocksdb-enable- 2pc rocksdb-wal- recovery-mode No data loss on unplanned machine reboot 1 (default) 1 (default) 1 (default) 1 (default) No data loss on mysqld crash & recovery 0 2 1 2 No data loss if always failover 0 2 0 2
  • 52. Notable issues fixed during migration ▪ Lots of “Snapshot Conflict” errors ▪ Because of implementation differences in MyRocks (PostgreSQL Style snapshot isolation) ▪ Setting tx-isolation=READ-COMMITTED solved the problem ▪ Slaves stopped with I/O errors on reads ▪ We switched to make MyRocks abort on I/O errors for both reads and writes ▪ Index statistics bugs ▪ Inaccurate row counts/length on bulk loading -> fixed ▪ Cardinality was not updated on fast index creation -> fixed ▪ Crash safety bugs and some crash bugs in MyRocks/RocksDB
  • 53. Preventing stalls ▪ Heavy writes cause lots of compactions, which may cause stalls ▪ Typical write stall cases ▪ Online schema change ▪ Massive data migration jobs that write lots of data ▪ Use fast data loading whenever possible ▪ InnoDB to MyRocks migration can utilize this technique ▪ Can be harder for data migration jobs that write into existing tables
  • 54. When write stalls happen ▪ Estimated number of pending compaction bytes exceeded X bytes ▪ soft|hard_pending_compaction_bytes, default 64GB ▪ Number of L0 files exceeded level0_slowdown|stop_writes_trigger (default 10) ▪ Number of unflushed number of MemTables exceeded max_write_buffer_number (default 4) ▪ All of these incidents are written to LOG as WARN level ▪ All of these options apply to each column family
  • 55. What happens on write stalls ▪ Soft stalls ▪ COMMIT takes longer time than usual ▪ Total estimated written bytes to MemTable is capped to rocksdb_delayed_write_rate, until slowdown conditions are cleared ▪ Default is 16MB/s (previously 2MB/s) ▪ Hard stalls ▪ All writes are blocked at COMMIT, until stop conditions are cleared
  • 56. Mitigating Write Stalls ▪ Speed up compactions ▪ Use faster compression algorithm (LZ4 for higher levels, ZSTD in the bottommost) ▪ Increase rocksdb_max_background_compactions ▪ Reduce total bytes written ▪ But avoid using too strong compression algorithm on upper levels ▪ Use more write efficient compaction algorithm ▪ compaction_pri=kMinOverlappingRatio ▪ Delete files slowly on Flash ▪ Deleting too many large files cause TRIM stalls on Flash ▪ MyRocks has an option to throttle sst file deletion speed ▪ Binlog file deletions should be slowed down as well
  • 57. Monitoring ▪ MyRocks files ▪ SHOW ENGINE ROCKSDB STATUS ▪ SHOW ENGINE ROCKSDB TRANSACTION STATUS ▪ LOG files ▪ information_schema tables ▪ sst_dump tool ▪ ldb tool
  • 58. SHOW ENGINE ROCKSDB STATUS ▪ Column Family Statistics, including size, read and write amp per level ▪ Memory usage *************************** 7. row *************************** Type: CF_COMPACTION Name: default Status: ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop --------------------------------------------------------------------------------------------------------------------------------------------------------------------- L0 2/0 51.58 0.5 0.0 0.0 0.0 0.3 0.3 0.0 0.0 0.0 40.3 7 10 0.669 0 0 L3 6/0 109.36 0.9 0.7 0.7 0.0 0.6 0.6 0.0 0.9 43.8 40.7 16 3 5.172 7494K 297K L4 61/0 1247.31 1.0 2.0 0.3 1.7 2.0 0.2 0.0 6.9 49.7 48.5 41 9 4.593 15M 176K L5 989/0 12592.86 1.0 2.0 0.3 1.8 1.9 0.1 0.0 7.4 8.1 7.4 258 8 32.209 17M 726K L6 4271/0 127363.51 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 Sum 5329/0 141364.62 0.0 4.7 1.2 3.5 4.7 1.2 0.0 17.9 15.0 15.0 321 30 10.707 41M 1200K
  • 59. SHOW GLOBAL STATUS mysql> show global status like 'rocksdb%'; +---------------------------------------+-------------+ | Variable_name | Value | +---------------------------------------+-------------+ | rocksdb_rows_deleted | 216223 | | rocksdb_rows_inserted | 1318158 | | rocksdb_rows_read | 7102838 | | rocksdb_rows_updated | 1997116 | .... | rocksdb_bloom_filter_prefix_checked | 773124 | | rocksdb_bloom_filter_prefix_useful | 308445 | | rocksdb_bloom_filter_useful | 10108448 | ....
  • 60. Kernel VM allocation stalls ▪ RocksDB has limited support for O_DIRECT ▪ Too many buffered writes may cause “VM allocation stalls” on older Linux kernels ▪ Some queries may take more than 1s because of this ▪ Using more %sys ▪ Linux Kernel 4.6 is much better for handling buffered writes ▪ For details: https://lwn.net/Articles/704739/
  • 61. More information ▪ MyRocks home, documentation: ▪ http://myrocks.io ▪ https://github.com/facebook/mysql-5.6/wiki ▪ Source Repo: https://github.com/facebook/mysql-5.6 ▪ Bug Reports: https://github.com/facebook/mysql-5.6/issues ▪ Forum: https://groups.google.com/forum/#!forum/myrocks-dev
  • 62. Summary ▪ We started migrating from InnoDB to MyRocks in our main database ▪ We run both master and slaves in production ▪ Major motivation was to save space ▪ Online data correctness check tool helped to find lots of data integrity bugs and prevented from deploying inconsistent instances in production ▪ Bulk loading greatly reduced compaction stalls ▪ Transactions and crash safety are supported
  • 63. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0