FlashCache
FlashCache



Mohan Srinivasan
April 2011
FlashCache at Facebook
▪   What
        ▪   We want to use some Flash storage on existing servers
        ▪   We want something that is simple to deploy and use
        ▪   Our IO access patterns benefit from a cache

▪   Who
    ▪   Mohan Srinivasan – design and implementation
    ▪   Paul Saab – platform and MySQL integration
    ▪   Michael Jiang – testing, performance and capacity planning
    ▪   Mark Callaghan - benchmarketing
Introduction
▪   Block cache for Linux - write back and write through modes
▪   Layered below the filesystem at the top of the storage stack
▪   Cache Disk Blocks on fast persistent storage (Flash, SSD)
▪   Loadable Linux Kernel module, built using the Device Mapper (DM)
▪   Primary use case InnoDB, but general purpose
▪   Based on dm-cache by Prof. Ming
Caching Modes
Write Back                              Write Through, Write Around
 ▪   Lazy writing to disk                ▪   Non-persistent
 ▪   Persistent across reboot            ▪   Are you a pessimist?
 ▪   Persistent across device removal
Cache Structure
▪   Set associative hash
▪   Hash with fixed sized buckets (sets) with linear probing within a set
▪   512-way set associative by default
▪   dbn: Disk Block Number, address of block on disk
▪   Set = (dbn / block size / set size) mod (number of sets)
▪   Sequential range of dbns map onto a single sets
Cache Structure
                         .                .
      Set 0              .                .
                         .                .
                         .                .
                         .                .
                         .                .

                                       Block 0
                 Cache set Block 0
                                          .
                                          .
                                          .

                                      Block 511

                                          .
      SET i                               .
                                          .       N set’s worth
                                          .
                                          .       Of blocks
                                          .

                                       Block 0

                                          .
                Cache set Block 511       .
                                          .
                         .
                         .
                                      Block 511
                         .
                         .                .
                         .                .
      Set N-1
Replacement and Memory Footprint
▪   Replacement policy is FIFO (default) or LRU within a set
▪   Switch on the fly between FIFO/LRU (sysctl)
▪   Metadata per cache block: 16 bytes in memory, 16 bytes on ssd
▪   On ssd metadata per-slot
     ▪   <dbn, block state>

▪   In memory metadata per-slot:
     ▪   <dbn, block state, LRU chain pointers, misc>
Reads
▪   Compute cache set for dbn
▪   Cache Hit
     ▪   Verify checksums if configured
     ▪   Serve read out of cache

▪   Cache Miss
     ▪   Find free block or reclaim block based on replacement policy
     ▪   Read block from disk and populate cache
     ▪   Update block checksum if configured
     ▪   Return data to user
Write Through - writes
▪   Compute cache set for dbn
▪   Cache hit
      ▪   Get cached block

▪   Cache miss
      ▪   Find free block or reclaim block

▪   Write data block to disk
▪   Write data block to cache
▪   Update block checksum
Write Back - writes
▪   Compute cache set for dbn
▪   Cache Hit
     ▪   Write data block into cache
     ▪   If data block not DIRTY, synchronously update on-ssd cache metadata to
         mark block DIRTY

▪   Cache miss
     ▪   Find free block or reclaim block based on replacement policy
     ▪   Write data block to cache
     ▪   Synchronously update on-ssd cache metadata to mark block DIRTY
Small or uncacheable requests
▪   First invalidate blocks that overlap the requests
      ▪   There are at most 2 such blocks
      ▪   For Write Back, if the overlapping blocks are DIRTY they are cleaned
          first then invalidated

▪   Uncacheable full block reads are served from cache in case of a cache
    hit.
▪   Perform disk IO
▪   Repeat invalidation to close races which might have caused the block
    to be cached while the disk IO was in progress
Write Back policy
▪   Dirty blocks not recently accessed
      ▪   A clock-like algorithm picks off Dirty blocks not accessed in the last
          15 minutes (configurable) for cleaning
▪   When dirty blocks in a set exceeds configurable threshold, clean some
    blocks
      ▪   Blocks selected for writeback based on replacement policy
      ▪   Default dirty threshold 20%. Set higher for write heavy workloads

▪   Sort selected blocks and pickup any other blocks in set that can be
    contiguously merged with these
▪   Writes merged by the IO scheduler
Write Back – overheads
▪   In-Memory cache metadata memory footprint
     ▪   300GB/4KB cache -> ~1.2GB
     ▪   160GB/4KB cache -> ~640MB

▪   Cache metadata writes/file system write
     ▪   Worst case is 2 cache metadata updates per write
           ▪   (VALID->DIRTY, DIRTY->VALID)
     ▪   Average case is much lower because of cache write hits and batching of
         cache metadata updates
Write Through/Around – cache overheads
▪   In-Memory Cache metadata footprint
     ▪   300GB/4KB cache -> ~1.2GB
     ▪   160GB/4KB cache -> ~640MB

▪   Cache metadata writes per file system write
     ▪   1 cache data write per file system write (Write Through)
     ▪   No overhead (for Write Around)
Write Back – metadata updates
▪   Cache (on-ssd) metadata only updated on writes and block cleanings
    (VALID->DIRTY or DIRTY->VALID)
▪   Cache (on-ssd) metadata not updated on cache population for reads
▪   Reload after an unclean shutdown only loads DIRTY blocks
▪   Fast and Slow cache shutdowns
     ▪   Only metadata is written on fast shutdown. Reload loads both dirty and
         clean blocks
     ▪   Slow shutdown writes all dirty blocks to disk first, then writes out
         metadata to the ssd. Reload only loads clean blocks.

▪   Metadata updates to multiple blocks in same sector are batched
Torn Page Problem
▪   Handle partial block write caused by power failure or other causes
▪   Problem exists for Flashcache in Write Back mode
▪   Detected via block checksums
     ▪   Checksums are disabled by default
     ▪   Pages with bad checksums are not used

▪   Checksums increase cache metadata writes and memory footprint
     ▪   Update cache metadata checksums on DIRTY->DIRTY block transitions
         for Write Back
     ▪   Each per-cache slot grows by 8 bytes to hold the checksum (a 50%
         increase from 16 bytes to 24 bytes for the Write Back case).
Cache controls for Write Back
▪   Work best with O_DIRECT file access
▪   Global modes – Cache All or Cache Nothing
     ▪   Cache All has a blacklist of pids and tgids
     ▪   Cache Nothing has a whitelist of pids and tgids

▪   tgids can be used to tag all pthreads in the group as cacheable
▪   Exceptions for threads within a group are supported
▪   List changes done via FlashCache ioctls
▪   Cache can be read but is not written for non-cacheable tgids and pids
▪   We modified MySQL and scp to use this support
Cache Nothing policy
▪   If the thread id is whitelisted, cache all IOs for this thread
▪   If the tgid is whitelisted, cache all IOs for this thread
▪   If the thread id is blacklisted do not cache IOs
Cache control example
▪   We use Cache Nothing mode for MySQL servers
▪   The mysqld tgid is added to the whitelist
     ▪   All IO done by it is cacheable
     ▪   Writes done by other processes do not update the cache

▪   Full table scans done by mysqldump use a hint that directs mysqld to
    add the query’s thread id to the blacklist to avoid wiping FlashCache
     ▪   select /* SQL_NO_FCACHE */ pk, col1, col2 from foobar
Utilities
▪   flashcache_create
     ▪   flashcache_create -b 4k -s 10g mysql /dev/flash /dev/disk

▪   flashcache_destroy
     ▪   flashcache_destory /dev/flash

▪   flashcache_load
sysctl –a | grep flash
dev.flashcache.cache_all = 0          dev.flashcache.do_pid_expiry = 0

dev.flashcache.fast_remove = 0        dev.flashcache.max_clean_ios_set = 2

dev.flashcache.zero_stats = 1         dev.flashcache.max_clean_ios_total = 4

dev.flashcache.write_merge = 1        dev.flashcache.debug = 0

dev.flashcache.reclaim_policy = 0     dev.flashcache.dirty_thresh_pct = 20

dev.flashcache.pid_expiry_secs = 60   dev.flashcache.stop_sync = 0

dev.flashcache.max_pids = 100         dev.flashcache.do_sync = 0
Removing FlashCache
▪   umount /data
▪   dmesetup remove mysql
▪   flashcache_destroy /dev/flash
cat /proc/flashcache_stats
reads=4 writes=0 read_hits=0 read_hit_percent=0 write_hits=0
 write_hit_percent=0 dirty_write_hits=0 dirty_write_hit_percent=0
 replacement=0 write_replacement=0 write_invalidates=0
 read_invalidates=0 pending_enqueues=0 pending_inval=0
 metadata_dirties=0 metadata_cleans=0 cleanings=0 no_room=0
 front_merge=0 back_merge=0 nc_pid_adds=0 nc_pid_dels=0
 nc_pid_drops=0 nc_expiry=0 disk_reads=0 disk_writes=0
 ssd_reads=0 ssd_writes=0 uncached_reads=169
 uncached_writes=128
Future Work
▪   Cache mirroring
        ▪   SW RAID 0 block device as a cache

▪   Online cache resize
        ▪   No shutdown and recreate

▪   Support for ATA trim
    ▪   Discard blocks no longer in use
▪   Fix the torn page problem
    ▪   Use shadow pages
Resources
▪   GitHub : facebook/flashcache
▪   Mailing list : flashcache-dev@googlegroups.com
▪   http://facebook.com/MySQLatFacebook
▪   Email :
        ▪   mohan@fb.com (Mohan Srinivasan)
        ▪   ps@fb.com (Paul Saab)
        ▪   michael@fb.com (Michael Jiang)
        ▪   mcallaghan@fb.com (Mark Callaghan)
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

PDF
Mastering PostgreSQL Administration
 
PDF
Setting up mongo replica set
PDF
MongoDB: Advantages of an Open Source NoSQL Database
PDF
Elastic 101 tutorial - Percona Europe 2018
PDF
MySQL async message subscription platform
PDF
Streaming replication in practice
PDF
Postgresql database administration volume 1
ODP
Benchmarking MongoDB and CouchBase
Mastering PostgreSQL Administration
 
Setting up mongo replica set
MongoDB: Advantages of an Open Source NoSQL Database
Elastic 101 tutorial - Percona Europe 2018
MySQL async message subscription platform
Streaming replication in practice
Postgresql database administration volume 1
Benchmarking MongoDB and CouchBase

What's hot (20)

PDF
PostgreSQL Streaming Replication Cheatsheet
PDF
12cR2 Single-Tenant: Multitenant Features for All Editions
PPTX
MYSQLDUMP & ZRM COMMUNITY (EN)
PDF
GOTO 2013: Why Zalando trusts in PostgreSQL
PDF
What's new in Jewel and Beyond
PDF
20171101 taco scargo luminous is out, what's in it for you
PDF
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
PDF
Dbdeployer
PDF
Varnish in action phpday2011
ODP
Caching and tuning fun for high scalability @ phpBenelux 2011
PDF
Varnish in action phpuk11
PPTX
Ceph Performance and Sizing Guide
PDF
Control your service resources with systemd
PDF
Performance Whack-a-Mole Tutorial (pgCon 2009)
PDF
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PPTX
Cephfs jewel mds performance benchmark
PDF
Improve your storage with bcachefs
PDF
Varnish in action confoo11
PDF
Things I wish I knew about GemStone
PDF
The Accidental DBA
PostgreSQL Streaming Replication Cheatsheet
12cR2 Single-Tenant: Multitenant Features for All Editions
MYSQLDUMP & ZRM COMMUNITY (EN)
GOTO 2013: Why Zalando trusts in PostgreSQL
What's new in Jewel and Beyond
20171101 taco scargo luminous is out, what's in it for you
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Dbdeployer
Varnish in action phpday2011
Caching and tuning fun for high scalability @ phpBenelux 2011
Varnish in action phpuk11
Ceph Performance and Sizing Guide
Control your service resources with systemd
Performance Whack-a-Mole Tutorial (pgCon 2009)
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
Cephfs jewel mds performance benchmark
Improve your storage with bcachefs
Varnish in action confoo11
Things I wish I knew about GemStone
The Accidental DBA
Ad

Viewers also liked (20)

PDF
How Big is Facebook Really?
PDF
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
PDF
Storage Infrastructure Behind Facebook Messages
PDF
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
PPTX
Facebook's TAO & Unicorn data storage and search platforms
PPTX
Big Data with Not Only SQL
PPT
Facebook Technology Stack
PDF
Hpca2012 facebook keynote
PPTX
Big Data: The 4 Layers Everyone Must Know
PDF
Facebook Architecture - Breaking it Open
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
KEY
Big Data Trends
PDF
Facebook Analysis and Study
PPTX
Big Data - The 5 Vs Everyone Must Know
PPTX
A Brief History of Big Data
PPTX
Big data architectures and the data lake
PPTX
Big data ppt
PPTX
Big Data and Advanced Analytics
PDF
20 Facebook, Twitter, Linkedin & Pinterest Features You Didn't Know Existed (...
PPTX
Customer Journey Analytics and Big Data
How Big is Facebook Really?
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
Storage Infrastructure Behind Facebook Messages
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Facebook's TAO & Unicorn data storage and search platforms
Big Data with Not Only SQL
Facebook Technology Stack
Hpca2012 facebook keynote
Big Data: The 4 Layers Everyone Must Know
Facebook Architecture - Breaking it Open
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Big Data Trends
Facebook Analysis and Study
Big Data - The 5 Vs Everyone Must Know
A Brief History of Big Data
Big data architectures and the data lake
Big data ppt
Big Data and Advanced Analytics
20 Facebook, Twitter, Linkedin & Pinterest Features You Didn't Know Existed (...
Customer Journey Analytics and Big Data
Ad

Similar to FlashCache (20)

PPTX
coa-Unit5-ppt1 (1).pptx
PPTX
Cachememory
PDF
SSD Caching: Device-Mapper- and Hardware-based solutions compared
PDF
ZFS Workshop
PPTX
Racing with Droids
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
PPTX
UNIT IV Computer architecture Analysis.pptx
PDF
Database performance tuning for SSD based storage
PPT
Computer organization memory hierarchy
PDF
SSD based storage tuning for databases
PDF
11g r2 flashcache_Tips
PDF
My sql innovation work -innosql
PDF
rac_for_beginners_ppt.pdf
PDF
HBase: Extreme Makeover
PPTX
RAIDZ on-disk format vs. small blocks
PPTX
Raidz on-disk format vs. small blocks
PDF
Ceph Performance: Projects Leading Up to Jewel
PDF
Ceph Performance: Projects Leading up to Jewel
PDF
My sql with enterprise storage
coa-Unit5-ppt1 (1).pptx
Cachememory
SSD Caching: Device-Mapper- and Hardware-based solutions compared
ZFS Workshop
Racing with Droids
Fine Tuning and Enhancing Performance of Apache Spark Jobs
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
UNIT IV Computer architecture Analysis.pptx
Database performance tuning for SSD based storage
Computer organization memory hierarchy
SSD based storage tuning for databases
11g r2 flashcache_Tips
My sql innovation work -innosql
rac_for_beginners_ppt.pdf
HBase: Extreme Makeover
RAIDZ on-disk format vs. small blocks
Raidz on-disk format vs. small blocks
Ceph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading up to Jewel
My sql with enterprise storage

More from Chris Westin (20)

PDF
Data torrent meetup-productioneng
PDF
Gripshort
PPTX
Ambari hadoop-ops-meetup-2013-09-19.final
PDF
Cluster management and automation with cloudera manager
PDF
Building low latency java applications with ehcache
PDF
SDN/OpenFlow #lspe
ODP
cfengine3 at #lspe
PPTX
mongodb-aggregation-may-2012
PDF
Nimbula lspe-2012-04-19
PPTX
mongodb-brief-intro-february-2012
PDF
Stingray - Riverbed Technology
PPTX
MongoDB's New Aggregation framework
PPTX
Replication and replica sets
PPTX
Architecting a Scale Out Cloud Storage Solution
PPTX
Large Scale Cacti
PPTX
MongoDB: An Introduction - July 2011
PPTX
Practical Replication June-2011
PPTX
MongoDB: An Introduction - june-2011
PPT
Ganglia Overview-v2
PPTX
MongoDB Aggregation MongoSF May 2011
Data torrent meetup-productioneng
Gripshort
Ambari hadoop-ops-meetup-2013-09-19.final
Cluster management and automation with cloudera manager
Building low latency java applications with ehcache
SDN/OpenFlow #lspe
cfengine3 at #lspe
mongodb-aggregation-may-2012
Nimbula lspe-2012-04-19
mongodb-brief-intro-february-2012
Stingray - Riverbed Technology
MongoDB's New Aggregation framework
Replication and replica sets
Architecting a Scale Out Cloud Storage Solution
Large Scale Cacti
MongoDB: An Introduction - July 2011
Practical Replication June-2011
MongoDB: An Introduction - june-2011
Ganglia Overview-v2
MongoDB Aggregation MongoSF May 2011

Recently uploaded (20)

PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
4 layer Arch & Reference Arch of IoT.pdf
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PPTX
future_of_ai_comprehensive_20250822032121.pptx
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
DOCX
search engine optimization ppt fir known well about this
PDF
Flame analysis and combustion estimation using large language and vision assi...
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Consumable AI The What, Why & How for Small Teams.pdf
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Comparative analysis of machine learning models for fake news detection in so...
4 layer Arch & Reference Arch of IoT.pdf
Training Program for knowledge in solar cell and solar industry
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
Custom Battery Pack Design Considerations for Performance and Safety
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Early detection and classification of bone marrow changes in lumbar vertebrae...
giants, standing on the shoulders of - by Daniel Stenberg
MuleSoft-Compete-Deck for midddleware integrations
future_of_ai_comprehensive_20250822032121.pptx
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Improvisation in detection of pomegranate leaf disease using transfer learni...
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
search engine optimization ppt fir known well about this
Flame analysis and combustion estimation using large language and vision assi...

FlashCache

  • 3. FlashCache at Facebook ▪ What ▪ We want to use some Flash storage on existing servers ▪ We want something that is simple to deploy and use ▪ Our IO access patterns benefit from a cache ▪ Who ▪ Mohan Srinivasan – design and implementation ▪ Paul Saab – platform and MySQL integration ▪ Michael Jiang – testing, performance and capacity planning ▪ Mark Callaghan - benchmarketing
  • 4. Introduction ▪ Block cache for Linux - write back and write through modes ▪ Layered below the filesystem at the top of the storage stack ▪ Cache Disk Blocks on fast persistent storage (Flash, SSD) ▪ Loadable Linux Kernel module, built using the Device Mapper (DM) ▪ Primary use case InnoDB, but general purpose ▪ Based on dm-cache by Prof. Ming
  • 5. Caching Modes Write Back Write Through, Write Around ▪ Lazy writing to disk ▪ Non-persistent ▪ Persistent across reboot ▪ Are you a pessimist? ▪ Persistent across device removal
  • 6. Cache Structure ▪ Set associative hash ▪ Hash with fixed sized buckets (sets) with linear probing within a set ▪ 512-way set associative by default ▪ dbn: Disk Block Number, address of block on disk ▪ Set = (dbn / block size / set size) mod (number of sets) ▪ Sequential range of dbns map onto a single sets
  • 7. Cache Structure . . Set 0 . . . . . . . . . . Block 0 Cache set Block 0 . . . Block 511 . SET i . . N set’s worth . . Of blocks . Block 0 . Cache set Block 511 . . . . Block 511 . . . . . Set N-1
  • 8. Replacement and Memory Footprint ▪ Replacement policy is FIFO (default) or LRU within a set ▪ Switch on the fly between FIFO/LRU (sysctl) ▪ Metadata per cache block: 16 bytes in memory, 16 bytes on ssd ▪ On ssd metadata per-slot ▪ <dbn, block state> ▪ In memory metadata per-slot: ▪ <dbn, block state, LRU chain pointers, misc>
  • 9. Reads ▪ Compute cache set for dbn ▪ Cache Hit ▪ Verify checksums if configured ▪ Serve read out of cache ▪ Cache Miss ▪ Find free block or reclaim block based on replacement policy ▪ Read block from disk and populate cache ▪ Update block checksum if configured ▪ Return data to user
  • 10. Write Through - writes ▪ Compute cache set for dbn ▪ Cache hit ▪ Get cached block ▪ Cache miss ▪ Find free block or reclaim block ▪ Write data block to disk ▪ Write data block to cache ▪ Update block checksum
  • 11. Write Back - writes ▪ Compute cache set for dbn ▪ Cache Hit ▪ Write data block into cache ▪ If data block not DIRTY, synchronously update on-ssd cache metadata to mark block DIRTY ▪ Cache miss ▪ Find free block or reclaim block based on replacement policy ▪ Write data block to cache ▪ Synchronously update on-ssd cache metadata to mark block DIRTY
  • 12. Small or uncacheable requests ▪ First invalidate blocks that overlap the requests ▪ There are at most 2 such blocks ▪ For Write Back, if the overlapping blocks are DIRTY they are cleaned first then invalidated ▪ Uncacheable full block reads are served from cache in case of a cache hit. ▪ Perform disk IO ▪ Repeat invalidation to close races which might have caused the block to be cached while the disk IO was in progress
  • 13. Write Back policy ▪ Dirty blocks not recently accessed ▪ A clock-like algorithm picks off Dirty blocks not accessed in the last 15 minutes (configurable) for cleaning ▪ When dirty blocks in a set exceeds configurable threshold, clean some blocks ▪ Blocks selected for writeback based on replacement policy ▪ Default dirty threshold 20%. Set higher for write heavy workloads ▪ Sort selected blocks and pickup any other blocks in set that can be contiguously merged with these ▪ Writes merged by the IO scheduler
  • 14. Write Back – overheads ▪ In-Memory cache metadata memory footprint ▪ 300GB/4KB cache -> ~1.2GB ▪ 160GB/4KB cache -> ~640MB ▪ Cache metadata writes/file system write ▪ Worst case is 2 cache metadata updates per write ▪ (VALID->DIRTY, DIRTY->VALID) ▪ Average case is much lower because of cache write hits and batching of cache metadata updates
  • 15. Write Through/Around – cache overheads ▪ In-Memory Cache metadata footprint ▪ 300GB/4KB cache -> ~1.2GB ▪ 160GB/4KB cache -> ~640MB ▪ Cache metadata writes per file system write ▪ 1 cache data write per file system write (Write Through) ▪ No overhead (for Write Around)
  • 16. Write Back – metadata updates ▪ Cache (on-ssd) metadata only updated on writes and block cleanings (VALID->DIRTY or DIRTY->VALID) ▪ Cache (on-ssd) metadata not updated on cache population for reads ▪ Reload after an unclean shutdown only loads DIRTY blocks ▪ Fast and Slow cache shutdowns ▪ Only metadata is written on fast shutdown. Reload loads both dirty and clean blocks ▪ Slow shutdown writes all dirty blocks to disk first, then writes out metadata to the ssd. Reload only loads clean blocks. ▪ Metadata updates to multiple blocks in same sector are batched
  • 17. Torn Page Problem ▪ Handle partial block write caused by power failure or other causes ▪ Problem exists for Flashcache in Write Back mode ▪ Detected via block checksums ▪ Checksums are disabled by default ▪ Pages with bad checksums are not used ▪ Checksums increase cache metadata writes and memory footprint ▪ Update cache metadata checksums on DIRTY->DIRTY block transitions for Write Back ▪ Each per-cache slot grows by 8 bytes to hold the checksum (a 50% increase from 16 bytes to 24 bytes for the Write Back case).
  • 18. Cache controls for Write Back ▪ Work best with O_DIRECT file access ▪ Global modes – Cache All or Cache Nothing ▪ Cache All has a blacklist of pids and tgids ▪ Cache Nothing has a whitelist of pids and tgids ▪ tgids can be used to tag all pthreads in the group as cacheable ▪ Exceptions for threads within a group are supported ▪ List changes done via FlashCache ioctls ▪ Cache can be read but is not written for non-cacheable tgids and pids ▪ We modified MySQL and scp to use this support
  • 19. Cache Nothing policy ▪ If the thread id is whitelisted, cache all IOs for this thread ▪ If the tgid is whitelisted, cache all IOs for this thread ▪ If the thread id is blacklisted do not cache IOs
  • 20. Cache control example ▪ We use Cache Nothing mode for MySQL servers ▪ The mysqld tgid is added to the whitelist ▪ All IO done by it is cacheable ▪ Writes done by other processes do not update the cache ▪ Full table scans done by mysqldump use a hint that directs mysqld to add the query’s thread id to the blacklist to avoid wiping FlashCache ▪ select /* SQL_NO_FCACHE */ pk, col1, col2 from foobar
  • 21. Utilities ▪ flashcache_create ▪ flashcache_create -b 4k -s 10g mysql /dev/flash /dev/disk ▪ flashcache_destroy ▪ flashcache_destory /dev/flash ▪ flashcache_load
  • 22. sysctl –a | grep flash dev.flashcache.cache_all = 0 dev.flashcache.do_pid_expiry = 0 dev.flashcache.fast_remove = 0 dev.flashcache.max_clean_ios_set = 2 dev.flashcache.zero_stats = 1 dev.flashcache.max_clean_ios_total = 4 dev.flashcache.write_merge = 1 dev.flashcache.debug = 0 dev.flashcache.reclaim_policy = 0 dev.flashcache.dirty_thresh_pct = 20 dev.flashcache.pid_expiry_secs = 60 dev.flashcache.stop_sync = 0 dev.flashcache.max_pids = 100 dev.flashcache.do_sync = 0
  • 23. Removing FlashCache ▪ umount /data ▪ dmesetup remove mysql ▪ flashcache_destroy /dev/flash
  • 24. cat /proc/flashcache_stats reads=4 writes=0 read_hits=0 read_hit_percent=0 write_hits=0 write_hit_percent=0 dirty_write_hits=0 dirty_write_hit_percent=0 replacement=0 write_replacement=0 write_invalidates=0 read_invalidates=0 pending_enqueues=0 pending_inval=0 metadata_dirties=0 metadata_cleans=0 cleanings=0 no_room=0 front_merge=0 back_merge=0 nc_pid_adds=0 nc_pid_dels=0 nc_pid_drops=0 nc_expiry=0 disk_reads=0 disk_writes=0 ssd_reads=0 ssd_writes=0 uncached_reads=169 uncached_writes=128
  • 25. Future Work ▪ Cache mirroring ▪ SW RAID 0 block device as a cache ▪ Online cache resize ▪ No shutdown and recreate ▪ Support for ATA trim ▪ Discard blocks no longer in use ▪ Fix the torn page problem ▪ Use shadow pages
  • 26. Resources ▪ GitHub : facebook/flashcache ▪ Mailing list : [email protected] ▪ http://facebook.com/MySQLatFacebook ▪ Email : ▪ [email protected] (Mohan Srinivasan) ▪ [email protected] (Paul Saab) ▪ [email protected] (Michael Jiang) ▪ [email protected] (Mark Callaghan)
  • 27. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0