SlideShare a Scribd company logo
Voldemort	
  on	
  Solid	
  State	
  Drives	
  
                             Vinoth	
  Chandar,	
  Lei	
  Gao,	
  Cuong	
  Tran	
  
                           Linkedin	
  Corporation,	
  Mountain	
  View,	
  CA	
  

Abstract
Voldemort is Linkedin’s open implementation of Amazon Dynamo, providing fast, scalable, fault-
tolerant access to key-value data. Voldemort is widely used by applications at LinkedIn that demand lots
of IOPS. Solid State Drives (SSD) are becoming an attractive option to speed up data access. In this
paper, we describe our experiences with GC issues on Voldemort server nodes, after migrating to SSD.
Based on these experiences, we provide an intuition for caching strategies with SSD storage.

1. Introduction
Voldemort [1] is a distributed key-value storage system, based on Amazon Dynamo. It has a very simple
get(k), put(k,v), delete(k) interface, that allows for pluggable serialization, routing and storage engines.
Voldemort serves a substantial amount of site traffic at LinkedIn for applications like ‘Skills’, ‘People
You May Know’, ‘Company Follow’, ‘LinkedIn Share’, serving thousands of operations/sec over several
terabytes of data. It also has wide adoption in companies such as Gilt Group, EHarmony, Nokia, Jive
Software, WealthFront and Mendeley.

Due to simple key-value access pattern, the single Voldemort server node performance is typically bound
by IOPS, with plenty of CPU cycles to spare. Hence, Voldemort clusters at LinkedIn were migrated to
SSD to increase the single server node capacity. The migration has proven fruitful, although unearthing a
set of interesting GC issues, which led to rethinking of our caching strategy with SSD. Rest of the paper
is organized as follows. Section 2 describes the software stack for a single Voldemort server. Section 3
describes the impact of SSD migration on the single server performance and details ways to mitigate
Java GC issues. Section 3 also explores leveraging SSD to alleviate caching problems. Section 4
concludes.

2. Single Server stack
The server uses an embedded, log structured, Java based storage engine - Oracle BerkeleyDB JE [2].
BDB employs an LRU cache on top of the JVM heap and relies on Java garbage collection for managing
its memory. Loosely, the cache is a bunch of references to index and data objects. Cache eviction
happens simply by releasing the references for garbage collection. A single cluster serves a large number
of applications and hence the BDB cache contains objects of different sizes, sharing the same BDB
cache. The server also has a background thread that enforces data retention policy, by periodically
deleting stale entries.

3. SSD Performance Implications
With plenty of IOPS at hand, the allocation rates went up causing very frequent GC pauses, moving the
bottleneck from IO to garbage collection. After migrating to SSD, the average latency greatly improved
from 20ms to 2ms. Speed of cluster expansion and data restoration has improved 10x. However, the 95th
and 99th percentile latencies shot up from 30ms to 130ms and 240ms to 380ms respectively, due to a host
of garbage collection issues, detailed below.

3.1 Need for End-End Correlation
By developing tools to correlate Linux paging statistics from SAR with pauses from GC, we discovered
that Linux was stealing pages from the JVM heap, resulting in 4-second minor pauses. Subsequent
promotions into the old generation incur page scans, causing the big pauses with a high system time
component. Hence, it is imperative to mlock() the server heap to prevent it from being swapped out.
Also, we experienced higher system time in lab experiments, since not all of the virtual address space of
the JVM heap had been mapped to physical pages. Thus, using the AlwaysPreTouch JVM option is
imperative for any ‘Big Data’ benchmarking tool, to reproduce the same memory conditions as in the
real world. This exercise stressed the importance of developing performance tools that can identify
interesting patterns by correlating performance data across the entire stack.

3.2 SSD Aware Caching
Promotion failures with huge 25-second pauses during the retention job, prompted us to rethink the
caching strategy with SSD. The retention job does a walk of the entire BDB database without any
throttling. With very fast SSD, this translates into rapid 200MB allocations and promotions, parallely
kicking out the objects from the LRU cache in old generation. Since the server is multitenant, hosting
different object sizes, this leads to heavy fragmentation. Real workloads almost always have ‘hotsets’
which live in the old generation and any incoming traffic that drastically changes the hotset is likely to
run into this issue. The issue was very difficult to reproduce since it depended heavily on the state of old
generation, highlighting the need for building performance test infrastructures that can replay real world
traffic. We managed to reproduce the problem by roughly matching up cache miss rates as seen in
production. We solved the problem by forcing BDB to evict data objects brought in by the retention job
right away, such that they are collected in young generation and never promoted.

In fact, we plan to cache only the index nodes over the JVM heap even for regular traffic. This will help
fight fragmentation and achieve predictable multitenant deployments. Results in lab have shown that this
approach can deliver comparable performance, due to the power of SSD and uniformly sized index
objects. Also, this approach reduces the promotion rate, thus increasing the chances that CMS initial
mark is scheduled after a minor collection. This improves initial mark time as described in next section.
This approach is applicable even for systems that manage their own memory since fragmentation is a
general issue.

3.3 Reducing Cost of CMS Initial mark
Assuming we can control fragmentation, yielding control back to the JVM to schedule CMS adaptively
based on promotion rate can help cut down initial mark times. Even when evicting data objects right
away, the high SSD read rates could cause heavy promotion for index objects. Under such
circumstances, the CMS initial mark might be scheduled when the young generation is not empty,
resulting in a 1.2 second CMS initial mark pause on a 2GB young generation. We found that by
increasing the CMSInitiatingOccupancyFraction to a higher value (90), the scheduling of CMS happened
much closer to minor collections when the young generation is empty or small, reducing the maximum
initial mark time to 0.4 seconds.

4. Conclusion
With SSD, we find that garbage collection will become a very significant bottleneck, especially for
systems, which have little control over the storage layer and rely on Java memory management. Big heap
sizes make the cost of garbage collection expensive, especially the single threaded CMS Initial mark. We
believe that data systems must revisit their caching strategies with SSDs. In this regard, SSD has
provided an efficient solution for handling fragmentation and moving towards predictable multitenancy.

References
[1] http://project-voldemort.com/
[2] http://www.oracle.com/technetwork/database/berkeleydb/overview/index-093405.html	
  

More Related Content

PPTX
Voldemort
fasiha ikram
 
PDF
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
PPTX
Time-Series Apache HBase
HBaseCon
 
PPTX
Cassandra Tuning - above and beyond
Matija Gobec
 
PDF
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
Michael Stack
 
PDF
Argus Production Monitoring at Salesforce
HBaseCon
 
PDF
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Redis Labs
 
PDF
Building Scalable, Real Time Applications for Financial Services with DataStax
DataStax
 
Voldemort
fasiha ikram
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
Time-Series Apache HBase
HBaseCon
 
Cassandra Tuning - above and beyond
Matija Gobec
 
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
Michael Stack
 
Argus Production Monitoring at Salesforce
HBaseCon
 
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Redis Labs
 
Building Scalable, Real Time Applications for Financial Services with DataStax
DataStax
 

What's hot (20)

PDF
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
PDF
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
PPTX
Scaling HDFS at Xiaomi
DataWorks Summit
 
PPTX
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
DataStax
 
PPTX
Hadoop engineering bo_f_final
Ramya Sunil
 
PDF
25 snowflake
剑飞 陈
 
PDF
Mesosphere and Contentteam: A New Way to Run Cassandra
DataStax Academy
 
PDF
TeraCache: Efficient Caching Over Fast Storage Devices
Databricks
 
PDF
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
Michael Stack
 
PPTX
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
DataStax
 
PPTX
Redis Labs and SQL Server
Lynn Langit
 
PPTX
RedisConf17 - Redis Labs - Implementing Real-time Machine Learning with Redis-ML
Redis Labs
 
PDF
HBaseConAsia2018 Track1-3: HBase at Xiaomi
Michael Stack
 
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
PDF
Redis for horizontally scaled data processing at jFrog bintray
Redis Labs
 
PPTX
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
DataStax
 
PDF
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
PDF
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Redis Labs
 
PDF
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
PDF
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon
 
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
Scaling HDFS at Xiaomi
DataWorks Summit
 
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
DataStax
 
Hadoop engineering bo_f_final
Ramya Sunil
 
25 snowflake
剑飞 陈
 
Mesosphere and Contentteam: A New Way to Run Cassandra
DataStax Academy
 
TeraCache: Efficient Caching Over Fast Storage Devices
Databricks
 
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
Michael Stack
 
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
DataStax
 
Redis Labs and SQL Server
Lynn Langit
 
RedisConf17 - Redis Labs - Implementing Real-time Machine Learning with Redis-ML
Redis Labs
 
HBaseConAsia2018 Track1-3: HBase at Xiaomi
Michael Stack
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
Redis for horizontally scaled data processing at jFrog bintray
Redis Labs
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
DataStax
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Redis Labs
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon
 
Ad

Viewers also liked (7)

PPTX
Gc and-pagescan-attacks-by-linux
Cuong Tran
 
PPT
Bluetube
Vinoth Chandar
 
PDF
Project Voldemort
Gregory Pence
 
PDF
Voldemort Nosql
elliando dias
 
PPT
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Vinoth Chandar
 
PDF
Voldemort : Prototype to Production
Vinoth Chandar
 
PDF
Introducción a Voldemort - Innova4j
Innova4j
 
Gc and-pagescan-attacks-by-linux
Cuong Tran
 
Bluetube
Vinoth Chandar
 
Project Voldemort
Gregory Pence
 
Voldemort Nosql
elliando dias
 
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Vinoth Chandar
 
Voldemort : Prototype to Production
Vinoth Chandar
 
Introducción a Voldemort - Innova4j
Innova4j
 
Ad

Similar to Voldemort on Solid State Drives (20)

PDF
Voldemort on Solid State Drives
Amy W. Tang
 
PPTX
Deploying ssd in the data center 2014
Howard Marks
 
PDF
hpc2013_20131223
Ryohei Kobayashi
 
PPTX
IMC Summit 2016 Breakout - Per Minoborg - Work with Multiple Hot Terabytes in...
In-Memory Computing Summit
 
PDF
CLFS 2010
bergwolf
 
PDF
Why does my choice of storage matter with cassandra?
Johnny Miller
 
PDF
Generic SAN Acceleration White Paper DRAFT
Mike Mendola ([email protected])
 
PDF
OpenDS_Jazoon2010
Ludovic Poitou
 
PDF
Optimizing RocksDB for Open-Channel SSDs
Javier González
 
PDF
Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?
Dennis Martin
 
PDF
Linux on System z Optimizing Resource Utilization for Linux under z/VM - Part1
IBM India Smarter Computing
 
PDF
Distro Recipes 2013 : My ${favorite_linux_distro} is slow!
Anne Nicolas
 
PDF
#MFSummit2016 Operate: The race for space
Micro Focus
 
PDF
What every-programmer-should-know-about-memory
xan peng
 
PDF
OS caused Large JVM pauses: Deep dive and solutions
Zhenyun Zhuang
 
PPT
Ssd And Enteprise Storage
Frank Zhao
 
PDF
S3
dvmug1
 
PDF
Demystifying SSD, Mark Smith, S3
subtitle
 
PPTX
2015 deploying flash in the data center
Howard Marks
 
PPTX
2015 deploying flash in the data center
Howard Marks
 
Voldemort on Solid State Drives
Amy W. Tang
 
Deploying ssd in the data center 2014
Howard Marks
 
hpc2013_20131223
Ryohei Kobayashi
 
IMC Summit 2016 Breakout - Per Minoborg - Work with Multiple Hot Terabytes in...
In-Memory Computing Summit
 
CLFS 2010
bergwolf
 
Why does my choice of storage matter with cassandra?
Johnny Miller
 
Generic SAN Acceleration White Paper DRAFT
Mike Mendola ([email protected])
 
OpenDS_Jazoon2010
Ludovic Poitou
 
Optimizing RocksDB for Open-Channel SSDs
Javier González
 
Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?
Dennis Martin
 
Linux on System z Optimizing Resource Utilization for Linux under z/VM - Part1
IBM India Smarter Computing
 
Distro Recipes 2013 : My ${favorite_linux_distro} is slow!
Anne Nicolas
 
#MFSummit2016 Operate: The race for space
Micro Focus
 
What every-programmer-should-know-about-memory
xan peng
 
OS caused Large JVM pauses: Deep dive and solutions
Zhenyun Zhuang
 
Ssd And Enteprise Storage
Frank Zhao
 
S3
dvmug1
 
Demystifying SSD, Mark Smith, S3
subtitle
 
2015 deploying flash in the data center
Howard Marks
 
2015 deploying flash in the data center
Howard Marks
 

More from Vinoth Chandar (6)

PPTX
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
PDF
Hoodie - DataEngConf 2017
Vinoth Chandar
 
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
PDF
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
PDF
Triple-Triple RDF Store with Greedy Graph based Grouping
Vinoth Chandar
 
PDF
Distributeddatabasesforchallengednet
Vinoth Chandar
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Hoodie - DataEngConf 2017
Vinoth Chandar
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
Triple-Triple RDF Store with Greedy Graph based Grouping
Vinoth Chandar
 
Distributeddatabasesforchallengednet
Vinoth Chandar
 

Recently uploaded (20)

PDF
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
PPTX
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
DOCX
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PPTX
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
PPTX
Care of patients with elImination deviation.pptx
AneetaSharma15
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
PPTX
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
PPTX
CDH. pptx
AneetaSharma15
 
PPTX
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
PPTX
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
PPTX
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
PPTX
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
PDF
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
PPTX
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
Care of patients with elImination deviation.pptx
AneetaSharma15
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
CDH. pptx
AneetaSharma15
 
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 

Voldemort on Solid State Drives

  • 1. Voldemort  on  Solid  State  Drives   Vinoth  Chandar,  Lei  Gao,  Cuong  Tran   Linkedin  Corporation,  Mountain  View,  CA   Abstract Voldemort is Linkedin’s open implementation of Amazon Dynamo, providing fast, scalable, fault- tolerant access to key-value data. Voldemort is widely used by applications at LinkedIn that demand lots of IOPS. Solid State Drives (SSD) are becoming an attractive option to speed up data access. In this paper, we describe our experiences with GC issues on Voldemort server nodes, after migrating to SSD. Based on these experiences, we provide an intuition for caching strategies with SSD storage. 1. Introduction Voldemort [1] is a distributed key-value storage system, based on Amazon Dynamo. It has a very simple get(k), put(k,v), delete(k) interface, that allows for pluggable serialization, routing and storage engines. Voldemort serves a substantial amount of site traffic at LinkedIn for applications like ‘Skills’, ‘People You May Know’, ‘Company Follow’, ‘LinkedIn Share’, serving thousands of operations/sec over several terabytes of data. It also has wide adoption in companies such as Gilt Group, EHarmony, Nokia, Jive Software, WealthFront and Mendeley. Due to simple key-value access pattern, the single Voldemort server node performance is typically bound by IOPS, with plenty of CPU cycles to spare. Hence, Voldemort clusters at LinkedIn were migrated to SSD to increase the single server node capacity. The migration has proven fruitful, although unearthing a set of interesting GC issues, which led to rethinking of our caching strategy with SSD. Rest of the paper is organized as follows. Section 2 describes the software stack for a single Voldemort server. Section 3 describes the impact of SSD migration on the single server performance and details ways to mitigate Java GC issues. Section 3 also explores leveraging SSD to alleviate caching problems. Section 4 concludes. 2. Single Server stack The server uses an embedded, log structured, Java based storage engine - Oracle BerkeleyDB JE [2]. BDB employs an LRU cache on top of the JVM heap and relies on Java garbage collection for managing its memory. Loosely, the cache is a bunch of references to index and data objects. Cache eviction happens simply by releasing the references for garbage collection. A single cluster serves a large number of applications and hence the BDB cache contains objects of different sizes, sharing the same BDB cache. The server also has a background thread that enforces data retention policy, by periodically deleting stale entries. 3. SSD Performance Implications With plenty of IOPS at hand, the allocation rates went up causing very frequent GC pauses, moving the bottleneck from IO to garbage collection. After migrating to SSD, the average latency greatly improved from 20ms to 2ms. Speed of cluster expansion and data restoration has improved 10x. However, the 95th and 99th percentile latencies shot up from 30ms to 130ms and 240ms to 380ms respectively, due to a host of garbage collection issues, detailed below. 3.1 Need for End-End Correlation By developing tools to correlate Linux paging statistics from SAR with pauses from GC, we discovered that Linux was stealing pages from the JVM heap, resulting in 4-second minor pauses. Subsequent
  • 2. promotions into the old generation incur page scans, causing the big pauses with a high system time component. Hence, it is imperative to mlock() the server heap to prevent it from being swapped out. Also, we experienced higher system time in lab experiments, since not all of the virtual address space of the JVM heap had been mapped to physical pages. Thus, using the AlwaysPreTouch JVM option is imperative for any ‘Big Data’ benchmarking tool, to reproduce the same memory conditions as in the real world. This exercise stressed the importance of developing performance tools that can identify interesting patterns by correlating performance data across the entire stack. 3.2 SSD Aware Caching Promotion failures with huge 25-second pauses during the retention job, prompted us to rethink the caching strategy with SSD. The retention job does a walk of the entire BDB database without any throttling. With very fast SSD, this translates into rapid 200MB allocations and promotions, parallely kicking out the objects from the LRU cache in old generation. Since the server is multitenant, hosting different object sizes, this leads to heavy fragmentation. Real workloads almost always have ‘hotsets’ which live in the old generation and any incoming traffic that drastically changes the hotset is likely to run into this issue. The issue was very difficult to reproduce since it depended heavily on the state of old generation, highlighting the need for building performance test infrastructures that can replay real world traffic. We managed to reproduce the problem by roughly matching up cache miss rates as seen in production. We solved the problem by forcing BDB to evict data objects brought in by the retention job right away, such that they are collected in young generation and never promoted. In fact, we plan to cache only the index nodes over the JVM heap even for regular traffic. This will help fight fragmentation and achieve predictable multitenant deployments. Results in lab have shown that this approach can deliver comparable performance, due to the power of SSD and uniformly sized index objects. Also, this approach reduces the promotion rate, thus increasing the chances that CMS initial mark is scheduled after a minor collection. This improves initial mark time as described in next section. This approach is applicable even for systems that manage their own memory since fragmentation is a general issue. 3.3 Reducing Cost of CMS Initial mark Assuming we can control fragmentation, yielding control back to the JVM to schedule CMS adaptively based on promotion rate can help cut down initial mark times. Even when evicting data objects right away, the high SSD read rates could cause heavy promotion for index objects. Under such circumstances, the CMS initial mark might be scheduled when the young generation is not empty, resulting in a 1.2 second CMS initial mark pause on a 2GB young generation. We found that by increasing the CMSInitiatingOccupancyFraction to a higher value (90), the scheduling of CMS happened much closer to minor collections when the young generation is empty or small, reducing the maximum initial mark time to 0.4 seconds. 4. Conclusion With SSD, we find that garbage collection will become a very significant bottleneck, especially for systems, which have little control over the storage layer and rely on Java memory management. Big heap sizes make the cost of garbage collection expensive, especially the single threaded CMS Initial mark. We believe that data systems must revisit their caching strategies with SSDs. In this regard, SSD has provided an efficient solution for handling fragmentation and moving towards predictable multitenancy. References [1] http://project-voldemort.com/ [2] http://www.oracle.com/technetwork/database/berkeleydb/overview/index-093405.html