SlideShare a Scribd company logo
Dealing with JVM limitations
in Apache Cassandra
Jonathan Ellis / @spyced
Pain points for Java databases

✤   GC
✤   GC
✤   GC
Pain points for Java databases

✤   GC
✤   Platform specific code
GC

✤   Concurrent and compacting: choose one
    ✤   G1
    ✤   Azul C4 / Zing?
Fragmentation

✤   Bloom filter arrays
✤   Compression offsets
Automatic mitigation?

✤   http://www.research.ibm.com/people/d/dfb/papers/Bacon03Controlling.pdf
✤   http://researcher.ibm.com/files/us-hirzel/pldi10-arraylets.pdf
Fragmentation, 2

✤   Arena allocation for memtables
(Memtables?)

    write( k1 , c1:v1 )

                                       Memory




                          Memtable




   Commit log



                                     Hard drive
write( k1 , c1:v )

                                             Memory
                      k1 c1:v




                                Memtable



     k1 c1:v




Commit log



                                           Hard drive
write( k1 , c2:v )

                                      Memory
                     k1 c1:v c2:v




    k1 c1:v
    k1 c2:v




                                    Hard drive
write( k2 ,     c1:v c2:v   )
                                                   Memory
                                k1 c1:v c2:v

                                k2   c1:v c2:v




   k1 c1:v
   k1 c2:v
 k2 c1:v c2:v




                                                 Hard drive
write( k1 ,     c1:v c3:v   )
                                                      Memory
                                k1 c1:v c2:v c3:v

                                k2   c1:v c2:v




   k1 c1:v
   k1 c2:v
 k2 c1:v c2:v
 k1 c1:v c3:v




                                                    Hard drive
Memory




          flush




                 index
cleanup    k1 c1:v c2:v c3:v

           k2   c1:v c2:v


                               SSTable




                                         Hard drive
“Java is a memory hog”

✤   Large overhead for typical objects and collections
✤   How large?
✤   java.lang.instrument.Instrumentation

    ✤   JAMM: Java Agent for Memory Measurements
    ✤   https://github.com/jbellis/jamm
org.apache.cassandra.cache.SerializingCache


✤   Live objects are about 85% JVM bookeeping
✤   org.apache.cassandra.cache.FreeableMemory   using reference
    counting
✤   Considering doing reference-counted, off-heap memtables
    as well
Don’t forget about young gen

✤   Always stop-the-world for ~100ms
Platform-specific code

✤   OS
✤   JVM
m[un]map

✤   Log-structured storage wants to remove old files post-
    compaction; some platforms disallow deleting open files
✤   Old workaround (pre-1.0):
    ✤   use PhantomReference to tell when mmap’d file is GC (hence
        unmapped)
    ✤   Poor user experience and messy corner cases
✤   New workaround:
    ✤   Class.forName("sun.nio.ch.DirectBuffer").getMethod("cleaner")
mmap part 2

✤   2GB limit via ByteBuffer:
    public abstract byte get(int index)

✤   Workaround: MmappedSegmentedFile
    public Iterator<DataInput> iterator(long position)
link

✤   Used for snapshots
✤   Old workaround: JNA
✤   New workaround: supported directly by Java7
mlockall

✤   swappiness: pissing off database developers since 2001 (?)
✤   mlockall(MCL_CURRENT)
Low-level i/o

✤   posix_fadvise
✤   mincore/fincore
✤   fctl


✤   ... JNA
A plug for JNA

✤   https://github.com/twall/jna

     static {
         try {
              Native.register("c");
       ...

     private static native int mlockall(int flags)
     throws LastErrorException;
The fallacy of choosing portability over power

✤   Applets have been dead for years
✤   Python gets it right
    ✤   import readline
The fallacy of choosing safety over power

✤   Allowing munmap would expose developers to segfaults
✤   But, relying on the GC to clean up external resources is a
    well-known antipattern
    ✤   File.close
✤   We need munmap badly enough that we resort to
    unnatural and unportable code to get it
    ✤   You haven’t kept us from risking segfaults, you’ve just made us
        miserable
Compatibility through obscurity?

✤   sun.misc.Unsafe
✤   Used by high-profile libraries like high-scale-lib
... even public options




     http://blogs.oracle.com/dave/entry/false_sharing_induced_by_card
Too negative?
Still true

✤   "Many concurrent algorithms are very easy to write with a
    GC and totally hard (to down right impossible) using
    explicit free." -- Cliff Click

More Related Content

What's hot (20)

ODP
Disk Performance Comparison Xen v.s. KVM
nknytk
 
PDF
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
The Linux Foundation
 
PDF
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph Community
 
PPTX
ceph-barcelona-v-1.2
Ranga Swami Reddy Muthumula
 
PDF
Redis persistence in practice
Eugene Fidelin
 
PDF
Intel QLC: Cost-effective Ceph on NVMe
Ceph Community
 
PDF
Improvements in GlusterFS for Virtualization usecase
Deepak Shetty
 
PPTX
Cinder Live Migration and Replication - OpenStack Summit Austin
Ed Balduf
 
PPT
Nexenta at VMworld Hands-on Lab
Nexenta Systems
 
PDF
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Danielle Womboldt
 
PDF
Redis cluster
iammutex
 
PDF
Automatic Operation Bot for Ceph - You Ji
Ceph Community
 
PDF
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
Ceph Community
 
PDF
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax Academy
 
PPTX
Cinder - status of replication
Ed Balduf
 
PPTX
RHEVM - Live Storage Migration
Raz Tamir
 
PDF
2021.02 new in Ceph Pacific Dashboard
Ceph Community
 
PPTX
Ceph on 64-bit ARM with X-Gene
Ceph Community
 
PDF
2021.06. Ceph Project Update
Ceph Community
 
PDF
OSv presentation from Linux Foundation Collaboration Summit
Don Marti
 
Disk Performance Comparison Xen v.s. KVM
nknytk
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
The Linux Foundation
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph Community
 
ceph-barcelona-v-1.2
Ranga Swami Reddy Muthumula
 
Redis persistence in practice
Eugene Fidelin
 
Intel QLC: Cost-effective Ceph on NVMe
Ceph Community
 
Improvements in GlusterFS for Virtualization usecase
Deepak Shetty
 
Cinder Live Migration and Replication - OpenStack Summit Austin
Ed Balduf
 
Nexenta at VMworld Hands-on Lab
Nexenta Systems
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Danielle Womboldt
 
Redis cluster
iammutex
 
Automatic Operation Bot for Ceph - You Ji
Ceph Community
 
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
Ceph Community
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax Academy
 
Cinder - status of replication
Ed Balduf
 
RHEVM - Live Storage Migration
Raz Tamir
 
2021.02 new in Ceph Pacific Dashboard
Ceph Community
 
Ceph on 64-bit ARM with X-Gene
Ceph Community
 
2021.06. Ceph Project Update
Ceph Community
 
OSv presentation from Linux Foundation Collaboration Summit
Don Marti
 

Similar to Dealing with JVM limitations in Apache Cassandra (Fosdem 2012) (20)

KEY
Fosdem 2012
pcmanus
 
KEY
33rd degree conference
pcmanus
 
PDF
Cassandra and Solid State Drives
DataStax Academy
 
PDF
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Jimin Hsieh
 
PDF
Dynamic Change Data Capture with Flink CDC and Consistent Hashing
HostedbyConfluent
 
PDF
Dynamic Change Data Capture with Flink CDC and Consistent Hashing
Yaroslav Tkachenko
 
PDF
Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...
Jean-Philippe BEMPEL
 
PDF
XT Best Practices
Jeff Larkin
 
PPTX
Next Level Mobile Graphics | Munseong Kang, Oleksii Vasylenko
Jessica Tams
 
PDF
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
 
PDF
Introduction to Khronos SYCL
Min-Yih Hsu
 
PPTX
Nytro-XV_NWD_VM_Performance_Acceleration
Khai Le
 
PDF
Jvm Language Summit Rose 20081016
Eduardo Pelegri-Llopart
 
PDF
Top five questions to ask when choosing a big data solution
jbellis
 
PDF
Scale Out Your Graph Across Servers and Clouds with OrientDB
Luca Garulli
 
PDF
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
confluent
 
PDF
Colin-Ian-King-Mentorship-Stress-ng.pdf
xiso
 
PDF
Haskell Symposium 2010: An LLVM backend for GHC
dterei
 
TXT
Starburn
ikhsan saputra
 
Fosdem 2012
pcmanus
 
33rd degree conference
pcmanus
 
Cassandra and Solid State Drives
DataStax Academy
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Jimin Hsieh
 
Dynamic Change Data Capture with Flink CDC and Consistent Hashing
HostedbyConfluent
 
Dynamic Change Data Capture with Flink CDC and Consistent Hashing
Yaroslav Tkachenko
 
Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...
Jean-Philippe BEMPEL
 
XT Best Practices
Jeff Larkin
 
Next Level Mobile Graphics | Munseong Kang, Oleksii Vasylenko
Jessica Tams
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
 
Introduction to Khronos SYCL
Min-Yih Hsu
 
Nytro-XV_NWD_VM_Performance_Acceleration
Khai Le
 
Jvm Language Summit Rose 20081016
Eduardo Pelegri-Llopart
 
Top five questions to ask when choosing a big data solution
jbellis
 
Scale Out Your Graph Across Servers and Clouds with OrientDB
Luca Garulli
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
confluent
 
Colin-Ian-King-Mentorship-Stress-ng.pdf
xiso
 
Haskell Symposium 2010: An LLVM backend for GHC
dterei
 
Starburn
ikhsan saputra
 
Ad

More from jbellis (20)

PPTX
Vector Search @ sw2con for slideshare.pptx
jbellis
 
PDF
Five Lessons in Distributed Databases
jbellis
 
PDF
Data day texas: Cassandra and the Cloud
jbellis
 
PDF
Cassandra Summit 2015
jbellis
 
PDF
Cassandra summit keynote 2014
jbellis
 
PDF
Cassandra 2.1
jbellis
 
PDF
Tokyo cassandra conference 2014
jbellis
 
PDF
Cassandra Summit EU 2013
jbellis
 
PDF
London + Dublin Cassandra 2.0
jbellis
 
PDF
Cassandra Summit 2013 Keynote
jbellis
 
PDF
Cassandra at NoSql Matters 2012
jbellis
 
PDF
State of Cassandra 2012
jbellis
 
PDF
Massively Scalable NoSQL with Apache Cassandra
jbellis
 
PDF
Cassandra 1.1
jbellis
 
PDF
Pycon 2012 What Python can learn from Java
jbellis
 
PDF
Apache Cassandra: NoSQL in the enterprise
jbellis
 
PDF
Cassandra at High Performance Transaction Systems 2011
jbellis
 
PDF
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
jbellis
 
PDF
What python can learn from java
jbellis
 
PDF
State of Cassandra, 2011
jbellis
 
Vector Search @ sw2con for slideshare.pptx
jbellis
 
Five Lessons in Distributed Databases
jbellis
 
Data day texas: Cassandra and the Cloud
jbellis
 
Cassandra Summit 2015
jbellis
 
Cassandra summit keynote 2014
jbellis
 
Cassandra 2.1
jbellis
 
Tokyo cassandra conference 2014
jbellis
 
Cassandra Summit EU 2013
jbellis
 
London + Dublin Cassandra 2.0
jbellis
 
Cassandra Summit 2013 Keynote
jbellis
 
Cassandra at NoSql Matters 2012
jbellis
 
State of Cassandra 2012
jbellis
 
Massively Scalable NoSQL with Apache Cassandra
jbellis
 
Cassandra 1.1
jbellis
 
Pycon 2012 What Python can learn from Java
jbellis
 
Apache Cassandra: NoSQL in the enterprise
jbellis
 
Cassandra at High Performance Transaction Systems 2011
jbellis
 
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
jbellis
 
What python can learn from java
jbellis
 
State of Cassandra, 2011
jbellis
 
Ad

Recently uploaded (20)

PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
July Patch Tuesday
Ivanti
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Français Patch Tuesday - Juillet
Ivanti
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
July Patch Tuesday
Ivanti
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Français Patch Tuesday - Juillet
Ivanti
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 

Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)

  • 1. Dealing with JVM limitations in Apache Cassandra Jonathan Ellis / @spyced
  • 2. Pain points for Java databases ✤ GC ✤ GC ✤ GC
  • 3. Pain points for Java databases ✤ GC ✤ Platform specific code
  • 4. GC ✤ Concurrent and compacting: choose one ✤ G1 ✤ Azul C4 / Zing?
  • 5. Fragmentation ✤ Bloom filter arrays ✤ Compression offsets
  • 6. Automatic mitigation? ✤ http://www.research.ibm.com/people/d/dfb/papers/Bacon03Controlling.pdf ✤ http://researcher.ibm.com/files/us-hirzel/pldi10-arraylets.pdf
  • 7. Fragmentation, 2 ✤ Arena allocation for memtables
  • 8. (Memtables?) write( k1 , c1:v1 ) Memory Memtable Commit log Hard drive
  • 9. write( k1 , c1:v ) Memory k1 c1:v Memtable k1 c1:v Commit log Hard drive
  • 10. write( k1 , c2:v ) Memory k1 c1:v c2:v k1 c1:v k1 c2:v Hard drive
  • 11. write( k2 , c1:v c2:v ) Memory k1 c1:v c2:v k2 c1:v c2:v k1 c1:v k1 c2:v k2 c1:v c2:v Hard drive
  • 12. write( k1 , c1:v c3:v ) Memory k1 c1:v c2:v c3:v k2 c1:v c2:v k1 c1:v k1 c2:v k2 c1:v c2:v k1 c1:v c3:v Hard drive
  • 13. Memory flush index cleanup k1 c1:v c2:v c3:v k2 c1:v c2:v SSTable Hard drive
  • 14. “Java is a memory hog” ✤ Large overhead for typical objects and collections ✤ How large? ✤ java.lang.instrument.Instrumentation ✤ JAMM: Java Agent for Memory Measurements ✤ https://github.com/jbellis/jamm
  • 15. org.apache.cassandra.cache.SerializingCache ✤ Live objects are about 85% JVM bookeeping ✤ org.apache.cassandra.cache.FreeableMemory using reference counting ✤ Considering doing reference-counted, off-heap memtables as well
  • 16. Don’t forget about young gen ✤ Always stop-the-world for ~100ms
  • 18. m[un]map ✤ Log-structured storage wants to remove old files post- compaction; some platforms disallow deleting open files ✤ Old workaround (pre-1.0): ✤ use PhantomReference to tell when mmap’d file is GC (hence unmapped) ✤ Poor user experience and messy corner cases ✤ New workaround: ✤ Class.forName("sun.nio.ch.DirectBuffer").getMethod("cleaner")
  • 19. mmap part 2 ✤ 2GB limit via ByteBuffer: public abstract byte get(int index) ✤ Workaround: MmappedSegmentedFile public Iterator<DataInput> iterator(long position)
  • 20. link ✤ Used for snapshots ✤ Old workaround: JNA ✤ New workaround: supported directly by Java7
  • 21. mlockall ✤ swappiness: pissing off database developers since 2001 (?) ✤ mlockall(MCL_CURRENT)
  • 22. Low-level i/o ✤ posix_fadvise ✤ mincore/fincore ✤ fctl ✤ ... JNA
  • 23. A plug for JNA ✤ https://github.com/twall/jna static { try { Native.register("c"); ... private static native int mlockall(int flags) throws LastErrorException;
  • 24. The fallacy of choosing portability over power ✤ Applets have been dead for years ✤ Python gets it right ✤ import readline
  • 25. The fallacy of choosing safety over power ✤ Allowing munmap would expose developers to segfaults ✤ But, relying on the GC to clean up external resources is a well-known antipattern ✤ File.close ✤ We need munmap badly enough that we resort to unnatural and unportable code to get it ✤ You haven’t kept us from risking segfaults, you’ve just made us miserable
  • 26. Compatibility through obscurity? ✤ sun.misc.Unsafe ✤ Used by high-profile libraries like high-scale-lib
  • 27. ... even public options http://blogs.oracle.com/dave/entry/false_sharing_induced_by_card
  • 29. Still true ✤ "Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free." -- Cliff Click