OS-caused Long JVM Pauses
- Deep Dive and Solutions
Zhenyun Zhuang
LinkedIn Corp., Mountain View, California, USA
https://www.linkedin.com/in/zhenyun
Zhenyun@gmail.com
Outline
 Introduction
 Background
 Scenario 1: startup state
 Scenario 2: steady state with memory pressure
 Scenario 3: steady state with heavy IO
 Lessons learned
2
Introduction
 Java + Linux
 Java is popular in production deployments
 Linux features interact with JVM operations
 Unique challenges caused by concurrent applications
 Long JVM pauses caused by Linux OS
 Production issues, in three scenarios
 Root causes
 Solutions
 References
 Ensuring High-performance of Mission-critical Java Applications in Multi-
tenant Cloud Platforms, IEEE Cloud 2014
 Eliminating Large JVM GC Pauses Caused by Background IO Traffic,
LinkedIn Engineering Blog, 2016 (Too many tweets bringing down a twitter server! :)
3
Background
 JVM and Heap
 Oracle HotSpot JVM
 Garbage collection
 Generations
 Garbage collectors
 Linux OS
 Paging (Regular page, Huge page)
 Swapping (Anonymous memory)
 Page cache writeback (Batched, Periodic)
4
Scenarios
 Three scenarios
 Startup state
 Steady state with memory pressure
 Steady state with heavy IO
 Workload
 Java application keeps allocating/de-allocating objects
 Background applications taking memories or issuing disk IO
 Performance metrics
 Application throughput (K allocations/sec)
 Java GC pauses
5
Scenario 1: Startup State (App. Symptoms)
 When Java applications start
 Life is good in the beginning
 Then Java throughput drops sharply
 Java GC pauses spike during the same period
6
Scenario 1: Startup State (Investigations)
 Java heap is gradually allocated
 Without enough memory, direct
page scanning can happen
 Heap is swapped out and in
 It causes large GC
7
Solutions
 Pre-allocating JVM heap spaces
 JVM “-XX:AlwaysPreTouch”
 Protecting JVM heap spaces from being
swapped out
 Swappoff command
 Swappiness
• =0 for kernel version before 2.6.32-303
• =1 for kernel version from 2.6.32-303
 Cgroup
8
Evaluations (Pre-allocating Heap)
9
Evaluations (Protecting Heap)
18 24
10
Scenario 2: Steady State (App. Symptoms)
 During steady state of a Java application, system
memory stresses due to other applications
 Java throughput drops sharply and performs badly
 Java GC pauses spike
11
Scenario 2: Steady State (Level-1 Investigations)
 During GC pauses, swapping activities persist
 Swapping in JVM pages causes GC pauses
 However, swapping is not enough
 Excessive GC pauses (i.e., 55 seconds)
 High sys-cpu usage (swapping is not sys-cpu intensive)
12
[Times: user=0.12 sys=54.67,
real=54.83 secs]
Scenario 2: Steady State (Level-2 Investigations)
 THP (Transparent Huge Pages)
 Improved TLB cache-hits
 Bi-directional operations
 THPs are allocated first, but split during memory pressure
 Regular pages are collapsed to make THPs
 CPU heavy, and thrashing!
4KB
Regular Pages
4KB
4KB
4KB
4KB
4KB
……
……
2MB
Transparent Huge
Pages (THP)
Splitting
Collapsing
13
Solutions
 Dynamically adjusting THP
 Enable THP when no memory pressure
 Disable THP during memory pressure period
 Fine tuning of THP parameters
14
Evaluations (Dynamic THP)
 Without memory pressure
 Dynamic THP delivers similar performance as THP is on
Mechanism THP Off THP On Dynamic THP
Throughput
(K allocations/sec)
12 15 15
Mechanism THP Off THP On Dynamic THP
Throughput
(K allocations/sec)
13 11 12
 With memory pressure
 Dynamic THP has some performance overhead
 Performance is less than THP-off
 But better than THP-on
15
Scenario 3: Steady State (Heavy IO)
 Production issue
 Online products
 Applications have light workload
 Both CMS and G1 garbage collectors
 Preliminary investigations
 Examined many layers/metrics
 The only suspect: disk IO occasionally is heavy
 But all application IO are asynchronous
16
Reproducing the problem
 Workload
 Simplified to avoid complex business logic
 https://github.com/zhenyun/JavaGCworkload
 Background IO
 Saturating HDD
17
Case I: Without background IO
18
No single longer-than-200ms pause
Case II: With background IO
Huge pause!
19
Investigations
20
Time lines
 At time 35.04 (line 2), a young GC starts and takes 0.12 seconds to complete.
 The young GC finishes at time 35.16 and JVM tries to output the young GC statistics
to gc log file by issuing a write() system call (line 4).
 The write() call finishes at time 36.64 after being blocked for 1.47 seconds (line 5)
 When write() call returns to JVM, JVM records at time 36.64 this STW pause of 1.59
seconds (i.e., 0.12 + 1.47) (line 3).
21
Interaction between JVM and OS
22
Non-blocking IO can be blocked
 Stable page write
 For file-backed writing, OS writes to page cache first
 OS has write-back mechanism to persist dirty pages
 If a page is under write-back, the page is locked
 Journal committing
 Journals are generated for journaling file system
 When appending GC log files needs new blocks,
journals need to be committed
 Commitment might need to wait
23
Background IO activities
 OS activity such as swapping
 Data writing to underlying disks
 Administration and housekeeping software
 System-level software such as CFEngine also perform
disk IO
 Other co-located applications
 Co-located applications that share the disk drives,
then other applications contend on IO
 IO of the same JVM instance
 The particular JVM instance may use disk IO in ways
other than GC logging
24
Solutions
 Enhancing JVM
 Another thread
 Exposing JVM flags
 Reducing IO activities
 OS, other apps, same app
 Latency sensitive applications
 Separate disk
 High performing disks such as SSD
 Tmpfs
25
Evaluation
 SSD as the disk
26
The good, the bad, and the ugly
 The good: low real time
 Low user time and low sys time
 [user=0.18 sys=0.01, real=0.04 secs]
 The bad: non-low (but not high) real time
 High user time and low sys time
 [user=8.00 sys=0.02, real=0.50 secs]
 The ugly: high real time
 High sys time [user=0.02 sys=1.20, real=1.20 secs]
 Low sys time, low user time [Example? ]
27
Lessons Learned (I)
 Be cautious about Linux’s (and other OS) new
features
 Constantly incorporating new features to optimize
performance
 Some features incur performance tradeoff
 They may backfire in certain scenarios
28
Lessons Learned (II)
29
 Root causes can come from seemingly
insignificant information
 Linux emits significant amount of performance
information
 Most of us most of the time mostly only examine a
small subset of them
 Don’t ignore others – understand the interactions
of sub-components
Lessons Learned (III)
30
 Pay attention to multi-layer interaction
 Application protocol, JVM, OS, storage/networking
 Most people are familiar with a few layers
 Optimizations done at one layer may adversely
affect other layers
 Many performance problems are caused by the
cross-layer interactions
Zhenyun@gmail.com

More Related Content

PPTX
Troubleshooting Kerberos in Hadoop: Taming the Beast
PDF
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
PDF
Kafka on ZFS: Better Living Through Filesystems
PDF
Kubernetes scheduling and QoS
PDF
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
PDF
Hadoop and OpenStack
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PDF
Ray: Enterprise-Grade, Distributed Python
Troubleshooting Kerberos in Hadoop: Taming the Beast
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
Kafka on ZFS: Better Living Through Filesystems
Kubernetes scheduling and QoS
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
Hadoop and OpenStack
Dynamic Rule-based Real-time Market Data Alerts
Ray: Enterprise-Grade, Distributed Python

What's hot (20)

PDF
Parquet performance tuning: the missing guide
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PPTX
前を向くために、後を見てみよう。自分らしいエンジニア人生設計フレームワーク
KEY
High performance network programming on the jvm oscon 2012
PPTX
Jvm tuning for low latency application & Cassandra
PPTX
Seamless replication and disaster recovery for Apache Hive Warehouse
PDF
Lessons for the optimizer from running the TPC-DS benchmark
PDF
Top 5 Mistakes When Writing Spark Applications
PPTX
Orion Context Broker 20221220
PDF
[245] presto 내부구조 파헤치기
PDF
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
PPTX
Apache NiFi- MiNiFi meetup Slides
PPTX
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
PDF
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
PDF
Practical Partitioning in Production with Postgres
 
PDF
Consumer offset management in Kafka
PPTX
Présentation de git
PDF
Flink Forward San Francisco 2019: How to Join Two Data Streams? - Piotr Nowojski
PDF
Optimizing Apache Spark UDFs
PDF
Securing SAML SSO from XSW attacks
Parquet performance tuning: the missing guide
Iceberg: A modern table format for big data (Strata NY 2018)
前を向くために、後を見てみよう。自分らしいエンジニア人生設計フレームワーク
High performance network programming on the jvm oscon 2012
Jvm tuning for low latency application & Cassandra
Seamless replication and disaster recovery for Apache Hive Warehouse
Lessons for the optimizer from running the TPC-DS benchmark
Top 5 Mistakes When Writing Spark Applications
Orion Context Broker 20221220
[245] presto 내부구조 파헤치기
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Apache NiFi- MiNiFi meetup Slides
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Practical Partitioning in Production with Postgres
 
Consumer offset management in Kafka
Présentation de git
Flink Forward San Francisco 2019: How to Join Two Data Streams? - Piotr Nowojski
Optimizing Apache Spark UDFs
Securing SAML SSO from XSW attacks
Ad

Similar to OS caused Large JVM pauses: Deep dive and solutions (20)

PDF
Eliminating OS-caused Large JVM Pauses for Latency-sensitive Java-based Cloud...
PDF
The Performance Engineer's Guide To (OpenJDK) HotSpot Garbage Collection - Th...
PDF
Eliminating the Pauses in your Java Application
PPTX
JVM memory management & Diagnostics
PDF
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
PDF
Low latency & mechanical sympathy issues and solutions
PPTX
Gc and-pagescan-attacks-by-linux
PDF
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
PPTX
Jvm problem diagnostics
PDF
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
PPTX
Java Memory Management Tricks
PDF
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
PPTX
Top 5 Java Performance Problems Presentation!
PDF
[BGOUG] Java GC - Friend or Foe
PPTX
G1 Garbage Collector - Big Heaps and Low Pauses?
PPTX
Jvm & Garbage collection tuning for low latencies application
PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
PDF
ZGC-SnowOne.pdf
PDF
Ensuring High-performance of Mission-critical Java Applications in Multi-tena...
Eliminating OS-caused Large JVM Pauses for Latency-sensitive Java-based Cloud...
The Performance Engineer's Guide To (OpenJDK) HotSpot Garbage Collection - Th...
Eliminating the Pauses in your Java Application
JVM memory management & Diagnostics
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Low latency & mechanical sympathy issues and solutions
Gc and-pagescan-attacks-by-linux
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
Jvm problem diagnostics
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
Java Memory Management Tricks
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
Top 5 Java Performance Problems Presentation!
[BGOUG] Java GC - Friend or Foe
G1 Garbage Collector - Big Heaps and Low Pauses?
Jvm & Garbage collection tuning for low latencies application
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
ZGC-SnowOne.pdf
Ensuring High-performance of Mission-critical Java Applications in Multi-tena...
Ad

More from Zhenyun Zhuang (20)

PDF
Designing SSD-friendly Applications for Better Application Performance and Hi...
PDF
Optimized Selection of Streaming Servers with GeoDNS for CDN Delivered Live S...
PDF
OCPA: An Algorithm for Fast and Effective Virtual Machine Placement and Assig...
PDF
Optimizing CDN Infrastructure for Live Streaming with Constrained Server Chai...
PDF
Application-Aware Acceleration for Wireless Data Networks: Design Elements an...
PDF
PAIDS: A Proximity-Assisted Intrusion Detection System for Unidentified Worms
PDF
On the Impact of Mobile Hosts in Peer-to-Peer Data Networks
PDF
WebAccel: Accelerating Web access for low-bandwidth hosts
PDF
Client-side web acceleration for low-bandwidth hosts
PDF
A3: application-aware acceleration for wireless data networks
PDF
Mutual Exclusion in Wireless Sensor and Actor Networks
PDF
Hazard avoidance in wireless sensor and actor networks
PDF
Dynamic Layer Management in Super-Peer Architectures
PDF
A Distributed Approach to Solving Overlay Mismatching Problem
PDF
Hybrid Periodical Flooding in Unstructured Peer-to-Peer Networks
PDF
AOTO: Adaptive overlay topology optimization in unstructured P2P systems
PDF
Mobile Hosts Participating in Peer-to-Peer Data Networks: Challenges and Solu...
PDF
Enhancing Intrusion Detection System with Proximity Information
PDF
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
PDF
SLA-aware Dynamic CPU Scaling in Business Cloud Computing Environments
Designing SSD-friendly Applications for Better Application Performance and Hi...
Optimized Selection of Streaming Servers with GeoDNS for CDN Delivered Live S...
OCPA: An Algorithm for Fast and Effective Virtual Machine Placement and Assig...
Optimizing CDN Infrastructure for Live Streaming with Constrained Server Chai...
Application-Aware Acceleration for Wireless Data Networks: Design Elements an...
PAIDS: A Proximity-Assisted Intrusion Detection System for Unidentified Worms
On the Impact of Mobile Hosts in Peer-to-Peer Data Networks
WebAccel: Accelerating Web access for low-bandwidth hosts
Client-side web acceleration for low-bandwidth hosts
A3: application-aware acceleration for wireless data networks
Mutual Exclusion in Wireless Sensor and Actor Networks
Hazard avoidance in wireless sensor and actor networks
Dynamic Layer Management in Super-Peer Architectures
A Distributed Approach to Solving Overlay Mismatching Problem
Hybrid Periodical Flooding in Unstructured Peer-to-Peer Networks
AOTO: Adaptive overlay topology optimization in unstructured P2P systems
Mobile Hosts Participating in Peer-to-Peer Data Networks: Challenges and Solu...
Enhancing Intrusion Detection System with Proximity Information
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
SLA-aware Dynamic CPU Scaling in Business Cloud Computing Environments

Recently uploaded (20)

PDF
Implantable Drug Delivery System_NDDS_BPHARMACY__SEM VII_PCI .pdf
PDF
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
PDF
Unit1 - AIML Chapter 1 concept and ethics
PDF
MLpara ingenieira CIVIL, meca Y AMBIENTAL
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PDF
Abrasive, erosive and cavitation wear.pdf
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
August -2025_Top10 Read_Articles_ijait.pdf
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PPTX
ai_satellite_crop_management_20250815030350.pptx
PPTX
Principal presentation for NAAC (1).pptx
PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PDF
Prof. Dr. KAYIHURA A. SILAS MUNYANEZA, PhD..pdf
PPTX
Feature types and data preprocessing steps
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
Measurement Uncertainty and Measurement System analysis
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
Implantable Drug Delivery System_NDDS_BPHARMACY__SEM VII_PCI .pdf
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
Unit1 - AIML Chapter 1 concept and ethics
MLpara ingenieira CIVIL, meca Y AMBIENTAL
August 2025 - Top 10 Read Articles in Network Security & Its Applications
Abrasive, erosive and cavitation wear.pdf
Information Storage and Retrieval Techniques Unit III
August -2025_Top10 Read_Articles_ijait.pdf
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
ai_satellite_crop_management_20250815030350.pptx
Principal presentation for NAAC (1).pptx
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
Prof. Dr. KAYIHURA A. SILAS MUNYANEZA, PhD..pdf
Feature types and data preprocessing steps
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Measurement Uncertainty and Measurement System analysis
Exploratory_Data_Analysis_Fundamentals.pdf
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf

OS caused Large JVM pauses: Deep dive and solutions

  • 1. OS-caused Long JVM Pauses - Deep Dive and Solutions Zhenyun Zhuang LinkedIn Corp., Mountain View, California, USA https://www.linkedin.com/in/zhenyun [email protected]
  • 2. Outline  Introduction  Background  Scenario 1: startup state  Scenario 2: steady state with memory pressure  Scenario 3: steady state with heavy IO  Lessons learned 2
  • 3. Introduction  Java + Linux  Java is popular in production deployments  Linux features interact with JVM operations  Unique challenges caused by concurrent applications  Long JVM pauses caused by Linux OS  Production issues, in three scenarios  Root causes  Solutions  References  Ensuring High-performance of Mission-critical Java Applications in Multi- tenant Cloud Platforms, IEEE Cloud 2014  Eliminating Large JVM GC Pauses Caused by Background IO Traffic, LinkedIn Engineering Blog, 2016 (Too many tweets bringing down a twitter server! :) 3
  • 4. Background  JVM and Heap  Oracle HotSpot JVM  Garbage collection  Generations  Garbage collectors  Linux OS  Paging (Regular page, Huge page)  Swapping (Anonymous memory)  Page cache writeback (Batched, Periodic) 4
  • 5. Scenarios  Three scenarios  Startup state  Steady state with memory pressure  Steady state with heavy IO  Workload  Java application keeps allocating/de-allocating objects  Background applications taking memories or issuing disk IO  Performance metrics  Application throughput (K allocations/sec)  Java GC pauses 5
  • 6. Scenario 1: Startup State (App. Symptoms)  When Java applications start  Life is good in the beginning  Then Java throughput drops sharply  Java GC pauses spike during the same period 6
  • 7. Scenario 1: Startup State (Investigations)  Java heap is gradually allocated  Without enough memory, direct page scanning can happen  Heap is swapped out and in  It causes large GC 7
  • 8. Solutions  Pre-allocating JVM heap spaces  JVM “-XX:AlwaysPreTouch”  Protecting JVM heap spaces from being swapped out  Swappoff command  Swappiness • =0 for kernel version before 2.6.32-303 • =1 for kernel version from 2.6.32-303  Cgroup 8
  • 11. Scenario 2: Steady State (App. Symptoms)  During steady state of a Java application, system memory stresses due to other applications  Java throughput drops sharply and performs badly  Java GC pauses spike 11
  • 12. Scenario 2: Steady State (Level-1 Investigations)  During GC pauses, swapping activities persist  Swapping in JVM pages causes GC pauses  However, swapping is not enough  Excessive GC pauses (i.e., 55 seconds)  High sys-cpu usage (swapping is not sys-cpu intensive) 12 [Times: user=0.12 sys=54.67, real=54.83 secs]
  • 13. Scenario 2: Steady State (Level-2 Investigations)  THP (Transparent Huge Pages)  Improved TLB cache-hits  Bi-directional operations  THPs are allocated first, but split during memory pressure  Regular pages are collapsed to make THPs  CPU heavy, and thrashing! 4KB Regular Pages 4KB 4KB 4KB 4KB 4KB …… …… 2MB Transparent Huge Pages (THP) Splitting Collapsing 13
  • 14. Solutions  Dynamically adjusting THP  Enable THP when no memory pressure  Disable THP during memory pressure period  Fine tuning of THP parameters 14
  • 15. Evaluations (Dynamic THP)  Without memory pressure  Dynamic THP delivers similar performance as THP is on Mechanism THP Off THP On Dynamic THP Throughput (K allocations/sec) 12 15 15 Mechanism THP Off THP On Dynamic THP Throughput (K allocations/sec) 13 11 12  With memory pressure  Dynamic THP has some performance overhead  Performance is less than THP-off  But better than THP-on 15
  • 16. Scenario 3: Steady State (Heavy IO)  Production issue  Online products  Applications have light workload  Both CMS and G1 garbage collectors  Preliminary investigations  Examined many layers/metrics  The only suspect: disk IO occasionally is heavy  But all application IO are asynchronous 16
  • 17. Reproducing the problem  Workload  Simplified to avoid complex business logic  https://github.com/zhenyun/JavaGCworkload  Background IO  Saturating HDD 17
  • 18. Case I: Without background IO 18 No single longer-than-200ms pause
  • 19. Case II: With background IO Huge pause! 19
  • 21. Time lines  At time 35.04 (line 2), a young GC starts and takes 0.12 seconds to complete.  The young GC finishes at time 35.16 and JVM tries to output the young GC statistics to gc log file by issuing a write() system call (line 4).  The write() call finishes at time 36.64 after being blocked for 1.47 seconds (line 5)  When write() call returns to JVM, JVM records at time 36.64 this STW pause of 1.59 seconds (i.e., 0.12 + 1.47) (line 3). 21
  • 23. Non-blocking IO can be blocked  Stable page write  For file-backed writing, OS writes to page cache first  OS has write-back mechanism to persist dirty pages  If a page is under write-back, the page is locked  Journal committing  Journals are generated for journaling file system  When appending GC log files needs new blocks, journals need to be committed  Commitment might need to wait 23
  • 24. Background IO activities  OS activity such as swapping  Data writing to underlying disks  Administration and housekeeping software  System-level software such as CFEngine also perform disk IO  Other co-located applications  Co-located applications that share the disk drives, then other applications contend on IO  IO of the same JVM instance  The particular JVM instance may use disk IO in ways other than GC logging 24
  • 25. Solutions  Enhancing JVM  Another thread  Exposing JVM flags  Reducing IO activities  OS, other apps, same app  Latency sensitive applications  Separate disk  High performing disks such as SSD  Tmpfs 25
  • 26. Evaluation  SSD as the disk 26
  • 27. The good, the bad, and the ugly  The good: low real time  Low user time and low sys time  [user=0.18 sys=0.01, real=0.04 secs]  The bad: non-low (but not high) real time  High user time and low sys time  [user=8.00 sys=0.02, real=0.50 secs]  The ugly: high real time  High sys time [user=0.02 sys=1.20, real=1.20 secs]  Low sys time, low user time [Example? ] 27
  • 28. Lessons Learned (I)  Be cautious about Linux’s (and other OS) new features  Constantly incorporating new features to optimize performance  Some features incur performance tradeoff  They may backfire in certain scenarios 28
  • 29. Lessons Learned (II) 29  Root causes can come from seemingly insignificant information  Linux emits significant amount of performance information  Most of us most of the time mostly only examine a small subset of them  Don’t ignore others – understand the interactions of sub-components
  • 30. Lessons Learned (III) 30  Pay attention to multi-layer interaction  Application protocol, JVM, OS, storage/networking  Most people are familiar with a few layers  Optimizations done at one layer may adversely affect other layers  Many performance problems are caused by the cross-layer interactions