SlideShare a Scribd company logo
Noch mehr Schweine und Schlangen
Neue Tipps zum Performance Troubleshooting
Rainer Schuppe
AppDynamics GmbH
Rainer
Customer Support
System Support / Ops
Consultant / Dev
Solution Architect
Sales Engineer

+Rainer Schuppe
Reprise:
Why care about performance
Where to start? What to do? Who to
blame?
Tooling
Usecases - Symptoms & Diagnostics
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
failure everyday
Complexity increases
Release 1.1
Release 1.2
Release 1.23
Tomcat Release 1.5

.NET

Amazon EC2
Windows Azure

CLOUD
Release 2.4
Release 2.5
Release 2.6
Release 3.0

Login
Search Flight
View Flight Status
Make Reservation

Tomcat

Mule, Tibco, AG
Tomcat

ESB

VMWare

WEB 2.0
Memcached
Weblogic
Release 1.4
Release 1.5
Release 1.6
Release 2.0

Browser Logic
AJAX
Web Frameworks

Oracle

Coherence
Hadoop
Cassandra
MongoDB

SOA

.NET
MQ

AGILE

Release 3.4
Release 3.5
Release 3.6
Release 4.0

SQL
Server

Release 4.4
Release 4.5
Release 4.6
Release 5.0

JBoss

Release 1.4
Release 1.5
Release 1.6
Release 2.0

ATG, Vignette,
Sharepoint

BIG DATA
Generic Troubleshooting Process
Alert / Detection

Rootcause
Detection

Triage

Diagnosis

Data /
Information
Solution
Finding

Move on with life

Fix
Triage
• Determine who needs to fix it
• Starts with overview and comparison to

ā€žnormalā€œ performance
• First level task (Operators)
• First indication of problem type
• Works best with transactional data
50 ms
.NET
10 ms Amazon EC2
60 ms
Windows Azure

Release 1.1
Release 1.2
Release 1.23
Tomcat Release 1.5

45,3 ms

CLOUD

50 ms
Release 2.4
Release 2.5
Release 2.6
Release 3.0

Login
Search Flight
View Flight Status
Make Reservation

Tomcat

145 Mule, Tibco, AG
ms
145 ms
ESB
145 ms
145 ms
10 ms

WEBms
100 2.0

Memcached

250 ms
Weblogic
Release 1.4
Release 1.5
Release 1.6
Release 2.0

Browser Logic
AJAX
Web Frameworks

300.NET
ms
300 ms
310 ms
AGILE

Release 3.4
Release 3.5
Release 3.6
Release 4.0

SQL
Server

150 ms
Tomcat
160 VMWare
ms
145 ms
Oracle

Release 4.4
Release 4.5
Release 4.6
Release 5.0

Coherence

SOA

1 MQ
ms
15 ms

250 ms
JBoss
Release 1.4
Release 1.5
Release 1.6
Release 2.0

ATG, Vignette,
Sharepoint

Hadoop
Cassandra
MongoDB

BIG DATA
Release 1.1
Release 1.2
Release 1.23
Tomcat Release 1.5

Pr

.NET

ob

lem

Amazon EC2
Windows Azure

CLOUD
Release 2.4
Release 2.5
Release 2.6
Release 3.0

Login
Search Flight
View Flight Status
Make Reservation

Tomcat

Mule, Tibco, AG
Tomcat

ESB

VMWare

WEB 2.0
Memcached
Weblogic
Release 1.4
Release 1.5
Release 1.6
Release 2.0

Browser Logic
AJAX
Web Frameworks

Oracle

Coherence
Hadoop
Cassandra
MongoDB

SOA

.NET
MQ

AGILE

Release 3.4
Release 3.5
Release 3.6
Release 4.0

SQL
Server

Release 4.4
Release 4.5
Release 4.6
Release 5.0

JBoss

Release 1.4
Release 1.5
Release 1.6
Release 2.0

ATG, Vignette,
Sharepoint

BIG DATA
Key:
= bad
= not bad
Diagnose
• Determine the root of the problem
• Uses first level information to narrow scope
• Needs specialists
• Lots of data / information needed in real time
and historical
• Usually needs iterations
• More than 1 tool used in the process
Rootcause detection
• Confirm the rootcause after you diagnosed it
• Document it
• Recreate it in test if possible
• Needs the same data as diagnostics
Solution finding
• Find a solution for the problem
• Architect a workaround or a fix
• Again needs the diagnostic data
• Run some test runs with different options check them in realtime
• Confirm the idea for the fix
• May be a different team then the trouble
shooters
How to get the data?
• Intuition
• Experience
• Tools
• Logfiles
• Communication
Tooling

Ā© val-j - sxc.hu
3 Key Things Impact
Performance & Availability
Concurrency

Data Volume

Resource
Why do things crash and slow down?
Development

Concurrency

Data Volume

Resource

QA/Test

Concurrency Data Volume

Resource

Production

Concurrency

Data Volume

Resource
Technologies
Logging
ARM
Bytecode Instrumentation / Aspects
Sampling
JMX (Java Management Extensions)
PMI (IBM WebSphere specific)

Dev
Test
Prod
Logfiles
Pros:

Dev
Test
Prod

• Anything can be logged
• Easy to implement (if you have the sourcecode)
Cons:
• Only what the developer thinks is needed
• I/O heavy
• No chance for change if you donā€˜t own the
source code
• Lots of files - no TX context usually
• How to correlate in distributed environment?
Logfiles - 2
Logging can be the source of problems itself
e.g. Log4Net
• Synchronous local file system access
• The more you log the longer it takes
• Can only be diagnosed with another tool

Dev
Test
Prod
Bringing down production with Logging

Microsoft.Win32.Win32Native:SetFilePointerWin32
Logfiles - 4
[#|2013-04-16T16:04:44.319+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Data Store timer|#]
[#|2013-04-16T16:04:44.335+0200|INFO|sun-appserver2.1|com.appdynamics.TOP.SUMMARY.STATS.WRITE|
_ThreadID=14;_ThreadName=pool-1-thread-9;|START TIME for timer service(TopSummaryStatsWriterTimerTaskBean) will be: Tue
Apr 16 16:05:00 CEST 2013|#]
[#|2013-04-16T16:04:44.338+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Data Store timer|#]
[#|2013-04-16T16:04:44.338+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Data Purger timer|#]
[#|2013-04-16T16:04:44.369+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Data Purger timer|#]
[#|2013-04-16T16:04:44.369+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Detail String cache timer|#]
[#|2013-04-16T16:04:44.376+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Detail String cache timer|#]
[#|2013-04-16T16:04:44.376+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean|
_ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats rollup timer|#]
Profiler
Pros:

• No config needed
• Lots of data - lots of detail
Cons:
• Lots of data - not suitable for production
• Needs experience
• No transactional concept / context

Dev
Test
Profiler
JMX (and similar)
Pros:

•
•
•

Built into most application servers
JConsole is part of the JDK
Easy to implement MBeans

Cons:

•
•
•
•

No transaction context
Not available for 3rd party
No historical data
Usually one JVM only

Dev
Test
Prod
JMX (and similar)
APM tools (free)
Pros:

• They are free
• Transaction context (most of them)
• Quick setup (the commercial ones)

Dev
Test
Prod

Cons:
• Usually functionally constrained (commercial)
• Hard to configure (open source)
• Usually no history
Dev
APM tools (commercial) Test

Pros:

• Transactions, Historical data
• Distributed monitoring
• Deep dive diagnostics
• Production fit
Cons:
• Costly
• Choose the right one

Prod
Diagnosis
There are just 2 sorts of issues
Ā© NLTeddy - sxc.hu
Ā© ross666 - sxc.hu
50 shades of slow (appx.)
•
•
•
•
•
•

Constantly slow (Turtle)
Slowly, but constantly slower
Exponentially slower
Suddenly slower
Sporadically slow
Spontaneous crash
The wonderful world of errors
•
•
•
•
•
•

Sudden outage
Always erroneous
Sporadically Errormessages
Silent death / Bleed to death
Increasing errorrates
Wrong / meaningless error messages
Diagnosis – Rough Flow
Look at symptoms
Eliminate definite non-causes
Prioritize the suspicions
Confirm suspicion / Eliminate suspicion
• Compare with ā€žnormalā€œ
• Gather more information
• Define root cause and confirm it
• Redo from Start
•
•
•
•
Possible Causes
(in no particular order)

•
•
•
•
•
•
•

Bad Coding
Too much load
Backend not reachable / slow
Conflicting resources
Memory Leak
Resource Leak
Network / Hardware Problem
Example
•
•
•

Resource Contention
Exceptions
Load Issues
Symptoms
•
•
•

User complaints - Slow performance
Exceptions appeared in logfiles
Alerts for ops triggered
Average Response Time (ms)
11.000

Connection Timeout

8.250

5.500

2.750

0

time

10:01 10:03 10:05 10:07 10:09 10:11 10:13 10:15 10:17 10:19 10:21 10:23 10:25 10:27 10:29
Connection Pool vs. Errors
15,00

org.hibernate.util.JDBCExceptionReporter : Cannot get a connection, pool error Timeout waiting for idle object

11,25

7,50

3,75

0
10:00

10:02

10:04

10:06

10:08

10:10

10:12

10:14

10:16

10:18

10:20

10:22

10:24

10:26

10:28

10:30
1st Diagnosis
•
•
•
•
•

OK - We do have a problem
Database connection pool depleted
Waiting times stacking
10 minutes until errors appear in logs
But WHY?
Open Questions
•
•
•
•
•

Which database?
Which DB Pool?
Transaction specific?
Problem on DB?
What is the load?
How to find data
•
•
•
•

Check log for DB connection info
Ask architect which TX are using this pool
Use JMX to check pool metrics
Check load info (if available)
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Who else is using the DB?
Ask your architect!
Who else is using the DB?
Ask your tool!
What did we find out?
•
•

There were other TXā€˜s using the DB

•

This TX had a specific DB connection pool

It was just a single transaction with the DB
problem

OK - Now letā€˜s check the load
Load and DB connections
Trx/
min
Avg RT
Pool Limit
Pool Usage
Trx Stalls

Why the sudden load increase?
Root Cause
•
•

Loadbalancer was not working correctly

•

Many different pools made this config
necessary

DB connection pool size was not
appropriate for this load
The missing link

Release 1.1
Release 1.2
Release 1.23
Tomcat Release 1.5

.NET

Amazon EC2
Windows Azure

CLOUD
Release 2.4
Release 2.5
Release 2.6
Release 3.0

Login
Browser(s)
Purchase
Search Flight
Search Flight
Flight Flight
View Status Status
Login
Make Reservation
Native
Mobile
App

MOBILE

Tomcat

Mule, Tibco, AG
Tomcat

ESB

VMWare

WEB 2.0
Memcached
Weblogic

Network

Release 1.4
Release 1.5
Release 1.6
Release 2.0

Browser Logic
AJAX
Web Frameworks

Oracle

Coherence
Hadoop
Cassandra
MongoDB

SOA

.NET
MQ

AGILE

Release 3.4
Release 3.5
Release 3.6
Release 4.0

SQL
Server

Release 4.4
Release 4.5
Release 4.6
Release 5.0

JBoss

Release 1.4
Release 1.5
Release 1.6
Release 2.0

ATG, Vignette,
Sharepoint

BIG DATA
Enduser Monitoring helps
Complete the picture
A quick look at
memory
and resources
Linear Memory Leak
Symptoms:

•
•
•
•

OOM (Out of memory error)
Slow over time with spikes
Sawtooth with upward trend

• Causes
•
•

Objects added to linear structures without being removed
(e.g., linked lists)
Other API misuse (addListener() without corresponding
removeListener(), etc.)
Linear Memory Leak
Aggregate detection:

•
•
•

linear growth in heap utilization
GC time growth

Specific detection:

•
•
•
•

Figure out object types being leaked
Verbose GC
Find related APIs and search code for misuse
Linear Memory Leak
Challenges

•
•
•

References - many small objects are referenced in one
collection
Death by 1000 cuts (Papierschnitte)

Specific detection:

•
•
•
•

Figure out object types being leaked
Verbose GC
Find related APIs and search code for misuse
Specific detection
•
•
•
•
•
•

•
•
•

•

Heap Dump Comparison

Needs at least 2 dumps
Stops the JVM
Can take several minutes each
Creates tons of data
Finds the object, not the code responsible for the leak

Profiler

High overhead - not for production
Lots of data

APM Solution
•
•
•

Collection based algorithm – finds only collection leaks
Instance counting
Trade off between low overhead and usefulness of data
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen
Exponential Memory Leak
Causes:
• Objects added to most data structures
without being removed (e.g., vectors,
hashtables)
• Other API misuse (as Linear Leak)
• Aggregate detection:
• exponential growth in heap
• Specific detection:
• Same as Linear Leak
•
Resource Leak
Causes:
• API misuse of Java objects with resourcestyle lifecycle (create->use->destroy)
• Aggregate detection:
• Slow over time
• Growth in heap (if you’re lucky)
• Specific detection:
• Audit code for API misuses
• Object instance tracking
•
Resource conflict (block / wait)
Resource conflict / blocking
•

•

•

Causes:
• Overcautious data integrity strategy
• Synchronising is always good
Aggregate detection:
• Stalled threads
• High thread usage - low CPU usage
Specific detection:
• Thread dumps as needed
• Stack traces / graphs
• CPU block / wait timing measurement
Fragen ?
Bad Coding: Infinite Loop
Causes:
• Infinite loop in code
• Aggregate detection:
• Stalled threads
• Permanently high usage of CPU / threads
• Specific detection:
• Thread dumps as needed
• Stack traces / graphs
•
Bad Coding: CPU-Bound Component
Causes:
• Idiot with a ā€œLearn Java in 24 Hoursā€ book
• Aggregate Detection:
• Response time measurement
• Aggregate CPU utilization
• Specific Detection:
• Detailed CPU utilization
• Typical Cure:
• Cache of data or of performed calculations
•
Layer-itis
Causes:

•
•
•

Poorly implemented data bridge layer, or simply
too many of them
DB -> XML -> XSLT -> More XML -> ā€œCustom
Data Management Layerā€ -> Consumer

Aggregate Detection:

•
•

Response time measurements

Specific Detection:

•
•
•

Call graphs - Call trace (stack trace not
enough)
Ask for a design or architecture document
O/R Mapper misuse
Causes:

•
•
•
•

Hibernate fixes everything
Massive SQL statements (length and amount)
Wrong data strategy

Aggregate Detection:

•
•
•

Response time measurements
DB time measurements

Specific Detection:

•
•

Call stacks / snapshots
Caching issues
The Unending Retry
Causes:
• Continual attempts to call backend +
unavailable backend
• Aggregate Detection / Specific Detection:
• Response time measurement
• Backend detection - measurement (time
& # of calls)
• Stalled TX count
• Exceptions
• Busy thread count
•
don’t forget about thrown exceptions
Threading: Deadlock / Livelock
Causes:
• Fundamental error in threading / lock
acquisition strategy
• Aggregate Detection:
• Stalled threads / permanently high
concurrent usage
• Specific Detection:
• Deadlock detection in JVM
• Thread dumps
• Busy thread count
•
Threading: Deadlock
Found one Java-level deadlock:
=============================
"Thread-2":
Ā Ā waiting to lock monitor 102054308 (object 7f3113800, a java.lang.Object),
Ā Ā which is held by "Thread-1"
"Thread-1":
Ā Ā waiting to lock monitor 1020348b8 (object 7f3113810, a java.lang.Object),
Ā Ā which is held by "Thread-2"
Ā 
Java stack information for the threads listed above:
===================================================
"Thread-2":
Ā Ā Ā Ā at DeadlockTest$2.run(DeadlockTest.java:42)
Ā Ā Ā Ā - waiting to lock <7f3113800> (a java.lang.Object)
Ā Ā Ā Ā - locked <7f3113810> (a java.lang.Object)
Ā Ā Ā Ā at java.lang.Thread.run(Thread.java:680)
"Thread-1":
Ā Ā Ā Ā at DeadlockTest$1.run(DeadlockTest.java:26)
Ā Ā Ā Ā - waiting to lock <7f3113810> (a java.lang.Object)
Ā Ā Ā Ā - locked <7f3113800> (a java.lang.Object)
Ā Ā Ā Ā at java.lang.Thread.run(Thread.java:680)
Ā 
Threading: Chokepoint
Causes:
• Many threads bottlenecked waiting for
one lock
• Aggregate Detection:
• Stalled threads / high concurrent usage
• Exponential slowness
• Low CPU usage
• Specific Detection:
• Request response time monitoring
• CPU block / wait timing
•
Threading: Chokepoint
Internal Resource Bottleneck
•

•
•

•

•

•
•
•

Causes:

Overusage of internal resource (threads,
database connections, etc.)
Underallocation of same

Aggregate Detection:

Stalled threads / high concurrent usage
Call rate and average response time of internal
resource

Specific Detection:

Also compare with methods from Resource
Leak, External Bottleneck, and Overusage of
External System
External Bottleneck
Causes:

•
•
•

External system (database, authentication server) is
slow
Compare with Overusage of external system

Aggregate Detection:

•
•
•

Response time on backend calls
Exceptions

Specific Detection:

•
•
•

Callgraphs
Specific monitoring on those backends
Commit happy
Production Ground to a halt for 2 hours And again the next day

Trx/
min
Avg RT
Pool Limit
Pool Usage
Trx Stalls
Overusage of External System
Causes:

•
•
•

•
•

•
•
•

Poor design or tuning of interaction with backend system
(e.g., join between two million-row tables for each user
logon)
O/R mapper misconfiguration

Aggregate Detection:

Response time measurement

Specific Detection:

Timing on backend systems
Also need tools for those backend systems
excessive database access
query too much data
Applications will become more Complex and Change Faster

Distributed
Monolithic
AGILE

Release 1.1
Release 1.2
Release 1.23
Release 1.5

.NET Service

WebLogic Service

WEB 2.0

3rd Party Web
Service

ESB/MQ
Browser(s)

Purchase
Search Flight
Flight Status
Login

Native
Mobile
App

Oracle

CDN

Network

Apache

Sybase

NOSQL
JBoss Service

SOA

MySQL

Cassandra

PHP Service

Memcached

MOBILE

DB2

SQL Server
Tomcat Service
JBoss Service
VMWare
Private

Amazon EC2
Public

CLOUD

PostgreSQL

Hadoop

BIG DATA

Copyright Ā© 2013 AppDynamics. All rights reserved.

86
•
•

One interesting problem occurs when the size of
transactions with backend systems needs to be tuned
Can be intertwined with / exacerbated by Layer-itis and
Overusage of External System

Many small requests
System constantly
wastes resources
dispatching /
unmarshalling many
xactions and results
ā€œDeath by a thousand
cutsā€

ā€œJust Rightā€
One HUGE request
System periodically
slows to a crawl as
many resources get
thrown at large
chunk of work
ā€œPig in a Pythonā€

More Related Content

PPTX
Apache Flink Hands On
Robert Metzger
Ā 
PDF
SplunkSummit 2015 - HTTP Event Collector, Simplified Developer Logging
Splunk
Ā 
PPTX
Open source: Top issues in the top enterprise packages
Rogue Wave Software
Ā 
PPTX
Open source applied: Real-world uses
Rogue Wave Software
Ā 
PPTX
2015 03 06 lmtv wtf http webcast
Tony Fortunato
Ā 
PDF
Prometheus course
Jorn Jambers
Ā 
PPTX
Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Stamo Petkov
Ā 
PDF
Priming Your Teams For Microservice Deployment to the Cloud
Matt Callanan
Ā 
Apache Flink Hands On
Robert Metzger
Ā 
SplunkSummit 2015 - HTTP Event Collector, Simplified Developer Logging
Splunk
Ā 
Open source: Top issues in the top enterprise packages
Rogue Wave Software
Ā 
Open source applied: Real-world uses
Rogue Wave Software
Ā 
2015 03 06 lmtv wtf http webcast
Tony Fortunato
Ā 
Prometheus course
Jorn Jambers
Ā 
Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Stamo Petkov
Ā 
Priming Your Teams For Microservice Deployment to the Cloud
Matt Callanan
Ā 

What's hot (20)

PDF
Revoke-Obfuscation
Daniel Bohannon
Ā 
ODP
opensource Monitoring Tool , an overview
Kris Buytaert
Ā 
PPTX
2017 Q1 Arcticcon - Meet Up - Adventures in Adversarial Emulation
Scott Sutherland
Ā 
PDF
Performance Tuning with Zabbix - Zabbix Conference 2014 - Andrew Nelson
Andrew Nelson
Ā 
PDF
Just enough web ops for web developers
Datadog
Ā 
PDF
From 0 to 0xdeadbeef - security mistakes that will haunt your startup
Diogo Mónica
Ā 
PPTX
How we sleep well at night using Hystrix at Finn.no
Henning Spjelkavik
Ā 
PPTX
Kscope 2013 delphix
Kyle Hailey
Ā 
PPTX
Automating Post Exploitation with PowerShell
EnclaveSecurity
Ā 
PDF
10 Deadly Sins of SQL Server Configuration - APPSEC CALIFORNIA 2015
Scott Sutherland
Ā 
PDF
44CON 2014 - I Hunt TR-069 Admins: Pwning ISPs Like a Boss, Shahar Tal
44CON
Ā 
PDF
How to WRAPS like Snoop Dogg
Alex Kim
Ā 
PDF
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Dmitri Zimine
Ā 
PDF
The Data Mullet: From all SQL to No SQL back to Some SQL
Datadog
Ā 
PDF
Laying the Foundation for Ionic Platform Insights on Spark
Ionic Security
Ā 
PPTX
Automating Zero-Downtime Production Cluster Upgrades for Amazon ECS
Matt Callanan
Ā 
PDF
Docker Cluster Management with ECS
Matt Callanan
Ā 
PDF
How to Monitoring the SRE Golden Signals (E-Book)
Siglos
Ā 
PDF
Security events in 2014
Chong-Kuan Chen
Ā 
PPTX
An Introduction to PowerShell for Security Assessments
EnclaveSecurity
Ā 
Revoke-Obfuscation
Daniel Bohannon
Ā 
opensource Monitoring Tool , an overview
Kris Buytaert
Ā 
2017 Q1 Arcticcon - Meet Up - Adventures in Adversarial Emulation
Scott Sutherland
Ā 
Performance Tuning with Zabbix - Zabbix Conference 2014 - Andrew Nelson
Andrew Nelson
Ā 
Just enough web ops for web developers
Datadog
Ā 
From 0 to 0xdeadbeef - security mistakes that will haunt your startup
Diogo Mónica
Ā 
How we sleep well at night using Hystrix at Finn.no
Henning Spjelkavik
Ā 
Kscope 2013 delphix
Kyle Hailey
Ā 
Automating Post Exploitation with PowerShell
EnclaveSecurity
Ā 
10 Deadly Sins of SQL Server Configuration - APPSEC CALIFORNIA 2015
Scott Sutherland
Ā 
44CON 2014 - I Hunt TR-069 Admins: Pwning ISPs Like a Boss, Shahar Tal
44CON
Ā 
How to WRAPS like Snoop Dogg
Alex Kim
Ā 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Dmitri Zimine
Ā 
The Data Mullet: From all SQL to No SQL back to Some SQL
Datadog
Ā 
Laying the Foundation for Ionic Platform Insights on Spark
Ionic Security
Ā 
Automating Zero-Downtime Production Cluster Upgrades for Amazon ECS
Matt Callanan
Ā 
Docker Cluster Management with ECS
Matt Callanan
Ā 
How to Monitoring the SRE Golden Signals (E-Book)
Siglos
Ā 
Security events in 2014
Chong-Kuan Chen
Ā 
An Introduction to PowerShell for Security Assessments
EnclaveSecurity
Ā 
Ad

Viewers also liked (20)

PPT
Web Globalization balanced by User Experience (Mensch und Computer 2008)
Rainer Gibbert
Ā 
PDF
InDAgo -- Prototyping Smart Mobility Assistants
UID GmbH
Ā 
PDF
USECON & Microsoft: Grundlagen des User Experience Designs fuer Windows Store...
USECON
Ā 
PDF
Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...
rschuppe
Ā 
PPTX
Activty based research design for User Experience
innogy Innovation GmbH
Ā 
PPTX
Rethinking Mobile Learning and the Promise of Flying Cars
Andrew Smyk
Ā 
PDF
IAK13 slideshare
Tom Zahler
Ā 
PDF
Journey to creating Alz.org UX Flowchart
Negar Khalandi
Ā 
PDF
Service Deisgn meets UX Design II
Franziska Semer
Ā 
PDF
E-Commerce User Experience & Usability
INM AG
Ā 
PDF
Multi-Device User Experience
Gabriel White
Ā 
PDF
Personas im Usability Engineering
Michael Jendryschik
Ā 
PDF
Zielgruppenanalyse: Segmentierungsvariablen
TWT
Ā 
PDF
Usability Kongress 2009: Card Sorting und dann? Oder wie man die Informations...
Steffen Schilb
Ā 
PPTX
Managing Responsive Design Projects
Andrew Smyk
Ā 
PDF
Die Anwender im Fokus - User experience fühlen und messen
rschuppe
Ā 
PPSX
Mobile User Experience: Entwicklung von benutzerfreundlichen mobilen Websites...
usability.de
Ā 
PDF
ux kundenworkshop
diana frank
Ā 
PDF
Usability vs. User Experience vs. CRO - warum eigentlich nicht miteinander?
TFT TIE Kinetix GmbH
Ā 
PPTX
User (Experience) Stories #iak13
Screamin Wrba
Ā 
Web Globalization balanced by User Experience (Mensch und Computer 2008)
Rainer Gibbert
Ā 
InDAgo -- Prototyping Smart Mobility Assistants
UID GmbH
Ā 
USECON & Microsoft: Grundlagen des User Experience Designs fuer Windows Store...
USECON
Ā 
Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...
rschuppe
Ā 
Activty based research design for User Experience
innogy Innovation GmbH
Ā 
Rethinking Mobile Learning and the Promise of Flying Cars
Andrew Smyk
Ā 
IAK13 slideshare
Tom Zahler
Ā 
Journey to creating Alz.org UX Flowchart
Negar Khalandi
Ā 
Service Deisgn meets UX Design II
Franziska Semer
Ā 
E-Commerce User Experience & Usability
INM AG
Ā 
Multi-Device User Experience
Gabriel White
Ā 
Personas im Usability Engineering
Michael Jendryschik
Ā 
Zielgruppenanalyse: Segmentierungsvariablen
TWT
Ā 
Usability Kongress 2009: Card Sorting und dann? Oder wie man die Informations...
Steffen Schilb
Ā 
Managing Responsive Design Projects
Andrew Smyk
Ā 
Die Anwender im Fokus - User experience fühlen und messen
rschuppe
Ā 
Mobile User Experience: Entwicklung von benutzerfreundlichen mobilen Websites...
usability.de
Ā 
ux kundenworkshop
diana frank
Ā 
Usability vs. User Experience vs. CRO - warum eigentlich nicht miteinander?
TFT TIE Kinetix GmbH
Ā 
User (Experience) Stories #iak13
Screamin Wrba
Ā 
Ad

Similar to Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen (20)

PPTX
DockerCon Europe 2018 Monitoring & Logging Workshop
Brian Christner
Ā 
PPTX
JavaOne 2015: Top Performance Patterns Deep Dive
Andreas Grabner
Ā 
PDF
Adding Value in the Cloud with Performance Test
Rodolfo Kohn
Ā 
PDF
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Daniel Coupal
Ā 
PPTX
SharePoint 2013 Performance Analysis - Robi Vončina
SPC Adriatics
Ā 
PDF
How to address operational aspects effectively with Agile practices - Matthew...
Skelton Thatcher Consulting Ltd
Ā 
PPTX
End-to-end Troubleshooting Checklist for Microsoft SQL Server
Kevin Kline
Ā 
PPTX
Sql azure cluster dashboard public.ppt
Qingsong Yao
Ā 
PDF
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
Ā 
PDF
Observability with Spring-based distributed systems
Rakuten Group, Inc.
Ā 
PPTX
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
Thoughtworks
Ā 
PDF
Benchmarking at Parse
Travis Redman
Ā 
PDF
Advanced Benchmarking at Parse
MongoDB
Ā 
PPTX
Building azure applications ireland
Michael Meagher
Ā 
PPTX
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...
Vikas Sahni
Ā 
PPT
Performance Analysis of Idle Programs
greenwop
Ā 
PDF
Become a Performance Diagnostics Hero
TechWell
Ā 
PPTX
Performance Tuning in the Trenches
Donald Belcham
Ā 
PDF
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
Ā 
PDF
6 tips for improving ruby performance
Engine Yard
Ā 
DockerCon Europe 2018 Monitoring & Logging Workshop
Brian Christner
Ā 
JavaOne 2015: Top Performance Patterns Deep Dive
Andreas Grabner
Ā 
Adding Value in the Cloud with Performance Test
Rodolfo Kohn
Ā 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Daniel Coupal
Ā 
SharePoint 2013 Performance Analysis - Robi Vončina
SPC Adriatics
Ā 
How to address operational aspects effectively with Agile practices - Matthew...
Skelton Thatcher Consulting Ltd
Ā 
End-to-end Troubleshooting Checklist for Microsoft SQL Server
Kevin Kline
Ā 
Sql azure cluster dashboard public.ppt
Qingsong Yao
Ā 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
Ā 
Observability with Spring-based distributed systems
Rakuten Group, Inc.
Ā 
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
Thoughtworks
Ā 
Benchmarking at Parse
Travis Redman
Ā 
Advanced Benchmarking at Parse
MongoDB
Ā 
Building azure applications ireland
Michael Meagher
Ā 
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...
Vikas Sahni
Ā 
Performance Analysis of Idle Programs
greenwop
Ā 
Become a Performance Diagnostics Hero
TechWell
Ā 
Performance Tuning in the Trenches
Donald Belcham
Ā 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
Ā 
6 tips for improving ruby performance
Engine Yard
Ā 

Recently uploaded (20)

PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
Ā 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
Ā 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
Ā 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
Ā 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
Ā 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
Ā 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
Ā 
PDF
Software Development Methodologies in 2025
KodekX
Ā 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
Ā 
PDF
Doc9.....................................
SofiaCollazos
Ā 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
Ā 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
Ā 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
Ā 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
Ā 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
Ā 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
Ā 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
Ā 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
Ā 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
Ā 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
Ā 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
Ā 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
Ā 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
Ā 
cloud computing vai.pptx for the project
vaibhavdobariyal79
Ā 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
Ā 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
Ā 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
Ā 
Software Development Methodologies in 2025
KodekX
Ā 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
Ā 
Doc9.....................................
SofiaCollazos
Ā 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
Ā 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
Ā 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
Ā 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
Ā 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
Ā 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
Ā 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
Ā 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
Ā 
Brief History of Internet - Early Days of Internet
sutharharshit158
Ā 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
Ā 

Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und Schlangen

  • 1. Noch mehr Schweine und Schlangen Neue Tipps zum Performance Troubleshooting Rainer Schuppe AppDynamics GmbH
  • 2. Rainer Customer Support System Support / Ops Consultant / Dev Solution Architect Sales Engineer +Rainer Schuppe
  • 3. Reprise: Why care about performance Where to start? What to do? Who to blame? Tooling Usecases - Symptoms & Diagnostics
  • 6. Complexity increases Release 1.1 Release 1.2 Release 1.23 Tomcat Release 1.5 .NET Amazon EC2 Windows Azure CLOUD Release 2.4 Release 2.5 Release 2.6 Release 3.0 Login Search Flight View Flight Status Make Reservation Tomcat Mule, Tibco, AG Tomcat ESB VMWare WEB 2.0 Memcached Weblogic Release 1.4 Release 1.5 Release 1.6 Release 2.0 Browser Logic AJAX Web Frameworks Oracle Coherence Hadoop Cassandra MongoDB SOA .NET MQ AGILE Release 3.4 Release 3.5 Release 3.6 Release 4.0 SQL Server Release 4.4 Release 4.5 Release 4.6 Release 5.0 JBoss Release 1.4 Release 1.5 Release 1.6 Release 2.0 ATG, Vignette, Sharepoint BIG DATA
  • 7. Generic Troubleshooting Process Alert / Detection Rootcause Detection Triage Diagnosis Data / Information Solution Finding Move on with life Fix
  • 8. Triage • Determine who needs to fix it • Starts with overview and comparison to ā€žnormalā€œ performance • First level task (Operators) • First indication of problem type • Works best with transactional data
  • 9. 50 ms .NET 10 ms Amazon EC2 60 ms Windows Azure Release 1.1 Release 1.2 Release 1.23 Tomcat Release 1.5 45,3 ms CLOUD 50 ms Release 2.4 Release 2.5 Release 2.6 Release 3.0 Login Search Flight View Flight Status Make Reservation Tomcat 145 Mule, Tibco, AG ms 145 ms ESB 145 ms 145 ms 10 ms WEBms 100 2.0 Memcached 250 ms Weblogic Release 1.4 Release 1.5 Release 1.6 Release 2.0 Browser Logic AJAX Web Frameworks 300.NET ms 300 ms 310 ms AGILE Release 3.4 Release 3.5 Release 3.6 Release 4.0 SQL Server 150 ms Tomcat 160 VMWare ms 145 ms Oracle Release 4.4 Release 4.5 Release 4.6 Release 5.0 Coherence SOA 1 MQ ms 15 ms 250 ms JBoss Release 1.4 Release 1.5 Release 1.6 Release 2.0 ATG, Vignette, Sharepoint Hadoop Cassandra MongoDB BIG DATA
  • 10. Release 1.1 Release 1.2 Release 1.23 Tomcat Release 1.5 Pr .NET ob lem Amazon EC2 Windows Azure CLOUD Release 2.4 Release 2.5 Release 2.6 Release 3.0 Login Search Flight View Flight Status Make Reservation Tomcat Mule, Tibco, AG Tomcat ESB VMWare WEB 2.0 Memcached Weblogic Release 1.4 Release 1.5 Release 1.6 Release 2.0 Browser Logic AJAX Web Frameworks Oracle Coherence Hadoop Cassandra MongoDB SOA .NET MQ AGILE Release 3.4 Release 3.5 Release 3.6 Release 4.0 SQL Server Release 4.4 Release 4.5 Release 4.6 Release 5.0 JBoss Release 1.4 Release 1.5 Release 1.6 Release 2.0 ATG, Vignette, Sharepoint BIG DATA
  • 12. Diagnose • Determine the root of the problem • Uses first level information to narrow scope • Needs specialists • Lots of data / information needed in real time and historical • Usually needs iterations • More than 1 tool used in the process
  • 13. Rootcause detection • Confirm the rootcause after you diagnosed it • Document it • Recreate it in test if possible • Needs the same data as diagnostics
  • 14. Solution finding • Find a solution for the problem • Architect a workaround or a fix • Again needs the diagnostic data • Run some test runs with different options check them in realtime • Confirm the idea for the fix • May be a different team then the trouble shooters
  • 15. How to get the data? • Intuition • Experience • Tools • Logfiles • Communication
  • 17. 3 Key Things Impact Performance & Availability Concurrency Data Volume Resource
  • 18. Why do things crash and slow down? Development Concurrency Data Volume Resource QA/Test Concurrency Data Volume Resource Production Concurrency Data Volume Resource
  • 19. Technologies Logging ARM Bytecode Instrumentation / Aspects Sampling JMX (Java Management Extensions) PMI (IBM WebSphere specific) Dev Test Prod
  • 20. Logfiles Pros: Dev Test Prod • Anything can be logged • Easy to implement (if you have the sourcecode) Cons: • Only what the developer thinks is needed • I/O heavy • No chance for change if you donā€˜t own the source code • Lots of files - no TX context usually • How to correlate in distributed environment?
  • 21. Logfiles - 2 Logging can be the source of problems itself e.g. Log4Net • Synchronous local file system access • The more you log the longer it takes • Can only be diagnosed with another tool Dev Test Prod
  • 22. Bringing down production with Logging Microsoft.Win32.Win32Native:SetFilePointerWin32
  • 23. Logfiles - 4 [#|2013-04-16T16:04:44.319+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean| _ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Data Store timer|#] [#|2013-04-16T16:04:44.335+0200|INFO|sun-appserver2.1|com.appdynamics.TOP.SUMMARY.STATS.WRITE| _ThreadID=14;_ThreadName=pool-1-thread-9;|START TIME for timer service(TopSummaryStatsWriterTimerTaskBean) will be: Tue Apr 16 16:05:00 CEST 2013|#] [#|2013-04-16T16:04:44.338+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean| _ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Data Store timer|#] [#|2013-04-16T16:04:44.338+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean| _ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Data Purger timer|#] [#|2013-04-16T16:04:44.369+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean| _ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Data Purger timer|#] [#|2013-04-16T16:04:44.369+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean| _ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats Detail String cache timer|#] [#|2013-04-16T16:04:44.376+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean| _ThreadID=14;_ThreadName=pool-1-thread-9;|Successfully initialized the Top Summary Stats Detail String cache timer|#] [#|2013-04-16T16:04:44.376+0200|INFO|sun-appserver2.1|com.singularity.ee.controller.beans.ControllerManagerBean| _ThreadID=14;_ThreadName=pool-1-thread-9;|Starting to initialize the Top Summary Stats rollup timer|#]
  • 24. Profiler Pros: • No config needed • Lots of data - lots of detail Cons: • Lots of data - not suitable for production • Needs experience • No transactional concept / context Dev Test
  • 26. JMX (and similar) Pros: • • • Built into most application servers JConsole is part of the JDK Easy to implement MBeans Cons: • • • • No transaction context Not available for 3rd party No historical data Usually one JVM only Dev Test Prod
  • 28. APM tools (free) Pros: • They are free • Transaction context (most of them) • Quick setup (the commercial ones) Dev Test Prod Cons: • Usually functionally constrained (commercial) • Hard to configure (open source) • Usually no history
  • 29. Dev APM tools (commercial) Test Pros: • Transactions, Historical data • Distributed monitoring • Deep dive diagnostics • Production fit Cons: • Costly • Choose the right one Prod
  • 30. Diagnosis There are just 2 sorts of issues
  • 31. Ā© NLTeddy - sxc.hu
  • 32. Ā© ross666 - sxc.hu
  • 33. 50 shades of slow (appx.) • • • • • • Constantly slow (Turtle) Slowly, but constantly slower Exponentially slower Suddenly slower Sporadically slow Spontaneous crash
  • 34. The wonderful world of errors • • • • • • Sudden outage Always erroneous Sporadically Errormessages Silent death / Bleed to death Increasing errorrates Wrong / meaningless error messages
  • 35. Diagnosis – Rough Flow Look at symptoms Eliminate definite non-causes Prioritize the suspicions Confirm suspicion / Eliminate suspicion • Compare with ā€žnormalā€œ • Gather more information • Define root cause and confirm it • Redo from Start • • • •
  • 36. Possible Causes (in no particular order) • • • • • • • Bad Coding Too much load Backend not reachable / slow Conflicting resources Memory Leak Resource Leak Network / Hardware Problem
  • 38. Symptoms • • • User complaints - Slow performance Exceptions appeared in logfiles Alerts for ops triggered
  • 39. Average Response Time (ms) 11.000 Connection Timeout 8.250 5.500 2.750 0 time 10:01 10:03 10:05 10:07 10:09 10:11 10:13 10:15 10:17 10:19 10:21 10:23 10:25 10:27 10:29
  • 40. Connection Pool vs. Errors 15,00 org.hibernate.util.JDBCExceptionReporter : Cannot get a connection, pool error Timeout waiting for idle object 11,25 7,50 3,75 0 10:00 10:02 10:04 10:06 10:08 10:10 10:12 10:14 10:16 10:18 10:20 10:22 10:24 10:26 10:28 10:30
  • 41. 1st Diagnosis • • • • • OK - We do have a problem Database connection pool depleted Waiting times stacking 10 minutes until errors appear in logs But WHY?
  • 42. Open Questions • • • • • Which database? Which DB Pool? Transaction specific? Problem on DB? What is the load?
  • 43. How to find data • • • • Check log for DB connection info Ask architect which TX are using this pool Use JMX to check pool metrics Check load info (if available)
  • 46. Who else is using the DB? Ask your architect!
  • 47. Who else is using the DB? Ask your tool!
  • 48. What did we find out? • • There were other TXā€˜s using the DB • This TX had a specific DB connection pool It was just a single transaction with the DB problem OK - Now letā€˜s check the load
  • 49. Load and DB connections Trx/ min Avg RT Pool Limit Pool Usage Trx Stalls Why the sudden load increase?
  • 50. Root Cause • • Loadbalancer was not working correctly • Many different pools made this config necessary DB connection pool size was not appropriate for this load
  • 51. The missing link Release 1.1 Release 1.2 Release 1.23 Tomcat Release 1.5 .NET Amazon EC2 Windows Azure CLOUD Release 2.4 Release 2.5 Release 2.6 Release 3.0 Login Browser(s) Purchase Search Flight Search Flight Flight Flight View Status Status Login Make Reservation Native Mobile App MOBILE Tomcat Mule, Tibco, AG Tomcat ESB VMWare WEB 2.0 Memcached Weblogic Network Release 1.4 Release 1.5 Release 1.6 Release 2.0 Browser Logic AJAX Web Frameworks Oracle Coherence Hadoop Cassandra MongoDB SOA .NET MQ AGILE Release 3.4 Release 3.5 Release 3.6 Release 4.0 SQL Server Release 4.4 Release 4.5 Release 4.6 Release 5.0 JBoss Release 1.4 Release 1.5 Release 1.6 Release 2.0 ATG, Vignette, Sharepoint BIG DATA
  • 54. A quick look at memory and resources
  • 55. Linear Memory Leak Symptoms: • • • • OOM (Out of memory error) Slow over time with spikes Sawtooth with upward trend • Causes • • Objects added to linear structures without being removed (e.g., linked lists) Other API misuse (addListener() without corresponding removeListener(), etc.)
  • 56. Linear Memory Leak Aggregate detection: • • • linear growth in heap utilization GC time growth Specific detection: • • • • Figure out object types being leaked Verbose GC Find related APIs and search code for misuse
  • 57. Linear Memory Leak Challenges • • • References - many small objects are referenced in one collection Death by 1000 cuts (Papierschnitte) Specific detection: • • • • Figure out object types being leaked Verbose GC Find related APIs and search code for misuse
  • 58. Specific detection • • • • • • • • • • Heap Dump Comparison Needs at least 2 dumps Stops the JVM Can take several minutes each Creates tons of data Finds the object, not the code responsible for the leak Profiler High overhead - not for production Lots of data APM Solution • • • Collection based algorithm – finds only collection leaks Instance counting Trade off between low overhead and usefulness of data
  • 63. Exponential Memory Leak Causes: • Objects added to most data structures without being removed (e.g., vectors, hashtables) • Other API misuse (as Linear Leak) • Aggregate detection: • exponential growth in heap • Specific detection: • Same as Linear Leak •
  • 64. Resource Leak Causes: • API misuse of Java objects with resourcestyle lifecycle (create->use->destroy) • Aggregate detection: • Slow over time • Growth in heap (if you’re lucky) • Specific detection: • Audit code for API misuses • Object instance tracking •
  • 66. Resource conflict / blocking • • • Causes: • Overcautious data integrity strategy • Synchronising is always good Aggregate detection: • Stalled threads • High thread usage - low CPU usage Specific detection: • Thread dumps as needed • Stack traces / graphs • CPU block / wait timing measurement
  • 68. Bad Coding: Infinite Loop Causes: • Infinite loop in code • Aggregate detection: • Stalled threads • Permanently high usage of CPU / threads • Specific detection: • Thread dumps as needed • Stack traces / graphs •
  • 69. Bad Coding: CPU-Bound Component Causes: • Idiot with a ā€œLearn Java in 24 Hoursā€ book • Aggregate Detection: • Response time measurement • Aggregate CPU utilization • Specific Detection: • Detailed CPU utilization • Typical Cure: • Cache of data or of performed calculations •
  • 70. Layer-itis Causes: • • • Poorly implemented data bridge layer, or simply too many of them DB -> XML -> XSLT -> More XML -> ā€œCustom Data Management Layerā€ -> Consumer Aggregate Detection: • • Response time measurements Specific Detection: • • • Call graphs - Call trace (stack trace not enough) Ask for a design or architecture document
  • 71. O/R Mapper misuse Causes: • • • • Hibernate fixes everything Massive SQL statements (length and amount) Wrong data strategy Aggregate Detection: • • • Response time measurements DB time measurements Specific Detection: • • Call stacks / snapshots
  • 73. The Unending Retry Causes: • Continual attempts to call backend + unavailable backend • Aggregate Detection / Specific Detection: • Response time measurement • Backend detection - measurement (time & # of calls) • Stalled TX count • Exceptions • Busy thread count •
  • 74. don’t forget about thrown exceptions
  • 75. Threading: Deadlock / Livelock Causes: • Fundamental error in threading / lock acquisition strategy • Aggregate Detection: • Stalled threads / permanently high concurrent usage • Specific Detection: • Deadlock detection in JVM • Thread dumps • Busy thread count •
  • 76. Threading: Deadlock Found one Java-level deadlock: ============================= "Thread-2": Ā Ā waiting to lock monitor 102054308 (object 7f3113800, a java.lang.Object), Ā Ā which is held by "Thread-1" "Thread-1": Ā Ā waiting to lock monitor 1020348b8 (object 7f3113810, a java.lang.Object), Ā Ā which is held by "Thread-2" Ā  Java stack information for the threads listed above: =================================================== "Thread-2": Ā Ā Ā Ā at DeadlockTest$2.run(DeadlockTest.java:42) Ā Ā Ā Ā - waiting to lock <7f3113800> (a java.lang.Object) Ā Ā Ā Ā - locked <7f3113810> (a java.lang.Object) Ā Ā Ā Ā at java.lang.Thread.run(Thread.java:680) "Thread-1": Ā Ā Ā Ā at DeadlockTest$1.run(DeadlockTest.java:26) Ā Ā Ā Ā - waiting to lock <7f3113810> (a java.lang.Object) Ā Ā Ā Ā - locked <7f3113800> (a java.lang.Object) Ā Ā Ā Ā at java.lang.Thread.run(Thread.java:680) Ā 
  • 77. Threading: Chokepoint Causes: • Many threads bottlenecked waiting for one lock • Aggregate Detection: • Stalled threads / high concurrent usage • Exponential slowness • Low CPU usage • Specific Detection: • Request response time monitoring • CPU block / wait timing •
  • 79. Internal Resource Bottleneck • • • • • • • • Causes: Overusage of internal resource (threads, database connections, etc.) Underallocation of same Aggregate Detection: Stalled threads / high concurrent usage Call rate and average response time of internal resource Specific Detection: Also compare with methods from Resource Leak, External Bottleneck, and Overusage of External System
  • 80. External Bottleneck Causes: • • • External system (database, authentication server) is slow Compare with Overusage of external system Aggregate Detection: • • • Response time on backend calls Exceptions Specific Detection: • • • Callgraphs Specific monitoring on those backends
  • 82. Production Ground to a halt for 2 hours And again the next day Trx/ min Avg RT Pool Limit Pool Usage Trx Stalls
  • 83. Overusage of External System Causes: • • • • • • • • Poor design or tuning of interaction with backend system (e.g., join between two million-row tables for each user logon) O/R mapper misconfiguration Aggregate Detection: Response time measurement Specific Detection: Timing on backend systems Also need tools for those backend systems
  • 86. Applications will become more Complex and Change Faster Distributed Monolithic AGILE Release 1.1 Release 1.2 Release 1.23 Release 1.5 .NET Service WebLogic Service WEB 2.0 3rd Party Web Service ESB/MQ Browser(s) Purchase Search Flight Flight Status Login Native Mobile App Oracle CDN Network Apache Sybase NOSQL JBoss Service SOA MySQL Cassandra PHP Service Memcached MOBILE DB2 SQL Server Tomcat Service JBoss Service VMWare Private Amazon EC2 Public CLOUD PostgreSQL Hadoop BIG DATA Copyright Ā© 2013 AppDynamics. All rights reserved. 86
  • 87. • • One interesting problem occurs when the size of transactions with backend systems needs to be tuned Can be intertwined with / exacerbated by Layer-itis and Overusage of External System Many small requests System constantly wastes resources dispatching / unmarshalling many xactions and results ā€œDeath by a thousand cutsā€ ā€œJust Rightā€ One HUGE request System periodically slows to a crawl as many resources get thrown at large chunk of work ā€œPig in a Pythonā€