SlideShare a Scribd company logo
Anatomy of a real-life incident
Alex Solomon
CTO & Co-Founder @
THIS IS A TRUE STORY
The events in this presentation took place in
San Francisco and Toronto on January 6, 2017
In the interest of brevity, some details have
been omitted
The Services
Web2Kafka
Service
Incident Log
Entries Service
Docker
Mesos / marathon
Linux Kernel
publishes change events from
web monolith to Kafka for
other services to consume
stores log entries for incidents
The People
Eric
Incident
Commander
Peter
Scribe
Ken
Deputy
Luke
Communications
Liaison
Major incident response principal roles
David
Core on-call
Cees
Core eng
Evan
SRE on-call
Renee
IM People on-call
Zayna
Mobile on-call
JD
IM Data on-call
Priyam
EM on-call
Subject Matter Experts (SMEs)
The Incident
[3:21 PM] David:
SME
!ic page
Officer URL:
Chat BOT
🚨Paging Incident Commander(s)

✔ Eric has been paged.
✔ Ken has been paged.
✔ Peter has been paged.
Incident triggered in the following service:
https://pd.pagerduty.com/services/PERDDFI
David:
SME
web2kafka is down, and I'm not sure what's
going on
kicked off the major incident
process
[3:21 PM] Eric:
IC
Taking IC Eric took the IC role (he was IC
primary on-call)
The Incident Commander
• The Wartime General: decision maker during a major incident
• GOAL: drive the incident to resolution quickly and effectively
• Gather data: ask subject matter experts to diagnose and
debug various aspects of the system
• Listen: collect proposed repair actions from SMEs
• Decide: decide on a course of action
• Act (via delegation): once a decision is made, ask the team
to act on it. IC should always delegate all diagnosis and
repair actions to the rest of the team.
Priyam:
SME
I’m here from EM
Evan:
SME
lmk if you need SRE
sounds like IHM might be down too
Ken:
DEPUTY
@renee, please join the call[3:22 PM] Ken took the deputy role
Other SMEs joined
The Deputy (backup IC)
• The Sidekick: right hand person for the IC
• Monitor the status of the incident
• Be prepared to page other people
• Provide regular updates to business and/
or exec stakeholders
Peter:
SCRIBE
I am now the scribe
Eric: Looking to find Mesos experts
Evan: Looking for logs & dashboards
Zayna:
SME
seeing a steady rise in crashes in
Android app around trigger incident log
entires
[3:24 PM]
JD:
SME
No ILEs will be generated due to LES
not being able to query web2kafka
[3:25 PM]
Eric: David, what have you looked at?
David: trolling logs, see errors
David: tried restarting, doesn’t help
[3:23 PM] Ken:
DEPUTY
Notifications are still going out, subject
lines are filled in but not email bodies
(they use ILEs)
Renee:
SME
Peter becomes the scribe
Discussing customer-visible
impact of the incident
Ken is both deputy and
scribe
The Scribe
• The Record-keeper
• Add notes to the chatroom when findings are
determined or significant actions are taken
• Add TODOs to the room that indicate follow-
ups for later (generally after the incident)
• Monitor tasks assigned by the IC to other
team members, remind the IC to follow-up
Renee:
SME
Can’t expand incident details
Luke:
CUST LIAISON
suggested tweet: `There is currently an
issue affecting the incident log entries
component of our web application
causing the application to display
errors. We are actively investigating.`
[3:29 PM]
David: No ILEs can be created
Renee: no incident details, error msg in
the UI
[3:27 PM] Peter:
SCRIBE
Eric: Comms rep on the phone? Luke
Eric to Luke: Please compose a tweet
Peter:
SCRIBE
Eric: What’s the customer impact?[3:26 PM] Peter:
SCRIBE
Luke to tweetPeter:
SCRIBE
IC asked the customer liaison
to write a msg to customers
Msg was sent out to
customers
The Communications Liaison
• The link to the customer
• Monitor customer and business impact
• Provide regular updates to customers (and/
or to customer-facing folks in the business)
• (Optional) Provide regular updates to
stakeholders
Cees:
SME
I’m away from any laptops, just arrived
at a pub for dinner.
[3:36 PM]
@cees Would you join us on the bridge?
We have a few Mesos questions
Eric:
IC
Evan: might need to kick new hardware if
system is actually unreachable.
Evan: slave01 is reachable
David: slave02 is not reachable.
David: slave03 is not reachable.
David: only 3 slaves for mesos
Eric: We are down to only one host
Evan: Seeing some stuff. Memory
exhaustion.
[3:37 PM] Peter:
SCRIBE
TODO: Create a runbook for mesos to
stop the world and start again
Peter:
SCRIBE
David added Cees to the incident
Eric: Is there a runbook for mesos?
David: Yes, but not for this issue.
[3:34 PM] Peter:
SCRIBE
Scribe captured a TODO to record &
remember a follow-up that should
happen after the incident is resolved
We paged a Mesos expert
who is not on-call
The Mesos expert joined the
chat
David: Only 3 slaves in that cluster, we
have another cluster in us-west-1
Eric: Two options: kick more slaves or
restart marathon
[3:38 PM] Peter:
SCRIBE
Evan: OOM killer has kicked in on
slave01
[3:39 PM] Peter:
SCRIBE
Eric: Stop slaves in west2, startup
web2kafka in west1
Evan: slave02 is alive!
Eric: Waiting 2 minutes
[3:47 PM] Peter:
SCRIBE
David: Consider bringing up another cluster?
Cees: Should be trivial
[3:44 PM] Peter:
SCRIBE
Eric to evan: please reboot slave02 and
slave03
[3:41 PM] Peter:
SCRIBE
Restart slaves firstCees:
SME
slave01 is now down[3:42 PM] Evan:
SME
They are considering
bringing up another
Mesos cluster in west1
slave02 is back up after
reboot, so they hold off
on flipping to west1
Noticed that oom-killer
killed the docker
process on slave01
Evan: Slave02 is quiet.
Evan: Slave02 is trying to start, exiting with
code 137
[3:49 PM] Peter:
SCRIBE
Evan: Slave02 is quiet.
Evan: 137 means it’s being killed by OOM,
OOM is killing docker containers
continuously
Peter:
SCRIBE
[3:53 PM] Proposed Action: David is going to
configure marathon to allow more memory
Peter:
SCRIBE
[3:54 PM] Proposed Action: Evan to force reboot
slave01
Peter:
SCRIBE
[3:56 PM] David: Web2kafka appears to be running
Eric: Looks like all things are running
Renee: Things are fine with notifications
JD: LES is seeing progress
Peter:
SCRIBE
[3:55 PM] Customer impact: there are 4 tickets so far
and 2 customers chatting with us, which is
another 2 tickets
Luke:
CUST LIAISON
They realized the
problem: oom-killer is
killing the docker
containers over and over
The resolution action was
to redeploy web2kafka
with a higher cgroup/
Docker memory limit:
2GB (vs 512 MB before)
The customer liaison
provided an update on
the customer impact
The system is recovering
The Punchline
• Root cause
• Increase in traffic caused web2kafka to increase its memory usage
• This caused the Linux oom-killer to kill the process
• Then, mesos / marathon immediately restarted it, it ramped up memory
again, oom-killer killed it, and so on.
• After doing this restart-kill cycle multiple times, we hit a race-condition
bug in the Linux kernel causing a kernel panic and killing the host
• Other services running on the host were impacted, notably LES
Summary
• Incident Command
• The most important role, crucial to fast decision making and action!
• Takes practice and experience
• Deputy
• The right-hand person for the IC, can step in and take over Incident Command for long-running incidents
• Responsible for business & exec stakeholder communications, allowing the technical team to focus on incident
resolution
• Scribe
• Essential for providing context in the chatroom and tracking follow-ups & action items (for example, the IC
saying “Evan, do X, report back in 5 min”)
• Produces step-by-step documentation which very helpful for constructing the timeline later (in the post-mortem)
• Communications liaison
• Essential for tracking customer impact and communicating status to customers
The End
Alex Solomon
CTO & Co-Founder @
alex@pagerduty.com
The PagerDuty Incident Response process and training is open-source: https://response.pagerduty.com

More Related Content

PDF
Agile Incident Response and Resolution in the Wold of Devops
Atlassian
 
PDF
How it All Goes Down
Daniel Doubrovkine
 
PPTX
Wcl303 russinovich
conleyc
 
PDF
Twitch Plays Pokémon: Twitch's Chat Architecture
C4Media
 
PDF
Architecting for a scalable enterprise - John Davies
JAXLondon_Conference
 
PPTX
.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel...
Karel Zikmund
 
PPT
DDM - Domino Domain Monitoring, If You Only Knew What Your server Was Saying
Keith Brooks
 
PDF
Scale by the Bay 2019 Reprogramming the Programmer
Paul Cleary
 
Agile Incident Response and Resolution in the Wold of Devops
Atlassian
 
How it All Goes Down
Daniel Doubrovkine
 
Wcl303 russinovich
conleyc
 
Twitch Plays Pokémon: Twitch's Chat Architecture
C4Media
 
Architecting for a scalable enterprise - John Davies
JAXLondon_Conference
 
.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel...
Karel Zikmund
 
DDM - Domino Domain Monitoring, If You Only Knew What Your server Was Saying
Keith Brooks
 
Scale by the Bay 2019 Reprogramming the Programmer
Paul Cleary
 

Similar to Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty (20)

PDF
Troubleshooting in a distributed systems
Komodor
 
PPTX
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
Karel Zikmund
 
PPTX
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
Karel Zikmund
 
ODP
Cs seminar 20071207
Todd Deshane
 
PDF
Virus Bulletin 2016: A Malicious OS X Cocktail Served from a Tainted Bottle
Peter Kálnai
 
PDF
Kalnai_Jirkal-vb-2016-malicious-osx-cocktail
Martin Jirkal
 
PDF
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...
Henning Jacobs
 
PDF
Computer sc report
fameliapayong
 
PDF
Scaling Selenium
Noah Sussman
 
PPT
Lec06-IO2.ppt
HemalkumarLakdawala
 
PDF
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-Bending
Luis Guirigay
 
PPTX
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
Karel Zikmund
 
PDF
JAX London 2015 - Architecting a Highly Scalable Enterprise
C24 Technologies
 
PPTX
Top Java Performance Problems and Metrics To Check in Your Pipeline
Andreas Grabner
 
PPTX
[DSC Adria 23] Miro MIljanic Telco Data Pipelines in the Cloud Architecture a...
DataScienceConferenc1
 
PPTX
Lessons learned running large real-world Docker environments
Alois Mayr
 
PDF
Implementing Event Sourcing in .NET
Andrea Saltarello
 
ODP
The Art of Message Queues
Mike Willbanks
 
PDF
Webinar: From Frustration to Fascination: Dissecting Replication
Howard Greenberg
 
PDF
Ops Happens: Improving Incident Response Using DevOps and SRE Practices
Rundeck
 
Troubleshooting in a distributed systems
Komodor
 
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
Karel Zikmund
 
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
Karel Zikmund
 
Cs seminar 20071207
Todd Deshane
 
Virus Bulletin 2016: A Malicious OS X Cocktail Served from a Tainted Bottle
Peter Kálnai
 
Kalnai_Jirkal-vb-2016-malicious-osx-cocktail
Martin Jirkal
 
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...
Henning Jacobs
 
Computer sc report
fameliapayong
 
Scaling Selenium
Noah Sussman
 
Lec06-IO2.ppt
HemalkumarLakdawala
 
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-Bending
Luis Guirigay
 
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
Karel Zikmund
 
JAX London 2015 - Architecting a Highly Scalable Enterprise
C24 Technologies
 
Top Java Performance Problems and Metrics To Check in Your Pipeline
Andreas Grabner
 
[DSC Adria 23] Miro MIljanic Telco Data Pipelines in the Cloud Architecture a...
DataScienceConferenc1
 
Lessons learned running large real-world Docker environments
Alois Mayr
 
Implementing Event Sourcing in .NET
Andrea Saltarello
 
The Art of Message Queues
Mike Willbanks
 
Webinar: From Frustration to Fascination: Dissecting Replication
Howard Greenberg
 
Ops Happens: Improving Incident Response Using DevOps and SRE Practices
Rundeck
 
Ad

More from Outlyer (20)

PPTX
Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...
Outlyer
 
PPTX
How & When to Feature Flag
Outlyer
 
PPTX
Why You Need to Stop Using "The" Staging Server
Outlyer
 
PPTX
How GitHub combined with CI empowers rapid product delivery at Credit Karma
Outlyer
 
PPTX
Packaging Services with Nix
Outlyer
 
PDF
Minimum Viable Docker: our journey towards orchestration
Outlyer
 
PDF
Ops is dead. long live ops.
Outlyer
 
PDF
The service mesh: resilient communication for microservice applications
Outlyer
 
PPTX
Microservices: Why We Did It (and should you?)
Outlyer
 
PPTX
Renan Dias: Using Alexa to deploy applications to Kubernetes
Outlyer
 
PDF
Alex Dias: how to build a docker monitoring solution
Outlyer
 
PPTX
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...
Outlyer
 
PDF
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group
Outlyer
 
PDF
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...
Outlyer
 
PPTX
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
Outlyer
 
PPTX
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...
Outlyer
 
PDF
Zero Downtime Postgres Upgrades
Outlyer
 
PDF
DOXLON November 2016: Facebook Engineering on cgroupv2
Outlyer
 
PDF
DOXLON November 2016 - ELK Stack and Beats
Outlyer
 
PDF
DOXLON November 2016 - Data Democratization Using Splunk
Outlyer
 
Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...
Outlyer
 
How & When to Feature Flag
Outlyer
 
Why You Need to Stop Using "The" Staging Server
Outlyer
 
How GitHub combined with CI empowers rapid product delivery at Credit Karma
Outlyer
 
Packaging Services with Nix
Outlyer
 
Minimum Viable Docker: our journey towards orchestration
Outlyer
 
Ops is dead. long live ops.
Outlyer
 
The service mesh: resilient communication for microservice applications
Outlyer
 
Microservices: Why We Did It (and should you?)
Outlyer
 
Renan Dias: Using Alexa to deploy applications to Kubernetes
Outlyer
 
Alex Dias: how to build a docker monitoring solution
Outlyer
 
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...
Outlyer
 
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group
Outlyer
 
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...
Outlyer
 
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
Outlyer
 
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...
Outlyer
 
Zero Downtime Postgres Upgrades
Outlyer
 
DOXLON November 2016: Facebook Engineering on cgroupv2
Outlyer
 
DOXLON November 2016 - ELK Stack and Beats
Outlyer
 
DOXLON November 2016 - Data Democratization Using Splunk
Outlyer
 
Ad

Recently uploaded (20)

PPTX
Color Model in Textile ( RGB, CMYK).pptx
auladhossain191
 
PPTX
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PPTX
easa module 3 funtamental electronics.pptx
tryanothert7
 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
PDF
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
PPT
Ppt for engineering students application on field effect
lakshmi.ec
 
PDF
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
flutter Launcher Icons, Splash Screens & Fonts
Ahmed Mohamed
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PPT
SCOPE_~1- technology of green house and poyhouse
bala464780
 
PPTX
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
PPTX
AgentX UiPath Community Webinar series - Delhi
RohitRadhakrishnan8
 
PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PDF
Top 10 read articles In Managing Information Technology.pdf
IJMIT JOURNAL
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PPT
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
Color Model in Textile ( RGB, CMYK).pptx
auladhossain191
 
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
easa module 3 funtamental electronics.pptx
tryanothert7
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
Ppt for engineering students application on field effect
lakshmi.ec
 
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
flutter Launcher Icons, Splash Screens & Fonts
Ahmed Mohamed
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
SCOPE_~1- technology of green house and poyhouse
bala464780
 
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
AgentX UiPath Community Webinar series - Delhi
RohitRadhakrishnan8
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
Top 10 read articles In Managing Information Technology.pdf
IJMIT JOURNAL
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 

Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

  • 1. Anatomy of a real-life incident Alex Solomon CTO & Co-Founder @
  • 2. THIS IS A TRUE STORY The events in this presentation took place in San Francisco and Toronto on January 6, 2017 In the interest of brevity, some details have been omitted
  • 3. The Services Web2Kafka Service Incident Log Entries Service Docker Mesos / marathon Linux Kernel publishes change events from web monolith to Kafka for other services to consume stores log entries for incidents
  • 4. The People Eric Incident Commander Peter Scribe Ken Deputy Luke Communications Liaison Major incident response principal roles David Core on-call Cees Core eng Evan SRE on-call Renee IM People on-call Zayna Mobile on-call JD IM Data on-call Priyam EM on-call Subject Matter Experts (SMEs)
  • 6. [3:21 PM] David: SME !ic page Officer URL: Chat BOT 🚨Paging Incident Commander(s)
 ✔ Eric has been paged. ✔ Ken has been paged. ✔ Peter has been paged. Incident triggered in the following service: https://pd.pagerduty.com/services/PERDDFI David: SME web2kafka is down, and I'm not sure what's going on kicked off the major incident process [3:21 PM] Eric: IC Taking IC Eric took the IC role (he was IC primary on-call)
  • 7. The Incident Commander • The Wartime General: decision maker during a major incident • GOAL: drive the incident to resolution quickly and effectively • Gather data: ask subject matter experts to diagnose and debug various aspects of the system • Listen: collect proposed repair actions from SMEs • Decide: decide on a course of action • Act (via delegation): once a decision is made, ask the team to act on it. IC should always delegate all diagnosis and repair actions to the rest of the team.
  • 8. Priyam: SME I’m here from EM Evan: SME lmk if you need SRE sounds like IHM might be down too Ken: DEPUTY @renee, please join the call[3:22 PM] Ken took the deputy role Other SMEs joined
  • 9. The Deputy (backup IC) • The Sidekick: right hand person for the IC • Monitor the status of the incident • Be prepared to page other people • Provide regular updates to business and/ or exec stakeholders
  • 10. Peter: SCRIBE I am now the scribe Eric: Looking to find Mesos experts Evan: Looking for logs & dashboards Zayna: SME seeing a steady rise in crashes in Android app around trigger incident log entires [3:24 PM] JD: SME No ILEs will be generated due to LES not being able to query web2kafka [3:25 PM] Eric: David, what have you looked at? David: trolling logs, see errors David: tried restarting, doesn’t help [3:23 PM] Ken: DEPUTY Notifications are still going out, subject lines are filled in but not email bodies (they use ILEs) Renee: SME Peter becomes the scribe Discussing customer-visible impact of the incident Ken is both deputy and scribe
  • 11. The Scribe • The Record-keeper • Add notes to the chatroom when findings are determined or significant actions are taken • Add TODOs to the room that indicate follow- ups for later (generally after the incident) • Monitor tasks assigned by the IC to other team members, remind the IC to follow-up
  • 12. Renee: SME Can’t expand incident details Luke: CUST LIAISON suggested tweet: `There is currently an issue affecting the incident log entries component of our web application causing the application to display errors. We are actively investigating.` [3:29 PM] David: No ILEs can be created Renee: no incident details, error msg in the UI [3:27 PM] Peter: SCRIBE Eric: Comms rep on the phone? Luke Eric to Luke: Please compose a tweet Peter: SCRIBE Eric: What’s the customer impact?[3:26 PM] Peter: SCRIBE Luke to tweetPeter: SCRIBE IC asked the customer liaison to write a msg to customers Msg was sent out to customers
  • 13. The Communications Liaison • The link to the customer • Monitor customer and business impact • Provide regular updates to customers (and/ or to customer-facing folks in the business) • (Optional) Provide regular updates to stakeholders
  • 14. Cees: SME I’m away from any laptops, just arrived at a pub for dinner. [3:36 PM] @cees Would you join us on the bridge? We have a few Mesos questions Eric: IC Evan: might need to kick new hardware if system is actually unreachable. Evan: slave01 is reachable David: slave02 is not reachable. David: slave03 is not reachable. David: only 3 slaves for mesos Eric: We are down to only one host Evan: Seeing some stuff. Memory exhaustion. [3:37 PM] Peter: SCRIBE TODO: Create a runbook for mesos to stop the world and start again Peter: SCRIBE David added Cees to the incident Eric: Is there a runbook for mesos? David: Yes, but not for this issue. [3:34 PM] Peter: SCRIBE Scribe captured a TODO to record & remember a follow-up that should happen after the incident is resolved We paged a Mesos expert who is not on-call The Mesos expert joined the chat
  • 15. David: Only 3 slaves in that cluster, we have another cluster in us-west-1 Eric: Two options: kick more slaves or restart marathon [3:38 PM] Peter: SCRIBE Evan: OOM killer has kicked in on slave01 [3:39 PM] Peter: SCRIBE Eric: Stop slaves in west2, startup web2kafka in west1 Evan: slave02 is alive! Eric: Waiting 2 minutes [3:47 PM] Peter: SCRIBE David: Consider bringing up another cluster? Cees: Should be trivial [3:44 PM] Peter: SCRIBE Eric to evan: please reboot slave02 and slave03 [3:41 PM] Peter: SCRIBE Restart slaves firstCees: SME slave01 is now down[3:42 PM] Evan: SME They are considering bringing up another Mesos cluster in west1 slave02 is back up after reboot, so they hold off on flipping to west1 Noticed that oom-killer killed the docker process on slave01
  • 16. Evan: Slave02 is quiet. Evan: Slave02 is trying to start, exiting with code 137 [3:49 PM] Peter: SCRIBE Evan: Slave02 is quiet. Evan: 137 means it’s being killed by OOM, OOM is killing docker containers continuously Peter: SCRIBE [3:53 PM] Proposed Action: David is going to configure marathon to allow more memory Peter: SCRIBE [3:54 PM] Proposed Action: Evan to force reboot slave01 Peter: SCRIBE [3:56 PM] David: Web2kafka appears to be running Eric: Looks like all things are running Renee: Things are fine with notifications JD: LES is seeing progress Peter: SCRIBE [3:55 PM] Customer impact: there are 4 tickets so far and 2 customers chatting with us, which is another 2 tickets Luke: CUST LIAISON They realized the problem: oom-killer is killing the docker containers over and over The resolution action was to redeploy web2kafka with a higher cgroup/ Docker memory limit: 2GB (vs 512 MB before) The customer liaison provided an update on the customer impact The system is recovering
  • 17. The Punchline • Root cause • Increase in traffic caused web2kafka to increase its memory usage • This caused the Linux oom-killer to kill the process • Then, mesos / marathon immediately restarted it, it ramped up memory again, oom-killer killed it, and so on. • After doing this restart-kill cycle multiple times, we hit a race-condition bug in the Linux kernel causing a kernel panic and killing the host • Other services running on the host were impacted, notably LES
  • 18. Summary • Incident Command • The most important role, crucial to fast decision making and action! • Takes practice and experience • Deputy • The right-hand person for the IC, can step in and take over Incident Command for long-running incidents • Responsible for business & exec stakeholder communications, allowing the technical team to focus on incident resolution • Scribe • Essential for providing context in the chatroom and tracking follow-ups & action items (for example, the IC saying “Evan, do X, report back in 5 min”) • Produces step-by-step documentation which very helpful for constructing the timeline later (in the post-mortem) • Communications liaison • Essential for tracking customer impact and communicating status to customers
  • 19. The End Alex Solomon CTO & Co-Founder @ [email protected] The PagerDuty Incident Response process and training is open-source: https://response.pagerduty.com