SlideShare a Scribd company logo
How to Build a Healthy
On-Call Culture
Hello!
I am smalltown
MaiCoin Group Lead SRE
Taipei HashiCorp UG Organizer
AWS UG Taiwan Staff
Organization Type
On Call
Incident Response
Root Cause Analysis
Monitoring
Join MaiCoin First Week (End of 2017)
~$ ssh smalltown@172.0.10.1
~$ df -ah
~$ rm /var/log/nginx/*
~$ service nginx restart
Organization Type
On Call
Incident Response
Root Cause Analysis
Monitoring
Why Need Monitoring?
● Service W/O Monitoring Just Like a Blind
○ Not Know Service Usual Health Status
○ No Notice When Something Bad Will/Has Happen(ed)
○ Not Investigate Issue After Incident
● No Monitoring -> No Measure -> No Quality
Common Monitoring Types
● End-To-End
○ Functionality
○ Performance
● Error Tracking
● Networking
○ Latency
○ Connectivity
● Infrastructure
○ CPU
○ Memory
○ Disk
○ …
External Monitoring
● DNS
● Certificate
● Geolocation (Global Ping)
○ Routing
○ Submarine Cable
● Content Delivery Network
● Web Application Firewall
● …
Monitoring As Code - Internal
YAML Git Repository
ArgoCD Prometheus
Monitoring As Code - External
Terraform
HCL Git Repository
…
Organization Type
On Call
Incident Response
Root Cause Analysis
Monitoring
Organization Scale
Organization Geographical Distribution
Organization Architecture - Traditional DC
Development Operations
Organization Architecture - DevOps
SRE SE
EM
SRE = Site Reliability Engineer
EM = Engineering Manager
SE = Software Engineer
Our Organization Architecture
SRE SE
EM
SRE SE
EM
EM
Organization Type
On Call
Incident Response
Root Cause Analysis
Monitoring
On Call Engineer Responsibility
● Routine Operation Job
● Handle Incident
● Runbook Refine/Writing
● Weekly On Call Report
On Call Model - Two Tracks
SE
SE
EM
SRE
SRE
EM
Escalation Path
Infrastructure
Application
Notification Problem
Voice Call Notification
User Experience
Communication
Between Teams
Arrange On Call Engineer Schedule
Notification System Architecture
Slack
Google
Calendar
AWS
Lambda
Twilio
…
Notification System - Rotation
● On Call Model
● Rotation Frequency
● Fail to Answer a Page
Notification System - Communication
● Slack User Group Break
the Communication Cap
● Check Document -> Check
Calendar -> Tag Group
Name
Notification System - Effectiveness
● Contacts Phone Book -> Slack
Command W/ User Group/Name
● Short Phone Call Notice -> Phone Call
W/ Validation
● Notification Visibility -> Detail Status in
Slack
Organization Type
On Call
Incident Response
Root Cause Analysis
Monitoring
Aware Incident Before Customer
Incident Handle Terminology
Role
● Incident Commander
● Tech Lead
● Communication Lead
● Engineering Manager
Incident Level
● S3
● S2
● S1
● S0
Incident Handle Process
Initiate Incident Define Incident
Assign Roles
Start Maintenance
Initiate Notify
Investigate & Fixing Verification
Maintenance End
Recovery Notify
Restrospective
Enhancements
Incident Visibility - Internal
Incident Visibility - External
● Service Health Page
● Mobile App Push Notification
● Facebook, Telegram, Twitter… Customer/Vendor
Communication Channel
Organization Type
On Call
Incident Response
Root Cause Analysis
Monitoring
What is Root Cause Analysis (RCA)?
● A Systematic Process for Identifying “Root Causes” of Problems or
Events and an Approach for Responding to Them
○ What Happened
○ How it Happened
○ Why it Happened…so
○ Actions for Preventing Reoccurrence are Developed
How to Prepare RCA?
● Incident Timeline
○ 2022/04/29 16:00 Receive Alert from Prometheus
○ 2022/04/29 16:03 Start Incident Response
○ …
● FIndings/Root Cause
○ The HTTP Status Code Return 400
○ Finding TLS Certificate Expired
● Follow-up/Corrective Action
○ Monitoring All TLS Certificate [Ticket: ID-1234][Owner:
smalltown][Status: On-Going]
Blameless Culture
Organization Type
On Call
Incident Response
Root Cause Analysis
Monitoring
THANKS!
ANY QUESTIONS?
You can find me at my office:
● MicroService Engineer
● Backend Engineer
● Frontend Engineer
● ...

More Related Content

PDF
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Brian Brazil
 
PPTX
Ansible presentation
Suresh Kumar
 
PPT
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 
PDF
Deep dive into Kubernetes Networking
Sreenivas Makam
 
PDF
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
Ian Choi
 
PPTX
Virtualization Vs. Containers
actualtechmedia
 
PPTX
Introduction to Kubernetes
rajdeep
 
PDF
Grafana introduction
Rico Chen
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Brian Brazil
 
Ansible presentation
Suresh Kumar
 
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 
Deep dive into Kubernetes Networking
Sreenivas Makam
 
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
Ian Choi
 
Virtualization Vs. Containers
actualtechmedia
 
Introduction to Kubernetes
rajdeep
 
Grafana introduction
Rico Chen
 

What's hot (20)

PDF
Open shift 4 infra deep dive
Winton Winton
 
PPTX
Introduction to CI/CD
Steve Mactaggart
 
PDF
Introduction to Git
Yan Vugenfirer
 
PPTX
OpenStack Introduction
openstackindia
 
PDF
stupid-simple-kubernetes-final.pdf
DaniloQueirozMota
 
PDF
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
PPTX
AWS Route53
zekeLabs Technologies
 
PDF
Android is NOT just 'Java on Linux'
Tetsuyuki Kobayashi
 
PDF
Automation with ansible
Khizer Naeem
 
PPTX
Kubernetes for Beginners: An Introductory Guide
Bytemark
 
PPTX
Cloud computing presentation
hemanth S R
 
PPTX
Jenkins Pipeline Tutorial | Jenkins Build And Delivery Pipeline | Jenkins Tut...
Simplilearn
 
PPTX
Disk and File System Management in Linux
Henry Osborne
 
PDF
Helm 3
Matthew Farina
 
ODP
Private Cloud Architecture
Derek Keats
 
PDF
Exploring the power of OpenTelemetry on Kubernetes
Red Hat Developers
 
PDF
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Knoldus Inc.
 
PPTX
Introduction to Docker - 2017
Docker, Inc.
 
PDF
Docker vs VM | | Containerization or Virtualization - The Differences | DevOp...
Edureka!
 
PDF
Cloud Monitoring tool Grafana
Dhrubaji Mandal ♛
 
Open shift 4 infra deep dive
Winton Winton
 
Introduction to CI/CD
Steve Mactaggart
 
Introduction to Git
Yan Vugenfirer
 
OpenStack Introduction
openstackindia
 
stupid-simple-kubernetes-final.pdf
DaniloQueirozMota
 
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
Android is NOT just 'Java on Linux'
Tetsuyuki Kobayashi
 
Automation with ansible
Khizer Naeem
 
Kubernetes for Beginners: An Introductory Guide
Bytemark
 
Cloud computing presentation
hemanth S R
 
Jenkins Pipeline Tutorial | Jenkins Build And Delivery Pipeline | Jenkins Tut...
Simplilearn
 
Disk and File System Management in Linux
Henry Osborne
 
Private Cloud Architecture
Derek Keats
 
Exploring the power of OpenTelemetry on Kubernetes
Red Hat Developers
 
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Knoldus Inc.
 
Introduction to Docker - 2017
Docker, Inc.
 
Docker vs VM | | Containerization or Virtualization - The Differences | DevOp...
Edureka!
 
Cloud Monitoring tool Grafana
Dhrubaji Mandal ♛
 
Ad

Similar to SRE Conference 2022 - How to Build a Healthy On-Call Culture (20)

PDF
How Collaboration and Communication Tie Your Tech Stack Together
TechSoup
 
PDF
SRE Organizational Framework
Olaf Reitmaier Veracierta
 
PDF
How to Build a Healthy On-Call Culture
Atlassian
 
PPTX
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
DevOpsDays Tel Aviv
 
PDF
Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!
Cprime
 
PDF
A Practical Approach to Incident Management for SaaS/PaaS
Michael Weber
 
PDF
The on-call survival guide - how to be confident on-call
Raygun
 
PDF
Monitoring An Enterprise Uc Environment
Lanair
 
PPTX
Slack + Atlassian Integration: Use Automation to Remove Organization Silos an...
Cprime
 
PPTX
Respond to and troubleshoot production incidents like an sa
Tom Cudd
 
PPTX
SOC Lessons from DevOps and SRE by Anton Chuvakin
Anton Chuvakin
 
PDF
DevSecOps in Baby Steps
Priyanka Aash
 
PDF
DevSecOps in Baby Steps
Priyanka Aash
 
PPTX
SASUG April - Building Social Networks and the Social Journey
David Broussard
 
PDF
Sean Falzon - Nagios - Resilient Notifications
Nagios
 
PPTX
World-Class Incident Response Management
Keith Smith
 
PPTX
Site reliability engineering
Jason Loeffler
 
PDF
Incident Management in the Age of DevOps and SRE
Rundeck
 
PDF
Nagios Conference 2012 - Alex Solomon - Managing Your Heros
Nagios
 
PDF
PagerDuty: Best Practices for On Call Teams
Mandi Walls
 
How Collaboration and Communication Tie Your Tech Stack Together
TechSoup
 
SRE Organizational Framework
Olaf Reitmaier Veracierta
 
How to Build a Healthy On-Call Culture
Atlassian
 
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
DevOpsDays Tel Aviv
 
Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!
Cprime
 
A Practical Approach to Incident Management for SaaS/PaaS
Michael Weber
 
The on-call survival guide - how to be confident on-call
Raygun
 
Monitoring An Enterprise Uc Environment
Lanair
 
Slack + Atlassian Integration: Use Automation to Remove Organization Silos an...
Cprime
 
Respond to and troubleshoot production incidents like an sa
Tom Cudd
 
SOC Lessons from DevOps and SRE by Anton Chuvakin
Anton Chuvakin
 
DevSecOps in Baby Steps
Priyanka Aash
 
DevSecOps in Baby Steps
Priyanka Aash
 
SASUG April - Building Social Networks and the Social Journey
David Broussard
 
Sean Falzon - Nagios - Resilient Notifications
Nagios
 
World-Class Incident Response Management
Keith Smith
 
Site reliability engineering
Jason Loeffler
 
Incident Management in the Age of DevOps and SRE
Rundeck
 
Nagios Conference 2012 - Alex Solomon - Managing Your Heros
Nagios
 
PagerDuty: Best Practices for On Call Teams
Mandi Walls
 
Ad

More from smalltown (20)

PDF
DevOpsDays Taipei 2025 - 為什麼你裝了一堆 O11y 工具,卻沒人用?🤷
smalltown
 
PDF
DevOpsDays Taipei 2025 - 🚒 DevOps 救火隊的逆襲:如何擺脫永無止境的電話鈴聲
smalltown
 
PDF
Kubernetes Summit 2024 - How GenAI Help you in K8s Ops
smalltown
 
PDF
DevOpsDays Taipei 2024 - Evolution of DevOps: Lessons Learned from a Growing ...
smalltown
 
PDF
SHOPLINE 職人聊天室: 警報管理 - 從系統和制度下手 By smalltown
smalltown
 
PDF
Kubernetes Summit 2023: Head First Kubernetes
smalltown
 
PDF
Kubernetes Summit 2021: Multi-Cluster - The Good, the Bad and the Ugly
smalltown
 
PDF
DevOpsDays Taipei 2021 - How FinTech Embrace Change Management
smalltown
 
PDF
Kubernetes Summit 2020 - DevOps: Where is My PodPod
smalltown
 
PDF
CDK Meetup: Rule the World through IaC
smalltown
 
PDF
AWS re:Invent re:Cap 2019: My ElasticSearch Journey on AWS
smalltown
 
PDF
Cloud Native User Group: Shift-Left Testing IaC With PaC
smalltown
 
PDF
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
smalltown
 
PDF
Kubernetes Summit 2019 - Harden Your Kubernetes Cluster
smalltown
 
PDF
HashiCorp Vault Workshop:幫 Credentials 找個窩
smalltown
 
PDF
TW SEAT - DevOps: Security 干我何事?
smalltown
 
PDF
Cloud Native User Group: Prometheus Day 2
smalltown
 
PDF
Kubernetes User Group: 維運 Kubernetes 的兩三事
smalltown
 
PDF
DevOpsDays - DevOps: Security 干我何事?
smalltown
 
PDF
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
smalltown
 
DevOpsDays Taipei 2025 - 為什麼你裝了一堆 O11y 工具,卻沒人用?🤷
smalltown
 
DevOpsDays Taipei 2025 - 🚒 DevOps 救火隊的逆襲:如何擺脫永無止境的電話鈴聲
smalltown
 
Kubernetes Summit 2024 - How GenAI Help you in K8s Ops
smalltown
 
DevOpsDays Taipei 2024 - Evolution of DevOps: Lessons Learned from a Growing ...
smalltown
 
SHOPLINE 職人聊天室: 警報管理 - 從系統和制度下手 By smalltown
smalltown
 
Kubernetes Summit 2023: Head First Kubernetes
smalltown
 
Kubernetes Summit 2021: Multi-Cluster - The Good, the Bad and the Ugly
smalltown
 
DevOpsDays Taipei 2021 - How FinTech Embrace Change Management
smalltown
 
Kubernetes Summit 2020 - DevOps: Where is My PodPod
smalltown
 
CDK Meetup: Rule the World through IaC
smalltown
 
AWS re:Invent re:Cap 2019: My ElasticSearch Journey on AWS
smalltown
 
Cloud Native User Group: Shift-Left Testing IaC With PaC
smalltown
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
smalltown
 
Kubernetes Summit 2019 - Harden Your Kubernetes Cluster
smalltown
 
HashiCorp Vault Workshop:幫 Credentials 找個窩
smalltown
 
TW SEAT - DevOps: Security 干我何事?
smalltown
 
Cloud Native User Group: Prometheus Day 2
smalltown
 
Kubernetes User Group: 維運 Kubernetes 的兩三事
smalltown
 
DevOpsDays - DevOps: Security 干我何事?
smalltown
 
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
smalltown
 

Recently uploaded (20)

DOCX
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
PPTX
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PPTX
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
PDF
Exploring AI Agents in Process Industries
amoreira6
 
PPTX
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
PDF
Bandai Playdia The Book - David Glotz
BluePanther6
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
PDF
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PDF
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PDF
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Presentation about variables and constant.pptx
kr2589474
 
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
Exploring AI Agents in Process Industries
amoreira6
 
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
Bandai Playdia The Book - David Glotz
BluePanther6
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 

SRE Conference 2022 - How to Build a Healthy On-Call Culture