Service Levels and Error Budgets
Paweł Kucharski
Site Reliability Engineering
Automation
Monitoring
Embracing Risk
Service Level Objectives (SLOs)
Global Storage
Emergency Incident Response
Software Engineering
Systems Engineering
Blameless Failures
google.com/sre
"Hope Is Not A Strategy"
Planet-Scale Distributed Systems
Load Balancing
Availability
Site Reliability Engineering
● Principles, practices, & management
of Google’s production systems
○ ~70 contributors, 500+ years of experience
○ Available at fine bookstores near you or
read online at g.co/srebook
Agenda
1. How good is a service?
2. How good should it be?
3. When do we need to make it better?
4. How do we make it better?
Agenda
1. How good is a service?
2. How good should it be?
3. When do we need to make it better?
4. How do we make it better?
Service Level Indicators (SLIs)
● An indicator (SLI) is a quantitative measure of how good
some attribute of the service is.
● An attribute is a dimension the service’s users care about,
such as:
○ throughput, how much work the service can do
○ latency, how long the work takes
○ availability, how often the service can do work
○ correctness, whether the work is right
Choosing Indicators
1. Figure out what service properties users care about
2. Collect data about those properties
a. Start analyzing server logs
b. Instrument the service and export metrics to monitoring
3. Choose a few metrics and carefully define indicators
Defining Indicators
1. Start with a property: “latency”
2. Specify the property: “HTTP GET latency”
3. Choose how to measure it: “measured at the client”
4. Choose the universe of measurement: “all frontend servers”
Result: “HTTP GET latency measured at the client every 10
seconds, by a black-box prober, across all frontend servers”
Shorthand: “client-side frontend GET latency”
Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem
Understand Your Metric (1)
Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem
Understand Your Metric (1)
Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem
Understand Your Metric (1)
Understand Your Metric (2)
● Use histograms and percentiles instead of averages
Re-Defining Indicators
Previous Result: “HTTP GET latency measured at the client
every 10 seconds, by a black-box prober, across all frontend
servers”
New Result: “95th percentile of HTTP GET latency measured by
a black-box prober every second, aggregated every 10 seconds,
across all frontend servers”
Shorthand: “p95 client-side frontend GET latency”
Agenda
1. How good is a service?
2. How good should it be?
3. When do we need to make it better?
4. How do we make it better?
How Good Should A Service Be?
● How {fast, reliable, available, …} a service should be is
fundamentally a product question
● “100% is the wrong reliability target for (nearly) everything”
○ cost of marginal improvements grows ~exponentially
● Can always make service better on some dimension, but
involves tradeoffs with $, people, time, and other priorities
○ Product & dev management best placed for tradeoffs
Set Achievable Targets
● Understanding what users need from the service, product
management, dev management, and SRE management
jointly agree on targets
○ “p95 latency should be < 250 ms, 99.9% of the month”
○ “should respond to requests at least 99.9% of the time”
● Targets should be ambitious but achievable
Service Level Objectives
● An SLO is a mathematical relation like:
○ SLI ≤ target
○ lower bound ≤ SLI ≤ upper bound
● Publish SLO to users & try to be slightly better, just in case
○ defines what users can reasonably expect & design for
○ but don’t be too much better or users will depend on it
○ plan controlled outages to find implicit dependencies
when it’s convenient, not at random
Agenda
1. How good is a service?
2. How good should it be?
3. When do we need to make it better?
4. How do we make it better?
Error Budgets
● An SLO implies acceptable levels of errors
○ 99.9% availability ⇒ 0.1% unavailability
● Tolerable errors accommodate
○ rolling out new software versions (which might break)
○ releasing new features
○ inevitable failure in hardware, networks, etc.
○ redesigning for other priorities
● Budget looks at SLO over 30-day rolling window
Balance Reliability and Velocity
● Error budgets balance reliability with feature velocity
○ SRE’s job is not “zero outages”, default answer isn’t “no”
○ instead, maximize velocity given reliability constraint
● Simple version:
○ release features until error budget exhausted, then
○ focus devs on reliability improvements until budget refill
Sophisticated Error Budgets
● change pace of feature releases given remaining budget
● keep a “rainy-day fund” for unexpected events
● use budget exhaustion rate to drive alerting
○ e.g. alert if recent errors > 1% of remaining budget
● small number of “silver bullets” for true emergency launches
despite budget launch freeze
Agenda
1. How good is a service?
2. How good should it be?
3. When do we need to make it better?
4. How do we make it better?
What Should SREs Do?
● build monitoring systems to measure indicators
● provide input on feasibility of achieving targets
● work with devs to improve both reliability and velocity
○ standardize infrastructure
○ consulting on system design
○ build safe release & rollback systems
○ phase rollouts & use load balancing, etc. to minimize
budget damage a release can do
What Must SREs Have?
● SRE must have (and use) authority to halt launches which
exceed error budget
○ requires strong support from management
● SRE must have ability to return pager to devs
○ strong SRE & dev partnership essential, no “code thrown
over wall”
● SRE must have similar experience to devs
○ ability to code
Things To Remember
1. Use SLIs to understand service’s key metrics.
2. Use SLOs to specify how good the service needs to be.
3. Use error budgets to control release velocity.
Thank You

More Related Content

PDF
Unhappy Path & Dealing With Bad Events
PPTX
3 Keys to Performance Testing at the Speed of Agile
PPTX
13 things your QA team wants you to know
PPTX
Site reliability engineering
PPTX
The Business Case for DevOps - Justifying the Journey
PPTX
DOES SFO 2016 San Francisco - Julia Wester - Predictability: No Magic Required
PPTX
Introduction of Kanban metrics
PPTX
Site reliability engineering - Lightning Talk
Unhappy Path & Dealing With Bad Events
3 Keys to Performance Testing at the Speed of Agile
13 things your QA team wants you to know
Site reliability engineering
The Business Case for DevOps - Justifying the Journey
DOES SFO 2016 San Francisco - Julia Wester - Predictability: No Magic Required
Introduction of Kanban metrics
Site reliability engineering - Lightning Talk

What's hot (19)

PDF
Managing software projects & teams effectively
PDF
Sre summary
PDF
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...
PDF
Predictability: No magic required
PPTX
APIdays Singapore 2019 - Building Applications in the Cloud: Best Practices F...
PPTX
DevOps By The Numbers
PPT
Critical Chain Slides Part 2
PPT
Linkedin Resource Focus
PDF
When down is not good enough. SRE On Azure - PolarConf
PPT
Story Based Burn Down
PDF
[webinar] Secrets of Top-performing DevOps Teams -- at Google and Beyond
PDF
Benefits of lean
PDF
Service Level Terminology : SLA ,SLO & SLI
PDF
Black Friday Is Approaching. Are You Prepared- Infographic
PDF
DevSec Delight with Compliance as Code - Matt Ray - AgileNZ 2017
PPTX
Team wide testing
PDF
Measuring DevOps Performance
PPTX
DevOps Torino Meetup - SRE Concepts
PPTX
Agile scrum roles
Managing software projects & teams effectively
Sre summary
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...
Predictability: No magic required
APIdays Singapore 2019 - Building Applications in the Cloud: Best Practices F...
DevOps By The Numbers
Critical Chain Slides Part 2
Linkedin Resource Focus
When down is not good enough. SRE On Azure - PolarConf
Story Based Burn Down
[webinar] Secrets of Top-performing DevOps Teams -- at Google and Beyond
Benefits of lean
Service Level Terminology : SLA ,SLO & SLI
Black Friday Is Approaching. Are You Prepared- Infographic
DevSec Delight with Compliance as Code - Matt Ray - AgileNZ 2017
Team wide testing
Measuring DevOps Performance
DevOps Torino Meetup - SRE Concepts
Agile scrum roles
Ad

Similar to Service Levels and Error Budgets - Paweł Kucharski (20)

ODP
Monitoring SLA with Prometheus and LibreOffice Calc
ODP
Continuous Delivery with Spinnaker.io
PDF
From 10 Deploys Per Year to 4 Per Day at DBS Bank: How Pivotal Platform Can R...
PDF
Ensuring Your Technology Will Scale
PDF
3 types of monitoring for 2020
PPTX
Web Vitals.pptx
PPTX
Agile process with a fixed cost
PDF
Design time governance
PPTX
Agile Governance for Hybrid Programs
PDF
Lighthouse
PPTX
T19 performance testing effort - estimation or guesstimation revised
PPTX
Narendra Ponnuswamy - Performance Testing Effort - Estimation or Guesstimation?
PDF
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
PPTX
software_engineering_agile_methodology.pptx
PDF
Using SaltStack to DevOps the enterprise
PDF
Identity management delegation and automation
PPTX
Agile 101
PDF
Telemetry: The Overlooked Treasure in Axon Server-Centric Applications
PDF
Real World Security @ DevSecOps Gathering (Sept '18)
PPSX
Cost estimation
Monitoring SLA with Prometheus and LibreOffice Calc
Continuous Delivery with Spinnaker.io
From 10 Deploys Per Year to 4 Per Day at DBS Bank: How Pivotal Platform Can R...
Ensuring Your Technology Will Scale
3 types of monitoring for 2020
Web Vitals.pptx
Agile process with a fixed cost
Design time governance
Agile Governance for Hybrid Programs
Lighthouse
T19 performance testing effort - estimation or guesstimation revised
Narendra Ponnuswamy - Performance Testing Effort - Estimation or Guesstimation?
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
software_engineering_agile_methodology.pptx
Using SaltStack to DevOps the enterprise
Identity management delegation and automation
Agile 101
Telemetry: The Overlooked Treasure in Axon Server-Centric Applications
Real World Security @ DevSecOps Gathering (Sept '18)
Cost estimation
Ad

Recently uploaded (20)

PDF
4 layer Arch & Reference Arch of IoT.pdf
PPTX
Internet of Everything -Basic concepts details
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
Auditboard EB SOX Playbook 2023 edition.
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PPTX
Microsoft User Copilot Training Slide Deck
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
SaaS reusability assessment using machine learning techniques
PDF
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PPTX
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
4 layer Arch & Reference Arch of IoT.pdf
Internet of Everything -Basic concepts details
giants, standing on the shoulders of - by Daniel Stenberg
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Auditboard EB SOX Playbook 2023 edition.
NewMind AI Weekly Chronicles – August ’25 Week IV
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Rapid Prototyping: A lecture on prototyping techniques for interface design
Microsoft User Copilot Training Slide Deck
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
SaaS reusability assessment using machine learning techniques
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Basics of Cloud Computing - Cloud Ecosystem
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
Data Virtualization in Action: Scaling APIs and Apps with FME
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...

Service Levels and Error Budgets - Paweł Kucharski

  • 1. Service Levels and Error Budgets Paweł Kucharski
  • 2. Site Reliability Engineering Automation Monitoring Embracing Risk Service Level Objectives (SLOs) Global Storage Emergency Incident Response Software Engineering Systems Engineering Blameless Failures google.com/sre "Hope Is Not A Strategy" Planet-Scale Distributed Systems Load Balancing Availability
  • 3. Site Reliability Engineering ● Principles, practices, & management of Google’s production systems ○ ~70 contributors, 500+ years of experience ○ Available at fine bookstores near you or read online at g.co/srebook
  • 4. Agenda 1. How good is a service? 2. How good should it be? 3. When do we need to make it better? 4. How do we make it better?
  • 5. Agenda 1. How good is a service? 2. How good should it be? 3. When do we need to make it better? 4. How do we make it better?
  • 6. Service Level Indicators (SLIs) ● An indicator (SLI) is a quantitative measure of how good some attribute of the service is. ● An attribute is a dimension the service’s users care about, such as: ○ throughput, how much work the service can do ○ latency, how long the work takes ○ availability, how often the service can do work ○ correctness, whether the work is right
  • 7. Choosing Indicators 1. Figure out what service properties users care about 2. Collect data about those properties a. Start analyzing server logs b. Instrument the service and export metrics to monitoring 3. Choose a few metrics and carefully define indicators
  • 8. Defining Indicators 1. Start with a property: “latency” 2. Specify the property: “HTTP GET latency” 3. Choose how to measure it: “measured at the client” 4. Choose the universe of measurement: “all frontend servers” Result: “HTTP GET latency measured at the client every 10 seconds, by a black-box prober, across all frontend servers” Shorthand: “client-side frontend GET latency”
  • 9. Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem Understand Your Metric (1)
  • 10. Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem Understand Your Metric (1)
  • 11. Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem Understand Your Metric (1)
  • 12. Understand Your Metric (2) ● Use histograms and percentiles instead of averages
  • 13. Re-Defining Indicators Previous Result: “HTTP GET latency measured at the client every 10 seconds, by a black-box prober, across all frontend servers” New Result: “95th percentile of HTTP GET latency measured by a black-box prober every second, aggregated every 10 seconds, across all frontend servers” Shorthand: “p95 client-side frontend GET latency”
  • 14. Agenda 1. How good is a service? 2. How good should it be? 3. When do we need to make it better? 4. How do we make it better?
  • 15. How Good Should A Service Be? ● How {fast, reliable, available, …} a service should be is fundamentally a product question ● “100% is the wrong reliability target for (nearly) everything” ○ cost of marginal improvements grows ~exponentially ● Can always make service better on some dimension, but involves tradeoffs with $, people, time, and other priorities ○ Product & dev management best placed for tradeoffs
  • 16. Set Achievable Targets ● Understanding what users need from the service, product management, dev management, and SRE management jointly agree on targets ○ “p95 latency should be < 250 ms, 99.9% of the month” ○ “should respond to requests at least 99.9% of the time” ● Targets should be ambitious but achievable
  • 17. Service Level Objectives ● An SLO is a mathematical relation like: ○ SLI ≤ target ○ lower bound ≤ SLI ≤ upper bound ● Publish SLO to users & try to be slightly better, just in case ○ defines what users can reasonably expect & design for ○ but don’t be too much better or users will depend on it ○ plan controlled outages to find implicit dependencies when it’s convenient, not at random
  • 18. Agenda 1. How good is a service? 2. How good should it be? 3. When do we need to make it better? 4. How do we make it better?
  • 19. Error Budgets ● An SLO implies acceptable levels of errors ○ 99.9% availability ⇒ 0.1% unavailability ● Tolerable errors accommodate ○ rolling out new software versions (which might break) ○ releasing new features ○ inevitable failure in hardware, networks, etc. ○ redesigning for other priorities ● Budget looks at SLO over 30-day rolling window
  • 20. Balance Reliability and Velocity ● Error budgets balance reliability with feature velocity ○ SRE’s job is not “zero outages”, default answer isn’t “no” ○ instead, maximize velocity given reliability constraint ● Simple version: ○ release features until error budget exhausted, then ○ focus devs on reliability improvements until budget refill
  • 21. Sophisticated Error Budgets ● change pace of feature releases given remaining budget ● keep a “rainy-day fund” for unexpected events ● use budget exhaustion rate to drive alerting ○ e.g. alert if recent errors > 1% of remaining budget ● small number of “silver bullets” for true emergency launches despite budget launch freeze
  • 22. Agenda 1. How good is a service? 2. How good should it be? 3. When do we need to make it better? 4. How do we make it better?
  • 23. What Should SREs Do? ● build monitoring systems to measure indicators ● provide input on feasibility of achieving targets ● work with devs to improve both reliability and velocity ○ standardize infrastructure ○ consulting on system design ○ build safe release & rollback systems ○ phase rollouts & use load balancing, etc. to minimize budget damage a release can do
  • 24. What Must SREs Have? ● SRE must have (and use) authority to halt launches which exceed error budget ○ requires strong support from management ● SRE must have ability to return pager to devs ○ strong SRE & dev partnership essential, no “code thrown over wall” ● SRE must have similar experience to devs ○ ability to code
  • 25. Things To Remember 1. Use SLIs to understand service’s key metrics. 2. Use SLOs to specify how good the service needs to be. 3. Use error budgets to control release velocity.