Error Budgets Explained: How Google Does SRE

Error budgets are the core mechanism that makes Google’s Site Reliability Engineering (SRE) work. They answer the question every engineering team faces: “How much risk can we take when shipping changes?”

The Core Idea

An error budget is the maximum amount of unreliability your service can tolerate before you’ve violated your SLO. It’s calculated simply:

Error Budget = 1 - SLO target

If your SLO is 99.9% availability, your error budget is 0.1%. Over a 30-day month (43,200 minutes), that’s 43.2 minutes of allowed downtime.

Use the error budget calculator to compute your specific budget and burn rate.

Why Error Budgets Work

Without error budgets, reliability and feature development are in constant tension. The product team wants to ship fast; the ops team wants to change nothing. Error budgets resolve this by making the trade-off explicit and data-driven.

When budget is available: Ship aggressively. Deploy frequently. Take calculated risks. The SLO gives you permission to be imperfect.

When budget is low or exhausted: Slow down. Focus on reliability. Fix flaky tests. Improve monitoring. Reduce deployment risk.

This isn’t a punishment. It’s a rational response to data. The budget is spent, so the investment shifts from features to reliability until it recovers.

How Google Uses Error Budgets

Google’s SRE teams operate with these principles:

1. SLOs are the source of truth

Not uptime targets pulled from thin air, but carefully chosen targets based on user expectations and business needs.

2. The budget belongs to the product team

The product team “spends” the error budget by deploying changes that might cause instability. They own the budget, so they own the decision of when to spend it.

3. When the budget is exhausted, SRE takes over

If the error budget is consumed, SRE can freeze non-critical deployments, require more stringent testing, or mandate reliability improvements before new features ship.

4. Excess budget is intentional

If you never consume your error budget, your SLO might be too loose, or you’re not shipping fast enough. A healthy error budget is one that gets partially consumed regularly.

Calculating Burn Rate

Burn rate measures how fast you’re consuming your error budget relative to what’s sustainable.

Burn Rate = Actual error rate / Allowed error rate

Burn rate = 1x → Consuming budget at exactly the sustainable rate. You’ll exhaust it at the end of the window
Burn rate = 10x → Consuming 10x too fast. Budget exhausted in 1/10th of the window
Burn rate = 0.5x → Under budget. You have room for more risk

Multi-Window Alerting

Google recommends alerting on two burn rate windows simultaneously:

Alert	Burn Rate	Window	Catches
Fast burn	14.4x	1 hour	Sudden spikes, major incidents
Slow burn	6x	6 hours	Gradual degradation, slow leaks

This combination catches both sudden outages and slow-moving problems that individually don’t trigger alerts but collectively drain your budget.

Setting Your SLO

Your SLO should balance user expectations with engineering cost:

Measure current reliability — Track your actual availability for 2-4 weeks before setting a target
Understand user tolerance — For most SaaS products, 99.9% is acceptable. Users notice 99% but tolerate 99.9%
Consider dependencies — Your SLO can’t exceed your dependencies’ SLAs. If your database is 99.9%, your service can’t promise 99.99%
Start conservative — It’s easier to loosen an SLO than to tighten one. Start at 99.9% and improve if needed

Error Budget Policies

Document what happens at different budget levels:

Budget Remaining	Policy
>50%	Ship freely. Normal deployment pace
25-50%	Increased scrutiny on risky changes. Require rollback plans
10-25%	Slow deployments. Focus on reliability improvements
<10%	Freeze non-critical deployments. SRE-approved changes only
0% (exhausted)	All engineering effort on reliability until budget recovers

The specific thresholds and policies should be agreed upon by product and engineering leadership before they’re needed. Negotiating during a crisis leads to bad decisions.

Common Mistakes

Setting SLOs too high — A 99.99% SLO for a CRUD app is unrealistic and prevents any meaningful feature work
No consequences for exhaustion — An error budget without enforcement is just a number
Measuring the wrong SLI — Uptime alone misses latency degradation and partial failures
Monthly windows only — Consider rolling windows (last 30 days) instead of calendar months to avoid month-boundary gaming
Not reviewing SLOs — Review quarterly. Adjust based on user feedback and business needs

Getting Started

You don’t need to be Google to use error budgets:

Define one SLI (e.g., “homepage returns 200 in under 2 seconds”)
Set an SLO (e.g., “99.9% over 30 days”)
Calculate the budget (0.1% = 43 minutes/month)
Monitor it with appropriate frequency
Track consumption and adjust deployment pace accordingly

The error budget calculator handles the math. The hard part is the organizational discipline to actually follow through.

Related tools:

Error Budget Calculator — Calculate budget, burn rate, and exhaustion time
Uptime Calculator — Convert SLO targets to downtime allowances
Downtime Cost Calculator — Quantify the business impact when budgets are exceeded