Error budgets are the core mechanism that makes Google’s Site Reliability Engineering (SRE) work. They answer the question every engineering team faces: “How much risk can we take when shipping changes?”
The Core Idea
An error budget is the maximum amount of unreliability your service can tolerate before you’ve violated your SLO. It’s calculated simply:
Error Budget = 1 - SLO targetIf your SLO is 99.9% availability, your error budget is 0.1%. Over a 30-day month (43,200 minutes), that’s 43.2 minutes of allowed downtime.
Use the error budget calculator to compute your specific budget and burn rate.
Why Error Budgets Work
Without error budgets, reliability and feature development are in constant tension. The product team wants to ship fast; the ops team wants to change nothing. Error budgets resolve this by making the trade-off explicit and data-driven.
When budget is available: Ship aggressively. Deploy frequently. Take calculated risks. The SLO gives you permission to be imperfect.
When budget is low or exhausted: Slow down. Focus on reliability. Fix flaky tests. Improve monitoring. Reduce deployment risk.
This isn’t a punishment. It’s a rational response to data. The budget is spent, so the investment shifts from features to reliability until it recovers.
How Google Uses Error Budgets
Google’s SRE teams operate with these principles:
1. SLOs are the source of truth
Not uptime targets pulled from thin air, but carefully chosen targets based on user expectations and business needs.
2. The budget belongs to the product team
The product team “spends” the error budget by deploying changes that might cause instability. They own the budget, so they own the decision of when to spend it.
3. When the budget is exhausted, SRE takes over
If the error budget is consumed, SRE can freeze non-critical deployments, require more stringent testing, or mandate reliability improvements before new features ship.
4. Excess budget is intentional
If you never consume your error budget, your SLO might be too loose, or you’re not shipping fast enough. A healthy error budget is one that gets partially consumed regularly.
Calculating Burn Rate
Burn rate measures how fast you’re consuming your error budget relative to what’s sustainable.
Burn Rate = Actual error rate / Allowed error rate- Burn rate = 1x → Consuming budget at exactly the sustainable rate. You’ll exhaust it at the end of the window
- Burn rate = 10x → Consuming 10x too fast. Budget exhausted in 1/10th of the window
- Burn rate = 0.5x → Under budget. You have room for more risk
Multi-Window Alerting
Google recommends alerting on two burn rate windows simultaneously:
| Alert | Burn Rate | Window | Catches |
|---|---|---|---|
| Fast burn | 14.4x | 1 hour | Sudden spikes, major incidents |
| Slow burn | 6x | 6 hours | Gradual degradation, slow leaks |
This combination catches both sudden outages and slow-moving problems that individually don’t trigger alerts but collectively drain your budget.
Setting Your SLO
Your SLO should balance user expectations with engineering cost:
- Measure current reliability — Track your actual availability for 2-4 weeks before setting a target
- Understand user tolerance — For most SaaS products, 99.9% is acceptable. Users notice 99% but tolerate 99.9%
- Consider dependencies — Your SLO can’t exceed your dependencies’ SLAs. If your database is 99.9%, your service can’t promise 99.99%
- Start conservative — It’s easier to loosen an SLO than to tighten one. Start at 99.9% and improve if needed
Error Budget Policies
Document what happens at different budget levels:
| Budget Remaining | Policy |
|---|---|
| >50% | Ship freely. Normal deployment pace |
| 25-50% | Increased scrutiny on risky changes. Require rollback plans |
| 10-25% | Slow deployments. Focus on reliability improvements |
| <10% | Freeze non-critical deployments. SRE-approved changes only |
| 0% (exhausted) | All engineering effort on reliability until budget recovers |
The specific thresholds and policies should be agreed upon by product and engineering leadership before they’re needed. Negotiating during a crisis leads to bad decisions.
Common Mistakes
- Setting SLOs too high — A 99.99% SLO for a CRUD app is unrealistic and prevents any meaningful feature work
- No consequences for exhaustion — An error budget without enforcement is just a number
- Measuring the wrong SLI — Uptime alone misses latency degradation and partial failures
- Monthly windows only — Consider rolling windows (last 30 days) instead of calendar months to avoid month-boundary gaming
- Not reviewing SLOs — Review quarterly. Adjust based on user feedback and business needs
Getting Started
You don’t need to be Google to use error budgets:
- Define one SLI (e.g., “homepage returns 200 in under 2 seconds”)
- Set an SLO (e.g., “99.9% over 30 days”)
- Calculate the budget (0.1% = 43 minutes/month)
- Monitor it with appropriate frequency
- Track consumption and adjust deployment pace accordingly
The error budget calculator handles the math. The hard part is the organizational discipline to actually follow through.
Related tools:
- Error Budget Calculator — Calculate budget, burn rate, and exhaustion time
- Uptime Calculator — Convert SLO targets to downtime allowances
- Downtime Cost Calculator — Quantify the business impact when budgets are exceeded