What is an Error Budget?
An error budget is the maximum amount of unreliability your service can have before violating its Service Level Objective (SLO). If your SLO is 99.9% availability, your error budget is 0.1%, which translates to roughly 43 minutes of downtime per month. The concept was popularized by Google's Site Reliability Engineering (SRE) practices as a way to balance reliability with development velocity.
How Error Budgets Work
The formula is simple: Error Budget = 1 - SLO target. For a 99.9% SLO, your error budget is 0.1% of total time in the measurement window. When your budget has remaining capacity, teams can ship features aggressively. When the budget is low or exhausted, the team shifts focus to reliability work, bug fixes, and operational improvements.
Error Budget Burn Rate
Burn rate measures how fast you're consuming your error budget relative to what's sustainable. A burn rate of 1x means you'll exactly exhaust your budget by the end of the window. A burn rate of 10x means you'll exhaust it in 1/10th of the time. Google recommends multi-window, multi-burn-rate alerting: alert on 14.4x burn rate over 1 hour (fast burn) and 6x burn rate over 6 hours (slow burn).
SLO vs SLA vs SLI
These three concepts form a hierarchy. SLI (Service Level Indicator) is what you measure (e.g., request success rate). SLO (Service Level Objective) is your internal target (e.g., 99.9% success rate). SLA (Service Level Agreement) is the contractual commitment to customers, typically set lower than your SLO to provide a buffer. Your error budget is derived from your SLO, not your SLA.
Implementing Error Budgets
Start by defining your SLI (usually availability or latency), set an SLO that balances user needs with engineering cost, then calculate the error budget. Track consumption in real-time using monitoring tools. The key policy decision: what happens when the budget is exhausted? Most SRE teams freeze non-critical deployments and redirect engineering effort to reliability improvements until the budget recovers.