The SRE guide to error budgeting that survives Monday.
Every team we audit has SLOs. About 60% have error budgets. Almost none have a budget anyone is paid to spend. That's the fault line.
An error budget is only useful if it can change behaviour. If the budget burns and nothing happens, the budget is decoration. We've built error-budget policies into 14 Foundational engagements; this post is the residue of what survives a year.
Why most error-budget policies fail
Three failure modes, in rough order of frequency:
- The SLO doesn't measure user experience. 99.9% on a CPU metric tells you nothing about whether checkout works.
- The owner can't spend the budget. If the platform team owns the SLO but the product team controls deploy velocity, no behaviour will change when the budget burns.
- The cadence is wrong. Quarterly review of a monthly budget is theatre. The budget has already been spent or rolled by the time anyone looks.
Treat the budget as a contract
The shift that makes budgets stick is treating them as a contract between SRE and product engineering, not as an SRE-side measurement. The contract has three clauses:
- What the SLO measures (a customer journey, not infra).
- Who can spend the budget (the team that controls deploy velocity).
- What happens at
50%,75%,100%burn (this is the part everyone skips).
The third clause is the one that determines whether the policy is real.
A working error-budget policy
The template we ship in Foundational engagements, abbreviated:
# Error-budget policy · checkout-svc
slo: 99.95% · 30d · journey=checkout-complete
budget: 21.6 minutes / 30d
owner: product-checkout team
escalation:
50%: notify owner team
75%: enforce launch-freeze on non-fix changes
100%: postmortem before next deploy
review: weekly · 30 min · owner + SRE-on-call The launch-freeze at 75% is the load-bearing clause. It's also the most controversial — and the one product leaders will try to negotiate away during the policy review. If they succeed, the budget is decoration again.
A budget you cannot spend is a budget you do not have.
Burn-rate alerts: two windows, one page
The classic Google SRE book pattern still works: alert on fast burn (1h window, 14× rate) and slow burn (6h window, 6× rate), and only page on the fast burn. The slow burn ticket stays in the queue.
In practice we adjust the multipliers per service tier. For a Tier-0 revenue path 14× is too lenient; for a Tier-2 internal tool it's too aggressive. The numbers below are typical defaults but they should always be tuned against historical incident data, not picked from a blog post.
- Tier-0 · fast: 8× / 1h · slow: 4× / 6h · page on fast.
- Tier-1 · fast: 14× / 1h · slow: 6× / 6h · page on fast.
- Tier-2 · fast: 24× / 1h · slow: 12× / 6h · ticket only.
Review cadence
The review cadence is what keeps the policy alive. We default to:
- Weekly — owner team + SRE on-call. 30 min. What burned, what didn't.
- Monthly — engineering leadership. 60 min. Trend across all SLOs; any policy changes.
- Quarterly — exec review. 30 min. Budget envelope vs. business outcomes.
If you cannot defend the weekly meeting at week 12, the budget will not survive. The meeting is the policy.
Error budgeting is not a tooling problem. It's a contract-design problem with a tooling layer. We ship working policies as part of Foundational, and we'll happily review yours in a Diagnostic.
An error-budget policy template.
What we ship in Foundational engagements — annotated, ready to adapt.
Burn-rate alerting without paging on noise.
Two windows, one page. Multipliers tuned per service tier.
The error budget that nobody is allowed to spend.
Companion failure mode: the policy exists, but the budget never changes anyone's behaviour.