Field guide

The SRE guide to error budgeting that survives Monday.

Every team we audit has SLOs. About 60% have error budgets. Almost none have a budget anyone is paid to spend. That's the fault line.

17 March 2026 · Ken Tan · 8 min read

An error budget is only useful if it can change behaviour. If the budget burns and nothing happens, the budget is decoration. We've built error-budget policies into 14 Foundational engagements; this post is the residue of what survives a year.

Why most error-budget policies fail

Three failure modes, in rough order of frequency:

The SLO doesn't measure user experience. 99.9% on a CPU metric tells you nothing about whether checkout works.
The owner can't spend the budget. If the platform team owns the SLO but the product team controls deploy velocity, no behaviour will change when the budget burns.
The cadence is wrong. Quarterly review of a monthly budget is theatre. The budget has already been spent or rolled by the time anyone looks.

Treat the budget as a contract

The shift that makes budgets stick is treating them as a contract between SRE and product engineering, not as an SRE-side measurement. The contract has three clauses:

What the SLO measures (a customer journey, not infra).
Who can spend the budget (the team that controls deploy velocity).
What happens at 50%, 75%, 100% burn (this is the part everyone skips).

The third clause is the one that determines whether the policy is real.

A working error-budget policy

The template we ship in Foundational engagements, abbreviated:

# Error-budget policy · checkout-svc
slo: 99.95% · 30d · journey=checkout-complete
budget: 21.6 minutes / 30d
owner: product-checkout team
escalation:
  50%: notify owner team
  75%: enforce launch-freeze on non-fix changes
  100%: postmortem before next deploy
review: weekly · 30 min · owner + SRE-on-call

The launch-freeze at 75% is the load-bearing clause. It's also the most controversial — and the one product leaders will try to negotiate away during the policy review. If they succeed, the budget is decoration again.

A budget you cannot spend is a budget you do not have.

Burn-rate alerts: two windows, one page

The classic Google SRE book pattern still works: alert on fast burn (1h window, 14× rate) and slow burn (6h window, 6× rate), and only page on the fast burn. The slow burn ticket stays in the queue.

In practice we adjust the multipliers per service tier. For a Tier-0 revenue path 14× is too lenient; for a Tier-2 internal tool it's too aggressive. The numbers below are typical defaults but they should always be tuned against historical incident data, not picked from a blog post.

Tier-0 · fast: 8× / 1h · slow: 4× / 6h · page on fast.
Tier-1 · fast: 14× / 1h · slow: 6× / 6h · page on fast.
Tier-2 · fast: 24× / 1h · slow: 12× / 6h · ticket only.

Review cadence

The review cadence is what keeps the policy alive. We default to:

Weekly — owner team + SRE on-call. 30 min. What burned, what didn't.
Monthly — engineering leadership. 60 min. Trend across all SLOs; any policy changes.
Quarterly — exec review. 30 min. Budget envelope vs. business outcomes.

If you cannot defend the weekly meeting at week 12, the budget will not survive. The meeting is the policy.

Error budgeting is not a tooling problem. It's a contract-design problem with a tooling layer. We ship working policies as part of Foundational, and we'll happily review yours in a Diagnostic.

The SRE guide to error budgeting that survives Monday.

Why most error-budget policies fail

Treat the budget as a contract

A working error-budget policy

Burn-rate alerts: two windows, one page

Review cadence

An error-budget policy template.

Burn-rate alerting without paging on noise.

The error budget that nobody is allowed to spend.

A budget you cannot spend is a budget you do not have.

Why most error-budget policies fail

Treat the budget as a contract

A working error-budget policy

Burn-rate alerts: two windows, one page

Review cadence

An error-budget policy template. →

Burn-rate alerting without paging on noise. →

The error budget that nobody is allowed to spend. →

A budget you cannot spend is a budget you do not have.

An error-budget policy template.

Burn-rate alerting without paging on noise.

The error budget that nobody is allowed to spend.