Tracefox / Library / Guide · SLO governance
Guide · SLO governance

An SLO without a policy is just a dashboard.

The error-budget policy is what turns reliability from an opinion into an operational currency. The five states, the decision-owners, and the template, written before the first P1, not during it.

10 min read · v1.0

The error budget is the amount of unreliability the SLO permits. It converts a reliability target into a concrete operational currency that engineering and product can reason about together. A 99.9% SLO over 30 days gives you 43.2 minutes of permitted downtime per month. That's the budget.

The policy is what turns the budget into something operationally meaningful. It answers: when the budget is healthy, what posture is engineering in? When the budget is exhausted, what stops? Who decides? Without the policy, the budget is just a number on a dashboard nobody acts on.

A policy that exists only in engineering's head will not survive the first P1 where product is pressuring a release. Get it written down before that conversation, not during it.

The five budget states

The Tracefox standard policy uses five states, each with a defined posture, a required action, and a named decision-owner. This is the starting point we bring into engagements; adapt thresholds and owners, but keep the structure.

State Threshold Posture Required action Owner
Healthy > 50% remaining Normal Business as usual. Feature velocity unrestricted. Engineering
Caution 25–50% Monitor Increase alert sensitivity. Reliability risks reviewed in sprint planning. Engineering Lead
Warning 10–25% Slow Down Freeze non-critical feature work. Prioritise reliability fixes. Eng Lead + Product
Critical < 10% Reliability Focus All capacity to reliability. Production change freeze. Daily SLO review. VP Engineering
Exhausted 0% · breached Incident Treat as P1. Incident Commander engaged. Customer comms assessed. IC + Leadership

The decision-owner is the load-bearing element

Most policies fail not because the thresholds are wrong, but because the decision-owner is unclear. When the budget hits 10%, who decides whether to slow down? When it hits 0%, who decides whether to halt the release that's already in canary?

The decision-owner field must name a role, not a team. "The platform team decides" is not a policy; it's an argument waiting to happen. "The VP of Engineering decides, in consultation with the CPO" is a policy. Disagreements have a tie-breaker.

Ship-freeze mechanics that actually work

A policy that says "freeze non-critical feature work" without defining how the freeze is enforced will be ignored. The mechanics that survive real product pressure look like this:

  1. Budget state surfaced in the deploy pipeline. Every PR description shows the current budget state for the affected service. Every deploy to production for that service includes the state in the deploy log.
  2. Critical-state freezes block deploys at the pipeline. Not via Slack. Not via "engineering will be more careful." The CI job fails with "service in Critical reliability focus; override requires VP Eng sign-off."
  3. Override has friction. A manual VP-Eng-signed override is possible (it has to be; there are genuine emergencies), but every override is logged and reviewed in the next quarterly reliability review.

Friction is the feature, not the bug. A freeze that can be bypassed in two clicks isn't a freeze.

A copy-paste template

Here's a starter you can drop into a runbook or wiki. Replace the bracketed fields with your service, owners, and threshold variations. The Blueprint (downloadable from /resources) includes the longer version with multi-service pattern.

# Error Budget Policy — [service-name]

SLO: [99.9% availability over 30 days rolling]
Budget: [43.2 minutes / month]
Owner: [Eng Lead — name]
Product owner: [name]
Last reviewed: [YYYY-MM-DD]

## Budget states

### Healthy > 50% remaining
Posture: Normal
Action: Business as usual.
Decision: Engineering

### Caution 25–50%
Posture: Monitor
Action: Reliability risks raised in sprint planning.
        No new technical debt accepted.
Decision: Engineering Lead

### Warning 10–25%
Posture: Slow Down
Action: Non-critical feature freeze.
        Reliability fixes prioritised over feature work.
Decision: Engineering Lead + Product

### Critical < 10%
Posture: Reliability Focus
Action: Production deploy freeze (override = VP Eng sign-off).
        All eng capacity directed to reliability.
        Daily standup includes SLO trend.
Decision: VP Engineering

### Exhausted 0%
Posture: Incident
Action: Treat as P1. IC engaged.
        Customer-facing comms assessed within 1hr.
        Formal PIR within 5 working days.
Decision: IC + Leadership

## Override and review

- Override of any state action requires written sign-off
  by the named Decision owner.
- All overrides logged in [/incidents/budget-overrides].
- Reviewed quarterly. Threshold or owner changes go through
  the same sign-off as the original policy.

Signed: [Eng Lead] [VP Eng] [Product] [Date]

Common failure modes

The policy exists, but only engineering signed it

A budget policy that hasn't been signed by product (or whoever owns release pressure) won't hold during the first conflict. Get sign-off before the SLO goes live, not after.

The thresholds are too tight

A policy that triggers Warning state every two weeks loses meaning. If your actual reliability is materially worse than your SLO, fix that first: adjust the SLO or fix the service. Policies don't compensate for unrealistic targets.

The freeze never actually freezes

The first time the policy hits Critical and product still ships, the policy is dead. From that point on, it's theatre. Resist the temptation to make exceptions on the first incident; exceptions become precedent immediately.

Reviews don't happen

A policy must be reviewed at a regular cadence (quarterly is the default) so threshold drift, ownership changes, and SLO retargeting are explicit decisions rather than implicit drift. Calendar the review. Skip it once and you'll skip it again.

Where to start

Pick the service with the most reliability friction: usually the one where engineering and product have most often disagreed about velocity. Draft the policy from the template above. Get it signed before the SLO goes live. Iterate quarterly.

The full Blueprint at /resources includes the multi-service version of this policy plus the assessment criteria we use to score governance maturity.

Engagement.start()

Most policies fail because they were never agreed with product.

Tracefox engagements include facilitating the sign-off conversation between engineering and product leadership. The policy is the artefact; the agreement is the work.