Tracefox / Library / Guide · Alerting
Guide · Alerting

Alert on how fast the budget burns. Not on the instantaneous error rate.

The two-window, two-burn-rate pattern is what you reach for when threshold alerts have stopped earning trust. Here's the math, the trade-offs, and a working implementation.

9 min read · v1.0

Threshold alerts fire late and produce noise. A brief 2% error spike that resolves in ten minutes is largely harmless. A sustained 1% error rate over thirty days burns through a 99.9% SLO entirely. A static threshold can't tell those two situations apart.

Burn-rate alerting fixes that. Instead of alerting on the instantaneous error rate, you alert on the rate at which the error budget is being consumed, and you fire only when that rate is fast enough to exhaust the budget before the team can respond.

What "burn rate" actually means

A burn rate of 1 means consuming the error budget at exactly the SLO-permitted rate. At burn rate 1, your budget lasts exactly the measurement window, by design.

A burn rate of 2 means consuming the budget twice as fast as allowed. A burn rate of 14.4 means consuming it 14.4× faster than allowed, which exhausts a 30-day budget in roughly 50 hours.

The point: burn rate is a velocity metric. It tells you not just whether something is wrong, but whether it's wrong fast enough to matter before someone notices.

The two-window, two-burn-rate pattern

The pattern Tracefox uses on every engagement (it's also the Google SRE Workbook default) combines two alerts:

Alert Burn rate Window Severity Time to budget exhaustion
Fast burn 14.4× 1 hour P1 · page on-call ~2 hours
Slow burn 6 hours P2 · alert channel ~5 days

Two alerts, two windows. The fast-burn alert pages the on-call when something is going wrong fast enough that you have hours, not days. The slow-burn alert nudges the team early enough to fix things before fast-burn fires.

Why these specific numbers 14.4× over 1 hour exhausts the budget in just under 2 hours: short enough to page, long enough to filter genuine spikes from a 30-second blip. 6× over 6 hours exhausts in ~5 days: long enough to be a problem, short enough to act on this sprint.

Computing the error-rate threshold for your SLO

Given an SLO target, the error rate at burn rate B is:

error_rate_threshold = (1 - SLO_target) × B

For a 99.9% SLO (allowed error rate 0.1%):

  • Fast burn (14.4×): error rate > 1.44% over 1hr → page
  • Slow burn (6×): error rate > 0.6% over 6hr → alert

For a 99.99% SLO (allowed error rate 0.01%):

  • Fast burn: error rate > 0.144% over 1hr → page
  • Slow burn: error rate > 0.06% over 6hr → alert

The SLO calculator works these out for any target and window.

What it looks like in PromQL

Assuming you have a recording rule that computes error_ratio_5m and error_ratio_6h as the error rate over 5-minute and 6-hour windows respectively, an SLO of 99.9% translates to:

# Fast burn — pages on-call (P1)
- alert: HighErrorBudgetBurnFast
  expr: |
    error_ratio_5m{service="checkout-api"} > (14.4 * 0.001)
    and
    error_ratio_1h{service="checkout-api"} > (14.4 * 0.001)
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "checkout-api is burning error budget > 14.4×"
    runbook: "https://runbooks/checkout-api/error-budget-burn"

# Slow burn — alerts the channel (P2)
- alert: HighErrorBudgetBurnSlow
  expr: |
    error_ratio_30m{service="checkout-api"} > (6 * 0.001)
    and
    error_ratio_6h{service="checkout-api"} > (6 * 0.001)
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "checkout-api is burning error budget > 6×"
    runbook: "https://runbooks/checkout-api/error-budget-burn"

Two windows on each alert (one short, one long). The short window catches the sudden onset; the long window confirms it's not a 90-second blip. Both must breach for the alert to fire.

Common mistakes

Picking thresholds without a measurement window

"Alert when error rate > 1%" is meaningless without saying over what window. 1% over 30 seconds is noise; 1% over 6 hours is a P1. Burn-rate alerts always specify both the rate and the window.

Single-window alerts

A burn-rate alert with only one window will fire constantly during minor spikes. The two-window pattern (short + long) provides the noise filter.

Treating burn-rate alerts as informational

A fast-burn alert pages the on-call. If your fast-burn alert routes to Slack with no acknowledgement requirement, you don't have a fast-burn alert; you have a notification. They are not the same thing.

No runbook

Every burn-rate alert must have a linked runbook. The on-call woken up at 3am needs to know what "checkout-api is burning budget" means and what to check first. Alerts without runbooks get muted.

Where to start

Pick your most critical service. Define one SLO. Implement fast and slow burn alerts for that one SLO. Run them for two weeks before adding more. The hardest part of burn-rate alerting isn't the math; it's the policy that decides what happens when the alert fires. Build that next.

Engagement.start()

Most teams have alerts that fire late and alerts that fire on noise. Often the same alert.

The Tracefox assessment includes an alert hygiene audit on day one. We score severity, runbook coverage, and ownership across your active alert set, and identify which to disable, which to retune, and which to replace with burn-rate.