Tracefox / Library / Guide · Alerting

Guide · Alerting

Alert on how fast the budget burns. Not on the instantaneous error rate.

The two-window, two-burn-rate pattern is what you reach for when threshold alerts have stopped earning trust. Here's the math, the trade-offs, and a working implementation.

9 min read · v1.0

Threshold alerts fire late and produce noise. A brief 2% error spike that resolves in ten minutes is largely harmless. A sustained 1% error rate over thirty days burns through a 99.9% SLO entirely. A static threshold can't tell those two situations apart.

Burn-rate alerting fixes that. Instead of alerting on the instantaneous error rate, you alert on the rate at which the error budget is being consumed, and you fire only when that rate is fast enough to exhaust the budget before the team can respond.

What "burn rate" actually means

A burn rate of 1 means consuming the error budget at exactly the SLO-permitted rate. At burn rate 1, your budget lasts exactly the measurement window, by design.

A burn rate of 2 means consuming the budget twice as fast as allowed. A burn rate of 14.4 means consuming it 14.4× faster than allowed, which exhausts a 30-day budget in roughly 50 hours.

The point: burn rate is a velocity metric. It tells you not just whether something is wrong, but whether it's wrong fast enough to matter before someone notices.

The two-window, two-burn-rate pattern

The pattern Tracefox uses on every engagement (it's also the Google SRE Workbook default) combines two alerts:

Alert	Burn rate	Window	Severity	Time to budget exhaustion
Fast burn	14.4×	1 hour	P1 · page on-call	~2 hours
Slow burn	6×	6 hours	P2 · alert channel	~5 days

Two alerts, two windows. The fast-burn alert pages the on-call when something is going wrong fast enough that you have hours, not days. The slow-burn alert nudges the team early enough to fix things before fast-burn fires.

Why these specific numbers 14.4× over 1 hour exhausts the budget in just under 2 hours: short enough to page, long enough to filter genuine spikes from a 30-second blip. 6× over 6 hours exhausts in ~5 days: long enough to be a problem, short enough to act on this sprint.

Computing the error-rate threshold for your SLO

Given an SLO target, the error rate at burn rate B is:

error_rate_threshold = (1 - SLO_target) × B

For a 99.9% SLO (allowed error rate 0.1%):

Fast burn (14.4×): error rate > 1.44% over 1hr → page
Slow burn (6×): error rate > 0.6% over 6hr → alert

For a 99.99% SLO (allowed error rate 0.01%):

Fast burn: error rate > 0.144% over 1hr → page
Slow burn: error rate > 0.06% over 6hr → alert

The SLO calculator works these out for any target and window.

What it looks like in PromQL

Assuming you have a recording rule that computes error_ratio_5m and error_ratio_6h as the error rate over 5-minute and 6-hour windows respectively, an SLO of 99.9% translates to:

# Fast burn — pages on-call (P1)
- alert: HighErrorBudgetBurnFast
  expr: |
    error_ratio_5m{service="checkout-api"} > (14.4 * 0.001)
    and
    error_ratio_1h{service="checkout-api"} > (14.4 * 0.001)
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "checkout-api is burning error budget > 14.4×"
    runbook: "https://runbooks/checkout-api/error-budget-burn"

# Slow burn — alerts the channel (P2)
- alert: HighErrorBudgetBurnSlow
  expr: |
    error_ratio_30m{service="checkout-api"} > (6 * 0.001)
    and
    error_ratio_6h{service="checkout-api"} > (6 * 0.001)
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "checkout-api is burning error budget > 6×"
    runbook: "https://runbooks/checkout-api/error-budget-burn"

Two windows on each alert (one short, one long). The short window catches the sudden onset; the long window confirms it's not a 90-second blip. Both must breach for the alert to fire.

Common mistakes

Picking thresholds without a measurement window

"Alert when error rate > 1%" is meaningless without saying over what window. 1% over 30 seconds is noise; 1% over 6 hours is a P1. Burn-rate alerts always specify both the rate and the window.

Single-window alerts

A burn-rate alert with only one window will fire constantly during minor spikes. The two-window pattern (short + long) provides the noise filter.

Treating burn-rate alerts as informational

A fast-burn alert pages the on-call. If your fast-burn alert routes to Slack with no acknowledgement requirement, you don't have a fast-burn alert; you have a notification. They are not the same thing.

No runbook

Every burn-rate alert must have a linked runbook. The on-call woken up at 3am needs to know what "checkout-api is burning budget" means and what to check first. Alerts without runbooks get muted.

Where to start

Pick your most critical service. Define one SLO. Implement fast and slow burn alerts for that one SLO. Run them for two weeks before adding more. The hardest part of burn-rate alerting isn't the math; it's the policy that decides what happens when the alert fires. Build that next.

Related guides

Alert on how fast the budget burns. Not on the instantaneous error rate.

What "burn rate" actually means

The two-window, two-burn-rate pattern

Computing the error-rate threshold for your SLO

What it looks like in PromQL

Common mistakes

Picking thresholds without a measurement window

Single-window alerts

Treating burn-rate alerts as informational

No runbook

Where to start

Error-budget policy

SLO calculator

The Golden Signals

Most teams have alerts that fire late and alerts that fire on noise. Often the same alert.

What "burn rate" actually means

The two-window, two-burn-rate pattern

Computing the error-rate threshold for your SLO

What it looks like in PromQL

Common mistakes

Picking thresholds without a measurement window

Single-window alerts

Treating burn-rate alerts as informational

No runbook

Where to start

Error-budget policy →

SLO calculator →

The Golden Signals →

Most teams have alerts that fire late and alerts that fire on noise. Often the same alert.

Error-budget policy

SLO calculator

The Golden Signals