Alert on how fast the budget burns. Not on the instantaneous error rate.
The two-window, two-burn-rate pattern is what you reach for when threshold alerts have stopped earning trust. Here's the math, the trade-offs, and a working implementation.
Threshold alerts fire late and produce noise. A brief 2% error spike that resolves in ten minutes is largely harmless. A sustained 1% error rate over thirty days burns through a 99.9% SLO entirely. A static threshold can't tell those two situations apart.
Burn-rate alerting fixes that. Instead of alerting on the instantaneous error rate, you alert on the rate at which the error budget is being consumed, and you fire only when that rate is fast enough to exhaust the budget before the team can respond.
What "burn rate" actually means
A burn rate of 1 means consuming the error budget at exactly the SLO-permitted rate. At burn rate 1, your budget lasts exactly the measurement window, by design.
A burn rate of 2 means consuming the budget twice as fast as allowed. A burn rate of 14.4 means consuming it 14.4× faster than allowed, which exhausts a 30-day budget in roughly 50 hours.
The point: burn rate is a velocity metric. It tells you not just whether something is wrong, but whether it's wrong fast enough to matter before someone notices.
The two-window, two-burn-rate pattern
The pattern Tracefox uses on every engagement (it's also the Google SRE Workbook default) combines two alerts:
| Alert | Burn rate | Window | Severity | Time to budget exhaustion |
|---|---|---|---|---|
| Fast burn | 14.4× | 1 hour | P1 · page on-call | ~2 hours |
| Slow burn | 6× | 6 hours | P2 · alert channel | ~5 days |
Two alerts, two windows. The fast-burn alert pages the on-call when something is going wrong fast enough that you have hours, not days. The slow-burn alert nudges the team early enough to fix things before fast-burn fires.
Computing the error-rate threshold for your SLO
Given an SLO target, the error rate at burn rate B is:
error_rate_threshold = (1 - SLO_target) × B For a 99.9% SLO (allowed error rate 0.1%):
- Fast burn (14.4×): error rate > 1.44% over 1hr → page
- Slow burn (6×): error rate > 0.6% over 6hr → alert
For a 99.99% SLO (allowed error rate 0.01%):
- Fast burn: error rate > 0.144% over 1hr → page
- Slow burn: error rate > 0.06% over 6hr → alert
The SLO calculator works these out for any target and window.
What it looks like in PromQL
Assuming you have a recording rule that computes error_ratio_5m and
error_ratio_6h as the error rate over 5-minute and 6-hour windows
respectively, an SLO of 99.9% translates to:
# Fast burn — pages on-call (P1)
- alert: HighErrorBudgetBurnFast
expr: |
error_ratio_5m{service="checkout-api"} > (14.4 * 0.001)
and
error_ratio_1h{service="checkout-api"} > (14.4 * 0.001)
for: 2m
labels:
severity: page
annotations:
summary: "checkout-api is burning error budget > 14.4×"
runbook: "https://runbooks/checkout-api/error-budget-burn"
# Slow burn — alerts the channel (P2)
- alert: HighErrorBudgetBurnSlow
expr: |
error_ratio_30m{service="checkout-api"} > (6 * 0.001)
and
error_ratio_6h{service="checkout-api"} > (6 * 0.001)
for: 15m
labels:
severity: warning
annotations:
summary: "checkout-api is burning error budget > 6×"
runbook: "https://runbooks/checkout-api/error-budget-burn" Two windows on each alert (one short, one long). The short window catches the sudden onset; the long window confirms it's not a 90-second blip. Both must breach for the alert to fire.
Common mistakes
Picking thresholds without a measurement window
"Alert when error rate > 1%" is meaningless without saying over what window. 1% over 30 seconds is noise; 1% over 6 hours is a P1. Burn-rate alerts always specify both the rate and the window.
Single-window alerts
A burn-rate alert with only one window will fire constantly during minor spikes. The two-window pattern (short + long) provides the noise filter.
Treating burn-rate alerts as informational
A fast-burn alert pages the on-call. If your fast-burn alert routes to Slack with no acknowledgement requirement, you don't have a fast-burn alert; you have a notification. They are not the same thing.
No runbook
Every burn-rate alert must have a linked runbook. The on-call woken up at 3am needs to know what "checkout-api is burning budget" means and what to check first. Alerts without runbooks get muted.
Where to start
Pick your most critical service. Define one SLO. Implement fast and slow burn alerts for that one SLO. Run them for two weeks before adding more. The hardest part of burn-rate alerting isn't the math; it's the policy that decides what happens when the alert fires. Build that next.