What the runbook should actually look like.
Five sections. Action-first. Written for the engineer at 02:47, not the engineer at 14:30. A copy-paste template, plus the sections most existing runbooks should delete.
The runbook template most teams use was written in a different decade. It has a section for "background context", a section for "architecture overview", a section for "stakeholders", and somewhere on page two, the actual diagnostic steps.
The runbook the engineer at 02:47 needs is shorter, sharper, and inverts that order. Here's the template Tracefox hands clients on engagement, along with the sections to delete from whatever you have today.
The five-section template
Copy this. Replace the bracketed values. Keep it under one screen if possible.
# [service-name] — [alert-short-name]
## 1. Is this real?
- Check [link to live dashboard]
- Sanity-check: [link to traffic/error baseline]
- Known false-positive conditions:
- [e.g. fires during weekly maintenance window 02:00–02:15 UTC]
- [e.g. expected during deploy if rollout exceeds 5min]
- If still unclear: page the secondary on-call.
## 2. First three actions
- [ ] Acknowledge the page.
- [ ] Open [link to incident dashboard].
- [ ] Check [link to deploy timeline] for changes in the last 60 min.
## 3. Most likely causes (ranked)
1. **Recent deploy.**
Verify: [link to deploy log]
Mitigate: [link to rollback procedure]
2. **Upstream dependency degradation.**
Verify: [link to dependency status board]
Mitigate: enable circuit-breaker via [feature flag link]
3. **Database connection-pool exhaustion.**
Verify: [link to db saturation dashboard]
Mitigate: scale read replicas via [runbook link]
## 4. Escalation
- If unresolved at +15 min: declare incident, page IC.
- Customer-facing impact > 5%: trigger customer comms via [link].
- Data-loss risk: page database team via [link].
- Ownership: @[team-handle] on Slack #[channel].
## 5. Once resolved
- File incident summary in [link].
- If novel cause: open follow-up to update this runbook.
- PIR scheduled within 5 working days.
---
Owner: [team]
Last reviewed: [YYYY-MM-DD]
Linked alert: [alert name + link] What each section is doing
"Is this real?"
The first thing the engineer needs to know. Most P1s have a five-second "is this an actual incident or has the alert misfired" question, and most runbooks bury the answer. Putting the live-status link first lets the engineer eliminate the false-positive case in seconds.
"First three actions"
Three. Not eight. Not "review the architecture diagram." The minimum set that gets the engineer oriented and looking at live data within sixty seconds of opening the runbook. Acknowledge, open the dashboard, check the deploy timeline. Almost every P1 starts with these three steps.
"Most likely causes (ranked)"
Ranked by historical frequency, not by alphabet. For most services, the answer is "recent deploy" 60% of the time. List it first. Each cause has a verification step (how to confirm it) and a mitigation step (what to do if confirmed). No prose. Imperative, link-heavy.
"Escalation"
Specific triggers, named recipients, time-bounded. "If unresolved at +15 min" is a clear trigger. "If you're not making progress" is not. The escalation criteria are the parts of the runbook that get used at +20 minutes when the engineer is starting to question whether they should have escalated already. Having explicit triggers means the decision is already made.
"Once resolved"
The post-incident hygiene that doesn't get done if it's not in the runbook. File the summary. Schedule the PIR. Update the runbook if the cause was novel. Without these prompts, the runbook stays static while the system evolves.
The "delete the bullshit" rule
Existing runbooks tend to accumulate sections that feel responsible to include but aren't useful at 02:47. Strong candidates for deletion:
- "Background context." If the engineer needs background context to act, the runbook is for the wrong audience. Move it to onboarding documentation.
- "Architecture overview." Same. The engineer either knows the architecture or shouldn't be on this rotation. The runbook is not training material.
- "Stakeholders." Replace with the escalation section's named contacts. Stakeholders aren't actionable; escalation paths are.
- Screenshots of dashboards. They're stale within weeks. Use links.
- Long prose explanations of "why this alert exists." The alert exists because something user-facing is broken. The engineer doesn't need the philosophy at 02:47.
Common anti-patterns
The runbooks we re-write most often share a few specific failure modes:
- Written in second person from the author's perspective. "You can also try checking..." Use imperative. "Check..." Two words shorter, easier to scan.
- Conditional logic without verification steps. "If the database is the problem, restart it." How does the engineer know if the database is the problem? Verification steps are non-negotiable.
- Copy-pasted across services. A runbook shared between five services isn't a runbook; it's a wiki article. Each alert deserves its own runbook with service-specific context, even if 80% of the content is templated.
- No "last reviewed" date. Without one, every runbook is suspected of being out of date. Date them; review them quarterly.
Where to keep them
In the repo for the service that owns the alert. Reviewed via PR. Linked from the alert annotation directly. The on-call engineer should not be searching a wiki at 02:47. Markdown rendered in your monitoring UI of choice (PagerDuty, Grafana OnCall, Opsgenie all support this).
How to keep them current
Three forcing functions:
- Quarterly audit. Same calendar slot as the alert audit. Every active runbook reviewed; the unowned ones get owners or get retired.
- PIR-driven updates. Every post-incident review either confirms the runbook was useful or flags the section that wasn't. The flag generates a follow-up.
- "Last reviewed" date enforcement. Anything older than six months gets visibly flagged in the runbook viewer. Engineers reading a stale runbook see the warning before they trust the contents.
The full multi-service standard is in the Blueprint, and the alert-hygiene rules that this template plugs into are covered in the alert hygiene post. Together they're the operational unit: alert + runbook + owner. Each one fails without the others.
The first ten minutes of a P1 are about the runbook
The argument this template answers: why the runbook matters more than the engineer in the early phase of an incident.
Burn-rate alerting
The alert that links to this runbook. Together they're the operational unit.
The Blueprint
Includes the longer multi-service runbook standard and the alert-hygiene rules that go with it.