Tracefox / Library / Opinion
Opinion

The first ten minutes of a P1 are about the runbook, not the engineer.

When the page goes off at 02:47, the engineer who answers it isn't the variable. The runbook is. Training the engineer (rotation, shadowing, IC training) gets a budget. Writing the runbook rarely does.

· Tracefox · 5 min read

02:47. The page goes off. The on-call engineer rolls over, opens the laptop, reads the alert summary, and clicks the runbook link.

Whatever happens in the next ten minutes is mostly determined before the engineer even logs in. It's determined by what's on the other end of that runbook link.

What the engineer is actually doing

In the first ten minutes of a P1, an engineer is not solving the problem. They are doing four things in sequence:

  1. Confirming the alert is real. Is this a known false-positive? Is something actually broken?
  2. Locating themselves. Which service. Which environment. Which dashboard. Which playbook.
  3. Triaging the blast radius. Customer-facing? How many users? Is data being lost?
  4. Deciding whether to escalate. Wake the IC? Page secondary? Declare incident?

None of these require deep technical insight in the first ten minutes. They require orientation. A good runbook gives the engineer orientation in seconds. A bad runbook, or no runbook, forces the engineer to derive the orientation from scratch. The difference between the two outcomes is the first thirty minutes of MTTR.

The two failure modes

No runbook

The alert exists. Someone added it during a previous incident. Nobody wrote the runbook. The engineer at 02:47 is starting cold: which dashboards do I open, which logs do I search, who else owns this service, what's normal traffic for this hour of night. They'll figure it out. They'll also lose twenty minutes doing it. Multiply by every P1 across the year.

The useless runbook

The runbook exists, has eight sections, four diagrams, and a paragraph of "background context" before the first actionable line. It was written eighteen months ago by an engineer who has since left. Three of the listed dashboards have been deleted. One of the linked tools requires an SSO that the on-call doesn't have access to. The runbook is worse than no runbook because the engineer has to first realise it's worse than no runbook before they can start working.

Both failure modes are common. The second is more common at well-funded orgs that took runbooks "seriously" five years ago and haven't audited them since.

A runbook that hasn't been touched in twelve months is a runbook that's almost certainly wrong. Treat it as evidence, not authority.

What useful runbooks share

Looking across runbooks that on-call engineers actually rate as helpful, the common pattern is short. Useful runbooks are:

  • Above the fold. The first thing visible answers "is this real?" and "what do I do first?". Background context is at the bottom or removed.
  • Action-first. Imperative sentences. "Check the deploy timeline." "Run kubectl describe pod for affected workloads." "Page the database team if connection-pool exhaustion appears."
  • Linked, not copy-pasted. Direct links to the dashboard, the log query, the rollback button. Not screenshots that go stale.
  • Owned. A named team owns the runbook, has reviewed it in the last quarter, and is on the hook for keeping it accurate.
  • Versioned. Stored in source control next to the service that owns it. Reviewed via PR like any other artefact.

None of this is technically difficult. It's a habit: writing the runbook the way the engineer at 02:47 needs it, not the way it makes sense to the engineer who's writing it at 14:30.

Why runbooks rot

The asymmetry is the same as the alert one: writing a runbook is a local, visible action with a short-term feel-good. Updating an existing runbook is a global, invisible action that nobody schedules. Without a recurring practice, runbooks accumulate the same way alerts do, and decay the same way.

The fix is the same too: budget time for it, treat the audit as a practice, and tie runbook coverage to the alert that links it. An alert whose runbook hasn't been touched in twelve months is a candidate for review, automatically flagged, not manually remembered.

The discipline

On engagements we run a runbook audit alongside the alert audit. Every active production alert is checked: is there a linked runbook, does the link work, when was it last touched, would an unfamiliar engineer actually use it at 02:47. The runbooks that fail get rewritten or the alerts get retired.

A typical first audit finds 30–50% of runbooks rotted. That's not a sign of a bad team. It's the structural decay nobody scheduled the work to prevent. The team that schedules the work stops having the conversation about why MTTR isn't improving.

The next post in this series, what the runbook should actually look like, has the template we hand teams on engagement. Twelve sections becomes five. Most of what's there now can probably be deleted.

Engagement.start()

The runbook either earns the engineer ten minutes or costs them an hour. Find out which yours do.

Tracefox engagements include runbook coverage as part of the alert audit. Every active production alert gets reviewed: does the linked runbook exist, is it useful at 02:47, when was it last touched. The ones that fail get rewritten, or the alert gets retired.