Field notes

The escalation path that ends in 'just DM Raj'.

On paper, you have a tiered rotation. L1 acknowledges. L2 escalates within 15 minutes. L3 is the architect on a quarterly schedule. In practice, the L1 messages Raj. Always Raj. Raj is on holiday this week and the team is quietly nervous about whether anything will get fixed before Monday.

25 December 2026 · Ken Tan · 6 min read

Every team I've worked with claims to have a tiered escalation. L1 triages. L2 takes the harder ones. L3 is the deeper expertise, paged only when the previous tiers have exhausted their options. The diagram is on the wiki. It looks orderly.

The diagram doesn't survive the first incident. What actually happens, in nearly every team I've reviewed, is that the L1 acknowledges the alert, opens a Slack DM to the same senior engineer they always open a DM to, and types "hey are you around, weird one on the orders pipeline." The senior engineer answers. They've been answering at 02:47 for three years. They know they're not on call. They answer anyway, because if they don't, the L1 will flounder for forty minutes and the postmortem will be uncomfortable.

The team has a rotation. The senior engineer is the rotation. Those are different things. The first one is on paper. The second one is the load-bearing reality.

Why this happens

The reach for the DM isn't a sign that the L1 is lazy or that the senior engineer is a glory-hound. It's a rational response to three underlying conditions, all of which are organisational:

The runbook doesn't get the L1 to a resolution. It gets them to "investigate further." The senior engineer is the shortcut to skipping the investigation, because they already know what the alert means.
The dashboards don't surface the right signal in the right shape. The senior engineer has built a mental model of the system that lets them read the existing dashboards correctly. The L1 has not. The DM is asking for an interpretation service.
The L1 has been burned before for guessing wrong. They reach for the senior because the cost of a wrong call (a wider outage, an awkward retro) is higher to them personally than the cost of waking someone up.

All three of those are fixable. None of them are about the L1's skill. They're about what's been built around the L1.

What the data usually shows

When I run a rotation review, the audit is uncomfortable. I trace the last quarter's pages — usually 80 to 200, depending on the team — to whoever ended up resolving each one. The breakdown is almost always shaped like this:

One or two senior engineers resolved 60–80% of the pages.
The named L1s resolved 10–20%, mostly false positives that didn't need anyone.
The named L2s resolved a single-digit number, almost all of them in office hours.
Several pages were resolved by people not in the rotation at all.

Which means the rotation isn't a rotation. It's a queue with two names doing the work, and the rest of the names providing psychological cover for the org chart. When one of those two names leaves, or burns out, or takes a planned holiday, the team discovers the rotation hasn't been training anyone all year, and the next incident is in a much worse place than anyone expected.

The senior engineer is paying the price

The other thing the audit surfaces is the cost on the senior engineer. They're not formally on call, so the org isn't paying them on-call comp. They are functionally on call, every night, for years. They answer DMs. They check Slack on weekends. They never quite go on holiday. By the time you ask them about it directly, they've rationalised it as "I'd rather just deal with it than have the conversation," and they're three months from quitting.

The cost of their leaving is the cost of discovering that nobody else can run the system. Which is the original problem, except now it's an emergency.

What to actually fix

The pattern that works isn't to scold the team into using the formal escalation path. The DM happens because the formal path doesn't work. To make the formal path work, you have to remove the reasons the L1 is reaching for the shortcut.

Audit the runbooks for the top ten alert types. For each, ask: can a competent L1 resolve this from the runbook alone, without DMing anyone? If the answer is no, the runbook is the artefact to invest in. The senior engineer is the right person to write it, ideally during a calm afternoon, not during the next incident.
Pair the L1 with the senior on real pages, not simulations. Shadowing during an actual incident, with the senior narrating what they're looking at, transfers more in twenty minutes than a quarter of training videos. The senior has to consent. Some won't. That's a different conversation.
Make the formal escalation path no slower than the DM. If the policy says "wait 15 minutes before escalating to L2," and the L1 thinks the senior will respond in 30 seconds via DM, the L1 is making a sensible time-to-resolution decision. The policy has to compete on speed, not just on form.
Recognise the senior engineer's load explicitly. Either pay them for the work they're doing, formally rotate them out of it, or both. The current arrangement is paying them in goodwill, and goodwill runs out.

What it looks like when this is fixed

The signal that the rotation is real is uneventful. The pages get handled. The senior engineer's Slack notifications are not the incident channel. Six months in, the audit shows the page-resolution distribution is broader, with five or six names doing meaningful work and no single name carrying the rotation. The senior engineer takes a holiday and nothing breaks. That's the bar.

The line worth holding

A rotation isn't a rotation if everyone DMs the same person. It's a queue with a single backend, dressed up as a system. The fix is not to forbid the DMs. It's to remove the reasons the DMs are the obvious move. Once the runbooks get the L1 home, the DMs stop on their own.

The escalation path that ends in 'just DM Raj'.

Why this happens

What the data usually shows

The senior engineer is paying the price

What to actually fix

What it looks like when this is fixed

The line worth holding

The first ten minutes of a P1 are about the runbook, not the engineer.

What the runbook should actually look like.

Your alert hygiene is a leadership problem.

If your rotation has names but everyone routes around them, the rotation is the bug, not the people.

Why this happens

What the data usually shows

The senior engineer is paying the price

What to actually fix

What it looks like when this is fixed

The line worth holding

The first ten minutes of a P1 are about the runbook, not the engineer. →

What the runbook should actually look like. →

Your alert hygiene is a leadership problem. →

If your rotation has names but everyone routes around them, the rotation is the bug, not the people.

The first ten minutes of a P1 are about the runbook, not the engineer.

What the runbook should actually look like.

Your alert hygiene is a leadership problem.