Field notes

The on-call who can't get into prod at 02:00.

Every team I've audited has at least one access path that hasn't been exercised at 02:00 in months. SSO, VPN, jump-host, break-glass — they all work fine in office hours. Then the page fires, the laptop is half-asleep, the SSO redirect 500s, and the first twenty minutes of the incident are spent getting in.

8 January 2027 · Ken Tan · 6 min read

The most expensive minutes of an incident, in my experience, aren't the ones spent diagnosing. They're the ones spent logging in.

The pattern goes like this. The page fires at 02:14. The responder opens the laptop, which has been asleep for six hours. The corporate SSO certificate rotated last Friday and their browser is still holding the old token. They click through three redirects, get a 500, click again, get a 500, restart Chrome. The VPN client auto-updated last week and now demands a "first-time-setup" prompt they don't remember the answer to. They find their phone for the MFA. Their MFA app updated at midnight and the saved entry no longer works without a re-pairing. They wake their team lead to ask for a fallback. Twenty-three minutes have passed before they can run kubectl get pods.

The system was healthy that whole time, in the sense that nothing about the application was broken by the access failure. The system was also unhealthy, in the sense that no human could touch it. Those are not the same thing, and only one of them is on the dashboard.

What the access path is, and why it rots

The "access path" is the chain of moving pieces between the responder and a working production shell. In a typical mid-size estate it includes:

The corporate identity provider and the SAML certificates inside it.
The VPN client, the policy server, and the device-trust check.
The MFA app, the device pairing, the recovery-code path.
The bastion host or jump server, and its own SSH-key rotation.
The kube-config, AWS profile, or equivalent, with its own session-token expiry.
The break-glass account and the approver who has to release it.

Every one of those has a maintainer somewhere. Each maintainer operates on their own change cadence. The whole chain works only if every maintainer's change has been propagated to every responder's laptop and tested. That coordination almost never happens. The chain is reliable in office hours because it gets exercised in office hours. It is not reliable at 02:14 because nobody has tested it at 02:14, on a cold laptop, since the last quarterly drill — assuming there was a quarterly drill, which usually there wasn't.

The access failures that come up most

From the rotations I've reviewed, four access failures account for most of the lost minutes:

Expired SSO sessions with no clean re-auth path. The user redirect chain breaks in a way the responder can't easily diagnose. Most often a misconfigured SAML response or a clock-skew issue on the responder's laptop.
VPN clients that auto-updated and now demand configuration. The configuration was distributed by IT in a calm afternoon four months ago. The responder didn't read the email and now their laptop is asking for parameters they don't have at 02:14.
Stale kubeconfigs and AWS profiles. The cluster endpoint rotated. The IAM role was renamed in a security review. The responder's local config still points at the old one. The kubectl command times out and the error message is not helpful.
Break-glass paths gated on approvers who are unavailable. The break-glass policy requires a director to approve. The director is in a different time zone, on a flight, or simply asleep. The responder can't escalate the access problem because the access problem requires the very person they can't reach.

None of these are exotic. All of them have happened to teams I've worked with in the last twelve months.

Why it's a leadership problem

Access is owned, in most orgs, by a team that is not the SRE team. Identity sits with IT or platform security. The VPN sits with networking. The MFA app sits with corporate IT. The bastion sits with platform engineering. The kube-config sits with whoever onboarded the cluster. The responder is downstream of all of them.

When access breaks at 02:14, none of those owners are paged. The SRE is paged. The SRE has no authority to fix the SSO certificate or re-issue the VPN policy. They have to wake someone in the upstream team, who is also not on call, and who has no incentive to be fast, because the incident isn't recorded against their tooling.

The fix is not technical. It's organisational. The access path needs a single owner who is accountable for it being exercisable at any hour, and that ownership has to be visible to leadership. Without that, the access failures keep happening, because they don't show up in any one team's metrics.

The drill that finds them

The intervention I recommend is unfashionably analog. Once a quarter, pick a random hour outside business time. Send the on-call a notification: "this is a drill. Time how long it takes you to reach a production shell from a cold laptop." Don't tell them in advance. Don't pre-warm the session. Just measure.

The first drill is always uncomfortable. The median time-to-shell I see is between fifteen and forty-five minutes. The responder is embarrassed, but the embarrassment is the wrong response — the failure is in the access path, not the responder. The drill output is a list of specific failures, each with a specific owner. Most of them get fixed within a sprint, because they're individually small. The drill four months later is fast.

What the runbook should say

The runbook for any production alert should include, before any diagnostic step, the access checklist:

Link to the SSO health page and the fallback SAML endpoint.
The current VPN client version and where to download it.
The kubeconfig refresh command, copy-pasteable, with the cluster name embedded.
The break-glass procedure, the approver list, and an explicit fallback if no approver answers in five minutes.

Most runbooks I read skip this section because the assumption is that access "just works." It doesn't. The access section is the first thing the responder needs and the most likely thing to be out of date.

The line worth holding

Your incident MTTR has a hidden floor, and that floor is the time it takes your responder to reach a shell from a cold laptop. If you've never measured it, the number will surprise you. Once you have measured it, the access path becomes a system you can improve, and the floor drops fast.

The on-call who can't get into prod at 02:00.

What the access path is, and why it rots

The access failures that come up most

Why it's a leadership problem

The drill that finds them

What the runbook should say

The line worth holding

The first ten minutes of a P1 are about the runbook, not the engineer.

What the runbook should actually look like.

The escalation path that ends in 'just DM Raj'.

If your responder can't run kubectl in three minutes from cold, your incident MTTR has a hidden floor.

What the access path is, and why it rots

The access failures that come up most

Why it's a leadership problem

The drill that finds them

What the runbook should say

The line worth holding

The first ten minutes of a P1 are about the runbook, not the engineer. →

What the runbook should actually look like. →

The escalation path that ends in 'just DM Raj'. →

If your responder can't run kubectl in three minutes from cold, your incident MTTR has a hidden floor.

The first ten minutes of a P1 are about the runbook, not the engineer.

What the runbook should actually look like.

The escalation path that ends in 'just DM Raj'.