The status page that lags the incident by 40 minutes.
The incident is detected at 14:18. The team's first instinct is to investigate, not to communicate. The status page is updated at 14:53, after the team has a working theory. By that time the customer escalation has already happened, and the trust damage from the silence is bigger than the trust damage from the outage.
A failure mode I see in nearly every team I work with, and one that almost nobody pre-mortems for: the status page lags the incident by between 30 and 90 minutes, and the team treats this as normal.
The pattern goes like this. The page fires at 14:13. The on-call investigates. By 14:30 they've confirmed the issue is real, customer-facing, and not going to be resolved in the next ten minutes. They start drafting a status update. The draft goes for review. Marketing weighs in on the wording. Legal asks for a softer phrase. The status page is updated at 14:53.
During those 40 minutes, customers had no idea whether the company knew. They tweeted. They opened tickets. They emailed their account managers. By the time the status page changed, the company had absorbed an additional layer of trust damage that was entirely about the silence, not the underlying outage.
What customers actually need from the status page
The mistake teams make is thinking the status page is a place to publish a diagnosis. It isn't. It's a place to publish acknowledgement. The customer doesn't need to know the root cause in the first ten minutes. They need to know:
- That the company has noticed.
- That someone is working on it.
- When the next update will arrive.
That's three sentences. The first update doesn't need to say what broke or why. It needs to say "we are aware of an issue affecting [feature], engineering is investigating, next update in 15 minutes." That update can ship in three minutes. Most teams take 40, because they're trying to ship a polished diagnosis instead of a quick acknowledgement.
Why the lag exists
The 40-minute lag isn't a tooling problem. It's an organisational one. The status page typically requires sign-off from at least three roles: the incident commander, a marketing lead, and sometimes legal. None of these roles are on call. Each one wants to read the draft. Each iteration takes fifteen minutes.
The other reason is the team's reluctance to commit. They don't want to publish "we're investigating" because they're worried the investigation will conclude this isn't actually a customer-facing issue, and now they've alarmed customers for nothing. The fear is of the false positive. The cost of the false negative — silence during a real outage — is much higher, but it's harder to feel, because nobody complains in real time about silence; they just quietly lose trust.
The protocol that closes the gap
The intervention that works, in my experience, is a written comms protocol that's executable in the first ten minutes without escalation. The structure I recommend has four pieces.
- A pre-approved sentence template. The first status update doesn't need legal review every time, because it's the same sentence shape every time. Marketing and legal sign off on the template once, in a calm afternoon. The on-call fills in the blanks: "We are aware of an issue affecting [feature]. Engineering is investigating. Next update at [time]." No further approval needed.
- A single named approver, with a backup. The incident commander is empowered to publish the first update. They don't need to wake marketing. The second update can involve marketing if needed, but the clock starts immediately.
- A refresh cadence. Every 15 minutes during an active incident, regardless of whether there's news. "Engineering continues to investigate. We will update again at [time]." The cadence reassures the customer that the page is being maintained.
- A clear severity threshold for triggering the protocol. Not every alert needs a status update. The protocol triggers when the issue is customer-facing and lasts more than five minutes. The threshold is written down so the on-call doesn't have to guess.
The protocol fits on a single page. It lives in the runbook. It's executable cold, at 02:14, by an L1 with no marketing background. That's the bar.
The internal comms problem
The status page is the external comms artefact. Most incidents also have an internal comms problem. Customer support doesn't know what's happening. Sales doesn't know whether to defer customer calls. Account managers are getting questions they can't answer.
The same protocol applies internally. A pre-approved Slack template. A named channel for incident updates. A 15-minute refresh cadence even when there's nothing to add. The customer support team needs to know what the customer is being told, in the same language, before the customer reaches out.
The internal comms gap, when it goes unaddressed, produces the secondary failure mode where customer support tells customers something different from what the status page says. Now the company is contradicting itself in public, and the trust damage compounds.
The metric to track
The metric I recommend tracking — and almost no team does — is "time to first external acknowledgement," measured from the first internal incident detection. Pull the last twenty incidents. For each, compute that latency. Plot the histogram.
Median under ten minutes is healthy. Median over thirty minutes is a comms problem the team should fix before the next major outage. The number doesn't need to be public. It just needs to be visible to leadership, who are the only people with authority to change the protocol that produces it.
The line worth holding
The status page is the company's voice during an outage. Silence is louder than anything you'd put on it. The customer can forgive an outage; they have a much harder time forgiving the impression that nobody noticed or cared. Acknowledge fast. Refresh on schedule. Diagnose later. The order matters, and the protocol is the thing that lets the team get the order right under pressure.
The customer told us before our monitoring did.
The companion failure: detection latency. The status page latency is the same problem at a different layer — the comms layer.
The synthetic check that lies to you.
A common cause of status page latency: the page is driven by a synthetic that didn't fail.
The first ten minutes of a P1 are about the runbook, not the engineer.
If the runbook doesn't include the comms step, the comms don't happen on time. Communication is part of the incident, not a thing that happens after.