Tracefox / Library / Field notes
Field notes

Hitting the landing page is not a triage.

A CPU alert fires at 02:14. The L1 responder opens the homepage, clicks pricing, clicks login, sees fast paint times, marks the alert as no-impact and goes back to sleep. Six hours later the support inbox tells a different story. I see this pattern more than almost any other in the rotations I review.

· Ken Tan · 6 min read

The story I want to tell first is one I've watched play out, with very minor variations, in every MSP and L1 rotation I've reviewed in the last few years.

It's 02:14 in Singapore. The pager goes off. CPU above 90% on prod-web-03 for five minutes. The L1 responder opens the laptop, types the production URL, lands on the marketing homepage. It paints in 400ms. They click "Pricing." Snappy. They click "Login." Fast. They acknowledge the alert with a comment that reads something like "site loading normally, no user impact, monitoring," close the laptop, and go back to bed.

Six hours later the support inbox has fourteen tickets. Checkout has been timing out. The CSV export endpoint has been returning 504s. One tenant's dashboard has been taking twenty-two seconds to render. A managing director somewhere is composing an email that begins "why was this not caught."

The alert was right. The triage was wrong. And the responder did nothing they hadn't been implicitly trained to do.

Why this is the default reflex

The reach for the homepage isn't laziness. In the moment, it's the most rational thing the responder has access to. They've been woken up. They don't have the cognitive headroom for a multi-tab investigation. The homepage is one click. It either loads or it doesn't.

The problem is that the question being answered ("does the homepage load for me, right now, from my network?") is not even a distant cousin of the question that should be answered ("is anyone, anywhere, having a materially worse experience than they should be?"). The two get conflated because the second question doesn't have an obvious one-click answer in most stacks, and the first one does.

So the test that gets used is the test that's available, not the test that's correct. That's a tooling-and-runbook problem, not a responder problem.

What the landing page actually exercises

The marketing homepage is the single page in your product least likely to reflect what users are experiencing. In most stacks I see, it is some combination of:

  • Statically rendered or aggressively cached at the edge.
  • Served from a CDN PoP nearest to the responder, which is usually the same city as the office.
  • Anonymous, so it touches none of the per-tenant code paths, none of the auth tier, none of the database.
  • Cached in the responder's own browser from yesterday.

Whatever pressure the alerted host is under, the homepage path does not typically traverse it. The L1 has confirmed that the front door of the building is still standing. They have not confirmed that the elevators work, that the fourth-floor tenant's office is on fire, or that the back-office workflow that 80% of revenue depends on has stopped moving.

What's actually happening on the box

A sustained CPU alert is a symptom, not a verdict. The diagnostic question is never "is the site up." It's "what is the CPU doing, and who is paying for it." From the engagements I've worked on, the most common answers have been:

  • A runaway query triggered by one tenant's data shape — invisible at the aggregate latency layer, devastating for that tenant.
  • A retry storm against a degraded downstream — the service is technically up, burning cycles re-attempting calls that will keep failing.
  • A poison message in a queue worker — the web tier looks fine because it isn't the one in trouble; an async processor on the same host is starving the request handlers.
  • GC pressure from a slow leak — latency is climbing gradually before the eventual restart.
  • Two cron jobs overlapping — the previous run hadn't finished when the next one started, and now there are two of them fighting for the same connection pool.

None of those are detectable by clicking around the marketing site. All of them are visible in the right place: real user monitoring, per-endpoint latency percentiles, per-tenant breakdowns, queue depths, error budgets. The triage is reading the wrong dashboard, or in many cases, the dashboard doesn't exist.

The five questions a real triage answers

When the page fires, the responder should be able to answer the following, from a single dashboard, in under five minutes, without having to think about it:

  1. Is p95 or p99 latency degraded on any user-facing endpoint? Not the homepage. The endpoints that carry revenue: login, search, checkout, the public API, the tenant dashboard.
  2. Are error rates elevated, and where? A 1% error rate on the checkout endpoint is a fire even when the homepage is green.
  3. Which tenants or user segments are affected? "Everyone" and "your largest customer" are different incidents with different escalation paths.
  4. What is the CPU actually doing? A flame graph, a top view, the slow query log. Not vibes.
  5. Did anything change? A deploy, a feature flag, a config push, an upstream incident, a traffic shift. Without this, root cause is guesswork.

If your L1 runbook does not lead to those answers in a few clicks, the runbook is the bug, not the responder. Fix the runbook, and the "site loads fine, dismissed" reflex disappears within a rotation or two — because there is now something more useful to look at.

Why I keep finding this pattern

When I dig into the rotations where this happens repeatedly, the cause is almost never the responder. It's one or more of three organisational failures:

  • The dashboards don't exist, or are too noisy to trust. The responder reaches for the homepage because it's the only signal they're certain how to read at 02:14.
  • The alert has cried wolf too many times. When the same CPU page has fired forty times for nothing, the forty-first one is dismissed in ninety seconds. (This is the deeper argument I've made separately about why utilisation alerts shouldn't be paging in the first place.)
  • There is no shared definition of "user impact." Without one, the responder defaults to the proxy they can verify in one click — their own browser — because it's the only signal they feel safe reporting to leadership in a comment.

Fixing the triage isn't a training problem. It's a runbook, dashboard, and definition problem. Give the L1 a one-page incident view that surfaces the five questions above. Tighten the alert thresholds so they correlate with real user pain. Define user impact in numbers the responder can read in under a minute — RUM p95, error rate on the top three endpoints, count of tenants with degraded latency. Once those exist, the homepage stops being the test of last resort.

What I tell teams to do this week

If you can't get to the full alert audit immediately, there are two moves I recommend that take less than a day each and meaningfully reduce this failure mode:

  1. Add user-facing signals to the alert payload itself. When the CPU alert fires, the notification should already include current p95 on the top three endpoints, current error rate, and a link to the per-tenant latency view. The responder should not have to go and find these. The alert should arrive carrying them.
  2. Rewrite the runbook's first line. If it currently says "check the site is loading", replace that with "open the user-impact dashboard at <link> and confirm p95, error rate, and affected tenants before deciding impact." The runbook is what the responder reads at 02:14. Make it lead them to the right place.

Both of those changes can land on a Friday afternoon. Neither requires new tooling. Both will catch incidents that the current setup is quietly dismissing.

The line worth holding

A 200 OK on the homepage is not evidence of anything except that the homepage is up. Treating it as evidence of user health is the most expensive triage shortcut I see in this industry, and it almost always ends with a postmortem that begins "the alert fired and was dismissed."

The teams that handle these incidents well aren't smarter or faster. They've just stopped letting a fast-loading marketing page end the conversation.

Engagement.start()

If your L1 runbook for a CPU page is 'check the site loads', the runbook is the bug.

A Tracefox alert and runbook audit looks at every active production alert and asks two questions: what user-facing signal does it correspond to, and what does the linked runbook tell the 02:14 responder to look at. The ones that fail get rewritten, or the alert gets retired. The rotations that adopt this stop dismissing real incidents as quiet nights.