Tracefox / Library / Opinion
Opinion

High CPU is not an incident.

Resource utilisation is a diagnostic signal, not a paging signal. The teams burning out their on-call rotations on CPU thresholds are paying for the wrong instinct, taught by a generation of monitoring tools that never finished evolving. Page on impact. Investigate with utilisation.

· Tracefox · 5 min read

The on-call engineer's pager goes off at 03:14. The alert says CPU on orders-api-prod-7 is above 85% for ten minutes. They open the laptop, log in to the platform, check the latency dashboard. P99 is at 180ms, same as it always is. Error rate is zero. Throughput is normal. They stare at the screen for two minutes, acknowledge the alert, and go back to bed annoyed.

This happens four times that week. Across the org, it happens about three hundred times that week. Every one of those pages is the team paying for a bad instinct that the monitoring tools have been teaching them for fifteen years: that high resource utilisation means something is wrong.

It doesn't. It just means the resource is being used.

The argument in one paragraph

Resource utilisation (CPU, memory, disk I/O, queue depth) is a diagnostic signal. It helps you explain why something is happening, once you already know something is happening. It is almost never a useful paging signal in its own right. The thing you should page on is the user-facing impact: latency, error rate, availability, the business outcomes those defend. If you have those signals and they're healthy, the CPU graph is a curiosity. If you don't have those signals, the CPU graph isn't filling the gap; it's hiding it.

Why this instinct exists

The alerting patterns most teams inherit were designed in an era when:

  • Servers were long-lived, named, and sometimes pets.
  • Capacity was provisioned manually, weeks ahead.
  • If a box ran out of CPU, the box ran out of capacity, and you had no way to add more without ordering hardware.
  • There was no application-level latency telemetry to alert on instead.

In that world, "CPU above 85%" was a useful proxy for "something bad is about to happen and you need a human." That world has not existed for a decade. The auto-scaling group adds another instance. The Kubernetes deployment spins up another pod. Latency telemetry is available in every serious monitoring stack. The proxy is not just unnecessary; it's now actively misleading, because it fires constantly in systems that are behaving exactly as designed.

Modern systems are supposed to drive utilisation up. That's the point. A service running at 30% CPU at all times is a service that's over-provisioned by 2x. The CFO is paying for headroom you don't need to catch incidents that aren't happening.

The two failure modes of CPU alerting

Both of these show up in nearly every alert audit we run:

Mode one: pages with no impact

The alert fires. The engineer investigates. Nothing is wrong. They acknowledge and go back to whatever they were doing. This happens often enough that the team has unconsciously trained itself to treat utilisation pages as low-credibility. The rare time the page genuinely correlates with user impact, the response is slower because nobody really believes the alert until they've checked latency themselves.

Mode two: missed incidents the CPU graph hid

The latency is degrading. The error rate is climbing. CPU is at 60% and looks fine. No alert fires, because the alerting strategy was built around resource thresholds. The customer-facing problem persists for an hour before someone notices, and the postmortem records "we didn't have an alert for this", because the alerts that would have caught it were never written, in favour of the dozens that were watching CPU.

Both modes share the same root cause: the team is alerting on the wrong thing. Mode one is the cost of false positives. Mode two is the cost of false negatives. They're related: every minute spent on noise is a minute not spent building the alert that would have caught the real issue.

What to alert on instead

The alerts that should wake people up are the ones tied to user impact, not infrastructure state. A working alert set, after an audit, looks like this:

  • SLO burn-rate alerts on the four or five user-facing services that matter: checkout, search, login, the API tier the customer integrates with. The guide covers the maths.
  • Tail-latency anomaly alerts on the same services. P99 drift is the leading indicator most teams underweight.
  • Saturation alerts on a small number of resources where saturation genuinely is the failure mode that can't be detected at the user-facing layer: typically queue depth on async pipelines, connection pool exhaustion, message broker lag.
  • External availability checks against the user-facing surfaces, from at least two regions, every minute.

What's not on that list: CPU thresholds, memory thresholds, disk I/O thresholds, generic "5xx count > 100" rules. Those become dashboards. Dashboards are fine. They are not alerts.

The kitchen analogy, retired

There is a tempting analogy where CPU is the kitchen, users are the diners, and a busy kitchen is fine if the food keeps coming out. We've used it too. It's not wrong, but it understates the case.

The accurate version is: CPU isn't even in the dining room. It's the electricity bill for the kitchen. Whether the bill is high tells you something about the kitchen's economics. It tells you nothing about whether the diners are happy. Pages that wake humans up should be about the diners.

The team that retires its CPU alerts on Monday gets a quieter on-call rotation by Friday. They also discover, on Tuesday, that there are real customer-facing failure modes nothing was alerting on, and now they have the engineering attention to fix that, because it isn't being spent on noise.

The transition, in practice

Teams resist this because the resource alerts feel like the safety net. Removing them feels reckless. The order of operations that makes it not reckless:

  1. Define SLOs on the user-facing services first. Without these, you genuinely don't have a substitute for the resource alerts, and removing them is a worse decision.
  2. Wire up SLO burn-rate alerts and let them run in parallel with the existing CPU set for two weeks. Watch which fires first when something genuinely goes wrong. The answer is, with high reliability, the SLO alert.
  3. Retire the CPU alerts. Keep the dashboards. Keep the data. Stop paging on it.
  4. Audit the result. The on-call rotation should be quieter: fewer pages per week, more of them being real. If it isn't, the SLO alerts are tuned wrong, not the resource alerts.

The leadership angle on alert hygiene is the conversation that has to happen first if this is going to stick. Without organisational permission to remove alerts, the audit doesn't survive its own first incident.

The line worth holding

Page on impact. Investigate with utilisation. Don't confuse the two.

The CPU graph is a useful debugging surface. It is not a useful paging surface. The teams who internalise the difference get a quieter rotation, a sharper investigation toolkit, and an alerting set that actually represents what the business cares about. The teams that don't keep paying for the wrong instinct, one 03:14 page at a time.

Engagement.start()

Day one of an alert audit usually cuts 30–50% of volume by retiring CPU and memory thresholds. Nothing real gets missed.

The Tracefox alert audit classifies every active alert against the SLO it claims to defend. Resource alerts that don't map to a user-facing objective get retired. The on-call rotation gets quieter that week. Nobody has yet asked us to put the CPU alerts back.