Tracefox / Library / Opinion
Opinion

The leading indicator you're not watching.

Most incidents are preceded by 5–15 minutes of degradation that nobody alerts on. The signal is in the data. The cardinality to see it and the alerting strategy to act on it are usually not. The gap between proactive and reactive operations is narrower than the conferences make it sound, and more concrete.

· Tracefox · 6 min read

Every "we should be more proactive" conversation in engineering leadership eventually arrives at the same place: a vague aspiration that the team should somehow predict failures rather than react to them, with no concrete plan for how. The conversation moves on. Six weeks later, the next incident, the same conversation. The cycle is durable because the framing is wrong.

Proactive operations isn't a culture change. It's a small number of specific instrumentation and alerting decisions, taken on a small number of specific signals, on a small number of specific services. The teams that do this well aren't more disciplined than the ones that don't. They've just done the boring work of identifying which signals lead the incident, and then alerted on them.

What the data actually looks like before an incident

Pull the last six P1 or P2 incidents from any production system that has been instrumented honestly. Walk backwards from the moment the page fired. What you find, with high reliability, is that for 5 to 15 minutes before the alert, one or more of the following had already started to drift:

  • P95 or P99 latency on a downstream service creeping up from baseline, not breaching threshold, just trending wrong. By the time the SLO burn rate triggers, the trend has been in motion for twelve minutes.
  • Retry rate on the calling service elevated. The downstream is responding, just slowly enough that timeout-and-retry is becoming the path of least resistance. Retries are a leading indicator of cascade because they're literally the cascade beginning.
  • Queue depth on an async pipeline growing linearly. The consumer is keeping up, just barely. The queue isn't full yet. It will be in eight minutes.
  • Connection pool utilisation climbing past its usual range. One service is holding connections longer than it should and hasn't released them. The pool will exhaust before the next deployment.
  • Error rate on a single tenant or region elevated by 3x against its own baseline, but invisible at the aggregate, because the affected segment is small. The aggregate error rate is fine. The blast radius is expanding inside it.

Every one of those is observable. Every one of those is, in our experience, sitting in the team's metrics or trace data already. None of them is being alerted on, because the alerting strategy was built around the moment of failure rather than the run-up to it.

Why the run-up isn't being watched

The reasons are unglamorous and consistent across teams:

1. The metrics are aggregated past the signal

P50 latency on the service is fine. P99 has been climbing for ten minutes. The dashboard shows the average, because that's what the metrics pipeline was wired to emit. The team is watching a number that, by construction, cannot show them what's happening. We've made the broader case against this elsewhere. Averages hide every distribution problem that matters.

2. The alert is on the threshold, not the trend

Retry rate alerts are typically configured as "fire when retries per second exceed N." The threshold is set high enough that it doesn't false- positive on normal jitter. By the time it fires, the cascade is in motion. The alert worth having is the one that fires when the retry rate has tripled against its own thirty-minute baseline. That's a qualitatively different alert, and not the one most teams have configured.

3. The signal exists at high cardinality, but not in the alerting layer

The single-tenant error spike is visible in the trace data, where every request is tagged with a tenant ID. It is invisible in the metrics data, where the tenant dimension was dropped because it would have multiplied the cardinality bill by 50,000. The team's alerts run against the metrics data. The signal lives somewhere they aren't looking.

4. The alerting team and the SRE team don't share a model

Whoever wrote the alert set wrote it without sitting down with the engineers who run the on-call rotation, walking through the actual failure modes the system has historically had. So the alerts exist against textbook failure modes (CPU exhaustion, OOM, 5xx cliffs) and not against the failure modes this system actually exhibits, which are usually weirder, more specific, and more predictable than the textbook ones.

Lagging vs leading, made concrete

Lagging indicators are the moment the incident is real. Leading indicators are the run-up. The pairs are reasonably stable across systems:

  • Lagging: SLO burn-rate breach. Leading: P99 latency drift against rolling baseline, on the same service.
  • Lagging: error rate cliff. Leading: retry rate elevation at the calling layer.
  • Lagging: queue full / consumer lag SLO breach. Leading: queue depth growth rate exceeding its usual range over the last 30 minutes.
  • Lagging: connection pool exhaustion errors. Leading: pool utilisation above the 95th percentile of normal.
  • Lagging: aggregate error rate alert. Leading: any tenant, region, or endpoint at 3x its own baseline with low absolute volume.

Each of these can be a real alert. Each requires the cardinality to slice by the relevant dimension and the alerting tooling to compare against a rolling baseline, not a fixed threshold. The implementation isn't exotic. Prometheus with proper recording rules, or any of the modern backends, will do it. What's missing isn't capability. It's the work to identify, for your specific system, which leading indicator pairs to actually wire up.

In retrospect, the leading indicator is usually obvious. The practice is moving from "we should have seen that coming" to "we have an alert for that now", and doing it within a week of the incident, not a quarter.

The minimum viable practice

A team can move from reactive to meaningfully proactive in a single quarter, on a single service, with the following sequence:

  1. Pick one tier-0 service. Pull its last six incidents from the postmortem archive.
  2. For each incident, walk backwards through the telemetry. Identify the earliest moment any signal in the system was visibly different from its baseline. Write it down.
  3. Cluster the leading indicators across the six incidents. Typically 2–4 indicators account for the run-up to all of them.
  4. Configure pre-incident alerts on those 2–4 indicators, against rolling baselines, with low severity and a runbook that says "investigate now; page if it doesn't resolve in 10 minutes."
  5. Run for a quarter. Track which pre-incident alerts fired ahead of an actual incident, which fired without one, and which incidents the pre-incident alerts missed. Tune.

By the end of the second quarter, on a service that previously had only SLO burn alerts, the rotation will be catching incidents 5–15 minutes earlier with high reliability. The cumulative effect on the user-facing severity of incidents is significant. The difference between an incident that customers noticed and one they didn't is, very often, exactly that 10 minutes.

What this isn't

It isn't anomaly detection. It isn't ML on telemetry. It isn't a platform purchase. The teams selling those things will tell you proactive operations requires their product. It doesn't. It requires reading your own incident history, identifying the specific signals that lead it, and writing a small number of alerts that wouldn't have existed otherwise.

The work is unglamorous. It's also the part nobody outside the team can do for you, because the leading indicators are specific to your system, not generic to the industry. Every off-the-shelf "predictive operations" product is approximating, badly, the work this post is asking you to do yourself.

The frame that makes it real

Stop asking "are we proactive enough." It's not a question that resolves. Ask instead: "what signal would have caught the last incident ten minutes earlier, and is there an alert on it now?" That question has a yes-or-no answer. If the answer is no, you have a piece of work. If the answer is yes, do it for the second-most-recent incident. Then the third.

Six iterations of that loop is the difference. There isn't a shortcut, and there isn't a vendor for it. There is just the leading indicator you're not watching, sitting in your data, waiting for someone to wire up the alert.

Engagement.start()

The leading indicator is almost always already in your data. The team just isn't alerting on it because nobody has looked.

The Tracefox engagement includes a leading-indicator review on the top three services: pulling the last six incidents and walking back through the telemetry to find where the signal was visible before the page fired. The output is a small set of pre-incident alerts that turn what was a 02:47 page into a 14:30 investigation.