Tracefox / Library / Field notes
Field notes

The customer told us before our monitoring did.

A support ticket lands at 14:22 with a screenshot of a 504. The SRE team checks the dashboards. Everything is green. Twenty minutes of digging later, they confirm a partial outage that started at 14:13. Nine minutes during which paying users were the only people in the building who knew.

· Ken Tan · 6 min read

Some of the most expensive numbers in observability are the ones that never get measured. Detection latency — the time between when a user starts having a worse experience and when someone inside your company notices — is the most expensive of those, in my experience.

I've sat in postmortems where the timeline reads:

  • 14:13: latency on the checkout endpoint begins to climb.
  • 14:14: error rate on checkout reaches 12%.
  • 14:18: first customer support ticket lodged.
  • 14:22: support agent flags the ticket to engineering.
  • 14:24: SRE on-call pulled in.
  • 14:31: incident declared.

Eighteen minutes from "users are getting hurt" to "we have an incident open." During those eighteen minutes the system kept breaking, and the only reason it eventually stopped breaking was a customer who had the patience to fill out a support form. The customers who didn't bother are still out there, just buying less.

Why the gap is so often that big

The gap exists for one of three reasons, and almost every team I audit has at least two of them:

  • The SLI doesn't measure the right thing. The latency SLI is on a path the failure didn't touch. The error rate SLI counts HTTP 5xx but the failure mode is a 200 with an empty body. The SLI is technically green during a real outage.
  • The aggregation hides the segment. One tenant is in pain. Their experience is averaged into a global SLI that moves by 0.1%, which doesn't trip the alert. The SLI is built for the median user; the failure mode is in the long tail.
  • The alert threshold is set against the wrong baseline. The error rate climbed from 0.2% to 1.8%. The threshold fires at 2%. The trend was clear from minute one but the gate caught it eight minutes later. The gate was technically correct and operationally late.

None of these are unfixable. All of them require treating detection latency as a first-class metric, which most teams don't.

The number you should be tracking

I ask every client to compute, for each incident in the last quarter, the gap between two timestamps:

  1. The first moment a user-visible signal could have been raised. (In practice: when the SLI data shows the system was already degraded.)
  2. The first internal acknowledgement. (When someone marked the incident, or paged the on-call, or wrote in the channel.)

The gap is detection latency. Plot it as a histogram across the last twenty incidents. The shape of the histogram will tell you something about the team that no other metric will. A team with median detection latency under two minutes has a working observability stack. A team with median over ten minutes has gaps that need to be named. A team with bimodal detection — some incidents at one minute, some at thirty — usually has different SLI coverage for different services, and the slow ones are the services with the worst telemetry.

What "the customer told us first" looks like as a class

When the customer beats the monitoring, the failure mode is almost always one of these patterns:

  • A new feature shipped without instrumentation. The endpoint exists in production but doesn't appear in any SLI. The first symptom anyone can detect is a support ticket.
  • A third-party dependency that the system relies on but doesn't measure. The user experiences "the page won't load." The internal signal is fine because the third party is outside the trace boundary.
  • A client-side error. The browser logs a JavaScript exception that prevents the form from submitting. The server is fine. The user is not.
  • A regional outage at the CDN or DNS layer. Internal regions look healthy. Users in one country are experiencing 60-second timeouts. There's no synthetic running from the affected region.

All four are detectable with the right instrumentation. The first two are SLI gaps. The third is a RUM gap. The fourth is a synthetic-coverage gap. Each gap, once named, gets fixed in a sprint or two.

The intervention

The intervention I recommend has three parts. None of them are technically sophisticated. They're operational discipline.

  1. Compute and post detection latency for every incident. Put the number in the postmortem template. Make it a section, not a footnote. The act of writing it down changes what the team optimises for.
  2. For every "customer told us first" incident, write a single sentence about which signal would have caught it. One sentence. Then add that signal to the backlog. Most teams have a backlog of five or six of these by the end of the quarter, and the cumulative effect of shipping them is a detection-latency curve that drops noticeably.
  3. Add user-reported incidents as a category in your tracking. Distinct from internally detected. The ratio of one to the other is a leading indicator of observability maturity. A team where 30% of incidents start with a customer ticket has a different problem from a team where it's 5%.

The line worth holding

The customer is allowed to find your bugs. The customer is not supposed to find your outages. The gap between user pain and internal awareness is a measurable number. The teams who measure it shrink it. The teams who don't keep apologising in support tickets and calling it operational reality.

Engagement.start()

If your customer beats your monitoring, the gap is a number you can shrink. Most teams have never written it down.

A Tracefox detection-latency review pulls the last quarter's incidents and computes, for each, the gap between the first user-visible symptom and the first internal signal. The output is a histogram and a list of specific gaps to close — usually two or three SLI definitions, one cardinality fix, and a synthetic check that should have existed.