Tracefox / Library / Opinion
Opinion

AI SRE without good telemetry is theatre.

Every vendor at the conferences is shipping an AI agent for incident response. Almost none of them are talking about the data those agents are reading. They should be, because the data is the work.

· Tracefox · 5 min read

Every vendor stand at every observability conference this year had an AI SRE agent on the demo. The agent pulls a trace. It summarises three Slack channels. It correlates the deploy timeline with the metric anomaly. It proposes a rollback. The room nods, slightly nervously.

The demos are genuinely impressive. The agents are doing real work: correlating signals, surfacing context, narrowing down likely root causes faster than a human on the same data. Some of them will be good enough to ship.

What none of the demos show is where the data came from.

The data the demo runs on doesn't exist in your production

In the demo, the trace is propagated end-to-end. The logs are structured, with trace IDs injected. The metrics have consistent labels (service, env, region, team) that the agent can pivot on. The error budget is calculated and current. Every alert has a runbook the agent can read.

That data is curated. It's the synthetic environment the vendor's product manager set up for the demo. It's not what's emitting from the company watching the demo, where the trace context drops at the message-broker hop, the logs are half-structured at best, and three of the metrics labels haven't been consistent across services for two years.

The agent is going to read the same telemetry your humans are reading. If your humans are taking two hours to find the cause of a P1, the agent will too, only with confident summaries this time.

Garbage in, agentic garbage out

The fundamental thing nobody is selling is the layer underneath. AI SRE doesn't transcend the data; it depends on it. An agent reasoning across incomplete traces and inconsistent log shapes is going to produce reasoning that's incomplete and inconsistent. It will look authoritative (these models always do) and it will be wrong with conviction.

The teams getting genuine value from AI-assisted incident response in 2026 are not the teams that bought the agent first. They're the teams that did the boring work first, and bolted the agent onto a stack that was already producing the right data.

What "the right data" actually means

Before any AI agent earns its keep on incident response, the telemetry beneath it has to be four things:

Complete

The four Golden Signals (latency, traffic, errors, saturation) emitting on every managed service. Histograms, not averages. Successful and failed paths separated. No services with "we'll instrument that one later." The primer is here if you need the long version.

Correlated

Trace context propagated through every service hop, including the ones that silently drop it (message brokers are the usual suspects). Trace IDs injected into every structured log line. The agent can pivot from a metric anomaly to the request that caused it without crossing tools. If the agent has to cross tools, it can't reason across the gap.

Bounded in cardinality

Labels small, consistent, and query-cheap. An agent that can't get a fast answer to "what's the error rate on checkout-api right now" can't reason about anything downstream of that. High-cardinality user IDs and request IDs belong on traces, not on metrics; every backend penalises the alternative.

Governed

SLOs agreed with the business. Error-budget policy written down. Runbooks linked from every active alert. The agent's outputs are only useful if the team has already decided what's worth acting on. Otherwise the agent is going to surface "errors elevated" alerts that the team has been ignoring for six months, only now with a chart attached.

The order of operations

The order matters. Genuinely. Skipping ahead to the agent because it's the interesting bit is the most expensive mistake a platform team can make this year, because the agent is going to amplify whatever's underneath. If what's underneath is a partially instrumented service mesh with inconsistent labels, the agent will amplify that into confidently wrong recommendations delivered at machine speed.

The teams getting it right are doing the unsexy work first:

  1. Pick an instrumentation standard (almost always OpenTelemetry) and apply it consistently.
  2. Get the four Golden Signals emitting reliably for every service.
  3. Standardise log structure with trace context injected.
  4. Write down what "healthy" means per service: SLO, target, window.
  5. Agree an error-budget policy with product. Sign it.
  6. Then bolt on the agent.

The agent is leverage on top of work. It is not the work.

What we see on engagements

Every Tracefox assessment in the last six months has had at least one conversation about AI SRE. The pattern is consistent: the teams asking when the agents will fix their incident response are scoring at maturity band 1 or 2. That's exactly where the data isn't yet good enough for the agent to be useful, and exactly where they would benefit most from doing the data work first.

The honest answer we give: the agents won't fix this. Not until your data is complete, correlated, bounded, and governed. Otherwise you're paying for a slightly faster path to the same wrong conclusion.

The unsexy work (Golden Signals, trace propagation, naming conventions, recording rules, SLO sign-off) is what makes the AI work. Skip it and the demos won't translate to your production. Build it and the agents will eventually accelerate teams that already know what they're doing.

Data first. Then context. Then AI.

Engagement.start()

When the question is 'when will AI SRE save us', the maturity score is usually 1 or 2.

The agent is leverage on top of work. The work is Golden Signals, trace propagation, SLOs, and a budget policy. Tracefox runs the assessment that tells you which of those you actually have, and which ones the agent is going to embarrass you on.