Opinion

Stop calling it observability if you don't have traces.

The vendor rebrand of 'observability' as logs + metrics + (optional) traces has done real damage. If you think you have an observability platform and you don't have traces, what you have is well-funded monitoring. The third pillar isn't a feature; it's the load-bearing one.

15 May 2026 · Tracefox · 5 min read

The bait-and-switch is everywhere. Walk into any vendor's "Observability Platform" pricing page in 2026. Three pillars repeated like a mantra: logs, metrics, traces. Now check what's actually emitting from your production. Logs: yes. Metrics: yes. Traces: a half-instrumented attempt that drops context at the message broker and hasn't been touched in eighteen months.

You bought an observability platform. You deployed monitoring. They are not the same thing.

The original definition is more useful than the rebranded one

Charity Majors' framing, the one the term was built on, is worth restating: observability is the ability to ask arbitrary, high-cardinality, high-dimensional questions of your system without redeploying it. Why is request 4bf92f3577b34da6 slow? Why is checkout failing for users on Android in Sydney since 14:42 UTC? What's different about the one tenant whose latency tripled?

Logs and metrics are necessary for that. They aren't sufficient. Logs without trace context can answer some of those questions, but only after you've already narrowed down which user, which time, which tenant. Metrics aggregate by definition; they don't preserve the per-request dimensionality you need for the question to make sense in the first place.

Traces are how you answer the questions you didn't anticipate. Without them, "observability" collapses into the older, narrower discipline: you-thought-of-this-yesterday. Which is also fine. Just call it that.

Why traces don't get done

Tracing is the hardest of the three pillars to deploy properly. The instrumentation has to land in every service. Context has to propagate through every hop, including the message broker, the worker pool, and the third-party SDK that strips headers by default. Then you need a backend that can store and query the volume without bankrupting you.

That's a lot of work. Logs and metrics, by comparison, mostly Just Work: turn on an agent and they appear. Vendors capitalised on this asymmetry. They sold "observability" as a category that included logs and metrics platforms with traces tacked on as a feature you'll get to eventually. Eventually keeps not happening.

What "having traces" actually looks like

Functioning distributed tracing isn't "we deployed Jaeger and looked at it once." It looks like:

W3C TraceContext propagated through every service-to-service call, including async hops via message brokers.
trace_id injected into every structured log line your applications emit, so a log search produces a one-click pivot to the trace.
A backend (Tempo, Honeycomb, X-Ray, Lightstep, take your pick) ingesting at production volume with sub-second query latency.
Engineers reaching for the trace view first when debugging, not last.
Sampling strategy understood and documented, not "we sample at 1% because that's the default."

If you're missing any of those, what you have is an unfinished trace deployment. Which is most teams, including teams that have a budget line item for "observability" running into seven figures. The platform was bought; the discipline wasn't.

Vendors will keep telling you that monitoring with extra steps is observability. It isn't. The work to close the gap is yours, not theirs, and they have no commercial incentive to mention this.

Why this matters more in 2026

The vendors are now selling AI SRE on top of the same architecture. The agents need traces to do anything useful. Without them they're correlating logs to metrics, which is exactly the partial-information problem that humans have already failed at. We wrote about that here: the data is the work, and the agent is leverage on top of the work.

If you only invest in two pillars, the AI gets the same partial picture you have today, only faster and more confidently wrong. The leverage is in the third pillar, and the third pillar is the one nobody is talking about because it's also the one that takes real engineering to land.

What to do about it

The order of operations is unsurprisingly the same as the rest of the methodology:

Pick an instrumentation standard. OpenTelemetry, almost always.
Deploy the OTel Collector as the central pipeline.
Get trace context propagating end-to-end, including the awkward hops that always silently break it.
Inject trace_id into every structured log line.
Validate by debugging a real incident with the trace view as the first stop. If you can't, you're not done.

The deeper version of all of this is in the guide on OTel Collector vs vendor agents and the Golden Signals primer. The Blueprint at /resources covers the full instrumentation checklist.

Until you've done this work, you have monitoring. Useful, valuable, necessary monitoring. But not observability. The vendors will keep telling you it's the same thing. It isn't.

Stop calling it observability if you don't have traces.

The original definition is more useful than the rebranded one

Why traces don't get done

What "having traces" actually looks like

Why this matters more in 2026

What to do about it

OTel Collector vs vendor agents

The Golden Signals

AI SRE without good telemetry is theatre

Half-finished tracing usually shows up only when something — an AI agent, an outage, a new hire — tries to actually use it.

The original definition is more useful than the rebranded one

Why traces don't get done

What "having traces" actually looks like

Why this matters more in 2026

What to do about it

OTel Collector vs vendor agents →

The Golden Signals →

AI SRE without good telemetry is theatre →

Half-finished tracing usually shows up only when something — an AI agent, an outage, a new hire — tries to actually use it.

OTel Collector vs vendor agents

The Golden Signals

AI SRE without good telemetry is theatre