Tracefox / Library / Opinion
Opinion

You don't need an observability platform. You need definitions.

A client came to us with a six-figure observability bill, three vendors, and engineers who still couldn't answer 'why is checkout slow today' in under an hour. The platform wasn't broken. The definitions were.

· Tracefox · 6 min read

A client came to us last quarter with a US$500k/year observability bill, three concurrent vendors, and a platform team that had been "consolidating" for two years. The new VP of Engineering asked us to come in because, after all that spend, his engineers still couldn't answer "why is checkout slow today" in under an hour.

The budget wasn't the problem. The definitions were.

The conversation we keep having

We assess a lot of teams. The pattern at the senior level is consistent: leadership thinks they have an observability problem, and the answer is going to be a vendor decision. They want to consolidate Datadog, replace New Relic, evaluate Grafana Cloud, look at Honeycomb. They've already drafted the RFP.

Once we run the assessment, the picture is almost never about the platform. The platform is fine. What's missing is everything upstream of it: the agreement about what to measure, what "healthy" means per service, which user journey matters, what number triggers which conversation. The data is being collected. Nobody decided what it meant.

What "definitions" actually covers

The ten things we end up writing down on every engagement, in every sector, regardless of vendor:

  1. Golden Signals per service. Latency, traffic, errors, saturation. Histograms not averages. Successful and failed paths separated.
  2. The user journeys that matter. Specific. "Authenticate", "complete a purchase", "load the dashboard". Three to five per product.
  3. The SLI for each journey. The number that says whether the journey worked.
  4. The SLO target for each SLI. Agreed with the business. Tiered by service criticality, not picked uniformly.
  5. The error-budget policy. What happens at 50%, 25%, 10%, 0% remaining. Who decides. Signed off.
  6. The burn-rate alerts. Two-window pattern. Severity, owner, runbook, escalation path.
  7. Naming conventions. Same labels across metrics, logs, traces. service, env, region, team.
  8. The instrumentation standard. OTel-first. Collector as the abstraction layer.
  9. Ownership at the alert level. Every alert traceable to a team within thirty seconds.
  10. The review cadence. Monthly SLO reviews. Quarterly tier re-assessment. Annual policy review.

None of this is a vendor capability. None of it changes if you swap Datadog for Grafana. The vendor is the substrate on which these definitions run, and switching substrate doesn't write the definitions for you.

Teams over-invest in the platform and under-invest in the definitions because the platform is something you can buy. The definitions are something you have to do.

The price of skipping this work

The fintech client above wasn't unusual. The cost of operating without these definitions tends to look like:

  • A platform spend that grows year-on-year without measurable improvement in MTTR.
  • Alert noise that the team has stopped acting on, because the alerts were never tied to user-facing outcomes.
  • SLO conversations that get re-litigated in every P1 because nobody wrote down what the team agreed.
  • Vendor sprawl driven by "the new platform will fix it", when the new platform won't, because the gap isn't there.
  • A leadership conversation about reliability metrics that engineering can't answer because nothing has been defined precisely enough to measure.

The cheap version

If you don't have budget for an engagement and don't want one, here's the week-one version of this work, in roughly the order it should happen:

  1. List your top ten production services.
  2. For each, name the one user journey that matters most. Write it down in a sentence.
  3. For each journey, pick one SLI. Use the starter catalogue.
  4. Pick a target. Use the tier model. Be conservative.
  5. Compute the error budget. Calculator here.
  6. Write down what you'll do at 25% remaining and at 0%. Get the product lead to sign.

That's a week of work for a small team. The result outperforms most vendor migrations, because most vendor migrations weren't going to do any of this either.

What to do this week

If you're considering a vendor change, run the test first: can your team answer "why is X slow today" in five minutes on your current stack? If yes, the platform is fine and the migration is probably a waste. If no, the new platform won't change the answer either; it'll just make the question more expensive.

The fintech client we walked in on cancelled the migration. They're still on their original stack, now answering checkout-slow questions in minutes, on the platform they already owned. The bill went down the following quarter.

The work was the definitions. It almost always is.

Engagement.start()

You probably don't need a new vendor. You need to write down the ten things you've never agreed on.

The Tracefox assessment is the structured conversation that gets those ten things written down. Signals, journeys, SLOs, budgets, ownership: by the end of the engagement, the team can answer 'why is checkout slow' in minutes, on the platform they already own.