The vendor demo that solved the wrong problem.
The slick demo runs on a curated dataset against a clean estate. Six months later your team is on the same platform, the bill is six figures higher, and the same incidents are taking the same length of time to resolve. The tool was never the constraint. The demo lied because the demo was selling against a different team's problem.
I've sat through enough vendor demos to know what's coming. The salesperson opens the platform. The dashboards are gorgeous. There's a sample dataset that lights up beautifully. Anomalies appear in real time, helpfully labelled. The platform suggests a probable root cause. The pricing slide arrives. Someone in the room nods. Procurement starts.
Six months later I'm working with the team. They're on the new platform. Their incident MTTR is unchanged. Their alert volume is the same. Their on-call is just as exhausted. The bill is significantly larger. The team is quietly disappointed.
The story is consistent enough that I now ask, on the first day of every engagement: "have you bought a new observability tool in the last twelve months, and if so, has it changed any metric?" The answer is almost always "we bought one, and not really."
What the demo is actually selling
The demo isn't lying about the platform. The platform usually does the things the demo shows. The demo is lying about your team. The demo runs on a curated dataset, with clean instrumentation, with well-defined services, with a meaningful SLI catalogue, with properly-labelled alerts. Your team's data has none of those properties.
When the platform lands in your environment, it does its best with what's there. What's there is unlabelled metrics, inconsistent service names, alerts that fire on resource utilisation, dashboards that haven't been audited in two years. The platform's beautiful features depend on inputs the team hasn't built. The features degrade gracefully into "another place to look at the same data," which is what the team has now paid extra for.
The vendor isn't dishonest. They sold a tool that does what they showed, given clean inputs. The team's mistake was assuming the inputs would arrive with the tool. They don't.
The four problems the tool can't solve
Most teams I work with come into a procurement conversation with a perceived tooling problem and an actual upstream problem. The upstream problems are remarkably consistent:
- The team has no SLI catalogue. Without a catalogue, no platform can compute meaningful burn rates, because there's no agreement on what "good" means. The platform will let you build dashboards, but the dashboards won't answer "is the system healthy?" because the team hasn't answered that question.
- The instrumentation is incomplete. The traces end at a service boundary. The metrics are missing the labels needed to slice by tenant. The logs aren't structured. No platform compensates for this. They all surface what's there, and what's there isn't enough.
- The alerts are wrong. The team has hundreds of alerts that fire on resource thresholds. The new platform can ingest them, route them, deduplicate them — and they're still wrong. The volume is the same. The fatigue is the same.
- The runbooks aren't useful. The platform can attach a runbook to every alert. The runbooks are still the same documents that didn't help anyone at 02:14. Better packaging doesn't make a poor runbook good.
Each of these is a definition-and-discipline problem, not a tooling problem. None of them are solved by switching platforms. All of them have to be solved before the new platform can deliver what the demo showed.
The demo to ask for
The demo I now ask for, when a client is procuring tooling, is the opposite of the curated one. "Show me the platform on a real customer's data. A messy one. Show me what an unfinished SLI catalogue looks like. Show me what happens when the metric labels are inconsistent. Show me the failure modes."
Few vendors agree to this. The ones who do are usually the ones worth buying from, because they're confident the platform is valuable even on imperfect inputs. The ones who can't show the messy version are usually selling the curated version, which is not the version your team will ever have.
What to do before procurement
The pattern that produces good tooling decisions, in my experience, is to do the upstream work first. The order of operations:
- Audit the current tool against a written set of questions you want it to answer. List the questions. Try to answer each on the existing platform. The questions that fail are the ones the next tool needs to address. The questions that pass are the ones for which switching tools is premature.
- Build a minimum SLI catalogue on the existing tool. Three or four user-facing SLIs, with burn-rate alerts. If you can't do this on the current tool, the next tool won't help — but in nearly every estate, you can. The pain is in the discipline, not the tooling.
- Audit the alerts. Retire the resource-based ones. Replace them with SLO burn-rate alerts. Measure alert volume and false-positive rate. The metrics will improve regardless of platform. If they don't improve enough, you now have a written case for tooling.
- Then talk to vendors. The conversation is different when you arrive with a written diagnosis. Vendors sell tools; they can't sell you definitions. If you don't have definitions, the tool conversation is premature.
The honest tool conversation
Sometimes, after the upstream work is done, the team genuinely needs a new tool. The old tool can't handle the cardinality. The pricing model has become punitive. The query language has bottlenecked the team's ability to investigate. Those are real procurement triggers, and the new tool will deliver against them, because the team's diagnosis is now specific.
The procurement that works is the one driven by a specific gap. The procurement that doesn't is the one driven by a beautiful demo. The first usually arrives quietly, the second arrives via a board meeting.
The line worth holding
Tooling is downstream of definitions. The vendor demo is selling the platform's behaviour on definitions you don't have yet. Build the definitions. Then buy the tool. In that order, or you're paying twice — once for the tool, once for the work that should have happened first.
You don't need an observability platform. You need definitions.
The companion: tools without definitions don't solve the problem. The vendor demo is the most expensive way to discover this.
Observability is on the wrong line item.
Sibling argument: tooling spend is the visible part of an iceberg whose larger volume is incident hours and over-provisioning.
CloudWatch was never going to be enough.
Where the vendor-demo conversation usually starts: a cloud-native tool the team has outgrown and a procurement process that's about to over-correct.