Tracefox / Library / Opinion
Opinion

Observability is on the wrong line item.

The CFO already pays for it. It just doesn't show up as a tooling cost. It shows up as incident hours, support tickets, and over-provisioning. The conversation that unblocks observability budget isn't about spending more. It's about moving the spend.

· Tracefox · 6 min read

The conversation always opens the same way. The platform lead has been trying to get observability budget approved for two quarters. Finance keeps asking what the ROI is. Engineering keeps producing decks full of MTTR charts and "industry best practices." The conversation goes nowhere because everyone is looking at the wrong line item.

The CFO is not refusing to fund observability. The CFO is already funding it, at roughly three to five times the proposed budget, just not under the name "observability."

Where the spend actually lives

Walk through the last twelve months of any production engineering org and list the costs that exist because the telemetry isn't good enough. They are never on the observability line. They are scattered across the P&L:

  • Incident hours. Engineers, SREs, and tech leads pulled into bridge calls that lasted four hours instead of forty minutes, because nobody could see why checkout was slow.
  • Customer credits and refunds. The contractually guaranteed compensation paid out for SLA breaches the team didn't catch until customers complained.
  • Support ticket volume. The CS team doing the work the monitoring system should have done: being the detection mechanism for production issues, one ticket at a time.
  • Over-provisioning. The 40% headroom carried on every service "to be safe," because nobody is confident enough in the utilisation data to run anything closer to its actual capacity.
  • Engineer attrition. The two senior people who left after a year of being on-call rotations made unsurvivable by alert noise. Replacement cost: six months of recruitment plus a year of onboarding for each.
  • Slowed delivery. The release cadence dropped from weekly to fortnightly to monthly, because nobody trusts the rollback signal enough to ship fast.

None of those line items is labelled "observability." They are labelled "engineering payroll," "customer success," "AWS bill," "recruitment," "delivery." The CFO is paying all of them. The CFO has never seen a summary that connects them.

The number worth producing

The number that moves a budget conversation isn't an industry benchmark. It's your last three significant incidents, costed end to end.

Pick the three biggest incidents from the last six months. For each one, add up:

  1. Engineering hours on the bridge call (count people × duration × loaded rate).
  2. Customer credits or refunds issued.
  3. Support ticket volume during and after.
  4. Revenue lost during degradation (conversion rate × duration × baseline traffic).
  5. Any infrastructure damage: re-runs, queue replays, manual data fixes.

For most mid-size orgs the per-incident number lands somewhere between US$50k and US$500k. Three of those a year is a million. The platform invoice they were arguing about was US$230k.

The conversation isn't "spend more." It's "the spend already exists. Where would you rather it sit: on a controllable line item, or on five uncontrollable ones?"

Finance is not anti-observability. Finance is anti-decks. Bring the incident-cost spreadsheet. The conversation changes in fifteen minutes.

Why this conversation rarely happens

Engineering doesn't produce the cost-of-incidents number for the same reason finance doesn't ask for it: nobody owns the analysis. The engineering team frames the problem in MTTR. Finance frames it in licence spend. The number that connects them (incidents costed in money) sits in the gap between two functions and gets calculated by neither.

The org chart tells you why. SRE leadership reports up through CTO. The licensing budget sits with Engineering Operations or sometimes Finance Business Partners. The customer credit ledger lives in Customer Success. The conversion-rate-during-incidents number lives in Analytics. No single function has all four. The connecting work is project work, and nobody has been asked to do it.

That's the leadership opening. Whoever does assemble the number wins the budget argument, not because the number is enormous, but because no competing argument is grounded in money at all.

What the spend should be moved to

This is where the conversation usually goes wrong even after the cost case is made. Engineering proposes a vendor. Finance approves a budget. The org buys a platform. Eighteen months later, the incidents are still happening at the same rate, because the gap was never tooling.

The spend should move toward the four things that actually reduce incident cost:

  1. Trace coverage on the critical path. Until requests can be followed end-to-end, every incident triage starts with "where do we look first." We've written about why this is the load-bearing pillar.
  2. SLOs on the user-facing surfaces. Not on infrastructure. On the things customers actually pay for: checkout completion, search response, dashboard load. Without these, the team is alerting on noise and missing the signal.
  3. An alert audit. Half the on-call burnout is not caused by real incidents. It's caused by alerts that should have been retired three years ago. The leadership angle is here.
  4. Runbook coverage. The runbook is the reason the next incident is forty minutes instead of four hours. The investment here is usually an order of magnitude below the leverage it returns.

None of those is a vendor purchase. All of them require the budget to be real, allocated, and protected from being reabsorbed when the next project deadline lands.

The version that lands

Take the cost-of-incidents number to your CFO without the deck. Show it against the proposed budget. Frame it as "we are already spending this. The question is whether to spend it on the consequence or the cause."

In our experience, the conversation lasts under twenty minutes. The budget gets approved within two weeks. The reason it had been stuck for six months wasn't disagreement. It was that the right number had never been put in front of the right person.

The CFO is on your side. The line item just needs moving.

Engagement.start()

The version of this conversation that lands with finance is one specific incident, costed end to end.

The Tracefox assessment includes the cost-of-not-having-it analysis: the last three significant incidents, broken down by engineering hours, customer credits, support load, and the over-provisioning that exists because nobody trusts the dashboards. The number is usually larger than the platform invoice. That's the conversation the CFO will have.