Tracefox / Library / Field notes
Field notes

The cost spike that turned out to be a logging loop.

The CFO's office spotted it before the SRE team did. A 40% jump in observability spend month-on-month. Engineering's first reaction was to question the bill. The bill was right. A retry loop in one service was generating eight billion log lines a day, and nobody had alerted on it because the system was doing exactly what it had been told to do.

· Ken Tan · 6 min read

The story starts with a finance ticket, not an incident page. A FinOps analyst at a client noticed that observability spend on the November invoice was 40% higher than October. They flagged it to engineering with a polite question: "is this expected?"

Engineering's first reaction was to question the bill. Vendors sometimes get the meter wrong. Sometimes a usage tier flips. The finance team was asked to wait while the SRE lead opened a case.

Three days later, the bill turned out to be correct. There was no metering error. The platform was ingesting forty per cent more log data than the month before, and almost all of that increase was coming from a single service. That service had no open incident against it. No alert had fired. The dashboards were green. The only signal that anything was wrong was the invoice.

What had actually happened

The service in question was a payment-reconciliation worker. It made an outbound call to a partner API on every reconciliation event. Three weeks before the bill spike, the partner had quietly deprecated a TLS configuration. The service couldn't establish the connection. It was retrying with no backoff and no circuit breaker. Each failed attempt logged a four-kilobyte stack trace at WARN level.

The service was processing about two hundred reconciliation events per second. With the retry loop, that became roughly eight billion log lines a day, all of them functionally identical, none of them surfaced as an alert because the alerting strategy was built around error rate rather than error volume, and a 100% error rate on a worker without an SLO doesn't trigger anything in most stacks.

The customer-facing impact was real but slow. Reconciliations were falling behind. Settlements were lagging by hours. Nobody had paged because the lag was being absorbed by a buffer that was sized for weeks, not days. The bill caught it first.

Why the bill is often the leading indicator

The lesson I take from this engagement, and from a handful of similar ones, is that the cloud bill is one of the most underrated diagnostic signals in an observability stack. It catches things alerts miss because it has no concept of "expected behaviour." It just measures. A retry loop, a runaway export job, a misconfigured log forwarder, a leaked debug flag in production — all of them show up on the invoice before they show up anywhere else, because all of them consume resources at a rate that's outside the system's normal distribution.

The reason most teams don't read the bill that way is operational. The bill arrives monthly. By the time it lands, the loop has been burning for two to three weeks. Engineering looks at the bill, finds it large, asks finance to explain it, and the loop continues for another week while the conversation happens. The mechanism is correct; the latency is wrong.

The four loops the bill catches earliest

From engagements where this has come up, the patterns repeat:

  • Retry storms against a degraded dependency. The service is "up" by every alerting definition. It's just spending thousands of dollars an hour failing.
  • Log-level flags accidentally left at DEBUG. Usually after a deploy where someone was tracing a specific issue and forgot to unset the level. The signature is a clean step function in log volume coinciding with a deploy timestamp.
  • Cardinality explosions in metrics. A new label with high uniqueness — usually a user ID, a request ID, or something equivalent — gets added to a metric and the time-series count multiplies by three or four orders of magnitude.
  • Misrouted traffic. A load balancer or service mesh config sending traffic through an unintended path, typically generating egress charges that look like the cost of a moderately-sized startup.

All four of these are diagnosable from the bill long before they become incidents. The detection just needs to happen on the right timeframe.

What to wire up

The intervention I recommend at the engagements where this has bitten is small but unfashionable. It's not a tool. It's three habits.

  1. Daily, not monthly, cost telemetry. Most cloud providers expose daily cost APIs. Pull the data into the same dashboard your SRE team already looks at. A cost panel next to a latency panel is one of the highest-leverage views in the platform, and most teams don't have it.
  2. Alert on per-service log volume deltas. A 5x day-over-day jump in log volume from a single service is almost always a real signal. The alert fires before the bill arrives.
  3. Make finance a stakeholder in the on-call review. Not as a headcount gate. As a second pair of eyes. A FinOps analyst reading the cost trend weekly will catch things engineering will not, because engineering is reading the system as "is it up" and finance is reading it as "is it weird."

The line worth holding

Bills are signals. The teams that read them as diagnostics catch incidents the alert set was never going to catch, because the alert set was tuned for failure modes the system architects had thought of. The bill measures everything, including the failure modes nobody had imagined yet. That's the whole point of treating it as telemetry.

Engagement.start()

Your cloud bill is a leading indicator. Most teams treat it as a trailing receipt.

A Tracefox cost-and-telemetry review reads the bill as a diagnostic surface. Logging volume per service, ingest spend per environment, retry-driven amplification — every line item gets traced back to an owner. The deliverable isn't a savings number; it's a list of services whose telemetry profile is telling you something is wrong before anyone has paged.