Cutting a $2.1M Datadog bill by two-thirds without losing fidelity.
A regional logistics platform on a runaway cardinality curve. We saved more than the engagement cost in the first quarter.
- Industry
- Logistics · APAC
- Vendor
- Datadog
- Engagement
- Foundational · 12 wk
- Stack
- Go · k8s · OTel
Six months from a budget veto. Three vendors all proposing "an enterprise tier."
The platform team had quietly added per-user cardinality dimensions to half their hot metrics over 18 months. The bill went from $40k/mo to $175k/mo. The reaction from finance was not subtle.
- Time-series active2.4M
- Used in alerts/dashboards≈460k (19%)
- Vendor proposalUpgrade to "Pro Plus" + DPM addon
- Goal −60% bill, no fidelity loss
- Timeline 12 weeks
- Out of scope Vendor migration
- Success metric $/active-series
- Reviewer VP Eng + CFO
Three moves. In order.
Inventory the metrics that earn their keep.
Mapped every series to an alert, dashboard, or SLO. Anything unreferenced for 90 days got a deletion candidate flag. 71% of cardinality fell into that bucket.
Move noisy dimensions to logs.
Per-user IDs and order IDs were carrying the cardinality. They moved to structured logs (queryable, cheaper). Metrics kept the dimensions that drive on-call decisions.
Re-route through an OTel collector we own.
Inserted a self-hosted OTel collector before the vendor agent. Tail-sampling, drop rules, label-set enforcement live in our pipeline — not theirs.
Asset pending
Cardinality flow diagram. Sources lane on the left emits five high-cardinality streams; the middle lane shows a Tracefox-owned OTel collector applying drop rules, tail-sampling, and label-set enforcement; the Datadog lane on the right receives 690k billable series instead of the original 2.4M.
Cardinality flow diagram, paper-white #f7f9fb. Three vertical lanes labeled 'SOURCES', 'OTEL COLLECTOR (TRACEFOX-OWNED)', 'DATADOG'. Inside SOURCES: 5 service icons emitting metric streams as thin obsidian #191c1e lines with mono labels: 'user_id', 'order_id', 'region', 'tenant', 'endpoint'. Inside the COLLECTOR lane: three stacked policy boxes — 'DROP RULES', 'TAIL-SAMPLE', 'LABEL-SET ENFORCEMENT' — with lines showing 5 streams in, 2 streams out. Right side: small 'DATADOG' label with 'BILLABLE: 690K SERIES' beneath, contrasted with a faded strikethrough tag '2.4M (BEFORE)'. Electric-blue #0066FF accents on the dropped streams (rendered as faded). 16:9, blueprint style.
/img/work/cardinality-flow.png Numbers, end of quarter 4.
¶ Engagement fee was USD 96k. Platform cost reduction in the first quarter post-handover was USD 350k. The next year is gravy.
"Tracefox didn't sell us a tool. They handed back our pipeline and our budget. The new collector is the cleanest piece of infra we own."
- description Cardinality inventory · 2.4M series
- description OTel collector terraform module
- description Tail-sampling policy v1
- description Label-set governance policy
- description Runbook · cardinality regression
- description ADR-014 · self-hosted collector
Datadog/New Relic bill out of control?
The Diagnostic ($18k) tells you whether a refactor is worth it. If the answer is no, you've still got a useful audit.