Tracefox / Library / Field notes
Field notes

The tier 0 service that wasn't.

Field notes from a recent assessment. Names withheld; the pattern won't surprise you. Over-tiering is the more common mistake (the one nobody calls out) and the operational cost is real.

· Tracefox · 5 min read

A fintech client we'd just started working with had fourteen production services. Five of them were tagged Tier 0, the highest reliability tier in their internal model, the one that means outage equals direct revenue loss or regulatory breach.

I asked the head of platform to walk me through why those five.

What they listed

  • Payment processing. Yes. Outage = transactions failing live = lost revenue and merchant complaints within minutes. Tier 0.
  • Authentication. Yes. Outage = no one can log in = effectively 100% of customer-facing functionality stops. Tier 0.
  • Settlement engine. Hmm. Outage = batch processing delayed by minutes-to-hours. Customers don't notice in real time; finance notices the next day.
  • Internal admin portal. "It's how the ops team manages everything." Used by twelve internal users.
  • Reporting service. Generates weekly compliance exports. If it failed, the report would be late by hours. Compliance has SLAs measured in days.

We worked through it. Two of those were Tier 0. Two were Tier 2. One (the admin portal) was Tier 3. The team had been operating it at 99.95% standards for two years.

What over-tiering actually costs

Operating a service at Tier 0 means certain things: deeper redundancy, faster alert response, change-control gates, possibly 24/7 paging, capacity overprovision. None of that is free.

The admin portal at Tier 0 was paging an on-call engineer roughly twice a week. The alerts were things like "page load exceeded 300ms" at 2am, for an internal tool nobody used at 2am. Multiply by twelve internal users who genuinely don't care, two years of accumulated pages, and you have an on-call burnout problem caused by an over-tiering decision the team had made when they wrote the original tier list and never revisited.

The settlement engine was the more interesting case. The team had listed it as Tier 0 because it processed money. The heuristic was "if it touches money, it's Tier 0." Reasonable instinct. But settlement is a batch process with a four-hour grace window before any external commitment is breached. A two-hour outage is a P2 incident with a clean recovery path. Tier 0 was forcing the team to treat a routine, retry-able workload as if it were live transaction processing. The cost was the same alert noise, the same 24/7 expectation, applied to a workload that didn't need it.

The fix is a conversation, not a downgrade

Re-tiering isn't an engineering decision; it's a business one. We ran a 90-minute workshop with platform and product leadership in the room together. For each service, we asked the same question: name the consequence of a one-hour outage, specifically. Not "it would be bad." Specifically.

For payments: ~US$230k in lost transaction revenue per hour, plus merchant complaints likely within thirty minutes, plus probable SLA breach with the largest accounts. Tier 0.

For the admin portal: twelve internal users mildly annoyed. They have email. They could wait. Tier 3.

The product lead asked: "are we sure we want to drop the admin portal to Tier 3? It feels like a downgrade." It is a downgrade, and that's fine. A Tier 3 service can have a 99% SLO, ship at higher velocity, and have a five-person Slack channel as its alerting destination. The "downgrade" is operational sanity.

Tier the consequence, not the heuristic. "Touches money" is a heuristic. "Outage = direct revenue loss within 30 minutes" is a consequence. Tier on the second one.

The discipline

Honest tiering means accepting that not every service deserves the highest standard, and that pretending otherwise is the more expensive choice. Over-tiering causes alert fatigue, on-call burnout, and friction between engineering and product because the operational standard exceeds what the business actually needs.

Three things make this stick:

  1. Tier on consequence, not on heuristic. Make the consequence concrete and measurable.
  2. Sign-off in writing, both sides. Engineering and product leadership both sign the tier list. The signature is the load-bearing element. It commits product to respecting the looser SLO when it's looser, and to funding the operational standard when it's tighter.
  3. Re-tier on a cadence. Quarterly is the default. Annually is the minimum. Service criticality changes; the tier list has to keep up. Otherwise it becomes archaeology.

What that client looks like today

Two Tier 0 services. Three Tier 1. Four Tier 2. Five Tier 3. Same fourteen services, accurate operational standard. On-call rotations stopped paging on the admin portal. The settlement engine moved to Tier 1, which gave the team room to ship features instead of guarding a workload that didn't need guarding.

The admin portal still works fine.

The full tier model is in the guide on tiered SLO targets, and the workshop format is part of every Tracefox engagement. If you've not re-tiered your services in over a year, the conversation is overdue.

Engagement.start()

The tier list you wrote three years ago is not the tier list you should be operating against.

Tracefox engagements include the tier-assignment workshop: engineering and product leadership in the room together, one consequence question per service, signed-off output. It takes 90 minutes and reliably saves on-call months of pointless paging.