The cargo-culted SLO target.
An engineer read the SRE book. The team adopted 99.9% on every service. The internal admin tool, the public API, the batch job that runs at 03:00, the customer-facing checkout — all on the same target. The result is a budget set that maps to nobody's actual experience. The team is investing reliability effort in the wrong places.
Three numbers I see in almost every SLO document: 99.9, 99.95, and 99.99. The numbers are reasonable. The application of the numbers is the problem. They're applied uniformly across every service in the estate, regardless of what the service is for.
The result is a target set that doesn't match the business. The customer-facing checkout, which absolutely cannot tolerate two minutes of downtime in the middle of a campaign, has the same target as the internal admin tool, which is used by twelve employees, none of whom have ever logged in at 03:14 on a Sunday. The reliability effort is distributed evenly. The business impact is not.
How the cargo-cult happens
The 99.9% target is not arrived at through analysis. It's arrived at through pattern-matching on what other teams have written. The chain is usually:
- An engineer reads the SRE book or a major company's blog post. The post mentions 99.9% as a typical target.
- The engineer drafts the company's first SLO document. They pick 99.9% because it's the number they've seen.
- The document is reviewed. Nobody pushes back, because nobody else has thought about it more than the author has.
- The number propagates. New services adopt it because that's what existing services use. Within a year, every service in the estate has the same target.
The number was never wrong, exactly. It was just never matched to the system it was applied to. The author copied a target from a company whose constraints were different, and the team inherited the assumption without inheriting the analysis.
What a tiered SLO catalogue looks like
The pattern that produces SLO targets that survive contact with the business is tiering. Not every service is the same. Tiering forces the team to articulate which services matter, in which direction. The model I use, with most clients, has four tiers.
- Tier 0 — revenue critical. Customer-facing services where downtime maps directly to lost revenue or regulatory penalty. Targets typically 99.95–99.99%, with genuine error budgets that drive ship-freezes.
- Tier 1 — customer impact. Customer-visible services where degradation is noticed but not catastrophic. Targets typically 99.9%. The dashboards exist; the budget is tracked; freezes are rare.
- Tier 2 — internal-impact. Internal tooling that affects employee productivity. Targets typically 99.5%. No paging on burn rate; tracked monthly.
- Tier 3 — best effort. Batch jobs, demo environments, internal experiments. No SLO. Health is monitored but not enforced.
The tiering is not about which services deserve attention. It's about which services deserve which kind of attention. Reliability effort is finite. Tiering is how you spend it where it matters.
The most common tiering mistake
The mistake I see most often isn't picking the wrong number. It's over-tiering — pushing too many services into Tier 0 because the team is reluctant to call any service less critical than another. Every service team's leader argues their service is Tier 0. The org has fifteen Tier 0 services, which is functionally the same as having no tiering at all, because the team can't focus attention on fifteen things.
The discipline of tiering is the discipline of saying no. Most services are Tier 1 or Tier 2. The Tier 0 list should be small — five services or fewer in most estates — and every line on that list should be defensible to a non-technical audience. "This service is Tier 0 because if it's down for ten minutes during market hours, we lose seven figures in trade volume." If the line can't survive that articulation, the service isn't Tier 0.
The workshop that produces the catalogue
The exercise I run with clients takes a day. Engineering leads, product managers, and at least one business stakeholder in the same room. The agenda:
- List every production service. One row per service. Pre-populated by the platform team.
- For each, answer three questions. Who uses it. What breaks if it's unavailable for an hour. What breaks if it's unavailable for a day. The answers go in three columns.
- Assign a tier based on the answers. Not based on the team's affection for the service. Based on the business impact described.
- Set an SLO target appropriate to the tier. Use the standard tier-to-target mapping unless there's a specific reason to deviate.
- Get sign-off, in writing, from a director. The director's name on the document is what makes it authoritative. Without it, the catalogue is engineering's opinion, and engineering's opinion is the cargo-cult that got you here.
The output is a one-page tiering memo. The memo is the artefact that the next budget conversation refers to. It's also the artefact that absorbs pressure when a service team wants to upgrade their tier later — the conversation goes to the director who signed off, not to the engineering manager who would otherwise have to defend the line alone.
SLO is not SLA
The other piece I keep having to clarify, because the cargo-cult blurs it: the SLO is the internal commitment. The SLA is the external one. The SLO should be tighter than the SLA, by a margin that gives the team time to react. A 99.9% SLA means an SLO of 99.95% or higher. The other way round — SLO equal to or weaker than the SLA — means the team will breach the SLA before they realise they have a problem.
Half the teams I audit have SLOs that are equal to their SLAs. This is mechanically broken, regardless of which numbers they chose.
The line worth holding
The number on the SLO is downstream of the conversation about what the service is for. If the team hasn't had that conversation, no number is the right number. The cargo-culted target is the sign that the conversation hasn't happened yet. Run the workshop. Sign the memo. Then the targets mean something, and the budget can do its job.
Stop applying 99.95% to everything.
The companion guide: the four-tier model, the SLO-vs-SLA distinction, and the workshop that gets sign-off.
Your error budget exists. It just isn't being used.
Sibling failure: the SLO is wrong, so the budget is performative, so it doesn't drive decisions.
The tier 0 service that wasn't.
The over-tiering problem: the same instinct that picks 99.95% for everything also pushes the wrong services into Tier 0.