Tracefox / Library / Reference · SLO design
Reference · SLO design

Stop applying 99.95% to everything.

Setting the same SLO target for every service is the most common (and the most expensive) SLO mistake. Different services warrant different targets. Here's the four-tier model that keeps the conversation grounded.

6 min read · v1.0

Applying a 99.95% availability SLO to an internal admin portal creates toil and alert noise for no return. Applying a 99.5% SLO to your payments service quietly sets the floor of acceptable failure at hours of monthly downtime, which is not what your business agreed to. One size never fits.

The Tracefox tiered model gives you four discrete targets to choose from, each appropriate for a class of service. Tier assignment is a leadership decision, not an engineering one, because it's a statement about how much unreliability the business will tolerate for that service.

The four tiers

Tier Availability p99 latency Error rate 30d budget Typical examples
Tier 0 · Mission Critical
Outage = direct revenue loss or regulatory breach
99.95% < 300ms < 0.05% 21.6 min Payment processing, authentication, core API
Tier 1 · Business Critical
Degradation materially impacts user experience
99.9% < 500ms < 0.1% 43.2 min Product catalogue, checkout flow, primary dashboards
Tier 2 · Standard
Noticed but not immediately blocking
99.5% < 1s < 0.5% 3.6 hr Search, recommendations, secondary APIs
Tier 3 · Internal / Best Effort
Internal tooling, non-customer-facing
99.0% < 2s < 1% 7.2 hr Admin portals, internal reporting, tooling

The numbers are starter targets. Tighten them where the business demands it; relax them where the engineering cost of the next decimal place exceeds the business value. The point of having four tiers (rather than continuous targets set per service) is operational sanity. Every additional unique target is another set of alert thresholds, dashboards, and runbooks to maintain.

The internal SLO must be tighter than the external SLA

If the SLA (your contractual commitment to customers) is 99.9%, the internal SLO should be 99.95% or higher. Never set the SLO equal to the SLA.

The reason: by the time you've breached an SLO equal to the SLA, you're simultaneously breaching the contract. The internal SLO needs headroom so the team has signal that the SLA is at risk, with time to respond, before the breach is contractual.

The 0.05% rule A reasonable starting buffer between SLO and SLA is one full step on the tier table. If the SLA is 99.9%, set the SLO at 99.95%. The team operates against the tighter number; the customer-visible commitment is the looser one.

The tier assignment process

Tier assignment is a 90-minute conversation, not an engineering decision. The structure that works:

  1. List every production service. Including ones that "don't really matter". The unwillingness to tier those is usually how a service ends up at Tier 0 standards while delivering Tier 3 value.
  2. For each service, name the consequence of a one-hour outage. Specifically. "Payments down for 1 hour = approx. US$230k lost transaction revenue + brand damage + likely SLA breach with merchants" is a Tier 0 conversation. "Internal expense reporting down for 1 hour = mild irritation" is a Tier 3 conversation. Make the consequence concrete.
  3. Pick the tier where the consequence sits. Tier 0 is reserved for services where outage means direct revenue loss or regulatory exposure. Tier 1 is for services where degradation is materially user-facing. Most services live at Tier 2. A surprising number of services teams want to label Tier 0 actually belong at Tier 1.
  4. Get sign-off from engineering and product leadership. The tier is a commitment in both directions: engineering commits to maintaining the standard; product commits to respecting the error-budget policy that flows from it. Both signatures, or the tier doesn't take effect.

Common mistakes

Tier 0 inflation

Every team thinks their service is critical. The standard to hold is that everything Tier 0 must operate at Tier 0 standards, which means budget, headcount, capacity overprovision, change control. If the service isn't getting that investment, it isn't actually Tier 0; it's Tier 1 with aspirations. Be honest about the tier you're actually resourcing.

Setting tighter targets without operational change

Moving a service from Tier 1 to Tier 0 is not a target change; it's a delivery model change. Faster alerting, deeper redundancy, change-freeze windows around high-risk launches, possibly 24/7 on-call. If you're not willing to make the operational changes, don't move the tier.

Letting "tiering" become a one-time exercise

Service criticality changes. A new product launch elevates a service from Tier 2 to Tier 1. A retired feature drops one from Tier 1 to Tier 3. Re-tier on a cadence: annually at minimum, quarterly for fast-moving products. Otherwise the tier table becomes archaeology.

Forgetting downstream services

A Tier 0 service that depends on a Tier 2 service is, in practice, a Tier 2 service. Reliability is set by the weakest dependency. Tier dependencies along with services, and either upgrade the dependency, or downgrade your stated tier to match reality.

Where to start

List your top ten production services. Tier them in 30 minutes, gut feel, individually. Then run the structured workshop above and compare. The discrepancies between gut tier and decided tier are the most useful conversations you'll have all quarter. The downloadable Blueprint includes the full tier model plus the SLO worksheet template per service.

Engagement.start()

The right SLO is the one engineering and product can both defend.

Tracefox engagements include the tier-assignment workshop: usually a 90-minute session per service group with engineering and product leadership. The output is signed off, not just decided.