Stop applying 99.95% to everything.
Setting the same SLO target for every service is the most common (and the most expensive) SLO mistake. Different services warrant different targets. Here's the four-tier model that keeps the conversation grounded.
Applying a 99.95% availability SLO to an internal admin portal creates toil and alert noise for no return. Applying a 99.5% SLO to your payments service quietly sets the floor of acceptable failure at hours of monthly downtime, which is not what your business agreed to. One size never fits.
The Tracefox tiered model gives you four discrete targets to choose from, each appropriate for a class of service. Tier assignment is a leadership decision, not an engineering one, because it's a statement about how much unreliability the business will tolerate for that service.
The four tiers
| Tier | Availability | p99 latency | Error rate | 30d budget | Typical examples |
|---|---|---|---|---|---|
| Tier 0 · Mission Critical Outage = direct revenue loss or regulatory breach | 99.95% | < 300ms | < 0.05% | 21.6 min | Payment processing, authentication, core API |
| Tier 1 · Business Critical Degradation materially impacts user experience | 99.9% | < 500ms | < 0.1% | 43.2 min | Product catalogue, checkout flow, primary dashboards |
| Tier 2 · Standard Noticed but not immediately blocking | 99.5% | < 1s | < 0.5% | 3.6 hr | Search, recommendations, secondary APIs |
| Tier 3 · Internal / Best Effort Internal tooling, non-customer-facing | 99.0% | < 2s | < 1% | 7.2 hr | Admin portals, internal reporting, tooling |
The numbers are starter targets. Tighten them where the business demands it; relax them where the engineering cost of the next decimal place exceeds the business value. The point of having four tiers (rather than continuous targets set per service) is operational sanity. Every additional unique target is another set of alert thresholds, dashboards, and runbooks to maintain.
The internal SLO must be tighter than the external SLA
If the SLA (your contractual commitment to customers) is 99.9%, the internal SLO should be 99.95% or higher. Never set the SLO equal to the SLA.
The reason: by the time you've breached an SLO equal to the SLA, you're simultaneously breaching the contract. The internal SLO needs headroom so the team has signal that the SLA is at risk, with time to respond, before the breach is contractual.
The tier assignment process
Tier assignment is a 90-minute conversation, not an engineering decision. The structure that works:
- List every production service. Including ones that "don't really matter". The unwillingness to tier those is usually how a service ends up at Tier 0 standards while delivering Tier 3 value.
- For each service, name the consequence of a one-hour outage. Specifically. "Payments down for 1 hour = approx. US$230k lost transaction revenue + brand damage + likely SLA breach with merchants" is a Tier 0 conversation. "Internal expense reporting down for 1 hour = mild irritation" is a Tier 3 conversation. Make the consequence concrete.
- Pick the tier where the consequence sits. Tier 0 is reserved for services where outage means direct revenue loss or regulatory exposure. Tier 1 is for services where degradation is materially user-facing. Most services live at Tier 2. A surprising number of services teams want to label Tier 0 actually belong at Tier 1.
- Get sign-off from engineering and product leadership. The tier is a commitment in both directions: engineering commits to maintaining the standard; product commits to respecting the error-budget policy that flows from it. Both signatures, or the tier doesn't take effect.
Common mistakes
Tier 0 inflation
Every team thinks their service is critical. The standard to hold is that everything Tier 0 must operate at Tier 0 standards, which means budget, headcount, capacity overprovision, change control. If the service isn't getting that investment, it isn't actually Tier 0; it's Tier 1 with aspirations. Be honest about the tier you're actually resourcing.
Setting tighter targets without operational change
Moving a service from Tier 1 to Tier 0 is not a target change; it's a delivery model change. Faster alerting, deeper redundancy, change-freeze windows around high-risk launches, possibly 24/7 on-call. If you're not willing to make the operational changes, don't move the tier.
Letting "tiering" become a one-time exercise
Service criticality changes. A new product launch elevates a service from Tier 2 to Tier 1. A retired feature drops one from Tier 1 to Tier 3. Re-tier on a cadence: annually at minimum, quarterly for fast-moving products. Otherwise the tier table becomes archaeology.
Forgetting downstream services
A Tier 0 service that depends on a Tier 2 service is, in practice, a Tier 2 service. Reliability is set by the weakest dependency. Tier dependencies along with services, and either upgrade the dependency, or downgrade your stated tier to match reality.
Where to start
List your top ten production services. Tier them in 30 minutes, gut feel, individually. Then run the structured workshop above and compare. The discrepancies between gut tier and decided tier are the most useful conversations you'll have all quarter. The downloadable Blueprint includes the full tier model plus the SLO worksheet template per service.
Starter SLI catalogue
Pick the indicator first; then map it to a tier and target with this guide.
Error-budget policy
Tier determines the target. Target determines the budget. Policy decides what happens when it's spent.
SLO calculator
Compute the budget and burn-rate thresholds for the tier you've picked.