A starter SLI catalogue.
Use this as the starting point. Adjust formulas and targets to match your architecture and business requirements. Every SLI selected must trace to a user-facing outcome. An SLI you cannot act on is a dashboard metric, not an SLI.
The principle on every Tracefox engagement: choose SLIs that are measurable today, meaningful to users, and actionable when they breach. An SLI you cannot act on is a dashboard metric, not an SLI.
The catalogue below is the starter set: the indicators we deploy first on most engagements, organised by Golden Signal. Adjust formulas and targets to your workload patterns and tier; the categories should travel.
Availability SLIs
| Indicator | Formula | Window | Starter target | Tier fit |
|---|---|---|---|---|
| HTTP 5xx error rate | 1 - (error_requests / total_requests) | 5 min rolling | < 0.1% | All |
| Successful health-check rate | passing_health_checks / total_health_checks | 1 min | > 99.9% | All |
The HTTP 5xx error ratio is the most common availability SLI. Health-check success is the cheapest signal, and the fastest to alert on, but it's a coarse indicator: useful for catching service-down conditions, less useful for catching partial degradation.
Latency SLIs
| Indicator | Formula | Window | Starter target | Tier fit |
|---|---|---|---|---|
| p99 API response latency | histogram_quantile(0.99, http_request_duration) | 5 min | < 500ms | API / Web |
| p95 database query latency | histogram_quantile(0.95, db_query_duration) | 5 min | < 100ms | Data tier |
| p50 page load (RUM) | real_user_measurement_p50 | 10 min | < 2s | Web / CDN |
Always histograms, never averages. Averages hide tail latency, and tail latency is where outages hide. Build SLOs on the percentile that represents the experience the worst-case user actually has.
Throughput SLIs
| Indicator | Formula | Window | Starter target | Tier fit |
|---|---|---|---|---|
| Message processing lag | consumer_lag / nominal_throughput | 1 min | < 30s lag | Async / Queue |
| Successful job completion rate | completed / (completed + failed) | 15 min | > 99.5% | Batch / Workers |
For asynchronous and batch workloads, raw request-rate metrics are the wrong indicator. What matters is whether work is being completed in acceptable time relative to the rate it's arriving. Lag and completion rate are how you measure that.
Saturation SLIs
| Indicator | Formula | Window | Starter target | Tier fit |
|---|---|---|---|---|
| Application CPU utilisation | avg(container_cpu_usage / cpu_limit) | 5 min | < 80% | Compute |
| Memory utilisation | container_memory_usage / memory_limit | 5 min | < 85% | Compute |
| Connection pool exhaustion rate | pool_exhaustion / total_requests | 5 min | < 0.01% | Database |
Saturation indicators are leading; they predict failure before it occurs. The bottleneck differs per service: a compute-bound service is rate-limited by CPU; a database-bound service by connection pool depth; a queue-bound service by consumer lag. Pick the indicator for the bottleneck that actually applies.
Quality / business SLIs
| Indicator | Formula | Window | Starter target | Tier fit |
|---|---|---|---|---|
| Business transaction success rate | successful_orders / total_order_attempts | 5 min | > 99.5% | E-commerce |
| Data pipeline completeness | records_delivered / records_expected | 15 min | > 99.9% | Data / ETL |
These are the indicators that catch failures invisible to the technical signals. A service can have 100% HTTP 200s and still be silently failing 5% of business transactions because of a payload-validation bug, a missing integration, or a downstream service that returns "success" but didn't do the work. Always include at least one business-quality SLI for any user-facing journey that matters.
Cardinality discipline
The biggest preventable mistake in SLI design is high-cardinality labels on
metrics. Every distinct combination of label values is a new time series. Add
a user_id label to a counter that fires on every request and you
have one time series per user, which scales catastrophically in cost and
query latency on every backend.
Reserve high-cardinality data for traces and structured logs. Keep label sets on metrics small and consistent. Standard labels for SLI metrics:
service (e.g. checkout-api)
endpoint (e.g. /v1/orders)
method (GET, POST, PUT, DELETE)
status_class (2xx, 3xx, 4xx, 5xx)
env (prod, staging, dev)
region (eu-west-2, us-east-1)
team (payments, search, growth) Anything more granular goes on the trace, not the metric.
Recording rules: make SLI queries cheap
Most SLI calculations involve quotients (error rate = errors/total) and histograms (latency p99 from request_duration). Computing these on every alert evaluation is expensive. Pre-compute them with recording rules:
# Pre-compute error rates per service every 30s
- record: service:error_ratio_5m
expr: |
sum by (service) (rate(http_requests_total{status_class="5xx"}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
# Pre-compute p99 latency per service
- record: service:request_duration_p99_5m
expr: |
histogram_quantile(0.99,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
) Alerts then evaluate against the recording rule, not the raw query: orders of magnitude faster and cheaper.
Where to start
Pick three SLIs. One availability, one latency, one quality. One service. Get them deployed, get them stable, get them reviewed against actual user behaviour for two weeks. Then expand. The full Blueprint at /resources includes the per-tier target table and the SLO worksheet template we use to capture each SLI definition with sign-off.