Tracefox / Library / Reference · SLIs
Reference · SLIs

A starter SLI catalogue.

Use this as the starting point. Adjust formulas and targets to match your architecture and business requirements. Every SLI selected must trace to a user-facing outcome. An SLI you cannot act on is a dashboard metric, not an SLI.

7 min read · v1.0

The principle on every Tracefox engagement: choose SLIs that are measurable today, meaningful to users, and actionable when they breach. An SLI you cannot act on is a dashboard metric, not an SLI.

The catalogue below is the starter set: the indicators we deploy first on most engagements, organised by Golden Signal. Adjust formulas and targets to your workload patterns and tier; the categories should travel.

Availability SLIs

Indicator Formula Window Starter target Tier fit
HTTP 5xx error rate 1 - (error_requests / total_requests) 5 min rolling < 0.1% All
Successful health-check rate passing_health_checks / total_health_checks 1 min > 99.9% All

The HTTP 5xx error ratio is the most common availability SLI. Health-check success is the cheapest signal, and the fastest to alert on, but it's a coarse indicator: useful for catching service-down conditions, less useful for catching partial degradation.

Latency SLIs

Indicator Formula Window Starter target Tier fit
p99 API response latency histogram_quantile(0.99, http_request_duration) 5 min < 500ms API / Web
p95 database query latency histogram_quantile(0.95, db_query_duration) 5 min < 100ms Data tier
p50 page load (RUM) real_user_measurement_p50 10 min < 2s Web / CDN

Always histograms, never averages. Averages hide tail latency, and tail latency is where outages hide. Build SLOs on the percentile that represents the experience the worst-case user actually has.

Throughput SLIs

Indicator Formula Window Starter target Tier fit
Message processing lag consumer_lag / nominal_throughput 1 min < 30s lag Async / Queue
Successful job completion rate completed / (completed + failed) 15 min > 99.5% Batch / Workers

For asynchronous and batch workloads, raw request-rate metrics are the wrong indicator. What matters is whether work is being completed in acceptable time relative to the rate it's arriving. Lag and completion rate are how you measure that.

Saturation SLIs

Indicator Formula Window Starter target Tier fit
Application CPU utilisation avg(container_cpu_usage / cpu_limit) 5 min < 80% Compute
Memory utilisation container_memory_usage / memory_limit 5 min < 85% Compute
Connection pool exhaustion rate pool_exhaustion / total_requests 5 min < 0.01% Database

Saturation indicators are leading; they predict failure before it occurs. The bottleneck differs per service: a compute-bound service is rate-limited by CPU; a database-bound service by connection pool depth; a queue-bound service by consumer lag. Pick the indicator for the bottleneck that actually applies.

Quality / business SLIs

Indicator Formula Window Starter target Tier fit
Business transaction success rate successful_orders / total_order_attempts 5 min > 99.5% E-commerce
Data pipeline completeness records_delivered / records_expected 15 min > 99.9% Data / ETL

These are the indicators that catch failures invisible to the technical signals. A service can have 100% HTTP 200s and still be silently failing 5% of business transactions because of a payload-validation bug, a missing integration, or a downstream service that returns "success" but didn't do the work. Always include at least one business-quality SLI for any user-facing journey that matters.

Cardinality discipline

The biggest preventable mistake in SLI design is high-cardinality labels on metrics. Every distinct combination of label values is a new time series. Add a user_id label to a counter that fires on every request and you have one time series per user, which scales catastrophically in cost and query latency on every backend.

Reserve high-cardinality data for traces and structured logs. Keep label sets on metrics small and consistent. Standard labels for SLI metrics:

service       (e.g. checkout-api)
endpoint      (e.g. /v1/orders)
method        (GET, POST, PUT, DELETE)
status_class  (2xx, 3xx, 4xx, 5xx)
env           (prod, staging, dev)
region        (eu-west-2, us-east-1)
team          (payments, search, growth)

Anything more granular goes on the trace, not the metric.

Recording rules: make SLI queries cheap

Most SLI calculations involve quotients (error rate = errors/total) and histograms (latency p99 from request_duration). Computing these on every alert evaluation is expensive. Pre-compute them with recording rules:

# Pre-compute error rates per service every 30s
- record: service:error_ratio_5m
  expr: |
    sum by (service) (rate(http_requests_total{status_class="5xx"}[5m]))
    /
    sum by (service) (rate(http_requests_total[5m]))

# Pre-compute p99 latency per service
- record: service:request_duration_p99_5m
  expr: |
    histogram_quantile(0.99,
      sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
    )

Alerts then evaluate against the recording rule, not the raw query: orders of magnitude faster and cheaper.

Where to start

Pick three SLIs. One availability, one latency, one quality. One service. Get them deployed, get them stable, get them reviewed against actual user behaviour for two weeks. Then expand. The full Blueprint at /resources includes the per-tier target table and the SLO worksheet template we use to capture each SLI definition with sign-off.

Engagement.start()

The hard part isn't picking the SLI. It's agreeing what 'the user journey' means.

The Tracefox assessment runs the SLI selection conversation per critical service, with engineering and the business in the room together, and produces signed-off worksheets your team takes forward.