Tracefox / Library / Reference · SLIs

Reference · SLIs

A starter SLI catalogue.

Use this as the starting point. Adjust formulas and targets to match your architecture and business requirements. Every SLI selected must trace to a user-facing outcome. An SLI you cannot act on is a dashboard metric, not an SLI.

7 min read · v1.0

The principle on every Tracefox engagement: choose SLIs that are measurable today, meaningful to users, and actionable when they breach. An SLI you cannot act on is a dashboard metric, not an SLI.

The catalogue below is the starter set: the indicators we deploy first on most engagements, organised by Golden Signal. Adjust formulas and targets to your workload patterns and tier; the categories should travel.

Availability SLIs

Indicator	Formula	Window	Starter target	Tier fit
HTTP 5xx error rate	`1 - (error_requests / total_requests)`	5 min rolling	< 0.1%	All
Successful health-check rate	`passing_health_checks / total_health_checks`	1 min	> 99.9%	All

The HTTP 5xx error ratio is the most common availability SLI. Health-check success is the cheapest signal, and the fastest to alert on, but it's a coarse indicator: useful for catching service-down conditions, less useful for catching partial degradation.

Latency SLIs

Indicator	Formula	Window	Starter target	Tier fit
p99 API response latency	`histogram_quantile(0.99, http_request_duration)`	5 min	< 500ms	API / Web
p95 database query latency	`histogram_quantile(0.95, db_query_duration)`	5 min	< 100ms	Data tier
p50 page load (RUM)	`real_user_measurement_p50`	10 min	< 2s	Web / CDN

Always histograms, never averages. Averages hide tail latency, and tail latency is where outages hide. Build SLOs on the percentile that represents the experience the worst-case user actually has.

Throughput SLIs

Indicator	Formula	Window	Starter target	Tier fit
Message processing lag	`consumer_lag / nominal_throughput`	1 min	< 30s lag	Async / Queue
Successful job completion rate	`completed / (completed + failed)`	15 min	> 99.5%	Batch / Workers

For asynchronous and batch workloads, raw request-rate metrics are the wrong indicator. What matters is whether work is being completed in acceptable time relative to the rate it's arriving. Lag and completion rate are how you measure that.

Saturation SLIs

Indicator	Formula	Window	Starter target	Tier fit
Application CPU utilisation	`avg(container_cpu_usage / cpu_limit)`	5 min	< 80%	Compute
Memory utilisation	`container_memory_usage / memory_limit`	5 min	< 85%	Compute
Connection pool exhaustion rate	`pool_exhaustion / total_requests`	5 min	< 0.01%	Database

Saturation indicators are leading; they predict failure before it occurs. The bottleneck differs per service: a compute-bound service is rate-limited by CPU; a database-bound service by connection pool depth; a queue-bound service by consumer lag. Pick the indicator for the bottleneck that actually applies.

Quality / business SLIs

Indicator	Formula	Window	Starter target	Tier fit
Business transaction success rate	`successful_orders / total_order_attempts`	5 min	> 99.5%	E-commerce
Data pipeline completeness	`records_delivered / records_expected`	15 min	> 99.9%	Data / ETL

These are the indicators that catch failures invisible to the technical signals. A service can have 100% HTTP 200s and still be silently failing 5% of business transactions because of a payload-validation bug, a missing integration, or a downstream service that returns "success" but didn't do the work. Always include at least one business-quality SLI for any user-facing journey that matters.

Cardinality discipline

The biggest preventable mistake in SLI design is high-cardinality labels on metrics. Every distinct combination of label values is a new time series. Add a user_id label to a counter that fires on every request and you have one time series per user, which scales catastrophically in cost and query latency on every backend.

Reserve high-cardinality data for traces and structured logs. Keep label sets on metrics small and consistent. Standard labels for SLI metrics:

service       (e.g. checkout-api)
endpoint      (e.g. /v1/orders)
method        (GET, POST, PUT, DELETE)
status_class  (2xx, 3xx, 4xx, 5xx)
env           (prod, staging, dev)
region        (eu-west-2, us-east-1)
team          (payments, search, growth)

Anything more granular goes on the trace, not the metric.

Recording rules: make SLI queries cheap

Most SLI calculations involve quotients (error rate = errors/total) and histograms (latency p99 from request_duration). Computing these on every alert evaluation is expensive. Pre-compute them with recording rules:

# Pre-compute error rates per service every 30s
- record: service:error_ratio_5m
  expr: |
    sum by (service) (rate(http_requests_total{status_class="5xx"}[5m]))
    /
    sum by (service) (rate(http_requests_total[5m]))

# Pre-compute p99 latency per service
- record: service:request_duration_p99_5m
  expr: |
    histogram_quantile(0.99,
      sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
    )

Alerts then evaluate against the recording rule, not the raw query: orders of magnitude faster and cheaper.

Where to start

Pick three SLIs. One availability, one latency, one quality. One service. Get them deployed, get them stable, get them reviewed against actual user behaviour for two weeks. Then expand. The full Blueprint at /resources includes the per-tier target table and the SLO worksheet template we use to capture each SLI definition with sign-off.

Related guides

A starter SLI catalogue.

Availability SLIs

Latency SLIs

Throughput SLIs

Saturation SLIs

Quality / business SLIs

Cardinality discipline

Recording rules: make SLI queries cheap

Where to start

The Golden Signals

Tiered SLO targets

SLO calculator

The hard part isn't picking the SLI. It's agreeing what 'the user journey' means.

Availability SLIs

Latency SLIs

Throughput SLIs

Saturation SLIs

Quality / business SLIs

Cardinality discipline

Recording rules: make SLI queries cheap

Where to start

The Golden Signals →

Tiered SLO targets →

SLO calculator →

The hard part isn't picking the SLI. It's agreeing what 'the user journey' means.

The Golden Signals

Tiered SLO targets

SLO calculator