The standard library we bring into every engagement.
Adopted as-is for new environments. Adapted for existing clients based on their assessment score. The same vocabulary applied consistently, so any engineer on the team can walk into any client and know where things stand without renegotiating first principles.
Four signals are sufficient to characterise the health of any service.
The mandatory baseline. If a team can only instrument one thing, it should be these four, for every user-facing service. CPU and memory are implementation details; the Golden Signals measure what users actually experience.
Latency
Time to serve a request, measured for successful and failed paths separately. Histograms (p50, p90, p95, p99, p999). Never averages.
Pitfall. Averaging hides tail latency. Failed requests at 1ms make a histogram look healthy.
Traffic
Demand on the system: requests per second, messages consumed, active sessions. Counters, rate over time.
Pitfall. A traffic drop is not improved performance. It's often an upstream failure that didn't page.
Errors
Rate of failed requests: explicit (5xx), implicit (wrong content), or by policy (SLO breach). Separate 4xx from 5xx.
Pitfall. A 200 OK with an error payload is invisible without application-level instrumentation. Add it.
Saturation
How full the service is: CPU, memory, disk, thread pool, queue depth, connection pool. Utilisation %.
Pitfall. Not all saturation is equal. A queue at 90% is more urgent than CPU at 90%. Know your bottleneck.
Choose indicators that are measurable today, meaningful to users, and actionable when they breach.
An SLI you cannot act on is a dashboard metric, not an SLI. Every selected indicator must trace to a user-facing outcome. We bring a starter catalogue (availability, latency, throughput, saturation, business-quality) and adapt it per client.
On cardinality: keep label sets small and consistent. High-cardinality labels (user_id, request_id, IP) on metrics create cost and performance problems in every backend. Reserve that data for traces and structured logs.
An SLO is a commitment, not a dashboard.
Defining an SLO requires agreement between engineering, product, and the business. It's not a threshold set by whoever happened to write the alert. Four steps, one worksheet per service per SLO.
Identify the user journey.
What is the user trying to accomplish? "Complete a purchase within ten seconds." "Receive a notification within sixty." Specific, observable, owned.
Select the SLI.
Which standard indicator best represents whether that journey is succeeding? Error rate and latency are the most common starting points.
Set the target and window.
What percentage of the time must the SLI be satisfied? Over what window? Start conservatively. You can always tighten.
Calculate the budget. Agree the policy.
What happens when the budget is consumed? This must be agreed with the business before the SLO is active. A policy in engineering's head will not survive product pressure on the first call.
Asset pending
Five-stage chain showing how a user journey maps through to an alert: USER JOURNEY → SLI → SLO → ERROR BUDGET → BURN-RATE ALERT, with worked example values at each stage.
Horizontal five-stage chain diagram on paper-white #f7f9fb. Five boxes connected by 1px obsidian arrows, left to right: 'USER JOURNEY' → 'SLI' → 'SLO' → 'ERROR BUDGET' → 'BURN-RATE ALERT'. Each box rendered as a 1px-bordered card with the stage name in JetBrains Mono caps at the top, plus a worked example below in mono: 'checkout.place_order', 'success_rate', '99.95% / 30d', '21.6 min / 30d', '14.4× over 1h'. Above the chain: a single thin electric-blue #0066FF trace line connecting the entire flow. Below: a small 'POLICY' callout linked to the ERROR BUDGET box showing the four budget states (Healthy / Caution / Warning / Critical) as small status pips. 16:9, generous whitespace, no fills, schematic style.
/img/methodology/slo-chain.png Not every service warrants the same target.
Applying a 99.95% availability SLO to an internal admin portal creates toil and alert noise for no return. Tier assignment is agreed with the client's engineering and product leadership.
| Tier | Availability | Latency p99 | Error rate | 30d budget | Examples |
|---|---|---|---|---|---|
| Tier 0 · Mission Critical | 99.95% | < 300ms | < 0.05% | 21.6 min | Payments, auth, core API |
| Tier 1 · Business Critical | 99.9% | < 500ms | < 0.1% | 43.2 min | Catalogue, checkout, primary dashboards |
| Tier 2 · Standard | 99.5% | < 1s | < 0.5% | 3.6 hr | Search, recommendations, secondary APIs |
| Tier 3 · Internal | 99.0% | < 2s | < 1% | 7.2 hr | Admin portals, internal reporting |
Rule: never set the SLO equal to the SLA. The internal target must be tighter than the external commitment, so the team knows about risk before the contract breaches.
A policy that decides what happens before reliability runs out.
Every SLO must have a written, signed-off policy defining behaviour at each budget state. The standard policy below is our starting point. It adapts per client, but it always exists in writing before the SLO goes live.
| State | Threshold | Posture | Required action | Owner |
|---|---|---|---|---|
| Healthy | > 50% remaining | Normal | Business as usual. Feature velocity unrestricted. | Engineering |
| Caution | 25–50% | Monitor | Increase alert sensitivity. Reliability risks reviewed in sprint planning. No new tech debt. | Eng Lead |
| Warning | 10–25% | Slow Down | Freeze non-critical feature work. Prioritise reliability. Notify product of risk. | Eng Lead + Product |
| Critical | < 10% | Reliability Focus | All capacity to reliability. Production change freeze. Daily SLO review. | VP Engineering |
| Exhausted | 0% (breached) | Incident | Treat as P1. Incident Commander engaged. Customer comms assessed. Formal PIR. | IC + Leadership |
Burn-Rate Alerts
Alert on how fast the budget is burning, not on the instantaneous error rate.
Threshold alerts fire late and produce noise. A brief 2% spike that resolves in ten minutes is largely harmless; a sustained 1% over thirty days burns through a 99.9% SLO. Burn-rate alerts measure the rate of consumption, and page only when it's fast enough to exhaust the budget before the team can respond. We use two-window, two-burn-rate by default.
Burn rate > 14.4× over 1hr
Pages on-call. Incident Commander engaged. Exhausts the full 30-day budget in under two hours if unresolved. Declares incident if not resolved in fifteen minutes.
Burn rate > 6× over 6hr
Alerts the incidents channel. Owner assigned within thirty minutes. Exhausts the 30-day budget in under five days. Early warning, not a page.
Alert hygiene · non-negotiable
Every alert in production must have a severity, a linked runbook, a named team owner, and a clear escalation path. Anything missing one of the four is disabled or downgraded to informational until it has them. We audit this on engagement start.
This stuff lands fastest when paired with the assessment.
The methodology is open and free to read. The assessment is what tells us which parts apply to you, in what order, and on what timeline.