Methodology · v1.0 · open

The standard library we bring into every engagement.

Adopted as-is for new environments. Adapted for existing clients based on their assessment score. The same vocabulary applied consistently, so any engineer on the team can walk into any client and know where things stand without renegotiating first principles.

01 Golden Signals 02 SLI Selection 03 SLO Definition 04 Tiered Targets 05 Error Budgets 06 Burn-Rate Alerts

01 The Golden Signals

Four signals are sufficient to characterise the health of any service.

The mandatory baseline. If a team can only instrument one thing, it should be these four, for every user-facing service. CPU and memory are implementation details; the Golden Signals measure what users actually experience.

[LAT]

Latency

Time to serve a request, measured for successful and failed paths separately. Histograms (p50, p90, p95, p99, p999). Never averages.

Pitfall. Averaging hides tail latency. Failed requests at 1ms make a histogram look healthy.

[TRF]

Traffic

Demand on the system: requests per second, messages consumed, active sessions. Counters, rate over time.

Pitfall. A traffic drop is not improved performance. It's often an upstream failure that didn't page.

[ERR]

Errors

Rate of failed requests: explicit (5xx), implicit (wrong content), or by policy (SLO breach). Separate 4xx from 5xx.

Pitfall. A 200 OK with an error payload is invisible without application-level instrumentation. Add it.

[SAT]

Saturation

How full the service is: CPU, memory, disk, thread pool, queue depth, connection pool. Utilisation %.

Pitfall. Not all saturation is equal. A queue at 90% is more urgent than CPU at 90%. Know your bottleneck.

02 SLI Selection

Choose indicators that are measurable today, meaningful to users, and actionable when they breach.

An SLI you cannot act on is a dashboard metric, not an SLI. Every selected indicator must trace to a user-facing outcome. We bring a starter catalogue (availability, latency, throughput, saturation, business-quality) and adapt it per client.

On cardinality: keep label sets small and consistent. High-cardinality labels (user_id, request_id, IP) on metrics create cost and performance problems in every backend. Reserve that data for traces and structured logs.

03 SLO Definition Process

An SLO is a commitment, not a dashboard.

Defining an SLO requires agreement between engineering, product, and the business. It's not a threshold set by whoever happened to write the alert. Four steps, one worksheet per service per SLO.

Step 01

Identify the user journey.

What is the user trying to accomplish? "Complete a purchase within ten seconds." "Receive a notification within sixty." Specific, observable, owned.

Step 02

Select the SLI.

Which standard indicator best represents whether that journey is succeeding? Error rate and latency are the most common starting points.

Step 03

Set the target and window.

What percentage of the time must the SLI be satisfied? Over what window? Start conservatively. You can always tighten.

Step 04

Calculate the budget. Agree the policy.

What happens when the budget is consumed? This must be agreed with the business before the SLO is active. A policy in engineering's head will not survive product pressure on the first call.

image

Asset pending

Five-stage chain showing how a user journey maps through to an alert: USER JOURNEY → SLI → SLO → ERROR BUDGET → BURN-RATE ALERT, with worked example values at each stage.

Horizontal five-stage chain diagram on paper-white #f7f9fb. Five boxes connected by 1px obsidian arrows, left to right: 'USER JOURNEY' → 'SLI' → 'SLO' → 'ERROR BUDGET' → 'BURN-RATE ALERT'. Each box rendered as a 1px-bordered card with the stage name in JetBrains Mono caps at the top, plus a worked example below in mono: 'checkout.place_order', 'success_rate', '99.95% / 30d', '21.6 min / 30d', '14.4× over 1h'. Above the chain: a single thin electric-blue #0066FF trace line connecting the entire flow. Below: a small 'POLICY' callout linked to the ERROR BUDGET box showing the four budget states (Healthy / Caution / Warning / Critical) as small status pips. 16:9, generous whitespace, no fills, schematic style.

/img/methodology/slo-chain.png

04 Tiered SLO Targets

Not every service warrants the same target.

Applying a 99.95% availability SLO to an internal admin portal creates toil and alert noise for no return. Tier assignment is agreed with the client's engineering and product leadership.

Tier	Availability	Latency p99	Error rate	30d budget	Examples
Tier 0 · Mission Critical	99.95%	< 300ms	< 0.05%	21.6 min	Payments, auth, core API
Tier 1 · Business Critical	99.9%	< 500ms	< 0.1%	43.2 min	Catalogue, checkout, primary dashboards
Tier 2 · Standard	99.5%	< 1s	< 0.5%	3.6 hr	Search, recommendations, secondary APIs
Tier 3 · Internal	99.0%	< 2s	< 1%	7.2 hr	Admin portals, internal reporting

Rule: never set the SLO equal to the SLA. The internal target must be tighter than the external commitment, so the team knows about risk before the contract breaches.

05 Error Budget Policy

A policy that decides what happens before reliability runs out.

Every SLO must have a written, signed-off policy defining behaviour at each budget state. The standard policy below is our starting point. It adapts per client, but it always exists in writing before the SLO goes live.

State	Threshold	Posture	Required action	Owner
Healthy	> 50% remaining	Normal	Business as usual. Feature velocity unrestricted.	Engineering
Caution	25–50%	Monitor	Increase alert sensitivity. Reliability risks reviewed in sprint planning. No new tech debt.	Eng Lead
Warning	10–25%	Slow Down	Freeze non-critical feature work. Prioritise reliability. Notify product of risk.	Eng Lead + Product
Critical	< 10%	Reliability Focus	All capacity to reliability. Production change freeze. Daily SLO review.	VP Engineering
Exhausted	0% (breached)	Incident	Treat as P1. Incident Commander engaged. Customer comms assessed. Formal PIR.	IC + Leadership

Burn-Rate Alerts

Alert on how fast the budget is burning, not on the instantaneous error rate.

Threshold alerts fire late and produce noise. A brief 2% spike that resolves in ten minutes is largely harmless; a sustained 1% over thirty days burns through a 99.9% SLO. Burn-rate alerts measure the rate of consumption, and page only when it's fast enough to exhaust the budget before the team can respond. We use two-window, two-burn-rate by default.

Budget consumption · 30-day window illustrative

Healthy · 12% burn ·

Caution · 47% burn ·

Warning · 78% burn ·

Critical · 95% burn ·

Exhausted · 112% ·

Fast burn · P1

Burn rate > 14.4× over 1hr

Pages on-call. Incident Commander engaged. Exhausts the full 30-day budget in under two hours if unresolved. Declares incident if not resolved in fifteen minutes.

Slow burn · P2

Burn rate > 6× over 6hr

Alerts the incidents channel. Owner assigned within thirty minutes. Exhausts the 30-day budget in under five days. Early warning, not a page.

Alert hygiene · non-negotiable

Every alert in production must have a severity, a linked runbook, a named team owner, and a clear escalation path. Anything missing one of the four is disabled or downgraded to informational until it has them. We audit this on engagement start.

Methodology · applied

This stuff lands fastest when paired with the assessment.

The methodology is open and free to read. The assessment is what tells us which parts apply to you, in what order, and on what timeline.

Run self-assessment See engagements