Methodology · v1.0 · open

The standard library we bring into every engagement.

Adopted as-is for new environments. Adapted for existing clients based on their assessment score. The same vocabulary applied consistently, so any engineer on the team can walk into any client and know where things stand without renegotiating first principles.

01 The Golden Signals

Four signals are sufficient to characterise the health of any service.

The mandatory baseline. If a team can only instrument one thing, it should be these four, for every user-facing service. CPU and memory are implementation details; the Golden Signals measure what users actually experience.

[LAT]

Latency

Time to serve a request, measured for successful and failed paths separately. Histograms (p50, p90, p95, p99, p999). Never averages.

Pitfall. Averaging hides tail latency. Failed requests at 1ms make a histogram look healthy.

[TRF]

Traffic

Demand on the system: requests per second, messages consumed, active sessions. Counters, rate over time.

Pitfall. A traffic drop is not improved performance. It's often an upstream failure that didn't page.

[ERR]

Errors

Rate of failed requests: explicit (5xx), implicit (wrong content), or by policy (SLO breach). Separate 4xx from 5xx.

Pitfall. A 200 OK with an error payload is invisible without application-level instrumentation. Add it.

[SAT]

Saturation

How full the service is: CPU, memory, disk, thread pool, queue depth, connection pool. Utilisation %.

Pitfall. Not all saturation is equal. A queue at 90% is more urgent than CPU at 90%. Know your bottleneck.

02 SLI Selection

Choose indicators that are measurable today, meaningful to users, and actionable when they breach.

An SLI you cannot act on is a dashboard metric, not an SLI. Every selected indicator must trace to a user-facing outcome. We bring a starter catalogue (availability, latency, throughput, saturation, business-quality) and adapt it per client.

On cardinality: keep label sets small and consistent. High-cardinality labels (user_id, request_id, IP) on metrics create cost and performance problems in every backend. Reserve that data for traces and structured logs.
03 SLO Definition Process

An SLO is a commitment, not a dashboard.

Defining an SLO requires agreement between engineering, product, and the business. It's not a threshold set by whoever happened to write the alert. Four steps, one worksheet per service per SLO.

Step 01

Identify the user journey.

What is the user trying to accomplish? "Complete a purchase within ten seconds." "Receive a notification within sixty." Specific, observable, owned.

Step 02

Select the SLI.

Which standard indicator best represents whether that journey is succeeding? Error rate and latency are the most common starting points.

Step 03

Set the target and window.

What percentage of the time must the SLI be satisfied? Over what window? Start conservatively. You can always tighten.

Step 04

Calculate the budget. Agree the policy.

What happens when the budget is consumed? This must be agreed with the business before the SLO is active. A policy in engineering's head will not survive product pressure on the first call.

04 Tiered SLO Targets

Not every service warrants the same target.

Applying a 99.95% availability SLO to an internal admin portal creates toil and alert noise for no return. Tier assignment is agreed with the client's engineering and product leadership.

Tier Availability Latency p99 Error rate 30d budget Examples
Tier 0 · Mission Critical 99.95% < 300ms < 0.05% 21.6 min Payments, auth, core API
Tier 1 · Business Critical 99.9% < 500ms < 0.1% 43.2 min Catalogue, checkout, primary dashboards
Tier 2 · Standard 99.5% < 1s < 0.5% 3.6 hr Search, recommendations, secondary APIs
Tier 3 · Internal 99.0% < 2s < 1% 7.2 hr Admin portals, internal reporting

Rule: never set the SLO equal to the SLA. The internal target must be tighter than the external commitment, so the team knows about risk before the contract breaches.

05 Error Budget Policy

A policy that decides what happens before reliability runs out.

Every SLO must have a written, signed-off policy defining behaviour at each budget state. The standard policy below is our starting point. It adapts per client, but it always exists in writing before the SLO goes live.

State Threshold Posture Required action Owner
Healthy > 50% remaining Normal Business as usual. Feature velocity unrestricted. Engineering
Caution 25–50% Monitor Increase alert sensitivity. Reliability risks reviewed in sprint planning. No new tech debt. Eng Lead
Warning 10–25% Slow Down Freeze non-critical feature work. Prioritise reliability. Notify product of risk. Eng Lead + Product
Critical < 10% Reliability Focus All capacity to reliability. Production change freeze. Daily SLO review. VP Engineering
Exhausted 0% (breached) Incident Treat as P1. Incident Commander engaged. Customer comms assessed. Formal PIR. IC + Leadership

Burn-Rate Alerts

Alert on how fast the budget is burning, not on the instantaneous error rate.

Threshold alerts fire late and produce noise. A brief 2% spike that resolves in ten minutes is largely harmless; a sustained 1% over thirty days burns through a 99.9% SLO. Burn-rate alerts measure the rate of consumption, and page only when it's fast enough to exhaust the budget before the team can respond. We use two-window, two-burn-rate by default.

Budget consumption · 30-day window illustrative
Healthy · 12% burn ·
Caution · 47% burn ·
Warning · 78% burn ·
Critical · 95% burn ·
Exhausted · 112% ·
Fast burn · P1

Burn rate > 14.4× over 1hr

Pages on-call. Incident Commander engaged. Exhausts the full 30-day budget in under two hours if unresolved. Declares incident if not resolved in fifteen minutes.

Slow burn · P2

Burn rate > 6× over 6hr

Alerts the incidents channel. Owner assigned within thirty minutes. Exhausts the 30-day budget in under five days. Early warning, not a page.

Alert hygiene · non-negotiable

Every alert in production must have a severity, a linked runbook, a named team owner, and a clear escalation path. Anything missing one of the four is disabled or downgraded to informational until it has them. We audit this on engagement start.

Methodology · applied

This stuff lands fastest when paired with the assessment.

The methodology is open and free to read. The assessment is what tells us which parts apply to you, in what order, and on what timeline.