Framework · v4.2 · open

The observability maturity framework.

Four tiers, twelve axes. We use this in every paid Diagnostic and we keep the rubric public so you can self-score before you call us.

01 Tiers

Four states a system can be in.

image

Asset pending

Four-panel comparative schematic showing the same simplified service architecture at increasing observability maturity, from T0 (chaotic, no traces, error-red accents) through T1 (partial traces, amber) and T2 (clean trace flows on critical paths, soft blue) to T3 (full coverage with burn-rate gauges, electric blue).

Four-panel comparative schematic on paper-white #f7f9fb. Each panel shows the same simplified service architecture (5 service nodes in a ring + central database, 1px obsidian outlines, no fills) at increasing maturity: T0 (Ad-hoc, error-red #ba1a1a accents) — services drawn but no trace lines, scattered '?' tokens; T1 (Reactive, amber) — partial trace lines between some services, alerts shown as orphan markers; T2 (Operational, soft blue #b3c5ff) — clean trace flows on all critical paths, SLO badges on revenue-bearing nodes; T3 (Proactive, electric blue #0066FF) — full trace coverage plus burn-rate gauge attached to two services. Each panel headed with 'T0 · AD-HOC' through 'T3 · PROACTIVE' in JetBrains Mono caps. Below each panel: 4-segment progress bar matching that panel's score (1/4, 2/4, 3/4, 4/4). 16:5 horizontal ratio. Style: technical blueprint, no fills inside service nodes, just outlines.

/img/maturity-framework/ladder.png

Tier · T0 Ad-hoc

Telemetry exists because someone enabled an agent. Nobody can describe the data hierarchy. Alerts are mostly noise.

arrow_forward Top-10 dashboards are 18+ months old
arrow_forward Bill grew 40%+ year on year
arrow_forward Most alerts are silenced or auto-resolved
arrow_forward MTTR is unpredictable

Tier · T1 Reactive

Pages get answered, dashboards exist, but the architecture is whatever the previous SRE left behind. Cardinality grows with revenue.

arrow_forward SLOs exist on infra metrics
arrow_forward Cardinality control by exception, not policy
arrow_forward Trace coverage spotty across boundaries
arrow_forward Vendor lock-in growing as a risk line

Tier · T2 Operational

The data hierarchy is intentional. SLOs reflect customer journeys. Cardinality has a budget. Cost-per-signal is a tracked metric.

arrow_forward Journey-keyed SLOs in place
arrow_forward Cardinality budget enforced at collector
arrow_forward OTel-portable trace surface
arrow_forward Quarterly observability reviews

Tier · T3 Proactive

Telemetry is product. Engineers consult it before shipping. The team's instinct is to delete dashboards rather than add them.

arrow_forward Pre-deploy SLO impact reviews
arrow_forward Self-service runbook + alert library
arrow_forward Cardinality regressions caught in CI
arrow_forward Vendor-portable, multi-vendor by choice

02 Axes

The twelve we score against.

Each axis is rated 0–4. Your tier is the worst-of axes — observability is bottlenecked by the weakest signal, not averaged across the group.

01	Telemetry hierarchy	Are signals organised by business meaning or by tool default?
02	Trace coverage	Does context propagate across service contracts you actually care about?
03	SLO discipline	Are SLOs keyed to customer outcomes, with a real burn-rate policy?
04	Cardinality control	Is $/active-series tracked? Is there a budget?
05	Alert quality	What share of pages are actionable in the first 5 minutes?
06	On-call ergonomics	Can a fresh on-caller resolve a tier-1 incident solo at week 2?
07	Retro depth	Do retros end in code changes or just Confluence pages?
08	Runbook freshness	Have your tier-0 runbooks been edited this quarter?
09	Vendor portability	How long would a vendor migration take?
10	Compliance posture	Can your telemetry stack survive audit without scrambling?
11	Telemetry literacy	Can engineers outside SRE write a useful query?
12	Operational ownership	Who is paid to make tier decisions when budgets collide?

Engagement.start()

Want this scored properly on your stack?

The Diagnostic engagement does this with read access to your telemetry. Two weeks, USD 18k, leaves you with a roadmap regardless.

Book Diagnostic Self-assess first