Framework · v4.2 · open

The observability maturity framework.

Four tiers, twelve axes. We use this in every paid Diagnostic and we keep the rubric public so you can self-score before you call us.

01 Tiers

Four states a system can be in.

Tier · T0 Ad-hoc

Telemetry exists because someone enabled an agent. Nobody can describe the data hierarchy. Alerts are mostly noise.

  • arrow_forward Top-10 dashboards are 18+ months old
  • arrow_forward Bill grew 40%+ year on year
  • arrow_forward Most alerts are silenced or auto-resolved
  • arrow_forward MTTR is unpredictable
Tier · T1 Reactive

Pages get answered, dashboards exist, but the architecture is whatever the previous SRE left behind. Cardinality grows with revenue.

  • arrow_forward SLOs exist on infra metrics
  • arrow_forward Cardinality control by exception, not policy
  • arrow_forward Trace coverage spotty across boundaries
  • arrow_forward Vendor lock-in growing as a risk line
Tier · T2 Operational

The data hierarchy is intentional. SLOs reflect customer journeys. Cardinality has a budget. Cost-per-signal is a tracked metric.

  • arrow_forward Journey-keyed SLOs in place
  • arrow_forward Cardinality budget enforced at collector
  • arrow_forward OTel-portable trace surface
  • arrow_forward Quarterly observability reviews
Tier · T3 Proactive

Telemetry is product. Engineers consult it before shipping. The team's instinct is to delete dashboards rather than add them.

  • arrow_forward Pre-deploy SLO impact reviews
  • arrow_forward Self-service runbook + alert library
  • arrow_forward Cardinality regressions caught in CI
  • arrow_forward Vendor-portable, multi-vendor by choice
02 Axes

The twelve we score against.

Each axis is rated 0–4. Your tier is the worst-of axes — observability is bottlenecked by the weakest signal, not averaged across the group.

01 Telemetry hierarchy Are signals organised by business meaning or by tool default?
02 Trace coverage Does context propagate across service contracts you actually care about?
03 SLO discipline Are SLOs keyed to customer outcomes, with a real burn-rate policy?
04 Cardinality control Is $/active-series tracked? Is there a budget?
05 Alert quality What share of pages are actionable in the first 5 minutes?
06 On-call ergonomics Can a fresh on-caller resolve a tier-1 incident solo at week 2?
07 Retro depth Do retros end in code changes or just Confluence pages?
08 Runbook freshness Have your tier-0 runbooks been edited this quarter?
09 Vendor portability How long would a vendor migration take?
10 Compliance posture Can your telemetry stack survive audit without scrambling?
11 Telemetry literacy Can engineers outside SRE write a useful query?
12 Operational ownership Who is paid to make tier decisions when budgets collide?
Engagement.start()

Want this scored properly on your stack?

The Diagnostic engagement does this with read access to your telemetry. Two weeks, USD 18k, leaves you with a roadmap regardless.