Method · 12 weeks · 4 phases

Diagnose. Define. Instrument. Operate.

A repeatable engineering protocol — not a deck. We treat observability like the production system it actually is: governed, versioned, and tested under load before handover.

01 Operating principles

Four constraints we apply to every engagement.

Most observability disasters trace back to the same handful of architectural shortcuts. We refuse them.

filter_alt

Signal before tooling.

Every probe must answer a question someone is paid to ask. We delete dashboards more often than we add them.

schema

Topology before metrics.

If you don't know your service graph, no SLO will save you. We map the boundary before we instrument the box.

savings

Cardinality is a budget.

Six-figure observability bills are a design failure, not a vendor problem. We treat $/series as a first-class metric.

engineering

Operate, don't just deploy.

The handover only counts if your team can run the system through a real P1 — so we run two before we leave.

02 Protocol

The four phases of an engagement.

Linear, fixed-fee, with named deliverables at each gate. You can fire us at the end of any phase.

01
Week 1

Diagnose

We baseline your telemetry hierarchy, ingest pipeline, alert load, and SLO discipline. No tools deployed yet — we're listening for the structural problem under the symptoms.

description Maturity profile
description Cardinality audit
description Alert noise heatmap
description Stakeholder interview log
02
Week 2–3

Define

We rewrite the data hierarchy to mirror your business logic. Services map to user journeys; metrics map to outcomes. SLOs are defined against revenue-bearing flows, not infra vanity.

description Telemetry schema (v1)
description SLO catalogue
description Trace coverage plan
description Cost target sheet
03
Week 4–8

Instrument

We deploy OpenTelemetry where it scales, eBPF where it matters, and turn off what's not earning its keep. Your engineers pair on every change — there is no black-box rollout.

description OTel collector config
description eBPF probe set
description Dashboard inventory
description On-call runbooks
04
Week 9–12

Operate

We run two real incidents with your team. We tune. We hand over the keys. Your team owns the system — we stay on retainer for the questions that come up at 3am.

description Incident retros (×2)
description 90-day roadmap
description Capacity review
description Handover doc set
03 Boundaries

What we won't do.

The cleanest way to be useful is to be honest about scope. These are the things you should hire someone else for.

  • block Long-term staff augmentation — we leave after handover.
  • block Vendor-led implementations where the tool is pre-decided.
  • block Compliance theatre — checkbox observability for audit only.
  • block Greenfield platform builds with no production traffic to learn from.
Engagement.start()

Want to see the protocol on your stack?

A 30-minute call. We'll tell you whether you need us — half the time the answer is no.