Library · 43 resources · open access

The methodology, fully open.

Tools, guides, and field notes — everything we'd want a client to read before the first call. Nothing gated, no email required.

Filter All · 43 Guides · 06 Essays · 32 Tools · 05

Section · 01 · Tools & frameworks

Run them yourself.

5 resources

Methodology

The Tracefox methodology.

Golden Signals, SLI selection, SLO definition, tiered targets, error-budget policy, burn-rate alerting. The standard library applied across every Tracefox engagement.

Web · open arrow_forward

Self-assessment

Maturity self-assessment.

Fourteen evidence-based questions across the seven dimensions of the formal Tracefox assessment. Free, no email.

Web · ~3 min arrow_forward

Calculator

SLO & error-budget calculator.

Plug in monthly volume and tier. Get a defensible SLO, an error budget in minutes, and burn-rate alert thresholds.

Web · runs in browser arrow_forward

Playground

A working observability lab.

Free in-browser tools: cardinality budgeter, alert heatmap, OTel collector recipes. Some live, more shipping monthly.

Web · multiple arrow_forward

Framework

The maturity framework, fully open.

Four tiers, twelve axes. The rubric we score against in every paid Diagnostic, published so you can self-score before you call us.

Web · v4.2 arrow_forward

Section · 02 · Guides

The standard library.

6 guides

Fundamentals 8 min read

The Golden Signals: a practical primer.

Latency, Traffic, Errors, Saturation: the four signals that characterise the health of any service. What they mean, how to measure them, and the pitfalls that catch most teams.

Read guide arrow_forward

Alerting 9 min read

Burn-rate alerting, properly.

Threshold alerts fire late. Burn-rate alerts fire when the budget is being consumed faster than the SLO allows. The two-window pattern, the math, and a working PromQL implementation.

Read guide arrow_forward

Governance 10 min read

Error-budget policy that survives the first P1.

An SLO without a policy is just a dashboard. The five budget states, the decision-owner per state, the ship-freeze mechanics that work in practice, plus a copy-paste template.

Read guide arrow_forward

Architecture 8 min read

OTel Collector vs vendor agents.

Where each fits, the lock-in cost of getting it wrong, and the migration path away from vendor-only instrumentation. The pragmatic recommendation.

Read guide arrow_forward

SLIs 7 min read

A starter SLI catalogue.

Availability, latency, throughput, saturation, quality. The indicators we deploy first on every engagement, organised by Golden Signal, with cardinality discipline and recording rules.

Read guide arrow_forward

SLO design 6 min read

Stop applying 99.95% to everything.

One-size-fits-all SLOs cause toil. The four-tier model, the SLO-vs-SLA rule that's easy to get wrong, and the tier-assignment workshop that gets sign-off.

Read guide arrow_forward

Section · 03 · Field notes & essays

What we keep finding.

32 pieces

Engagement.start()

Want this lived experience on your stack?

The library is the methodology, fully open. The Diagnostic engagement is what tells you which parts apply to you, in what order, and on what timeline.

Book Diagnostic Run self-assessment

The methodology, fully open.

Run them yourself.

The Tracefox methodology.

Maturity self-assessment.

SLO & error-budget calculator.

A working observability lab.

The maturity framework, fully open.

The standard library.

The Golden Signals: a practical primer.

Burn-rate alerting, properly.

Error-budget policy that survives the first P1.

OTel Collector vs vendor agents.

A starter SLI catalogue.

Stop applying 99.95% to everything.

What we keep finding.

eBPF probing at scale: 500k TPS throughput.

Error budgeting that survives Monday.

The status page that lags the incident by 40 minutes.

The cargo-culted SLO target.

The vendor demo that solved the wrong problem.

The handover that didn't survive contact with reality.

The dependency you didn't know you had.

The 'temporary' workaround that's now load-bearing.

The dashboard that aged into uselessness.

Your error budget exists. It just isn't being used.

The retro action items nobody did.

The synthetic check that lies to you.

The customer told us before our monitoring did.

The on-call who can't get into prod at 02:00.

The escalation path that ends in 'just DM Raj'.

The cost spike that turned out to be a logging loop.

The service nobody owns.

The dashboard nobody opens during the incident.

Hitting the landing page is not a triage.

The tier 0 service that wasn't.

The backlog is two years long. Start with one service this Friday.

The leading indicator you're not watching.

High CPU is not an incident.

CloudWatch was never going to be enough.

Observability is on the wrong line item.

Centralised observability or squad ownership? You probably want both.

What the runbook should actually look like.

The first ten minutes of a P1 are about the runbook, not the engineer.

You don't need an observability platform. You need definitions.

Your alert hygiene is a leadership problem.

Stop calling it observability if you don't have traces.

AI SRE without good telemetry is theatre.

Want this lived experience on your stack?