The methodology, fully open.
Tools, guides, and field notes — everything we'd want a client to read before the first call. Nothing gated, no email required.
Run them yourself.
The Tracefox methodology.
Golden Signals, SLI selection, SLO definition, tiered targets, error-budget policy, burn-rate alerting. The standard library applied across every Tracefox engagement.
Maturity self-assessment.
Fourteen evidence-based questions across the seven dimensions of the formal Tracefox assessment. Free, no email.
SLO & error-budget calculator.
Plug in monthly volume and tier. Get a defensible SLO, an error budget in minutes, and burn-rate alert thresholds.
A working observability lab.
Free in-browser tools: cardinality budgeter, alert heatmap, OTel collector recipes. Some live, more shipping monthly.
The maturity framework, fully open.
Four tiers, twelve axes. The rubric we score against in every paid Diagnostic, published so you can self-score before you call us.
The standard library.
The Golden Signals: a practical primer.
Latency, Traffic, Errors, Saturation: the four signals that characterise the health of any service. What they mean, how to measure them, and the pitfalls that catch most teams.
Burn-rate alerting, properly.
Threshold alerts fire late. Burn-rate alerts fire when the budget is being consumed faster than the SLO allows. The two-window pattern, the math, and a working PromQL implementation.
Error-budget policy that survives the first P1.
An SLO without a policy is just a dashboard. The five budget states, the decision-owner per state, the ship-freeze mechanics that work in practice, plus a copy-paste template.
OTel Collector vs vendor agents.
Where each fits, the lock-in cost of getting it wrong, and the migration path away from vendor-only instrumentation. The pragmatic recommendation.
A starter SLI catalogue.
Availability, latency, throughput, saturation, quality. The indicators we deploy first on every engagement, organised by Golden Signal, with cardinality discipline and recording rules.
Stop applying 99.95% to everything.
One-size-fits-all SLOs cause toil. The four-tier model, the SLO-vs-SLA rule that's easy to get wrong, and the tier-assignment workshop that gets sign-off.
What we keep finding.
- Field notes
eBPF probing at scale: 500k TPS throughput.
Architectural constraints and practical optimisations for keeping observability latency sub-millisecond on a high-throughput trading workload.
· 14 min read - Field guide
Error budgeting that survives Monday.
Most error-budget policies don't outlive their first quarter. The fix isn't a better dashboard — it's a budget your engineering org is allowed to spend.
· 8 min read - Field notes
The status page that lags the incident by 40 minutes.
The team's first instinct is to investigate, not to communicate. The status page updates after the team has a working theory.
· 6 min read - Field notes
The cargo-culted SLO target.
The team set 99.95% on every service because that's what Google's blog said. The reliability effort is distributed evenly. The business impact is not.
· 6 min read - Field notes
The vendor demo that solved the wrong problem.
The slick demo runs on a curated dataset. Six months later the bill is six figures higher and the same incidents take the same length of time to resolve.
· 6 min read - Field notes
The handover that didn't survive contact with reality.
A new team takes over. The wiki has eighteen months of stale facts. The first incident under new ownership is the moment they discover what the documentation actually was.
· 6 min read - Field notes
The dependency you didn't know you had.
NTP. Internal DNS. The package mirror. The CA. Dependencies that aren't on the architecture diagram are the ones that take you down for an afternoon.
· 6 min read - Field notes
The 'temporary' workaround that's now load-bearing.
The cron job from 2019. The shell script on the bastion. The hardcoded IP added during an incident. They were never meant to last. They're now infrastructure.
· 6 min read - Field notes
The dashboard that aged into uselessness.
Dashboards rot. Metrics get renamed. Services get retired. The panel still shows 'OK' because Grafana treats missing data as healthy.
· 6 min read - Field notes
Your error budget exists. It just isn't being used.
The team has SLOs. They have an error-budget calculation. The number is on a dashboard. Nobody changed any plans because of it.
· 6 min read - Field notes
The retro action items nobody did.
Pull the last twenty postmortems. Count the action items. Count the ones that shipped. The ratio is almost always 15–30%.
· 6 min read - Field notes
The synthetic check that lies to you.
Every status board has a green tick from a synthetic check that hits the same curated path every minute. The check passes during real outages.
· 6 min read - Field notes
The customer told us before our monitoring did.
The most damning sentence in any postmortem. The system was hurting users for nine minutes before anyone internal noticed.
· 6 min read - Field notes
The on-call who can't get into prod at 02:00.
The page fires. SSO is failing. VPN crashed. Twenty minutes go before they can run a single command. The system was up. The responder was not.
· 6 min read - Field notes
The escalation path that ends in 'just DM Raj'.
On paper, you have a tiered rotation. In practice, the L1 messages the same senior engineer they always do. The rotation is a fiction.
· 6 min read - Field notes
The cost spike that turned out to be a logging loop.
FinOps flagged a 40% jump in observability spend. By Friday we'd traced it to a single service retrying 200 times a second and logging a 4KB stack trace each time.
· 6 min read - Field notes
The service nobody owns.
In every microservices estate I've audited, there are two or three services nobody has put their name to. The day they break, the bridge call has eight people and zero answers.
· 6 min read - Field notes
The dashboard nobody opens during the incident.
The team built it in a calm afternoon, populated it with everything, and pinned it on the wiki. Then the page went off and nobody opened it.
· 6 min read - Field notes
Hitting the landing page is not a triage.
A CPU alert fires at 02:14. The L1 responder opens the homepage, sees fast paint times, marks no-impact. Six hours later the support inbox tells a different story.
· 6 min read - Field notes
The tier 0 service that wasn't.
A fintech client had five Tier 0 services. Two of them were. Over-tiering is the more common mistake, and the more expensive one.
· 5 min read - Opinion
The backlog is two years long. Start with one service this Friday.
When the observability problem is the size of the whole estate, the team stops shipping. The forty-page roadmap is a symptom of paralysis, not progress.
· 6 min read - Opinion
The leading indicator you're not watching.
Most incidents are preceded by 5–15 minutes of degradation that nobody alerts on. The signal is in the data. The cardinality and alerting strategy aren't.
· 6 min read - Opinion
High CPU is not an incident.
Resource utilisation is a diagnostic signal, not a paging signal. The teams burning out their on-call rotations on CPU thresholds are paying for the wrong instinct.
· 5 min read - Opinion
CloudWatch was never going to be enough.
The cloud-native tools were built to monitor the cloud's resources, not your application. The realisation usually arrives the third or fourth time an incident outlasts what the dashboard can explain.
· 6 min read - Opinion
Observability is on the wrong line item.
The CFO already pays for it. It just doesn't show up as a tooling cost; it shows up as incident hours, customer credits, and over-provisioning.
· 6 min read - Opinion
Centralised observability or squad ownership? You probably want both.
Central ownership freezes squads; full squad ownership creates inconsistency. The model that works is platform-as-a-service, and it's harder to land than either extreme.
· 6 min read - Reference
What the runbook should actually look like.
Five sections, action-first, written for the engineer at 02:47 rather than the engineer at 14:30. A copy-paste template, plus the sections most existing runbooks should delete.
· 6 min read - Opinion
The first ten minutes of a P1 are about the runbook, not the engineer.
When the page goes off at 02:47, the engineer who answers it isn't the variable. The runbook is.
· 5 min read - Opinion
You don't need an observability platform. You need definitions.
A team came to us with a US$500k/year observability budget, three vendors, and engineers who couldn't answer 'why is checkout slow today' in under an hour.
· 6 min read - Opinion
Your alert hygiene is a leadership problem.
In an alert audit, the engineering team usually apologises. They shouldn't. Bad alerts aren't an engineering problem; they're a leadership symptom.
· 6 min read - Opinion
Stop calling it observability if you don't have traces.
If you think you have an observability platform and you don't have traces, what you have is well-funded monitoring.
· 5 min read - Opinion
AI SRE without good telemetry is theatre.
Every vendor at the conferences is shipping an AI agent for incident response. Almost none of them are talking about the data those agents are reading.
· 5 min read
Want this lived experience on your stack?
The library is the methodology, fully open. The Diagnostic engagement is what tells you which parts apply to you, in what order, and on what timeline.