Tracefox / Library / Opinion
Opinion

The backlog is two years long. Start with one service this Friday.

When the observability problem is the size of the whole estate, the team stops shipping. The forty-page roadmap is a symptom of paralysis, not progress. The way out is one service, one SLO, one dashboard, one runbook, one alert retired — by end of week. Then do it again next week, on a different service.

· Tracefox · 6 min read

The Slack message lands on a Tuesday. "We need an observability strategy by end of quarter." The engineer it lands on owns everything-shaped: 180 services, four AWS accounts, a Datadog bill that nobody can fully explain, 47 alerts that fired overnight last week, and a Confluence page from 2024 titled "Observability Roadmap (DRAFT)" that has been a draft for two years.

They open a new doc. They write a heading. "Observability Strategy — Q4." They stare at it for ten minutes. They close the laptop. Two weeks later the doc is still one heading.

This is not a productivity problem. It's a sizing problem. The work in front of them is genuinely two years of work, and they know it. The instinct to plan it all before shipping any of it is what kills the quarter. Then the next quarter. Then the year.

Why the big plan keeps failing

The forty-page roadmap looks like progress. It isn't. It is the artefact a team produces when the real work feels too big to start. Every section of the roadmap is a different team's dependency, a different vendor decision, a different naming convention to bikeshed. The roadmap is finished only when every cross-cutting concern has been resolved, which is to say: never.

Meanwhile the on-call rotation is still firing on CPU thresholds nobody trusts. The Tier-0 service still doesn't have an SLO. The runbook for the checkout API is still the eight-line bullet list someone pasted in during the last incident. The plan is getting longer. The reality is not getting better.

A two-year roadmap that ships nothing in the first quarter is not a two-year roadmap. It's an excuse, written down.

The honest premise

You are not going to fix observability across the whole estate this quarter. You are not going to fix it next quarter either. Stop trying to write the plan that does. Write the plan that fixes one service, end-to-end, by Friday. Then write the same plan again next Friday for a different service. After eight Fridays you have a pattern, eight services covered, and a team that has shipped instead of planned.

Almost everyone agrees this is the right shape when it's described abstractly. Almost nobody does it, because the second the work starts, the platform-shaped questions show up. What about the naming convention? What about the vendor decision? What about the squads who'll have to follow the same pattern? Park them. None of them block Friday.

What ships by Friday

On the one service you've picked, the smallest end-to-end loop has five things in it. None of them are individually difficult. The discipline is in stopping after them.

  1. One SLO. One indicator, one target, one window. Not five. Pick the indicator the customer feels most directly: latency on the main read path, or error rate on the main write path. Set a target that's slightly tighter than current performance, not aspirational. The starter SLI catalogue has the shortlist.
  2. One dashboard. Five panels max. The four golden signals plus the SLO burn-rate. Linked from the runbook. Bookmarked by the on-call. If a panel doesn't earn its place at 02:47, delete it.
  3. One burn-rate alert. Wired to the SLO. Two-window, 14.4× and 6×. Pages a real human. Replaces nothing yet — just runs alongside whatever's already there.
  4. One runbook. Action-first, above-the-fold, five sections. Linked from the alert. Owned by the team that owns the service. The template is a copy-paste.
  5. One alert retired. The single noisiest legacy alert on this service. The CPU threshold that nobody trusts. The "5xx > 100" rule that fires every deploy. Delete it. Not silence — delete. If the SLO alert catches what matters, the legacy one was never doing useful work.

That's the week. It's small enough that an experienced engineer can finish it inside three working days, and small enough that a team that's never done it before can finish it in two weeks. Either way, something real ships. The roadmap doc gets one line crossed off, not added.

Picking the service

The picking is where teams stall, so make it mechanical. Sort the services by how many pages they generated last month. Take the top one that's also customer-facing. That's it. Don't run a workshop. Don't convene the architecture review. The service that woke the team up most is the service that has the most to gain from being instrumented properly. It is also, almost always, the service that the on-call most wants to fix.

The runner-up criterion, if the noisiest service is genuinely mis-tiered (worth reading the field note on that), is the service the revenue number most directly depends on. Checkout. Search. The login path. Whichever one is still on the page if you remove everything else.

What you do not do this week

The single biggest predictor of whether the first service ships is whether the team can keep the scope to that service. The temptations that derail it are predictable.

  • You do not pick a vendor this week. The SLO and the burn-rate alert can be implemented in whatever you already have. Vendor selection is a year-one decision and it can wait until you have three services worth of pattern.
  • You do not write the org-wide naming convention this week. You name the SLO after the service. The convention emerges after eight services, retroactively, in an afternoon.
  • You do not run a tagging audit this week. You instrument what the one service needs. Cardinality discipline matters; it doesn't matter enough to block Friday.
  • You do not present this to leadership this week. You ship it. Then you present the working example. The order matters. A demo on a real service is worth thirty pages of strategy doc.

The 90-day arc

Friday-by-Friday, the shape becomes:

  • Week 1. Service one. Five-thing checklist. Ship it.
  • Week 2. Service two. Same checklist. Notice what was painful in week one and only fix the thing that recurs in week two.
  • Weeks 3–6. Services three, four, five, six. The team starts doing two services per week as the pattern hardens. The retired-alerts list becomes the most-shared Slack message of the month.
  • Week 8. Stop and write the convention. By now you have eight working examples; the convention writes itself in an afternoon and survives contact with reality, because it's reverse-engineered from things that work.
  • Weeks 9–12. Hand the pattern to the squads. They run it on their own services with the standard you've already proven. Platform team moves to the next layer (the cost story, the central collector, the SLO governance) — which is now a real platform conversation, not a strategic-looking stalling tactic.

Twelve weeks in, you have somewhere between fifteen and thirty services covered, a runbook coverage number that's actually true, an alert volume that's measurably lower, and a leadership conversation that is about expanding what's working rather than defending what isn't. The two-year roadmap is no longer the document the team is judged against. The queue of services is.

The conversation with leadership

The hardest part of this whole approach is not technical. It's explaining to a director or VP, who asked for the strategy, that you're not going to write them a strategy. You're going to ship them a working example, then ten more, then the strategy will be obvious. Some leaders accept this immediately. Some don't, and ask again for the forty-page document.

The line that has worked, in real conversations: "We can hand you a plan in six weeks, or a working tier-0 service in two weeks and a pattern that scales. The plan won't have shipped anything by the time the second one has fixed eight services. Pick one." When framed that way, most leaders pick the second. The ones who don't are optimising for something other than reliability, and that's a different conversation, worth having explicitly.

The line worth holding

Observability is not a programme. It's a habit applied one service at a time. The teams that treat it as a programme write roadmaps. The teams that treat it as a habit ship Fridays. After a year, the habit teams have transformed their on-call rotation. The programme teams are on version four of the roadmap.

Pick a service. Pick this week. Ship the five things. Tell us how it went; we read every reply.

Engagement.start()

A working SLO, dashboard, and runbook on a single service is a fortnight's work. The reason it doesn't happen is that the plan keeps trying to cover everything.

The Tracefox starter engagement runs the first service end-to-end with the team in two weeks: one tier-0 surface, one SLO, one burn-rate alert, one runbook, one retirement of the noisiest legacy alert. The pattern that comes out is what gets copied to the next service, and the one after that. The roadmap stops being a document and becomes a queue.