Tracefox / Library / Opinion
Opinion

Centralised observability or squad ownership? You probably want both.

The two extremes are well-trodden. Centralised ownership freezes the squads; full squad ownership produces every team inventing its own Golden Signals. The model that works at scale is the boring one in the middle, and it's harder to land than either extreme.

· Tracefox · 6 min read

Every observability conversation at scale eventually becomes an organisational one. Who owns the dashboards. Who writes the SLOs. Who decides when to alert. The question gets framed as a binary: central platform team owns it all, or squads own their own. Both framings are wrong, and both fail in predictable ways.

Why "central platform owns it" fails

The pattern: a platform team is given a remit to "deliver observability across the organisation". They build dashboards for the squads. They define the SLOs. They configure the alerts. The squads receive the output and treat it as someone else's product.

Within twelve months: the squads can't change anything without filing a ticket. The platform team has become a queue. The dashboards reflect what the platform team thinks the squads need, not what the squads use. SLOs drift from actual user-journey importance because the platform team isn't in the product conversations. Alerts page squads about things the squads have no agency to fix.

The squads stop trusting the signal. Centralised observability is bureaucratically tidy and operationally inert.

Why "squads own everything" fails

The opposite pattern: the platform team is dissolved or never created. Each squad owns its own observability stack, picks its own backend, instruments however it wants. Local autonomy, which is the ideal pushed by every platform engineering thinkpiece.

Within twelve months: every team has reinvented Golden Signals differently. There are seven competing conventions for log structure. Nobody agrees on what service means as a label. Cross-team incidents take an hour to triage because the data shapes don't align. Cost per signal balloons because nobody is consolidating volume across the org. The new joiner who moves between two teams has to learn a new observability stack on day one of each rotation.

Squad-only observability is locally autonomous and globally incoherent.

What works: platform-as-a-service

The model that holds up at scale is the one neither extreme advocates for: the platform team builds the rails; the squads own the signals. This is the standard "platform engineering" pattern, applied to observability specifically.

The platform team's actual job:

  • The OTel Collector deployment: one config, one upgrade path, one place to debug ingest issues.
  • Naming and tagging conventions, written down, enforced via lint rules in the dashboard repo.
  • Recording rules for common SLI calculations (error ratios, latency quantiles), so squads don't recompute them.
  • A starter dashboard template per service that any squad can fork and adapt.
  • The instrumentation library (Golden Signals helper, structured-log helper) wrapped over the OTel SDK so squads don't reimplement.
  • Cost governance: per-team visibility, per-signal retention defaults, the conversation about high-cardinality labels.

The squad's actual job:

  • Owning their service's signals: what gets emitted, with what labels, at what rate.
  • Defining their SLOs in conversation with their product partner. Tier-appropriate. Signed off.
  • Owning their own runbooks, alerts, and on-call rotation.
  • Adapting the starter dashboard template to the questions their service raises.
  • Reviewing their SLO against actual user behaviour monthly, not letting it drift.

The interface between them is a paved road: the platform team makes the right thing easy; the squad picks up the platform's defaults and only deviates where their service genuinely demands it.

The platform team's job is to make the squad's observability work cheaper. Not to do it for them. Centralisation that doesn't serve squad velocity is an anti-pattern.

What this looks like at week 1, month 6, year 2

Week 1. The platform team's deliverable is the OTel Collector and a single template repository: dashboards as code, an alert pattern, a recording-rules starter. Squads pick it up and instrument their highest-priority service against it.

Month 6. Most production services emit Golden Signals via the helper library. Three or four squads have deviated locally where needed (bespoke dashboards, custom recording rules), but the labels are consistent. The platform team has shifted from building rails to refining them based on squad feedback.

Year 2. The platform team is small, has a tight remit, and runs the OTel pipeline plus the cost governance function. SLO definitions live with squads. Cross-service incidents are triaged in minutes because the data shapes align. New services onboard against the template and reach production telemetry coverage in days, not sprints.

The org-design implication

The model has a specific organisational form: a small platform team (often 4–8 engineers in a mid-sized org) with a service-team mandate, not a project-team mandate. They build and maintain the rails as long-running products. They have a roadmap, a backlog, and customer feedback channels, where the customers are the internal squads.

If your "platform team" is currently a queue of tickets from squads with no roadmap of their own, the model has collapsed back into Centralised mode. If your platform team doesn't exist and every squad does its own thing, you're paying the Squad-only tax. Both are common; both are fixable; both are leadership decisions, not engineering ones.

The right answer is in the middle, and the middle is harder to land than either extreme. The methodology we bring into engagements assumes the platform-as-a-service model and provides the rails the platform team would otherwise have to invent. Most clients keep the rails. The point of the engagement is to leave them in place.

Engagement.start()

The platform team's actual job is to make the squad's observability work cheaper. Not to do it for them.

Tracefox engagements with platform-shaped clients design the interface explicitly: what the platform team owns, what the squads own, what the paved-road looks like. The output is a model both sides can defend in writing.