Tracefox / Library / Field notes
Field notes

The service nobody owns.

The auth-token-validator. The legacy notification gateway. The user-preferences API. Every estate has them. Three teams call into them, none of them say 'mine'. They run on a host that hasn't been patched in eighteen months because the patch script needs an owner-of-record. Then one Tuesday they break.

· Ken Tan · 6 min read

A pattern I see in almost every estate audit, regardless of company size: there are services that nobody owns.

They have names. They have repos. They appear in the architecture diagram, usually drawn as a small box near the centre. Three or four other services call into them. The traffic is real. The dependency is real. But when you ask "whose team is this?" the room goes quiet, and then someone offers "I think Priya's team had it originally, but she's in platform now," and someone else says "didn't we move that to the integrations squad?", and the answer that emerges is that nobody, currently, owns it.

The day that service breaks is one of the most expensive incidents the business will run. Not because the service is hard to fix, but because nobody on the bridge has the authority to fix it, and nobody has the context to know what fixing it means.

How services get orphaned

Nobody sets out to create an unowned service. Orphaning is a slow organisational process that usually goes:

  • A team builds the service to solve a specific need. They own it properly for a year.
  • The team is reorganised. Half the team moves to a new platform initiative. The remaining half inherits a wider remit and quietly stops loving the original service.
  • A second team starts using the service. Then a third. The original builders are no longer the only consumers, but they're still the only oncall.
  • Someone proposes moving ownership to the platform team. The platform team says "not without funding." The conversation never quite gets decided. The service falls off the platform team's roadmap and the original team's runbook.
  • A senior engineer who knew the service leaves. Nobody backfills the knowledge. The wiki page hasn't been edited in eight months.

By the time you're auditing the estate, the service has been in this state for a year or more. It hasn't broken yet, which is the only reason no one has noticed. The first incident is also the discovery that there's no owner.

What it looks like during the incident

The bridge call has eight people. Three are the on-call from the teams that depend on the service. Two are platform engineers who got pulled in because someone said "platform owns shared services." One is the incident commander. Two are senior engineers who joined because they might know something.

Every question gets bounced. "Who owns deploys here?" Silence. "Whose runbook is this?" Someone shares a Confluence link that 404s. "Can we roll back?" Nobody knows the rollback process. "Who has prod access?" Three people raise their hands; none of them are sure they should be running prod commands on something they don't own.

The MTTR on this incident will be three to five times what it would be on a properly owned service. The postmortem will be polite and won't name the underlying issue, because naming the underlying issue means naming a leadership-level decision that hasn't been made.

What ownership actually means

"Owned" is one of those words that sounds firm but is usually under-defined. The bar I use, and the one that holds up in incidents, has four columns:

  1. A team name, not a person. Persons leave. Teams have process for backfilling.
  2. An on-call rotation that includes this service. Not "we'll find someone if it breaks." A rotation, with names, with a schedule.
  3. A runbook that has been touched in the last quarter. If nobody has updated the runbook in six months, the service is effectively abandoned regardless of what the wiki says.
  4. A roadmap entry, even if the entry is "no planned work." Services without a line on someone's roadmap rot. The line itself is a forcing function for the owning team to think about it once a quarter.

A service that fails any one of those four is provisionally orphaned. A service that fails two or more is the one that's going to be on the next bridge call.

The audit that finds them

The fastest way I've found to surface orphan services is a one-page sheet. Every production service gets a row. The columns are the four above. The exercise takes a few hours if the team is honest, two days if there's an organisational reluctance to admit gaps. Either timeline is short compared to the next incident.

The output is almost always uncomfortable. There will be services nobody can name an owner for. There will be services where the named owner says "we haven't looked at that in a year." There will be one or two services that everybody assumed someone else was watching, and nobody was. That output is the actual deliverable. Once it exists in writing, in front of a director who can assign owners, the re-homing happens fast. Without the writing, the conversation stays informal and the orphans stay orphaned.

Re-homing without theatre

When the audit identifies an orphan, the temptation is to dump it on the platform team. Don't, unless platform is genuinely the right home. The pattern that works:

  • Map the service to the team that depends on it most. The team with the largest call volume is usually the right new owner. Their incentives are aligned with keeping it healthy.
  • Give them a runway, not just a transfer. Two or three sprints to land a runbook, an on-call inclusion, and at least one round of housekeeping. Without a runway, the transfer is ceremonial.
  • Document the decision in writing. A one-paragraph ownership memo with the service name, the new owner, and the date. This is the thing that will be looked up the next time someone tries to orphan it again.

The line worth holding

Every estate has orphans. The question is whether you find yours in a workshop or on a bridge call. The workshop costs a half day and surfaces them in writing. The bridge call costs the rest of the quarter, and surfaces them in front of a customer. Pick the cheaper discovery mechanism.

Engagement.start()

Service-ownership audit, run as a half-day workshop. Most estates surface two to three orphans they didn't know they had.

The Tracefox tiering and ownership workshop walks every production service against a single sheet — name, owner team, on-call rotation, runbook link, last deploy. Services that fail any of those four columns get flagged. The output is a tier list with names against every line. Most teams come out of the workshop with at least one service that turns out to be Tier 0 with no owner.