Tracefox / Library / Field notes
Field notes

The handover that didn't survive contact with reality.

A new team takes over the estate. The wiki has eighteen months of stale facts. The architecture diagram shows three services that have been retired and omits two that were added. The first incident under new ownership is the moment the new team discovers what the documentation actually was: an artefact written for one moment in 2024 and never updated.

· Ken Tan · 6 min read

There's a specific moment in every transition I've watched, where the new team realises the documentation they inherited is fiction. It usually happens in week two or three, during the first real investigation. They open the wiki, read the architecture page, and find that a service it describes was retired eighteen months ago. They check the runbook, and the database it tells them to query no longer exists.

The new team is now in a difficult position. They've been brought in to operate something. The documentation describing what they're supposed to operate is wrong. The previous team has either left, or moved on, or is too busy to answer questions. The runbook they're meant to follow at 02:14 is a story about a system that isn't there.

Why handover documentation rots

The handover document is almost always written in the wrong moment. It's written either:

  • Once, by the team that built the system, in the year they built it. The document describes the original design. Eighteen months of small changes have accumulated since, and none of them are reflected.
  • Once, by the outgoing team, the week before they leave. The document is hurried. It describes the system at the level of detail one person can write in a week. It catches the headlines and misses the texture.
  • Continuously, by everyone, with no canonical source. Confluence has thirty pages, written at different times, by different people, contradicting each other. There's no version that's authoritative. The new team has to choose which page to believe and they usually choose wrong.

The document is presented to the new team as a complete artefact. It is treated as canonical. It is not.

The discoveries the audit surfaces

The 90-day audit I run, when a new team is taking over an estate, is the most productive piece of work that happens in the first quarter. The categories of discovery are remarkably consistent:

  1. Services that exist and aren't documented. Two or three per estate. Usually small services that a previous engineer set up to solve a specific problem and never wrote up.
  2. Documented services that don't exist. Usually services that were retired or absorbed into something else, but whose pages were never updated. The wiki still describes them in the present tense.
  3. Dependencies the documentation doesn't show. The architecture diagram has fifteen arrows. The traffic logs show forty distinct service-to-service call patterns. The missing twenty-five are the silent dependencies that will surprise the new team eventually.
  4. Configuration that drifts from version control. The repo says one thing. Production runs another. The drift happened over months, in small manual edits during incidents, and never got reconciled.
  5. Runbooks that reference systems that no longer exist. The page says "check the dashboard at this URL." The URL 404s. The dashboard was migrated to a different tool eighteen months ago.
  6. Permissions and accesses that nobody can re-grant. The previous team had access via someone's personal IAM role. That role's owner has left. The access still works because the role still exists, but nobody knows how to extend it to the new team.

Each of these is a small discovery individually. Cumulatively they're the difference between "we operate this estate" and "we pretend to operate this estate while the senior engineer who knows what's going on quietly answers our questions."

The audit, structured

The 90-day handover audit I run has three phases. Each is bounded and produces a written deliverable.

  • Weeks 1–4: Inventory. List everything. Every service, every host, every database, every queue, every dependency. Cross-reference against the documentation. Mark every line as "matches docs," "missing from docs," or "in docs but not present." The output is a single sheet that becomes the new canonical.
  • Weeks 5–8: Verification. For every service, run the runbook. Confirm the steps work. Confirm the dashboards open. Confirm the alerts fire. The runbooks that fail get flagged for rewrite. The dashboards that 404 get rebuilt or removed. The alerts that don't fire get tested or retired.
  • Weeks 9–12: Establishment. Adopt a documentation hygiene practice that prevents the next round of rot. Docs-as-code. Docs review as part of every PR. Quarterly canonical review. The cadence is what stops the same problem recurring two years from now.

Twelve weeks is not a long time. It's the budget that closes the handover gap before the inevitable first major incident. Teams that skip it usually run into the discoveries the audit would have surfaced, but during a P1 instead of during a calm afternoon.

What good documentation hygiene looks like

The teams whose documentation stays current share a few practices. None require new tools.

  • The architecture diagram lives in the same repository as the code. Updates to the system require updates to the diagram. The PR review catches drift.
  • Runbooks are tested. Once a quarter, an engineer who didn't write the runbook follows it end-to-end. The steps that fail get rewritten. The runbook that nobody can follow doesn't survive the test.
  • The canonical source is named and singular. One place. Not Confluence and a wiki and a Notion and a shared Google Drive. One. Everything else is explicitly marked "non-canonical."
  • Stale pages are deleted, not archived. A page that's no longer current is misleading. Removing it is better than letting it rot in place.

The line worth holding

Inherited documentation is a hypothesis, not a description. The new team's first job is to verify the hypothesis against the running estate. The work is unglamorous and uncomfortable, and it's the only thing that prevents the first quarter under new ownership from being a series of expensive surprises. The 90-day window is short. Use it.

Engagement.start()

The first 90 days of taking over an estate is the only window where ignorance is institutionally acceptable. Use it.

A Tracefox 90-day handover audit walks the inherited documentation against the running estate. Every service, every dependency, every runbook gets a 'matches reality' check. The output is a corrected canonical, an inventory of discoveries (orphans, workarounds, hidden dependencies), and a documentation hygiene plan that prevents the same drift from recurring.