Tracefox / Library / Field notes
Field notes

The dependency you didn't know you had.

The architecture diagram has fifteen boxes and twenty arrows. The real production system has a hundred and forty boxes and a thousand arrows, and the difference is the dependencies nobody drew. NTP. Internal DNS. The package mirror. The CA. The certificate the build pipeline trusts. They aren't on the diagram. They will be the next outage.

· Ken Tan · 6 min read

Every team has an architecture diagram. The diagram is wrong. Not in the sense that it's poorly drawn — usually it's been thoughtfully maintained — but in the sense that it shows the dependencies the team is aware of, which is a strict subset of the dependencies they actually have.

The dependencies that aren't on the diagram are the ones that take you down. They aren't malicious omissions; they're a category of infrastructure so foundational that nobody thinks of it as a dependency. Until it fails. At which point the diagram becomes irrelevant and the team is debugging in territory the runbook doesn't cover.

The dependencies most teams have and don't realise

From the engagements I've worked on, the omissions cluster into a set of usual suspects. Most of these are present in every estate I audit, and few are documented:

  • NTP and time synchronisation. Almost everything breaks subtly when clocks drift. Certificates start failing. Distributed locks misbehave. Audit logs become useless. Most teams haven't checked their NTP source in three years.
  • Internal DNS. The AWS Route 53 resolver, the private hosted zone, the VPC's DNS forwarding. When this layer stutters, every service-to-service call fails in confusing ways. Few teams have alerts on DNS resolution latency.
  • Certificate authorities. The internal CA that signs your service-to-service mTLS. The public CA your customers trust. The build-pipeline CA that signs your container images. Each of these has a renewal cadence; few teams know what it is.
  • Package and artefact mirrors. The npm registry, the Docker Hub mirror, the internal Artifactory. Builds depend on these silently. When the mirror is slow or the auth token expires, builds fail in ways nobody has a runbook for.
  • Identity providers. The corporate SSO, the vendor SSO, the customer SSO. Each of these is a single point of failure for whatever it gates. Outages here cascade in directions the diagram doesn't show.
  • Email and notification gateways. The system that delivers password resets, alert notifications, and transactional emails. Every team has one. Few have alerts on it. When it breaks, it usually breaks silently.
  • Configuration stores. SSM parameter store, Consul, Vault, etcd. Every running service reads from one of these on startup and often during operation. Their availability is assumed.

Each of these has been the primary cause of an incident at one or more clients I've worked with in the last two years.

Why these omissions happen

The architecture diagram answers the question "how does our system work?" The dependencies above answer the question "what does our system assume?" Those are different questions, and most diagrams are only built to answer the first.

The omitted dependencies have three properties in common:

  • They were chosen by someone in a different team. Networking picked the DNS provider. Security picked the CA. Platform picked the package mirror. The application teams consume them but don't think of them as part of "their" architecture.
  • They've been reliable for years. The reliability has trained the team to ignore them. The diagram leaves them off because they've never been a problem.
  • Their failure modes are subtle. When DNS breaks, the symptoms are confusing application errors, not a clean "DNS down" signal. The team chases the application layer, because that's what the diagram and the alerts point at.

How to find yours

The exercise I run with clients is a structured "what does this service assume" walkthrough. For each tier-zero service in the estate, the team writes down every external thing it expects to work. Not just the services in the diagram. Every external thing. Time, DNS, certificates, package mirrors, identity, secrets, artefact stores, network paths, the list goes on.

The first list is uncomfortable. It runs to thirty or forty items per service. Most teams have never written it down. The discomfort is the deliverable. Once the list exists, the team has named every dependency they have, and can ask the next question: "do we have a contingency for each?"

For most items, the answer is no. The contingency design happens over the following quarter, in increments. The point isn't to have a hot failover for every dependency — that would be absurdly expensive — but to have, for each, a written answer to "what do we do if this is broken for an hour?" The written answer is the artefact that turns a hidden dependency into a managed one.

The cheap interventions

Some hidden-dependency interventions are expensive. Many are not. The ones I see ship most often within the engagement window:

  1. Alerts on DNS resolution latency from inside the estate. A simple synthetic check that resolves a handful of internal and external names every minute. Catches DNS issues before they become application incidents.
  2. A certificate-expiry dashboard. Every cert in the estate, with days-to-expiry and owner. Sorted by least time remaining. The list is read in a weekly platform meeting. No certificate has expired in production in twelve months at the teams who do this.
  3. NTP drift alerts on every host. Off-the-shelf, cheap, almost never deployed. The first time it fires you'll learn something about your time infrastructure you didn't know.
  4. A monthly "dependency review" calendar item. Thirty minutes, platform team, a single agenda: walk the hidden-dependency list, ask if anything has changed. The meeting is boring. The boredom is the goal.

The line worth holding

The dependencies on your architecture diagram are the ones you've chosen to manage. The dependencies not on the diagram are the ones managing you. The exercise of listing them isn't glamorous and won't ship a feature, but it's one of the few investments that pays off in the form of incidents that didn't happen, and those are the cheapest incidents of all.

Engagement.start()

The architecture diagram is a story your team tells itself. The real dependency graph is the one that takes you down.

A Tracefox dependency-mapping engagement walks the production estate against the canonical diagram and lists every dependency the diagram doesn't show. NTP, DNS, certificate authorities, package mirrors, identity providers, internal artefact stores — the boring layers that are invisible until they fail. The deliverable is a real dependency graph, with owners and contingencies for each.