The 'temporary' workaround that's now load-bearing.
The cron job was a stop-gap during a 2019 outage. The shell script was a debugging aid the SRE used once and forgot to remove. The hardcoded IP was the right call at 03:47 on a Tuesday. None of them were meant to last. They're now in the critical path of three services, undocumented, owned by no one, and impossible to turn off without breaking something.
The most expensive sentence in any postmortem is "we put that in as a temporary fix three years ago."
The pattern is so consistent I now ask about it on the first day of every engagement. "What's running in your estate that you'd consider temporary?" The answer is always the same shape. A silence. A laugh. Then an inventory that goes:
- A cron job on a host nobody remembers provisioning.
- A shell script in someone's home directory that processes data nightly.
- A hardcoded IP address in three config files that points at a service that should have moved to DNS.
- A "test" Lambda that's been in production for two years.
- A CSV file in S3 that one of the services treats as a configuration source.
- An SSH tunnel from one VPC to another that someone set up "for the demo."
Each item was put in for a real reason. Each was meant to be removed once a longer-term fix was in place. None of them were. All of them are now critical, and the team only knows it because the original owner mentioned it once before they left.
Why workarounds become permanent
The dynamics that ossify a workaround are simple:
- The workaround solved an immediate problem. Once solved, the urgency to do the proper fix disappeared. The "proper fix" ticket got created, lived in the backlog for a sprint, and was deprioritised in favour of feature work. After a year, it was archived as stale.
- Other systems started depending on the workaround without knowing it was a workaround. The shell script was reliable for six months. A new service was built that relied on its output. The new service has no idea it's depending on a hack; the hack now looks like infrastructure.
- The original author left, and the institutional memory went with them. The next generation of engineers treats the workaround as a feature of the system, because that's how it appears to them. Removing it would feel destructive.
- Nothing has broken yet. The workaround has been running for three years without an incident. That's the strongest possible argument against touching it, regardless of how fragile it actually is.
The fourth dynamic is the most dangerous. The workaround's survival up to now is being read as evidence of robustness, when it's just evidence that nobody has stress-tested it. The first time it does break — usually because the host it lives on is decommissioned, or the script's interpreter is upgraded, or the hardcoded IP changes — the failure is sudden, undocumented, and expensive.
The audit that catches them
The fastest way I've found to surface the load-bearing workarounds in an estate is a structured first-90-days audit. Three checks, run against every host and every service.
- Crontab and systemd timer audit. Pull every scheduled task from every production host. For each, ask: who owns this, what does it do, what breaks if it stops. If there's no answer for any of those, the task is provisional and needs investigation.
- Bastion home-directory audit. Walk the bastions. Look for shell scripts in user home directories. The ones that haven't been touched in a year are the ones to worry about — they were written for a one-time job and nobody removed them, but the data they produce may have been wired into downstream systems.
- Configuration drift audit. Diff the running configuration against the configuration in version control. Hardcoded IPs, manual edits, environment-specific patches that never made it back to the repo — these are the seams where workarounds hide.
The audit takes about a week. Most teams find more than they expect. The discovery isn't the deliverable — the deliverable is the per-item remediation plan.
The remediation pattern
Once the workaround is identified, it falls into one of three categories, and the remediation is different for each:
- The workaround should become a real feature. The need it serves is genuine. Promote it: give it a proper home in version control, an owner, tests, a runbook. The workaround stops being a workaround and becomes a service.
- The workaround can be retired. The original need has been solved another way, or no longer exists. The work is to confirm nothing depends on it, then remove it. The "confirm nothing depends on it" step is the dangerous one and shouldn't be rushed.
- The workaround should be replaced. The need is genuine, but the implementation is the wrong shape. The work is to build the right shape, migrate the dependents, then remove the original. This is the most expensive of the three, and the one that gets stuck in backlogs longest.
The trap is treating every workaround as a category-three problem when most are actually category-one. Promoting an existing working hack to a real feature is cheap. The work is mostly documentation and ownership, not engineering.
The line worth holding
Every long-lived estate has temporary workarounds. The estates that have audited them have a list. The estates that haven't, don't. The ones with a list are managing risk; the ones without are accumulating it. The cheapest moment to make the list is when a new team takes over, before familiarity makes the hacks look like architecture. After that, the cost compounds in incidents nobody could have predicted.
The service nobody owns.
The companion failure: an artefact ages out of ownership. The 'temporary' workaround is the same dynamic at the script level.
The handover that didn't survive contact with reality.
Where most of these workarounds get discovered: in the first 90 days of a new team taking over the estate.
The dependency you didn't know you had.
Sibling pattern: workarounds become dependencies. The system has been quietly reorganised around them.