Tracefox / Library / Opinion
Opinion

Your alert hygiene is a leadership problem.

In an alert audit, the engineering team usually apologises. They shouldn't. Bad alerts are a leadership symptom, not an engineering problem. The fix is policy and budget, not effort.

· Tracefox · 6 min read

Run an alert audit on a new client and the engineering team usually opens with an apology. The platform lead says some version of "I know it's bad. We just haven't had time."

They shouldn't apologise. Bad alerts aren't an engineering problem. They're a leadership symptom.

The pattern

Walk into any production engineering org with more than two years of alerts accumulated and the picture is the same: hundreds of alerts in the catalogue, dozens that fire weekly, a handful that the team has muted in the routing config because they fired so often that on-call had stopped acting on them.

The engineering team will frame this as "we need to do an alert audit." They've been saying that for eighteen months. They will keep saying it for the next eighteen unless something changes.

What needs to change isn't engineering's commitment to hygiene. It's leadership's allocation of time to do it.

Why alerts accumulate

Alerts are added during incidents. Something failed; a postmortem action item said "add an alert"; someone added an alert; the alert exists forever.

Alerts are not removed during incidents. There's no urgency to. The cost of a noisy alert is distributed across every page that wakes someone up, every Slack notification ignored, every minute of investigation that turned out to be unnecessary. The cost of a silent failure (because you removed the alert that would have caught it) lands in one place: the next incident postmortem, with the team being asked why this wasn't caught.

The asymmetry is structural. Adding alerts is local and visible. Removing alerts is global and risky. Without leadership intervention, alerts accumulate monotonically.

The reason your alert hygiene is bad is not that your engineers don't care. It's that the org has never written down that removing alerts is part of their job.

The four things only leadership can do

Engineering can't fix this from inside the team. The fixes require organisational mandate.

1. Budget time for hygiene as a line item

Not "we'll get to it." Allocate it. One sprint per quarter (minimum) dedicated to alert audit, runbook coverage, and ownership review. If reliability work is funded as a category (the way tech debt is in some orgs), alert hygiene gets funded inside it. If not, it doesn't happen. "We'll find time" is a polite way of saying we won't.

2. Mandate ownership at the alert level

Every active alert in production must have a named team owner. A team, not a person. If an alert can't be traced to an owning team within thirty seconds, it gets disabled until it can. This is leadership's call to make, because it forces an organisational design decision: who owns what, and what happens when the answer is "nobody."

The version that gets quietly resisted: alerts that nobody owns are usually the alerts that fire the most, because they were inherited from a team that no longer exists. Disabling them feels reckless. Doing so anyway is the act of leadership that makes the next ninety days possible.

3. Make removal as legitimate as addition

The cultural permission to remove alerts has to come from above. Engineers don't remove alerts because they're scared of being the engineer who removed the alert that would have caught the next outage. Leadership can change that calculus by stating the position explicitly:

"We'd rather miss one signal than accumulate fifty noisy ones, and the team that removes a bad alert is doing their job."

Said in writing. Said in the engineering all-hands. Said again the first time a removed alert is involved in an incident, and meant.

4. Treat the audit as a recurring practice

Annually is too rarely. Quarterly is the floor. The audit is part of the operational rhythm: same calendar slot, same format, same budget. The first audit is the hardest because the backlog is enormous. The fourth audit is routine. Getting from one to four is just whether someone scheduled the second one.

What the audit looks like

The Tracefox version, which you can run yourself or have us run with you:

  • Every active alert listed with its current fire rate, ownership, runbook URL, and severity.
  • Each alert classified as: keep, retune, or disable.
  • Disabled alerts logged with the reasoning, so the next audit can review whether to reinstate.
  • The "no owner, no runbook" rule enforced: alerts without both get disabled until they have both.

A typical first audit removes 40–60% of the active alert set. Engineers describe the on-call rotation feeling different within the next two weeks. The teams that stick with it report the on-call burnout conversation quietly disappearing within a quarter.

The conversation engineering can't have alone

If you're an engineering leader reading this and your alert hygiene has been "on the list" for over a year, the conversation isn't about whether your team can do better. It's about whether your organisation has structurally enabled them to.

The version we have on engagements: the answer is usually no. The fix is policy and budget, not effort. The engineers were never the problem.

If you want a starting point, the alert audit is the second thing we run on every engagement (the assessment is the first), and the burn-rate alerting guide covers what the post-audit alert set should look like. Or if you'd rather start with the cheaper version, the Blueprint is downloadable.

Either way: this isn't an engineering apology. It's a leadership decision that hasn't been made yet.

Engagement.start()

The first audit is the hardest. The fourth is routine. Getting from one to four is whether anyone's made it the work.

The Tracefox engagement runs the audit on day one: the alert inventory, the ownership review, the keep/retune/disable classification. We bring the calendar slot and the leadership air-cover; your engineers bring the context. The output is a healthier rotation by the end of week two.