Field notes

The dashboard that aged into uselessness.

The dashboard was beautiful in week one. By month nine, the team had renamed three services, retired two metrics, and added a new database without instrumenting the old views. Half the panels are dead. The team can't tell, because Grafana treats missing data as the same colour as healthy data, and nobody has audited it since the original author left.

19 March 2027 · Ken Tan · 6 min read

A subtle observability failure I keep finding in audits: the team has dashboards that look fine and aren't. The panels render. The colours are mostly green. The queries don't error visibly. But the underlying data feeding half the panels has been gone for months. The dashboard has aged into a piece of decoration, and the team hasn't noticed because the decoration looks the same colour as health.

This is dashboard rot. It's caused by the same forces that cause code rot — small, individually-reasonable changes accumulating against an artefact that nobody owns — and it ends in the same place. An incident where the dashboard says "OK" while the system is on fire, and the team blames the wrong thing for ten minutes because the panels they were trusting were lying.

How dashboards rot

Rot doesn't arrive all at once. It's a slow accumulation of small discrepancies, each individually too small to fix. The most common modes:

Metric renames. A service team refactors. The metric http_requests_total becomes http_server_requests_total. The new name flows everywhere except the dashboard, which keeps querying the old name. The query returns no series. The panel renders empty. Empty looks like flat. Flat looks like healthy.
Service retirements. A service is decommissioned. Its metrics stop being emitted. The dashboard panel for that service shows "no data" — but the panel above and below it look normal, and nobody flags it.
Label cardinality changes. A label was added to one metric and not to the dashboard's filters. The dashboard now aggregates across a label dimension it shouldn't, and the numbers it shows are subtly wrong. The shape looks plausible. The values are not the values the panel claims to be showing.
Threshold drift. The "good" and "bad" colour thresholds were set when the system handled 50 requests per second. It now handles 500. The thresholds were never updated, and the panel is permanently showing "in the green" because the absolute numbers are now in a different range entirely.
Time-window mismatches. The dashboard defaults to "last 6 hours." The team is investigating something that happened in the last fifteen minutes. The dashboard's smoothing makes the spike invisible at the default zoom. The panel renders, and renders something, but not the thing the eye is looking for.

Why nobody catches it

Dashboards are usually owned by the engineer who built them. That engineer rotates teams, gets promoted, leaves. The dashboard inherits the same fate as any unowned artefact: it persists, but nobody is responsible for it being correct. The team uses it because it's there, and the longer it's been there, the more it's trusted, despite no one having audited the queries in a year.

The other reason is that "no data" looks like "no problem" in most visualisation tools. A panel with zero series rendered is visually identical to a panel with all-green metrics. The brain reads both as healthy. The tool isn't surfacing the difference.

The audit, in practice

The audit I run on dashboards has three checks per panel. They're simple enough that a junior engineer can do the audit in a few days with a spreadsheet and a query tool.

Is the underlying metric still being emitted? Run the query. Confirm there's data in the last hour. If the panel returns "no data," it's either rotted or genuinely dead. Either way, the panel needs a decision: fix the query, or remove the panel.
Does the query still mean what it claims to mean? Read the query. Compare it to the metric definition in the service. Look for label drift, aggregation drift, naming drift. A surprising number of queries are aggregating the wrong way after a downstream change.
Is anyone using this panel during incidents? Check the access logs of the dashboarding tool, or just ask the on-call. Panels that nobody opens during incidents are not load-bearing in any decision. They can either move to a separate "deep dive" board or be deleted.

Most audits I've run produce a 30–50% reduction in panels. The remaining panels are smaller in number and higher in value. The dashboard becomes legible again.

What to do about "no data"

The single most important visualisation change I recommend after an audit is making "no data" look different from "data showing healthy." In Grafana, that's a thresholds-and-mappings change: explicitly map missing data to a distinctive colour (I use a yellow-grey diagonal stripe pattern) so the eye knows the panel is broken, not happy.

The change is small. The effect is large. A team that can see at a glance which panels are dark stops trusting them by accident. The audit then becomes continuous instead of quarterly.

Dashboards as code

The other intervention that prevents future rot is treating dashboards as code. Version them. Review changes. Tie them to the service they describe, in the same repository, with the same review process. When the metric is renamed, the dashboard PR is part of the same change, and the rot doesn't accumulate.

The teams who do this still need to audit periodically — code reviews catch most renames but miss the long tail — but the background rate of rot drops by an order of magnitude.

The line worth holding

A dashboard is not a static artefact. It's a service. It has inputs, outputs, an implicit contract, and an owner. The dashboards that age into uselessness are the ones nobody promoted to first-class status. Treat them as code, audit them quarterly, make missing data visually obvious, and the rot stops being the silent failure mode it currently is.

The dashboard that aged into uselessness.

How dashboards rot

Why nobody catches it

The audit, in practice

What to do about "no data"

Dashboards as code

The line worth holding

The dashboard nobody opens during the incident.

The service nobody owns.

A starter SLI catalogue.

Half the panels on most dashboards are showing nothing useful. The team can't tell.

How dashboards rot

Why nobody catches it

The audit, in practice

What to do about "no data"

Dashboards as code

The line worth holding

The dashboard nobody opens during the incident. →

The service nobody owns. →

A starter SLI catalogue. →

Half the panels on most dashboards are showing nothing useful. The team can't tell.

The dashboard nobody opens during the incident.

The service nobody owns.

A starter SLI catalogue.