Tracefox / Library / Field notes
Field notes

The dashboard nobody opens during the incident.

The team built it in a calm afternoon, populated it with everything, demoed it to leadership, and pinned it on the wiki. Then the page went off and nobody opened it. They fell back to Slack and gut feel. The dashboard wasn't broken. It was just built for a calmer reader than the one a P1 produces.

· Ken Tan · 6 min read

I've sat through enough incident reviews to know what to listen for. There's always a moment, ten or fifteen minutes in, where someone says "I checked the dashboard and it looked fine, so I went to look at the logs." The person says it casually. Nobody flinches. The reviewer moves on.

That sentence is a complete description of the failure. The dashboard that's supposed to anchor the investigation didn't. The responder bypassed it within the first quarter of the incident and started reasoning from raw signals. By the time someone with more context arrived, the dashboard was still open in nobody's browser.

This is one of the most common observability failures I see, and it's almost never a tooling problem. The data is in the platform. The team has paid for it. They just built the wrong dashboard for the wrong reader.

Who the dashboard was built for

The dashboards I see in most engagements were built in a calm afternoon by an engineer with full coffee, full context, and a brief that read something like "put everything important on one page so leadership can see it." What gets shipped is a wall of panels — request rates, error rates, latency histograms, queue depths, Kafka lag, GC pause time, JVM heap, CPU, memory, disk, every service, every environment. Forty panels. Sometimes eighty.

That dashboard is fine for the engineer who built it. It's fine for the architect doing capacity planning on a Tuesday. It is the wrong tool for a sleep-deprived responder at 02:14 trying to answer the question "is this real, and how bad." The cognitive load is too high. The signal-to-noise ratio is wrong. The eye doesn't know where to look first, and the panels aren't ordered by what matters.

What happens during the actual incident

The responder opens the dashboard out of habit. They scan it for two seconds. Nothing jumps out. Two of the panels are red but they're red every Tuesday morning, so they discount them. The latency panel uses a log scale and the spike doesn't read as a spike. They close the tab, open Slack, type "anyone seeing weirdness on orders-api?", and start running ad-hoc queries from memory.

Now they're investigating without an anchor. Every query is invented in the moment. The senior engineer who joins the bridge ten minutes later asks "what does the dashboard say" and the answer is "we didn't really look, it's hard to read." That sentence is the one that should have been caught a quarter earlier, in a calm review, before the dashboard became scenery.

What a dashboard that survives a P1 looks like

The dashboards that work in real incidents share a few traits. None of them are about which tool you're using. They're about how the dashboard was framed.

  • One screen, no scrolling. If the responder has to scroll, the dashboard has lost. The screen has to fit on a 13-inch laptop at 02:14, with a hand still half-asleep.
  • Built backwards from the questions. Start with the five questions the responder will ask in the first five minutes (latency on the top endpoints, error rate, affected tenants, what changed, downstream health). Build a panel for each. Stop.
  • Panels ordered by reading flow. Top-left to bottom-right, mapped to the order the responder needs them. User impact first. Internal causes second. Resource utilisation last, and only if it's diagnostic.
  • Annotations for deploys, feature flags, config pushes. The single highest-leverage panel feature, and the one most teams skip. A vertical line on a latency graph saying "deploy at 02:11" cuts ten minutes off the average investigation.
  • Linear scales by default. Log scales hide the spikes you're trying to see. They're for the analyst, not the responder.
  • Per-tenant or per-segment slicing. One tenant in pain is a different incident from everybody in pain. The dashboard should show both shapes at a glance.

Everything else — the 80-panel forensic view, the per-host CPU breakdown, the JVM internals — moves to a separate "deep dive" board the responder navigates to when the first board has narrowed the question. The two boards do different jobs and shouldn't be the same board.

Why teams resist this

The pushback I hear when I propose stripping a dashboard back is consistent: "but we might need that panel." The panel that might be needed once a quarter is on the same page as the panel that's needed every incident. The cost of keeping it isn't visible until the incident, when the eye fails to pick out the urgent panel because of everything around it. The dashboard's job isn't to hold every panel that might ever be useful. Its job is to make the most useful panels impossible to miss.

A useful exercise: time how long it takes a responder, cold, to answer "is anything materially worse for users right now" using the current dashboard. If it's more than thirty seconds, the dashboard is the bug.

The two boards a real team needs

From the engagements I run, the working pattern is almost always two dashboards, not one:

  1. The incident board. One screen, six to ten panels, built backwards from the first five minutes of triage. Linked from every alert. Owned by whoever owns the on-call rotation, reviewed every quarter.
  2. The deep-dive board. Forty panels, optional, used after the incident board has narrowed the search. Owned by the service team, lives or dies by whether anyone uses it.

The incident board is what the runbook links to. The deep-dive board is what the senior engineer pivots to once the responder has called them in. Conflating the two is what produces the dashboard nobody opens.

The line worth holding

The test of a good dashboard is not whether it shows everything. It's whether the responder reaches for it instead of away from it. If the rotation is bypassing the dashboard within the first quarter of every incident, that's not laziness. The dashboard wasn't built for them. Build them one that is.

Engagement.start()

A dashboard that survives a P1 looks nothing like the one most teams build.

A Tracefox engagement strips the incident dashboard back to a single screen. One row per Golden Signal, one tile per top-tier service, every panel earning its place by answering a question the responder will actually ask in the first five minutes. Everything else moves to a deep-dive view. The on-call rotation gets a tool they reach for instead of avoiding.