Tracefox / Library / Guide · Fundamentals
Guide · Fundamentals

The Golden Signals, properly understood.

Four signals are sufficient to characterise the health of any service. The reason most teams still alert on the wrong things is they treat them as a checklist instead of a vocabulary.

8 min read · v1.0

The Golden Signals come from Google's Site Reliability Engineering book: four signals that together describe whether a service is healthy from the user's perspective. Latency, Traffic, Errors, Saturation. If you can only instrument one thing on every service you operate, this is the one thing.

What gets people stuck isn't the list. It's that the list looks deceptively simple. Teams check the box ("yes, we measure latency") without asking whether they're measuring it the way the signal demands: separately for successful and failed requests, as a histogram, with bounded cardinality. The signal is only as good as the instrumentation underneath it.

[LAT] Latency

Time to serve a request. The operative word is serve: the user's experience of waiting, not your service's internal processing time. Track latency as a histogram, not as an average. Averages hide tail latency, and tail latency is where outages live.

Track successful and failed requests separately. A request that returns a 500 in 1ms looks great in the latency histogram, but the user experienced an error, not low latency. Failed-request latency is interesting for debugging; successful-request latency is what your SLO should be built on.

Pitfall Averaging latency hides tail latency. p99 is the floor; p999 catches the whales. Two services with identical p50 can have wildly different user experiences if one has a long tail.

[TRF] Traffic

Demand on the system. Requests per second. Messages consumed. Active sessions. The "load" in load testing. Tracked as a counter, displayed as a rate over time, broken down by endpoint, region, and customer tier where useful.

Traffic establishes the baseline that anomaly detection sits on top of. Spikes get worried about; drops rarely do, and that's a mistake. A traffic drop is rarely improved performance. It's almost always an upstream failure that didn't page, a load balancer misconfiguration, or a CDN cache stampede in the wrong direction.

[ERR] Errors

Rate of requests that fail. The most actionable early-warning signal you have. Three flavours, all worth tracking:

  • Explicit: 5xx responses, exception traces, transport-layer failures.
  • Implicit: 200 OK with the wrong content. The login that returns "success" but didn't actually log anyone in. Invisible without application-level instrumentation. Add it.
  • By policy: requests that succeeded but breached a quality gate (latency target missed, response too large, schema invalid).

Separate 4xx (client error, usually not your fault) from 5xx (server error, almost always your fault). Combining them in a single error metric pollutes the signal and produces alerts you'll learn to ignore.

[SAT] Saturation

How full the service is. CPU, memory, disk, thread pool, queue depth, connection pool. Saturation is the leading indicator: systems degrade before they fail, and saturation tells you the failure is approaching.

Crucially: not all saturation is equal. A queue at 90% depth is more urgent than CPU at 90%. CPU has elastic responses (autoscale, throttle); a full queue does not. Know which resource is your bottleneck for each service, and measure that one as the saturation signal, not whichever metric your dashboard shipped with by default.

Why CPU and memory aren't Golden Signals

They're implementation details. A service can run at 10% CPU and fail every request. A service can run at 90% CPU and serve users perfectly well. CPU correlates with health badly enough that paging on it is a coin-flip.

Saturation is the only resource-flavoured Golden Signal, and it's measured as a leading indicator for the user-facing failure that's about to happen, not as a primary health signal.

What this looks like applied: a checkout API

A worked example of what "instrumenting Golden Signals" actually means in practice, for a single service:

  • Latency: request duration histogram with endpoint, method, status_class labels. p50/p95/p99 computed at query time. Successful and failed paths separated.
  • Traffic: request count counter with the same labels. rate() over 1m and 5m windows.
  • Errors: error count counter (5xx only, plus a separate business_error_total for 200-with-error-payload cases). Error ratio computed as a recording rule.
  • Saturation: in this case the bottleneck is the database connection pool. Track db_connections_in_use / db_connections_max as a gauge.

All four signals on one service. From here you can build SLOs, burn-rate alerts, and dashboards. Without these four, everything downstream is built on sand.

What to do next

If your services don't yet emit the four signals consistently, that's the first job: before SLOs, before dashboards, before alert tuning. Once they do, the next move is choosing the right SLIs to build on top of them. We've published a starter SLI catalogue that maps each signal to the indicator patterns we use most often, plus the full Blueprint if you want the long version.

Engagement.start()

The four signals are the easy part. Agreeing what they mean is the work.

The Tracefox assessment scores Golden Signal coverage on evidence: what's actually emitting in production, not what's planned. Book a discovery if you want to know where you actually stand.