The Golden Signals, properly understood.
Four signals are sufficient to characterise the health of any service. The reason most teams still alert on the wrong things is they treat them as a checklist instead of a vocabulary.
The Golden Signals come from Google's Site Reliability Engineering book: four signals that together describe whether a service is healthy from the user's perspective. Latency, Traffic, Errors, Saturation. If you can only instrument one thing on every service you operate, this is the one thing.
What gets people stuck isn't the list. It's that the list looks deceptively simple. Teams check the box ("yes, we measure latency") without asking whether they're measuring it the way the signal demands: separately for successful and failed requests, as a histogram, with bounded cardinality. The signal is only as good as the instrumentation underneath it.
[LAT] Latency
Time to serve a request. The operative word is serve: the user's experience of waiting, not your service's internal processing time. Track latency as a histogram, not as an average. Averages hide tail latency, and tail latency is where outages live.
Track successful and failed requests separately. A request that returns a 500 in 1ms looks great in the latency histogram, but the user experienced an error, not low latency. Failed-request latency is interesting for debugging; successful-request latency is what your SLO should be built on.
[TRF] Traffic
Demand on the system. Requests per second. Messages consumed. Active sessions. The "load" in load testing. Tracked as a counter, displayed as a rate over time, broken down by endpoint, region, and customer tier where useful.
Traffic establishes the baseline that anomaly detection sits on top of. Spikes get worried about; drops rarely do, and that's a mistake. A traffic drop is rarely improved performance. It's almost always an upstream failure that didn't page, a load balancer misconfiguration, or a CDN cache stampede in the wrong direction.
[ERR] Errors
Rate of requests that fail. The most actionable early-warning signal you have. Three flavours, all worth tracking:
- Explicit: 5xx responses, exception traces, transport-layer failures.
- Implicit: 200 OK with the wrong content. The login that returns "success" but didn't actually log anyone in. Invisible without application-level instrumentation. Add it.
- By policy: requests that succeeded but breached a quality gate (latency target missed, response too large, schema invalid).
Separate 4xx (client error, usually not your fault) from 5xx (server error, almost always your fault). Combining them in a single error metric pollutes the signal and produces alerts you'll learn to ignore.
[SAT] Saturation
How full the service is. CPU, memory, disk, thread pool, queue depth, connection pool. Saturation is the leading indicator: systems degrade before they fail, and saturation tells you the failure is approaching.
Crucially: not all saturation is equal. A queue at 90% depth is more urgent than CPU at 90%. CPU has elastic responses (autoscale, throttle); a full queue does not. Know which resource is your bottleneck for each service, and measure that one as the saturation signal, not whichever metric your dashboard shipped with by default.
Why CPU and memory aren't Golden Signals
They're implementation details. A service can run at 10% CPU and fail every request. A service can run at 90% CPU and serve users perfectly well. CPU correlates with health badly enough that paging on it is a coin-flip.
Saturation is the only resource-flavoured Golden Signal, and it's measured as a leading indicator for the user-facing failure that's about to happen, not as a primary health signal.
What this looks like applied: a checkout API
A worked example of what "instrumenting Golden Signals" actually means in practice, for a single service:
- Latency: request duration histogram with
endpoint,method,status_classlabels. p50/p95/p99 computed at query time. Successful and failed paths separated. - Traffic: request count counter with the same labels.
rate()over 1m and 5m windows. - Errors: error count counter (5xx only, plus a separate
business_error_totalfor 200-with-error-payload cases). Error ratio computed as a recording rule. - Saturation: in this case the bottleneck is the database connection pool. Track
db_connections_in_use / db_connections_maxas a gauge.
All four signals on one service. From here you can build SLOs, burn-rate alerts, and dashboards. Without these four, everything downstream is built on sand.
What to do next
If your services don't yet emit the four signals consistently, that's the first job: before SLOs, before dashboards, before alert tuning. Once they do, the next move is choosing the right SLIs to build on top of them. We've published a starter SLI catalogue that maps each signal to the indicator patterns we use most often, plus the full Blueprint if you want the long version.