Tracefox / Library / Field notes
Field notes

The synthetic check that lies to you.

A vendor sells you a synthetic monitor. You point it at one URL, every minute, from three regions. It passes. It always passes. During a real outage last quarter, it kept passing for forty-three minutes while the actually-broken endpoints — the ones doing the work — were timing out. The board was green. The customers were not.

· Ken Tan · 6 min read

I have a personal hierarchy of monitoring artefacts I trust. Real user monitoring sits at the top. Tracing comes second. Properly structured logs come third. Synthetic checks, as commonly deployed, sit near the bottom. Not because synthetics are useless — well-built synthetics are valuable — but because the synthetics most teams actually run are barely informative, and the green ticks they produce do active harm.

The pattern I see in audits: a team has bought a synthetic monitoring product. They've configured a check that hits the homepage every minute from three geographies. The check passes with high reliability. The status board is mostly green. Leadership looks at the board and concludes the system is healthy. The customers, meanwhile, are submitting tickets about a feature the synthetic doesn't touch.

Why the synthetic passes during outages

The synthetic check that lies to you usually has three properties, all of which were chosen for reasons that seemed sensible at setup time:

  • It hits a single URL. Usually the homepage or a designated "health" endpoint. The endpoint was chosen because it was easy to set up and unlikely to flap.
  • It only checks for HTTP 200. Not response time, not body content, not asset completeness. A 200 with an empty shell counts as healthy.
  • It's anonymous. No login. No tenant context. No authenticated state. The path it exercises is the smallest possible surface.

Each of those choices made the check easy to maintain. Together, they made the check incapable of detecting anything users care about. The homepage is fast and cached. The 200 will return whether the database is up or not. The anonymous path doesn't touch auth, so the check stays green during an auth outage that's blocking every actual user.

What the synthetic should be testing

The synthetic checks that earn their keep look very different. They're shaped by what the user journey actually is, not by what's cheapest to set up. The pattern that works:

  1. Authenticated journeys. Log in as a test user. Hit the dashboard. Hit the search. Submit something. Log out. The check exercises the same code paths real users do, including the auth tier and the database.
  2. Body assertions, not status assertions. The response should contain expected content. An empty 200 is a failure. A 200 with the wrong body is a failure. Status code alone is the floor, not the test.
  3. Latency budgets per journey. The journey passes if it completes within an SLO-relevant time. A successful but slow journey is a failure, because users experience it as a failure.
  4. Coverage of the top revenue paths, not the easiest paths. Checkout. Search. The API your largest customer integrates with. The exports that finish the quarterly cycle. Every one of those needs a synthetic. The marketing homepage does not.
  5. Realistic geographic distribution. Run from the regions your users actually live in, not just from the same cloud region your origin sits in. A check from the same datacentre as the origin tests almost nothing about user experience.

The cost of the lying check

The most insidious property of the synthetic-that-passes-anyway is that it's worse than no check at all. No check would mean the team knows it doesn't have coverage and would build alternative detection — RUM, customer support hooks, error budgets. A check that always passes creates the impression of coverage. The team stops investing in the alternatives. When the real outage arrives, they discover the check was theatre, but they've spent a year behaving as though it was monitoring.

I've seen this play out in two specific shapes:

  • A team's status page is driven by their synthetic. The synthetic stays green. Customers report a degradation. The status page doesn't budge for forty minutes. The customers escalate to social media. The team is now managing two crises: the technical one and the public one.
  • A leadership review uses synthetic uptime as a KPI. The KPI improves quarter over quarter. The actual user-experience metrics degrade in the same window. Nobody notices because the KPI is the one being read.

The audit, shaped as a question

When I audit synthetics, I ask one question: "if every active synthetic is passing right now, can the system still be hurting a paying user?"

For most teams, the answer is yes, and the list of how is long. The audit deliverable is that list, mapped to the synthetics that should exist but don't. Most teams come out of the audit with two or three new authenticated journeys to wire up, and a handful of existing checks to upgrade with body assertions and latency budgets. The cost is low. The impact on detection latency is significant.

The line worth holding

A synthetic that always passes is not a monitor. It's a vendor's marketing artefact. The synthetics worth running are the ones that fail when users would fail, and pass when users are succeeding. Build for that, and the green tick becomes evidence again instead of decoration.

Engagement.start()

A synthetic that hits one cached URL is not monitoring; it's a marketing tool for the vendor's status page.

The Tracefox synthetic-coverage audit maps every active synthetic check against the user journeys they claim to cover. Most teams discover within an hour that 70% of real user paths are unsynthesised, and the existing checks are clustered on the easiest routes. The deliverable is a coverage matrix and a remediation plan that costs less than the vendor seat.