Tracefox / Library / Opinion
Opinion

CloudWatch was never going to be enough.

The cloud-native tools were built to monitor the cloud's resources, not your application. The realisation usually lands the third or fourth time an incident outlasts what the dashboard can explain, and the response is usually to buy more dashboards rather than admit the architecture was wrong.

· Tracefox · 6 min read

The observability story usually starts with whatever the cloud provider includes. AWS gives you CloudWatch. Azure gives you Azure Monitor and Log Analytics. GCP gives you Cloud Logging and Cloud Monitoring. They are turned on by default. They produce graphs immediately. They feel like coverage.

They aren't. Not because the engineering is bad (it isn't) but because the tools were designed to answer a different question than the one incidents actually pose.

What the cloud-native tools were built for

CloudWatch exists primarily because AWS needs you to be able to size your instances, configure your auto-scaling groups, and decide whether to keep paying for a resource. It is, structurally, the billing-and-capacity instrumentation surface for the cloud, repurposed as a monitoring product. The same is broadly true of the other two.

That heritage shows up in three places that matter the moment you have an incident:

  • The unit of observation is the resource, not the request.
  • The default cardinality is low: by instance, by region, by service.
  • The query language is built around metrics aggregation, not exploration.

None of which is a flaw if your job is to size an EC2 fleet. All of which is a flaw if your job is to find out why a single tenant's checkout has been failing intermittently since 14:42 UTC.

The five places it breaks

1. The unit is the resource

CloudWatch shows you the EC2 instance, the RDS database, the Lambda function, the load balancer. Your customer doesn't experience any of those. They experience a request, which traverses six of those resources in turn, hits two third-party APIs, returns through a CDN, and is rendered by a frontend running outside your cloud account entirely.

The cloud-native dashboard cannot show you that request. It can show you seven separate dashboards, each containing a fragment of the journey. The correlation work (the part that actually answers "why is this slow") is yours, manually, every time.

2. Cardinality is low by design

Cloud-native metrics are pre-aggregated. CloudWatch custom metrics cost money per dimension and the pricing structure pushes you, hard, toward a handful of low-cardinality labels. Which means the questions you can ask are limited to the dimensions you committed to in advance.

"Is this affecting users on the new mobile app version?" requires a dimension you didn't add. Adding it now means deploying a metrics change, waiting for it to take effect, and hoping the incident is still happening when the data starts arriving. By the time you can ask the question, the incident is over and you've answered it through customer support escalations.

3. Logs are unstructured by default

CloudWatch Logs and Azure Log Analytics will both happily accept whatever string your application emits. There is no enforced schema, no required correlation ID, no propagation contract. Most application logs in these systems are still single-line text or JSON-ish strings with inconsistent field names across services.

The result is that during an incident, the on-call engineer is doing full-text search across gigabytes of unstructured text, in a query language designed for metrics. The phrase "let me grep through the logs" in 2026 is a tell that the logging pipeline never finished being designed.

4. Distributed tracing is a separate, half-finished product

AWS X-Ray exists. Azure Application Insights has tracing. GCP has Cloud Trace. They are all bolt-ons to the metrics-and-logs core, sold separately, instrumented separately, and not propagated automatically across the boundaries that matter most: message brokers, third-party SDKs, the front-end SPA.

Cloud-native tracing typically lands at a 30–60% end-to-end propagation rate, with the gaps clustered on the hops where incidents tend to originate. We've written separately about why partial tracing is worse than no tracing. It gives you the confidence of a complete picture without the picture being complete.

5. The pricing model punishes the questions you most need to ask

Long retention, high cardinality, full sampling: each of these is a multiplier on the cloud-native bill. So teams reduce sampling, drop dimensions, shorten retention. The data is always thinnest exactly where you need it most: at the moment of the incident, on the unusual axis, three weeks after the deployment that introduced the regression.

The pricing model isn't dishonest. It's just optimised for steady-state monitoring, not for the long-tail debugging that observability requires. Those are different products, and the bill structure gives the game away.

The lock-in problem nobody warned you about

The other gap is one teams discover only when they try to leave. Each cloud's monitoring stack is its own data model, its own query language, its own dashboard format. CloudWatch Metric Insights is not Kusto. Kusto is not LogQL. Moving telemetry off the native stack means rewriting every alert, every dashboard, every saved query.

Multi-cloud teams already know this. They have three of everything and no single source of truth. Single-cloud teams discover it the moment a business decision pushes them to add a second cloud, or when a vendor consolidation conversation suggests collapsing onto one observability backend across regions.

The cloud-native tools quietly assume you'll never want to leave them. OpenTelemetry exists because that assumption is no longer safe.

The cloud-native tools are good at what they were designed for: telling you whether to scale up or scale out. They were never designed to debug distributed systems. The vendor pages don't say so out loud.

What you can keep, what you have to add

None of this means turn off CloudWatch. The cloud-native stack is genuinely good at infrastructure-level signals: host health, scaling decisions, capacity planning. Keep it for that. The mistake is treating it as the application observability layer too.

What needs to be added on top:

  1. OpenTelemetry instrumentation in the application, emitting traces and metrics in a standard format. The guide on collector vs agents covers the deployment choice.
  2. An OTel Collector as the central pipeline, routing data to whatever backend you choose, including, fine, CloudWatch for the metrics that belong there.
  3. A backend optimised for high-cardinality query. Honeycomb, Tempo, ClickHouse-backed setups, Grafana with the right datasources. Whichever one you pick, the criteria are sub-second query latency and arbitrary dimension filters at full sample rate.
  4. Structured logs with trace correlation. Every log line carries the trace_id from the active span. This is what turns logs back into a useful diagnostic pillar.

The end state is: cloud-native for infrastructure, OpenTelemetry for the application layer, structured logs with trace correlation across both. The cloud-native tools become one of several backends, not the only one.

The honest framing

CloudWatch was never the wrong tool. It was the wrong choice of only tool. The teams that get observability working in 2026 are the ones who stopped expecting the cloud's billing instrumentation to also be their debugging instrumentation. Those are different jobs. They need different products.

If your team is still trying to debug distributed systems through CloudWatch Insights, the next incident is going to make the case more eloquently than this post can. The work is unsurprising; the budget conversation is the harder one.

Engagement.start()

The teams that move first off cloud-native-only do so after one incident they couldn't debug from the console.

The Tracefox assessment scores telemetry on five dimensions: pillar coverage, cardinality, propagation, query flexibility, and SLO alignment. CloudWatch and its peers score well on coverage and badly on the other four. The gap is the gap between monitoring and observability, and it's where every long incident lives.