Observability & SRE · Singapore

Observability is a definition problem, not a tools problem.

A Singapore observability and SRE practice. We rebuild telemetry, SLOs, and incident response on systems that have outgrown the dashboards they started with.

01 Diagnosis

The maturity profile & roadmap

A 12-axis assessment that scores the seven structural dimensions of your observability stack on evidence. The panel on the right is the actual artefact from a recent anonymised engagement — read it the way a client reads it.

Dimension scores · 0-4 scale
  • 01 Telemetry coverage
    3/4
  • 02 Alerting quality
    1/4
  • 03 Incident response
    1/4
  • 04 SLO maturity
    1/4
  • 05 Correlation & RCA
    2/4
  • 06 Tooling & IaC
    2/4
  • 07 Platform & culture
    2/4
Tier verdict · worst-of axes
T0 · Ad-hoc

Bottlenecked by alerting quality, incident response, and SLO maturity · all scoring 1/4.

Maturity ladder

What the score actually means.

Each dimension above scores 0–4. Your tier is the worst-of axes — observability is bottlenecked by the weakest signal, not averaged across the group. The four states below are where systems land in practice.

Tier · T0

Ad-hoc

Telemetry exists because someone enabled an agent. Nobody can describe the data hierarchy. Pages mostly noise; MTTR unpredictable.

  • Top-10 dashboards 18+ months old
  • Bill grew 40%+ year-on-year
  • Most alerts silenced or auto-resolved
Tier · T1

Reactive

Pages get answered. Dashboards exist. But the design is inherited rather than chosen, and cardinality grows in step with revenue.

  • SLOs exist on infra metrics
  • Cardinality controlled by exception
  • Trace coverage spotty across boundaries
Tier · T2

Operational

Data hierarchy is intentional. SLOs reflect customer journeys. Cardinality has a budget. Cost-per-signal is a tracked metric.

  • Journey-keyed SLOs in place
  • Cardinality budget at the collector
  • OTel-portable trace surface
Tier · T3

Proactive

Telemetry is treated as a first-class surface. Engineers check it before shipping, and the dashboard count trends down rather than up.

  • Pre-deploy SLO impact reviews
  • Cardinality regressions caught in CI
  • Multi-vendor by choice, not lock-in

Most teams we audit start at T1. Foundational moves you to T2 in 12 weeks.

KT
From the founder

I'm Ken. I started Tracefox after enough years inside platform teams watching telemetry budgets compound while incidents stayed slow. Almost every observability problem I see is upstream of the tools — missing definitions, no shared vocabulary, dashboards nobody owns. That's what we fix first. Everything else is configuration.

Ken Tan · Founder, Tracefox About → LinkedIn →
House rule

Every service we manage emits all four Golden Signals. Every user-facing journey has at least one SLO. Every SLO has a burn-rate alert. Anything beyond that is improvement, not baseline.

Diagnostic · 2 wk · USD 18,000 fixed

Maturity profile, telemetry redesign, and a 90-day roadmap. Same price in Singapore, San Francisco, or Sydney.