Case study · 042 · Fintech

From 98.4% to five-nines on a Tier-1 clearing engine.

A structural overhaul of a regulated core-banking workload. Names redacted; numbers are not.

Industry
Fintech · regulated
Region
APAC · 3 dc
Engagement
Foundational · 14 wk
Stack
JVM · k8s · OTel
warning Problem statement

A single latency spike in the clearing sub-module would saturate the global connection pool, cascading into a system-wide freeze. Twice per quarter. Always at month-end.

  • MTTR 4.2 hr · regulator-reportable
  • Error rate at peak 1.2%
  • Hidden overhead 400ms in mesh discovery
  • Symptom Cascading degradation
Diagnostic verdict
"The architecture wasn't failing. It was suffocating. We found 400ms of hidden overhead in service-mesh discovery alone — and a connection pool sized for 2019 traffic."
— Lead engineer · Tracefox
01 Intervention

What we changed.

Phase 01 · Topology refactor

Bulkhead the blast radius.

Implemented the Bulkhead pattern at VPC level — separated the transaction engine from peripheral reporting. Single failures stopped propagating.

  • account_tree
    Logical sharding
    Monolith state to a 4-cluster distributed topology.
  • hub
    Adaptive throttling
    Rate limiting keyed to p99 latency feedback loops.
Phase 02 · Telemetry redesign

Trace the boundary, not the box.

Re-instrumented around contract boundaries instead of process boundaries. Traces now span the regulatory pathway end-to-end — from order ingest to clearing acknowledgement.

  • route
    Boundary-keyed traces
    OTel context propagated across 31 service boundaries.
  • speed
    Hot-path eBPF
    Probes on the 7 endpoints carrying 92% of revenue.
02 Same incident · same workload

Before & after, told as a trace.

Before
incident.0xC4.before
tail -f
01[16:04:11] WARN ▸ p99 = 4,210ms (clearing-svc)
02[16:04:12] ERR ▸ pool exhausted upstream=core-banking
03[16:04:14] ALERT ▸ cascading saturation across 11 services
04[16:04:15] OPS ▸ paging primary on-call (3rd this quarter)
05[16:04:18] — ▸ system frozen · 4.2 hr MTTR forecast
After · same pre-conditions
incident.0xC4.after
tail -f
01[16:04:11] info ▸ p99 = 88ms (clearing-svc)
02[16:04:11] info ▸ deadline budget healthy (78%)
03[16:04:11] info ▸ bulkhead boundary holding · isolated to ledger-2
04[16:04:12] auto ▸ adaptive throttle engaged · 62 RPS shed
05[16:04:14] info ▸ degradation contained · no paging needed
99.999 % trending_up
Uptime
14 min trending_down
MTTR
58 % trending_down
Cost reduction
0 /9mo trending_down
Unplanned outages
Engagement.start()

A clearing engine like this?

We've done six. Tell us about yours — first call costs nothing, second one is the Diagnostic.