From 98.4% to five-nines on a Tier-1 clearing engine.
A structural overhaul of a regulated core-banking workload. Names redacted; numbers are not.
- Industry
- Fintech · regulated
- Region
- APAC · 3 dc
- Engagement
- Foundational · 14 wk
- Stack
- JVM · k8s · OTel
A single latency spike in the clearing sub-module would saturate the global connection pool, cascading into a system-wide freeze. Twice per quarter. Always at month-end.
- MTTR 4.2 hr · regulator-reportable
- Error rate at peak 1.2%
- Hidden overhead 400ms in mesh discovery
- Symptom Cascading degradation
"The architecture wasn't failing. It was suffocating. We found 400ms of hidden overhead in service-mesh discovery alone — and a connection pool sized for 2019 traffic."
What we changed.
Bulkhead the blast radius.
Implemented the Bulkhead pattern at VPC level — separated the transaction engine from peripheral reporting. Single failures stopped propagating.
- account_tree Logical shardingMonolith state to a 4-cluster distributed topology.
- hub Adaptive throttlingRate limiting keyed to p99 latency feedback loops.
Trace the boundary, not the box.
Re-instrumented around contract boundaries instead of process boundaries. Traces now span the regulatory pathway end-to-end — from order ingest to clearing acknowledgement.
- route Boundary-keyed tracesOTel context propagated across 31 service boundaries.
- speed Hot-path eBPFProbes on the 7 endpoints carrying 92% of revenue.
Asset pending
Before/after architecture diagram. Left: monolithic clearing engine with single shared connection pool (red border, cascading saturation). Right: same workload split into four isolated bulkheads — Ingest, Match, Clear, Report — each with its own pool, electric-blue border.
Two-panel architecture diagram, paper-white #f7f9fb. LEFT (BEFORE, 1px error-red #ba1a1a panel border): single rectangular monolith block labeled 'CLEARING ENGINE', arrows pointing in from 'INGEST', 'MATCH', 'REPORT' nodes, all routing through a shared 'POOL · 200' rendered as a small box. A red dashed arrow shows cascading saturation back through all upstreams. RIGHT (AFTER, 1px electric-blue #0066FF panel border): same workload split into 4 isolated cluster cards — 'INGEST', 'MATCH', 'CLEAR', 'REPORT' — each with its own bordered pool. Adaptive throttle shown as a small gauge between MATCH and CLEAR. All connections drawn as 1px obsidian lines. JetBrains Mono labels throughout. 16:9 ratio, lots of whitespace.
/img/work/fintech-bulkhead.png Before & after, told as a trace.
01[16:04:11] WARN ▸ p99 = 4,210ms (clearing-svc)02[16:04:12] ERR ▸ pool exhausted upstream=core-banking03[16:04:14] ALERT ▸ cascading saturation across 11 services04[16:04:15] OPS ▸ paging primary on-call (3rd this quarter)05[16:04:18] — ▸ system frozen · 4.2 hr MTTR forecast 01[16:04:11] info ▸ p99 = 88ms (clearing-svc)02[16:04:11] info ▸ deadline budget healthy (78%)03[16:04:11] info ▸ bulkhead boundary holding · isolated to ledger-204[16:04:12] auto ▸ adaptive throttle engaged · 62 RPS shed05[16:04:14] info ▸ degradation contained · no paging needed A clearing engine like this?
We've done six. Tell us about yours — first call costs nothing, second one is the Diagnostic.