Case study · 042 · Fintech

From 98.4% to five-nines on a Tier-1 clearing engine.

A structural overhaul of a regulated core-banking workload. Names redacted; numbers are not.

Industry: Fintech · regulated
Region: APAC · 3 dc
Engagement: Foundational · 14 wk
Stack: JVM · k8s · OTel

warning Problem statement

A single latency spike in the clearing sub-module would saturate the global connection pool, cascading into a system-wide freeze. Twice per quarter. Always at month-end.

MTTR 4.2 hr · regulator-reportable
Error rate at peak 1.2%
Hidden overhead 400ms in mesh discovery
Symptom Cascading degradation

Diagnostic verdict

"The architecture wasn't failing. It was suffocating. We found 400ms of hidden overhead in service-mesh discovery alone — and a connection pool sized for 2019 traffic."

— Lead engineer · Tracefox

01 Intervention

What we changed.

Phase 01 · Topology refactor

Bulkhead the blast radius.

Implemented the Bulkhead pattern at VPC level — separated the transaction engine from peripheral reporting. Single failures stopped propagating.

account_tree

Logical sharding

Monolith state to a 4-cluster distributed topology.
hub

Adaptive throttling

Rate limiting keyed to p99 latency feedback loops.

Phase 02 · Telemetry redesign

Trace the boundary, not the box.

Re-instrumented around contract boundaries instead of process boundaries. Traces now span the regulatory pathway end-to-end — from order ingest to clearing acknowledgement.

route

Boundary-keyed traces

OTel context propagated across 31 service boundaries.
speed

Hot-path eBPF

Probes on the 7 endpoints carrying 92% of revenue.

image

Asset pending

Before/after architecture diagram. Left: monolithic clearing engine with single shared connection pool (red border, cascading saturation). Right: same workload split into four isolated bulkheads — Ingest, Match, Clear, Report — each with its own pool, electric-blue border.

Two-panel architecture diagram, paper-white #f7f9fb. LEFT (BEFORE, 1px error-red #ba1a1a panel border): single rectangular monolith block labeled 'CLEARING ENGINE', arrows pointing in from 'INGEST', 'MATCH', 'REPORT' nodes, all routing through a shared 'POOL · 200' rendered as a small box. A red dashed arrow shows cascading saturation back through all upstreams. RIGHT (AFTER, 1px electric-blue #0066FF panel border): same workload split into 4 isolated cluster cards — 'INGEST', 'MATCH', 'CLEAR', 'REPORT' — each with its own bordered pool. Adaptive throttle shown as a small gauge between MATCH and CLEAR. All connections drawn as 1px obsidian lines. JetBrains Mono labels throughout. 16:9 ratio, lots of whitespace.

/img/work/fintech-bulkhead.png

02 Same incident · same workload

Before & after, told as a trace.

Before

incident.0xC4.before

tail -f

01[16:04:11] WARN  ▸ p99 = 4,210ms (clearing-svc)
02[16:04:12] ERR   ▸ pool exhausted upstream=core-banking
03[16:04:14] ALERT ▸ cascading saturation across 11 services
04[16:04:15] OPS   ▸ paging primary on-call (3rd this quarter)
05[16:04:18] —     ▸ system frozen · 4.2 hr MTTR forecast

After · same pre-conditions

incident.0xC4.after

tail -f

01[16:04:11] info  ▸ p99 = 88ms (clearing-svc)
02[16:04:11] info  ▸ deadline budget healthy (78%)
03[16:04:11] info  ▸ bulkhead boundary holding · isolated to ledger-2
04[16:04:12] auto  ▸ adaptive throttle engaged · 62 RPS shed
05[16:04:14] info  ▸ degradation contained · no paging needed

99.999 % trending_up

Uptime

14 min trending_down

MTTR

58 % trending_down

Cost reduction

0 /9mo trending_down

Unplanned outages

All case studies

Engagement.start()

A clearing engine like this?

We've done six. Tell us about yours — first call costs nothing, second one is the Diagnostic.

Start a project Compare engagements