Field notes

eBPF probing at scale: investigating 500k TPS throughput.

Most teams reach for eBPF after they've tried everything else. By the time we showed up, this client had already attached BCC scripts in production, set them on fire, and forgotten to detach them. Probes were chained. Probes were unbounded. Latency on hot paths had gained 14ms and nobody could explain why.

22 April 2026 · Ken Tan · 14 min read

What follows is sanitised — names redacted per contract — but the numbers are real and the architectural mistakes are extremely typical.

The problem statement

The workload was a Tier-1 clearing engine. Five-nines was a regulatory line item. Throughput peaked around 500,000 transactions per second across 31 services, with p99 budget set at 80ms across the regulatory pathway.

The team had three problems converging:

Vendor agents couldn't see inside JNI calls — 40% of latency was invisible.
BCC scripts attached for incident triage had been left running for months.
Cardinality was growing because every probe attached a per-connection label.

Topology comes first

Before adding a single probe, we mapped the failure domains. Most teams skip this step because it feels boring; it's the step that determines whether your eBPF effort survives Q3.

For this workload the topology decomposed into four logical clusters: ingest, match, clear, and report. Critical path latency lived almost entirely in match → clear, but the team's existing dashboards averaged across all four — the noise from report was masking what should've been an alert.

Trace the boundary, not the box. A probe attached to a process tells you about that process; a probe attached to a contract tells you about your business.

The probe surface we shipped

We deployed exactly seven kprobes and three uprobes across the regulatory pathway. Not seven hundred. Seven. The temptation when working with eBPF is to instrument everything because the cost is low — but cardinality is the cost, and cognitive load on the on-caller is a much higher cost than CPU.

Hot-path kprobe set

SEC("kprobe/tcp_sendmsg")
int kprobe__tcp_sendmsg(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    if (!is_traced(pid)) return 0;
    // record sk + size, key by 5-tuple hash, not pid
    struct ev_key key = derive_key(ctx);
    struct ev *e = bpf_map_lookup_elem(&active, &key);
    if (e) e->sent_bytes += PT_REGS_PARM3(ctx);
    return 0;
}

image

Asset pending

Cross-section schematic of the Linux userspace/kernel boundary showing four production processes (JVM, Go, Python, Rust) above the kernel line, seven kprobe markers attached at specific kernel call sites below the line, a ring buffer pulling those events into a Go reader that batches every 5ms or 128 events into the OTel collector.

Cross-section diagram, paper-white #f7f9fb. Top half: 4 userspace process rectangles labeled 'JVM', 'GO SVC', 'PYTHON', 'RUST' in JetBrains Mono caps. Mid: 1px horizontal rule labeled 'KERNEL BOUNDARY' (small mono caps in #727687). Bottom: kernel-space drawing showing 7 small electric-blue #0066FF triangle markers labeled 'kprobe/tcp_sendmsg', 'kprobe/tcp_recvmsg', 'kprobe/sys_read', etc., attached at specific kernel call sites rendered as small horizontal segments. Arrows from kernel markers up to a 'RING BUFFER' box, then across to a 'GO READER' box with 'BATCH @ 5MS / 128 EVT' label inside. Final arrow exits stage right to 'OTEL COLLECTOR'. All linework 1px obsidian #191c1e, mono labels throughout. 16:9, no fills.

/img/blog/ebpf-kernel-cross-section.png

Two things to notice. First, we key by 5-tuple hash, not by PID — this is what kept cardinality bounded when one of the JVM workers churned PIDs. Second, the is_traced() guard runs against a small bpf map that the userspace controller updates: only services in the regulatory pathway emit events. Everything else gets short-circuited at the kernel.

Pitfall Keying by PID is the most common eBPF cardinality mistake. JVM workers, container restarts, anything that churns PIDs will silently blow your time-series count.

Userspace ring-buffer consumer

We used the BPF ring buffer (not the older perf buffer) and consumed in a Go-side reader that batched into the OTel SDK. Three things matter here for sub-millisecond observability latency:

Drop on overflow — the ring buffer must fall behind under load, never block the kernel side.
Batch flushes on either 128 events or 5ms, whichever comes first.
Tail-sampling at the OTel collector, not in the agent — keeps trace decisions consistent across replicas.

Cost & cardinality

The temptation when reaching for eBPF is to feel that observability cost has been "solved" because you're not paying a vendor agent any more. This is almost always wrong. The cost shifts to cardinality at the storage tier, and to cognitive load at the on-caller.

Numbers from this engagement, week 1 vs week 12:

Active time-series: 2.4M → 690k
Storage cost: $172k/mo → $51k/mo
Probe-attached CPU overhead: 4.1% → 0.6%
p99 observability latency: 14ms → 0.8ms

Takeaways

Topology before probes. If you can't draw the failure domains, you can't decide what to observe.
Seven, not seven hundred. Probe surface area is a budget, not a feature.
Kernel-side filtering. Don't ship events to userspace and filter — filter in the kernel and ship signals.
Cardinality is the cost. Vendor lock-in is solvable; cardinality is a design problem you re-solve every quarter.

This post traces back to a Foundational engagement completed in Q1 2026. Sanitised numbers, founder-approved write-up. If you're seeing a similar pattern — vendor agents that can't see your hot path — talk to us.

eBPF probing at scale: investigating 500k TPS throughput.

The problem statement

Topology comes first

The probe surface we shipped

Hot-path kprobe set

Userspace ring-buffer consumer

Cost & cardinality

Takeaways

The Golden Signals, properly understood.

OTel collector vs. vendor agents.

The cost spike was a logging loop.

If you can't draw the failure domains, you can't decide what to observe.

The problem statement

Topology comes first

The probe surface we shipped

Hot-path kprobe set

Userspace ring-buffer consumer

Cost & cardinality

Takeaways

The Golden Signals, properly understood. →

OTel collector vs. vendor agents. →

The cost spike was a logging loop. →

If you can't draw the failure domains, you can't decide what to observe.

The Golden Signals, properly understood.

OTel collector vs. vendor agents.

The cost spike was a logging loop.