eBPF probing at scale: investigating 500k TPS throughput.
Most teams reach for eBPF after they've tried everything else. By the time we showed up, this client had already attached BCC scripts in production, set them on fire, and forgotten to detach them. Probes were chained. Probes were unbounded. Latency on hot paths had gained 14ms and nobody could explain why.
What follows is sanitised — names redacted per contract — but the numbers are real and the architectural mistakes are extremely typical.
The problem statement
The workload was a Tier-1 clearing engine. Five-nines was a regulatory line item. Throughput peaked around 500,000 transactions per second across 31 services, with p99 budget set at 80ms across the regulatory pathway.
The team had three problems converging:
- Vendor agents couldn't see inside JNI calls — 40% of latency was invisible.
- BCC scripts attached for incident triage had been left running for months.
- Cardinality was growing because every probe attached a per-connection label.
Topology comes first
Before adding a single probe, we mapped the failure domains. Most teams skip this step because it feels boring; it's the step that determines whether your eBPF effort survives Q3.
For this workload the topology decomposed into four logical clusters: ingest,
match, clear, and report. Critical path
latency lived almost entirely in match → clear, but the team's
existing dashboards averaged across all four — the noise from report
was masking what should've been an alert.
Trace the boundary, not the box. A probe attached to a process tells you about that process; a probe attached to a contract tells you about your business.
The probe surface we shipped
We deployed exactly seven kprobes and three uprobes across the regulatory pathway. Not seven hundred. Seven. The temptation when working with eBPF is to instrument everything because the cost is low — but cardinality is the cost, and cognitive load on the on-caller is a much higher cost than CPU.
Hot-path kprobe set
SEC("kprobe/tcp_sendmsg")
int kprobe__tcp_sendmsg(struct pt_regs *ctx) {
u32 pid = bpf_get_current_pid_tgid() >> 32;
if (!is_traced(pid)) return 0;
// record sk + size, key by 5-tuple hash, not pid
struct ev_key key = derive_key(ctx);
struct ev *e = bpf_map_lookup_elem(&active, &key);
if (e) e->sent_bytes += PT_REGS_PARM3(ctx);
return 0;
} Asset pending
Cross-section schematic of the Linux userspace/kernel boundary showing four production processes (JVM, Go, Python, Rust) above the kernel line, seven kprobe markers attached at specific kernel call sites below the line, a ring buffer pulling those events into a Go reader that batches every 5ms or 128 events into the OTel collector.
Cross-section diagram, paper-white #f7f9fb. Top half: 4 userspace process rectangles labeled 'JVM', 'GO SVC', 'PYTHON', 'RUST' in JetBrains Mono caps. Mid: 1px horizontal rule labeled 'KERNEL BOUNDARY' (small mono caps in #727687). Bottom: kernel-space drawing showing 7 small electric-blue #0066FF triangle markers labeled 'kprobe/tcp_sendmsg', 'kprobe/tcp_recvmsg', 'kprobe/sys_read', etc., attached at specific kernel call sites rendered as small horizontal segments. Arrows from kernel markers up to a 'RING BUFFER' box, then across to a 'GO READER' box with 'BATCH @ 5MS / 128 EVT' label inside. Final arrow exits stage right to 'OTEL COLLECTOR'. All linework 1px obsidian #191c1e, mono labels throughout. 16:9, no fills.
/img/blog/ebpf-kernel-cross-section.png
Two things to notice. First, we key by 5-tuple hash, not by PID — this is what kept
cardinality bounded when one of the JVM workers churned PIDs. Second, the
is_traced() guard runs against a small bpf map that the userspace
controller updates: only services in the regulatory pathway emit events.
Everything else gets short-circuited at the kernel.
Userspace ring-buffer consumer
We used the BPF ring buffer (not the older perf buffer) and consumed in a Go-side reader that batched into the OTel SDK. Three things matter here for sub-millisecond observability latency:
- Drop on overflow — the ring buffer must fall behind under load, never block the kernel side.
- Batch flushes on either 128 events or 5ms, whichever comes first.
- Tail-sampling at the OTel collector, not in the agent — keeps trace decisions consistent across replicas.
Cost & cardinality
The temptation when reaching for eBPF is to feel that observability cost has been "solved" because you're not paying a vendor agent any more. This is almost always wrong. The cost shifts to cardinality at the storage tier, and to cognitive load at the on-caller.
Numbers from this engagement, week 1 vs week 12:
- Active time-series: 2.4M → 690k
- Storage cost: $172k/mo → $51k/mo
- Probe-attached CPU overhead: 4.1% → 0.6%
- p99 observability latency: 14ms → 0.8ms
Takeaways
- Topology before probes. If you can't draw the failure domains, you can't decide what to observe.
- Seven, not seven hundred. Probe surface area is a budget, not a feature.
- Kernel-side filtering. Don't ship events to userspace and filter — filter in the kernel and ship signals.
- Cardinality is the cost. Vendor lock-in is solvable; cardinality is a design problem you re-solve every quarter.
This post traces back to a Foundational engagement completed in Q1 2026. Sanitised numbers, founder-approved write-up. If you're seeing a similar pattern — vendor agents that can't see your hot path — talk to us.
The Golden Signals, properly understood.
Latency, traffic, errors, saturation — the four signals worth instrumenting before you reach for eBPF.
OTel collector vs. vendor agents.
When to keep proprietary agents, when to replace them with a self-hosted collector.
The cost spike was a logging loop.
How to find the per-request log fan-out that's quietly bankrupting your observability bill.