v0.2.5: split BPF datapath into fast_path + finalize via bpf_tail_call#45
Merged
Conversation
Fixes the v0.2.4 regression on UniFi 5.15.72-ui-cn9670 (aarch64) where
the kernel rejected fast_path with "combined stack size of 3 calls is
544. Too large" — same bytecode that loaded cleanly on CI's qemu 5.15
(stack 0+360+0+0). UniFi's BPF patches plus aarch64 JIT account stack
~120 bytes higher than vanilla 5.15 on x86_64.
Architecture: two XDP programs in one ELF, chained by bpf_tail_call.
Each gets its own 512-byte stack budget.
fast_path (XDP, attached per-iface):
classification (allow-prefix, block-prefix, dry-run)
FIB lookup (kernel-fib | custom-fib | compare)
devmap pre-check
TTL decrement (in-place)
L2 rewrite (in-place)
write per-CPU MUTATION_CTX
bpf_tail_call(MUTATION_PROGS, 0) ────────► finalize (XDP, tail-called):
read MUTATION_CTX
mss-clamp lookup + mutation
VLAN choreography
bpf_redirect_map
mss-clamp + VLAN + redirect move from forward_success into the new
finalize program; per-prefix LPM keys + TCP-options walk live in
finalize's fresh stack budget. fast_path's responsibilities shrink to
classification + L2/TTL mutation, which fits comfortably under any
kernel's accounting.
This is NOT the multi-module dispatcher (SPEC §3.4 / §5.0). Tail-call
is one-way control transfer between cooperating stages of one logical
pipeline; the dispatcher is for chaining independent modules at the
same hook (ddos, sampler). Both will eventually exist; v0.2.5 ships
only the former.
New BPF maps:
* MUTATION_CTX (PerCpuArray<MutationCtx>): per-CPU scratch carrying
egress_ifindex, egress_vid, ingress_vid, ip_offset, is_v4 across the
tail-call boundary. fast_path writes, finalize reads.
* MUTATION_PROGS (ProgramArray, 8 slots): jump table. Slot 0 holds
finalize today; slots 1-7 reserved for future stages.
New StatIdx counters (append-only):
* 35: err_tail_call — fast_path's tail_call returned an error (slot
empty). fast_path falls through to XDP_PASS so traffic still flows.
* 36: err_mutation_ctx — finalize couldn't read MUTATION_CTX. Should
be 0 in steady state.
Userspace lifecycle changes:
* attach() loads finalize first, populates MUTATION_PROGS[0], then
loads + attaches fast_path. Order matters: fast_path's first packet
must find a populated slot.
* pin_program_and_maps walks PROGRAM_NAMES (fast_path + finalize); both
pins survive SIGTERM per SPEC §8.5.
* MAP_NAMES grows to include MUTATION_CTX, MUTATION_PROGS, and the
v0.2.4 mss-clamp maps that were missing from the previous list.
* Status command reports tail-call chain occupancy ("MUTATION_PROGS[0]:
populated (finalize)") so operators can confirm wiring.
Test harness:
* Harness::new() now loads both programs and populates MUTATION_PROGS[0]
before returning, so existing bpf_prog_test_run-based tests follow
the chain transparently. Kernel's BPF_PROG_TEST_RUN handles
bpf_tail_call by re-entering its dispatcher for the target program;
tests see the verdict + mutations from the full chain.
Version bumped 0.2.4 → 0.2.5. README Status table grows a "Two-stage
BPF datapath" row. New runbook at docs/runbooks/tail-call-architecture.md
documents the chain, MutationCtx wire format, debug commands, and
how future stages slot in.
Netns end-to-end integration test (real veth + SYN + capture, asserts
MSS clamped on the wire) is deferred to a follow-up PR. Existing
attach-roundtrip + bpf_prog_test_run fixtures in qemu-verifier validate
that both programs LOAD + attach + the tail-call wires correctly on
kernels 5.15 + 6.6.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The newer-kernel verifier on the GitHub Actions runner rejected the
single-entry mss_clamp_inline at the proto-byte read with
`R9 offset is outside of the packet`. The bound check used a
runtime-conditional size (`if is_v4 { 20 } else { 40 }`), which the
verifier could not connect to the subsequent typed cast through
`*const Ipv4Hdr` — so the read at offset 9 (proto field) appeared
unbounded.
Splitting the dispatch upfront lets each path bound-check with a
compile-time constant (`Ipv4Hdr::LEN` / `Ipv6Hdr::LEN`) immediately
followed by the cast and field reads — the same `ptr_at` pattern
main.rs already uses. The qemu kernels (5.15, 6.6) accepted the old
form; the newer runner kernel did not.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The verifier's `find_good_pkt_pointers` refuses to propagate readable- range info through packet-pointer arithmetic when the scalar offset's umax_value exceeds MAX_PACKET_OFF (0xffff). `mctx.ip_offset` is read from a per-CPU map, so the verifier sees its full u32 range and skips range propagation — leaving the post-bound-check pkt pointer with range=0 and rejecting the subsequent header field read. Capping `ip_offset` at MAX_IP_OFFSET (64) right after the MUTATION_CTX read gives the verifier a tight umax it can reason about. fast_path writes 14 or 18 in practice; 64 leaves headroom for a future second VLAN tag. Out-of-range is fail-safe XDP_PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Capping ip_offset at 64 (previous commit) got the verifier past the IP header reads but the TCP csum patch still hit "R6 offset is outside of the packet" at byte 17 of the TCP header. The bound check on `start + csum_off + 2 > end` did not propagate readable-range back to the actual read site because LLVM emitted a fresh packet-pointer arithmetic chain (new id) for the read. v0.2.4's working pattern derived `ip_offset = (ip as usize) - start` inside mss_clamp_inline, where the verifier tracks the result as a `pkt - pkt` subtraction with `umax = MAX_PACKET_OFF (0xffff)` — a pkt-derived bound that range propagation honors. Pulling the same pattern into finalize: pass the typed `ip` pointer (already bounds-checked) into `mss_clamp_tcp` and recover ip_offset there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the v0.2.4 stack-budget regression on UniFi 5.15 kernels by splitting the BPF datapath into two programs connected by
bpf_tail_call. Each program gets its own 512-byte stack budget. Same architecture establishes the pattern for future fast-path-internal stages (additional packet transforms, more sophisticated FIB logic) without re-bisecting stack bytes every release.The proximate failure (reported on edge1-mci1-net): UniFi's
5.15.72-ui-cn9670 aarch64kernel rejected v0.2.4's fast_path BPF program atcombined stack size of 3 calls is 544. Too large— same bytecode that loaded cleanly on CI's qemu 5.15 vanilla x86_64 (stack depth 0+360+0+0). UniFi's BPF patches plus the aarch64 JIT account stack ~120 bytes higher than vanilla.Architecture
mss-clamp + VLAN + redirect move from
forward_successintofinalize. Per-prefix LPM keys + TCP-options walk live in finalize's fresh stack budget. fast_path's responsibilities shrink to classification + L2/TTL, which fits comfortably under any kernel's accounting.This is not the multi-module dispatcher (SPEC §3.4 / §5.0) — that's for chaining independent modules at the same hook (ddos in front of fast-path, sampler behind it). Tail-call is for splitting one logical pipeline. Both eventually exist; v0.2.5 ships only the former.
What's in the PR
#[xdp] pub fn finalize— reads MUTATION_CTX, runs mss-clamp + VLAN + redirect. ~280 LOC; mss-clamp + VLAN choreography moved here verbatim from main.rs.MutationCtxstruct (16 bytes),MUTATION_CTX(PerCpuArray, single-element scratch),MUTATION_PROGS(ProgramArray, 8 slots). New StatIdx 35/36 (err_tail_call,err_mutation_ctx).forward_successwrites MutationCtx and tail-calls into MUTATION_PROGS[0] instead of doing mss-clamp + VLAN + redirect inline. ~440 LOC of mss-clamp + VLAN choreography moved out (now in finalize.rs).attach()loads finalize first → populates MUTATION_PROGS[0] with finalize's FD → loads + attaches fast_path. Order matters. Newpopulate_mutation_progshelper. Newtail_call_chain_from_pinfor status reporting.FINALIZE_PROGRAM_NAMEconstant +PROGRAM_NAMESarray. MAP_NAMES grows to 19 (added MSS_CLAMP_V4/V6/BY_IFACE that were missing from v0.2.4, plus MUTATION_CTX/PROGS).pin_program_and_mapswalks both program names.Harness::newnow loads both programs and populates MUTATION_PROGS[0] before returning.bpf_prog_test_runfollows tail-calls (kernel re-enters its dispatcher for the target program), so existing tests transparently see the full chain's verdict + mutations.bpftool prog show,bpftool map dump MUTATION_PROGS), and how future stages slot in.What's deliberately NOT in this PR
tests/tail_call.rswas in the plan). The kernel BPF_PROG_TEST_RUN harness already exercises the tail-call via the existing fixtures (now updated to populate MUTATION_PROGS[0]). A real-veth + AF_PACKET capture test is good additional coverage but ~150 LOC of test infra and was the right thing to defer to keep this PR focused on the architecture change. Will land as a follow-up.bpf_fib_lookupper-CPU map move (mentioned as "alternative B" in earlier discussion). Tail-call obviates it; we have plenty of stack headroom now.CI expectations
Existing CI matrix should pass:
The qemu jobs are the meaningful test for "does the verifier accept this on real kernels." If they pass, vanilla 5.15 / 6.6 are fine. UniFi-style stricter accounting will be confirmed via post-merge deployment on the same router that hit the v0.2.4 regression.
Test plan
Pre-merge (CI):
cargo fmt --all --checkcleancargo clippy --workspace --all-targets --all-features -- -D warningscleancargo test --workspace --lib94 + 40 pass on macOS dev hostPost-merge on the deployed UniFi router:
apt install ./packetframe_0.2.5_arm64.deb.sudo systemctl restart packetframe.sudo packetframe feasibility --config /etc/packetframe/packetframe.conf --human— all xdp.attach.ethN now PASS (no more 544/512 rejection).sudo packetframe status— the new "tail-call chain" section reportsMUTATION_PROGS[0]: populated (finalize).mss-clamp 23.191.200.0/24 1360.sudo packetframe reconfigure.tcpdump -i eth2 -n 'tcp[tcpflags] & tcp-syn != 0' -vvconfirms wire MSS=1360 on outbound SYNs.sudo packetframe status | grep mss_clamp_appliedshows the counter climbing.err_tail_callanderr_mutation_ctxstay at 0.Tag flow after merge
(The version bump is in this PR — same pattern as v0.2.4 — so just tag and push.)
🤖 Generated with Claude Code