Skip to content

v0.2.4: mss-clamp directive + reconfigure CLI + ExecReload#44

Merged
lunarthegrey merged 7 commits into
mainfrom
v0.2.4-mss-clamp
May 5, 2026
Merged

v0.2.4: mss-clamp directive + reconfigure CLI + ExecReload#44
lunarthegrey merged 7 commits into
mainfrom
v0.2.4-mss-clamp

Conversation

@lunarthegrey
Copy link
Copy Markdown
Contributor

@lunarthegrey lunarthegrey commented May 5, 2026

Summary

Closes the SPEC §11.4 MSS-clamping gap and lands the long-deferred reconfigure CLI / systemctl reload path in one bundled PR.

  • mss-clamp directive (fast-path). Inline mutation of the TCP MSS option in matched SYN/SYN-ACK packets before redirect. Fixes the iptables-bypass gap on real edge deployments where allow-prefix overlaps a -j TCPMSS FORWARD rule. Four grammars (global / per-iface / per-prefix / per-prefix+iface), SYN+SYN-ACK both directions, lower-if-higher policy, IPv4 + IPv6 parity, RFC 1624 incremental checksum. See docs/runbooks/mss-clamp.md.
  • packetframe reconfigure subcommand. SIGHUP plumbing was already wired (loader.rs:reconfigure_from_signal, reconcile.rs) — what was missing was the CLI side. Now uses PIDFile + /proc/<pid>/comm validation, sends SIGHUP, and polls a last-reconfigure.timestamp ack marker for up to 5s. Exit code distinguishes parse errors from per-module reconcile failures. See docs/runbooks/reconfigure.md.
  • ExecReload= + PIDFile= in the systemd unit (crates/cli/debian/packetframe.service). The README's aspirational systemctl reload packetframe finally works.
  • block-prefix reconcile path (small adjacent fix). v0.2.1 shipped BLOCK_V4/V6 maps but no reconcile, so adding/removing block-prefix lines required restart. ~50-LOC mirror of the existing allow-prefix reconcile.
  • build.rs improvement. Forward BPF rustc JSON diagnostics on failure so future BPF compile errors are visible — without this, cargo::warning=[bpf stderr] only carried "Compiling X" progress lines and the final summary, hiding actual rustc diagnostics. Found while debugging this PR's BPF stack/verifier issues.

Why bundle

  • The runbook for mss-clamp explains how to test changes via reconfigure (flip MSS values without re-attaching XDP); they're naturally documented together.
  • mss-clamp creates a real test surface for reconfigure (flip values, watch counters).
  • ExecReload only makes sense after the CLI subcommand exists.
  • block-prefix reconcile was a known gap noticed during the reconfigure exploration; bundling beats a follow-up PR.

What's in the PR

Area Files Notes
BPF maps maps.rs New MssClampValue, MSS_CLAMP_V4/V6/BY_IFACE. FpCfg layout V1 → V2 (the formerly-_reserved slot is now mss_clamp_global: u16; struct size unchanged). StatIdx 33/34 = mss_clamp_applied/skipped (append-only per CLAUDE.md).
BPF program main.rs New mss_clamp_inline (and helpers) called from forward_success between devmap check and TTL decrement. All #[inline(always)] because BPF subprograms can't take packet-pointer args (see "BPF lessons" below). Verifier-bounded TCP-options walk capped at 8 iterations with sequential if branches (state-space-friendly). RFC 1624 incremental TCP checksum update.
Config parser config.rs New ModuleDirective::MssClamp + MssClampPrefix enum. Parser disambiguates the four grammars by token shape (CIDR contains /, via is the iface keyword). Bounds check 88 ≤ mss ≤ 65495.
Userspace fast-path linux_impl.rs, reconcile.rs populate_mss_clamp (load), reconcile_mss_clamp (SIGHUP delta), reconcile_block_v4/v6. The mss reconcile uses re-insert-on-key semantics so a value change for an existing prefix is picked up.
CLI main.rs, loader.rs Reconfigure handler replaces the stub. New loader::reconfigure() + ReconfigureError. PIDFile lifecycle in run_linux. Marker file written by reconfigure_from_signal.
Systemd packetframe.service ExecReload=/bin/kill -HUP $MAINPID, PIDFile=....
Build harness build.rs Forward BPF rustc JSON diagnostics on failure (parse compiler-message records, emit rendered field as cargo warnings).
Docs README.md, conf/example.conf, 2 new runbooks Status table expanded to enumerate sketched future modules (ddos / sampler / randomizer / dispatcher) with SPEC §-pointers + priority buckets. mss-clamp + reconfigure marked Production (v0.2.4+).
SPEC.md (untracked) local file §3.2 priority taxonomy note, §4.6 counter rows 32-34, §4.10 closeout, new §4.12 MSS clamping, §5.0 dispatcher prerequisite, §8.4 reconfigure UX + hot-reloadable matrix, §11.4 closeout.

BPF lessons learned (worth recording for future XDP work)

The BPF side took five iterations after the initial design:

  1. #[inline(never)] to escape stack budget — failed. BPF kernel verifier rejects bpf2bpf calls with packet-pointer arguments: R1 pointer arithmetic with <<= operator prohibited. LLVM emits a 32→64 bit lift via shift-left for the pkt_ptr arg to satisfy the BPF calling convention; the verifier prohibits arithmetic shifts on packet pointers.
  2. Pass ip_offset: usize (scalar) instead — failed. LLVM's SROA decomposes &XdpContext into (data, data_end) packet pointers across the call boundary. Same verifier rejection on the data_end arg.
  3. Revert to #[inline(always)], trim stack via block scopes — partially worked. Keeps everything in the parent's frame; reading src/dst inside LPM-lookup blocks lets the compiler reuse stack slots between src and dst keys.
  4. 0..40 TCP-options walk — verifier hit 1M-instruction processing limit (BPF program is too large). Combinatorial state explosion across the inner match-based kind dispatch.
  5. Cap at 0..8 iterations + flatten match to sequential if — final fix. Real SYN packets put MSS in option positions 1-4 (Linux's tcp_options_write emits MSS very early); 8 is comfortable headroom. SYNs that bury MSS deeper get counted as mss_clamp_skipped (pathological in practice).

Takeaway captured in crates/modules/fast-path/bpf/src/main.rs doc-comments and SPEC §4.12: for BPF code that touches packet pointers, #[inline(always)] is essentially mandatory; stack-budget pressure has to be solved by tightening locals, not by extracting subprograms.

Tests

14 new parser tests in config.rs covering all four grammars (IPv4 + IPv6), boundary values (88, 65495), out-of-range rejection, missing-arg errors, and multi-line accumulation.

test result: ok. 94 passed; 0 failed; ...   # packetframe-common
test result: ok. 40 passed; 0 failed; ...   # other workspace tests

cargo fmt --check clean. cargo clippy --workspace --all-targets --all-features -- -D warnings clean.

CI

Check Result
fmt + clippy + test
cross-build amd64 / arm64 (gnu + musl) ✅ all 4
qemu kernel v5.15 (BPF verifier + attach)
qemu kernel v6.6 (BPF verifier + attach)
CodeQL / Analyze (actions, rust)

The qemu jobs run the real BPF program through the kernel verifier on attach, so this confirms the byte-level mutation logic is verifier-clean on both 5.15 and 6.6.

Test plan (post-merge, on the deployed router)

  • Apply v0.2.4 to the router. apt install ./packetframe_0.2.4_amd64.deb.
  • Add mss-clamp 23.191.201.0/24 via eth2 1360 (replacing the equivalent iptables -j TCPMSS rule). sudo packetframe reconfigure.
  • Verify packetframe status shows mss_clamp_applied climbing on customer SYN traffic. tcpdump -i eth2 -n 'tcp[tcpflags] & tcp-syn != 0' -c 5 -vv confirms the wire MSS is 1360.
  • Retire the iptables -j TCPMSS rule once mss-clamp is confirmed firing.
  • block-prefix add/remove via reconfigure → confirm BLOCK_V4 deltas applied without restart.
  • Bad config edit → packetframe reconfigure exits non-zero with parse-error detail; daemon stays up with old config.

Out of scope (explicitly)

  • Separate mss-clamp module (requires the dispatcher first; SPEC §5.0).
  • Hot-reload of attach iface set, route-source, circuit-breaker thresholds, local-prefix.
  • packetframe reconfigure --check (dry-run validation).
  • Type=notify migration of the systemd unit.
  • AppArmor/SELinux profiles, apt-repo hosting, rpm packaging (carry-over from packaging: ship .deb packages with systemd service #39).

Tag flow after merge

git checkout main && git pull
# Bump workspace Cargo.toml + VERSION + README install snippets to 0.2.4
# in a small follow-up PR (see release-v0.2.3 pattern), then:
git tag -a v0.2.4 -m "v0.2.4"
git push origin v0.2.4

🤖 Generated with Claude Code

lunarthegrey and others added 7 commits May 4, 2026 19:58
Closes the SPEC §11.4 MSS-clamping gap: standard iptables `TCPMSS
--set-mss` rules don't fire on fast-pathed flows because XDP redirect
via bpf_redirect_map skips netfilter entirely. This was hitting real
operators on edge routers where customer subnets are in `allow-prefix`
and the MSS clamp on the WAN-egress FORWARD rule is required for TCP
to behave across MTU-limited tunnels.

mss-clamp ships as a directive inside the existing fast-path module,
not a separate module — bpf_redirect_map bypasses TC egress, and the
multi-module dispatcher isn't built yet, so a separate-program clamp
has nowhere to hook on fast-pathed traffic. Inline mutation in the
existing XDP program before redirect is the only correct approach
today and remains correct after the dispatcher lands (mss-clamp would
just be one of fast-path's policies).

Four grammars accepted, looked up in order of specificity:

    mss-clamp <cidr> via <iface> <mtu>   # most specific
    mss-clamp <cidr> <mtu>                # prefix-scoped, any egress
    mss-clamp via <iface> <mtu>           # egress-iface, any prefix
    mss-clamp <mtu>                       # global default

Prefix matches src OR dst (mirrors allow-prefix); clamp applies to
SYN AND SYN-ACK (per-direction MSS is independent in TCP); lower-if-
higher policy (matches iptables `TCPMSS --set-mss` semantics).

Bundled with two related items the operator runbook ties together:

* `packetframe reconfigure` CLI subcommand. The SIGHUP plumbing in
  the daemon was already wired (loader::reconfigure_from_signal,
  fast-path's reconcile delta paths) — what was missing was the CLI
  side. Now sends SIGHUP via PIDFile + /proc/<pid>/comm validation
  and polls a `last-reconfigure.timestamp` ack marker for up to 5s.
  Distinguishes parse errors from per-module reconcile failures via
  exit code, so scripted operator flows can react.

* ExecReload + PIDFile in the systemd unit. Was deliberately omitted
  in #39 pending this work; the README's `systemctl reload
  packetframe` instruction now resolves.

* block-prefix reconcile path (v0.2.1 BLOCK_V4/V6 maps existed but
  had no reconcile path — adding/removing block-prefix lines required
  full restart). Mirrors allow-prefix line for line.

Two new counters at SPEC §4.6 indices 33-34 (mss_clamp_applied,
mss_clamp_skipped) — append-only per CLAUDE.md guardrail. FpCfg
layout bumped V1 -> V2 (the formerly-_reserved [u8; 2] slot is now
mss_clamp_global: u16; struct size unchanged).

Userspace + BPF map additions: MSS_CLAMP_V{4,6} (LPM tries with
MssClampValue { mss, iface_filter }) + MSS_CLAMP_BY_IFACE (HashMap
keyed on egress ifindex). Lookup precedence in BPF: src-prefix LPM
-> dst-prefix LPM -> per-iface HashMap -> CFG.mss_clamp_global.

TCP options walk is verifier-bounded (40-iteration cap; max one MSS
option in 40 bytes); RFC 1624 incremental checksum update mirrors
the existing decrement_ipv4_ttl pattern.

Docs: new runbooks docs/runbooks/{mss-clamp,reconfigure}.md, README
Status table expanded to enumerate the sketched future modules
(ddos / sampler / randomizer / dispatcher) with SPEC §-pointers and
priority buckets, mss-clamp + reconfigure rows added as v0.2.4+
production. conf/example.conf grew an mss-clamp section.

Tests: 14 new parser tests covering all four grammars, IPv4 + IPv6,
boundary checks (88 / 65495), duplicate-arg rejection, missing-args
errors. All 94 + 40 lib tests pass; cargo fmt + clippy clean.

Out of scope (deferred):
* Separate `mss-clamp` module with the multi-module dispatcher.
* Hot-reload of attach iface set / route-source / circuit-breaker /
  local-prefix.
* `packetframe reconfigure --check` (dry-run validation).
* Type=notify migration of the systemd unit.

SPEC.md update summary (private spec — apply manually):
* §3.2 priority taxonomy: mss-clamp lives at priority 1500
  (1000-1999 forwarding/transformation bucket).
* §4.x new "MSS clamping" section: lookup precedence, SYN+SYN-ACK
  semantics, lower-if-higher rule, IPv4/IPv6 parity, skipped-counter
  conditions.
* §4.6 counter table: append rows 33 (mss_clamp_applied) and 34
  (mss_clamp_skipped).
* §8.4 reconfigure UX: clarify CLI subcommand mechanics (SIGHUP via
  PIDFile + ack-timestamp poll), explicit list of what's hot-
  reloadable as of v0.2.4 vs what isn't.
* §5.x dispatcher placeholder: record that any second module on the
  same hook needs the dispatcher first; mss-clamp avoiding-this by
  living inside fast-path validates the design.
* §11.4: close out — "Resolved in v0.2.4 via fast-path mss-clamp
  directive (not a separate module — bpf_redirect_map bypasses TC
  egress, dispatcher prerequisite for separate-program approach)."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The nested BPF build runs cargo with `--message-format=json` so
build.rs can extract the artifact path. JSON messages go to stdout,
human-readable progress + summary lines go to stderr. The previous
forward_output(&[], &out.stderr) call only forwarded the latter,
which meant rustc's actual diagnostic content (errors + warnings)
was completely silent — operators saw "BPF build failed (exit 101)
— due to 1 previous error; 10 warnings emitted" with no clue what
the error was.

Add forward_rendered_diagnostics() that walks the JSON stream and
extracts each compiler-message's `rendered` field (the same text
rustc would print to stderr without --message-format=json). Wire
it into the failure path so operators actually see what broke.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI surfaced "Looks like the BPF stack limit is exceeded" — the
mss_clamp_inline + lookup_mss_clamp_v{4,6} functions, all marked
#[inline(always)], were being inlined into forward_success which
itself is inlined into the already-large fast_path XDP entry. The
cumulative stack frame blew past BPF's 512-byte limit.

Switch all three to #[inline(never)] so LLVM emits them as
separate BPF subprograms, each with its own 512-byte budget.
Block-scope the LPM Key locals inside the lookup functions so LLVM
can reuse the same stack slot for src vs dst keys rather than
carrying both live.

The error was invisible until the previous build.rs commit started
forwarding rustc JSON diagnostics; without that, "due to 1 previous
error; 10 warnings emitted" was all operators saw.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bprogram

The previous fix split mss_clamp_inline into a separate BPF
subprogram to escape the 512-byte stack budget. CI then surfaced
a kernel verifier rejection at attach time:

    4738: (67) r1 <<= 32
    R1 pointer arithmetic with <<= operator prohibited

BPF subprograms can't accept packet pointers as arguments — LLVM
emits a 32->64 bit lift via shift-left for the pkt_ptr arg, and
the verifier prohibits arithmetic shifts on packet pointers. Pass
a scalar `ip_offset` instead and reconstruct the IP-header pointer
inside the subprogram via `ctx.data() + ip_offset`, which the
verifier tracks as a fresh packet-derived pointer with the bounds
check immediately following. Same pattern the existing `ptr_at`
helper uses.

Also drop the `let _ = proto;` workaround — proto is no longer
destructured into the outer scope, the early-return guards inside
each branch are sufficient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverting to #[inline(always)] for mss_clamp_inline + lookup
helpers because BPF subprograms can't take packet-pointer args.
Two earlier attempts ran into the verifier:

  - Pass `ip: *mut u8` directly: rejected
    "R1 pointer arithmetic with <<= operator prohibited"
  - Pass `ip_offset: usize` via &XdpContext: rejected
    "R2 pointer arithmetic on pkt_end prohibited" — LLVM SROA
    decomposed &XdpContext into (data, data_end) packet pointers
    behind our backs.

Inlining is the only verifier-friendly option for code that
touches the packet. The stack overflow that prompted the subprogram
attempt is mitigated by:

  - Reading src/dst addresses inside block scopes within the
    lookup helpers, not as locals in the parent frame. The Key
    locals + addr arrays are released when each block exits.
  - Letting the lookup helpers take a pointer-to-IP-header rather
    than addresses by value, so the caller doesn't pre-materialize
    them.
  - Reading the proto byte first and bailing on non-TCP, the
    common case for fast-pathed traffic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous run hit "BPF program is too large. Processed 1000001
insn" — the 0..40 loop with the match-based kind dispatch caused
combinatorial state explosion in the BPF verifier (max_states_per_insn
was 23, total_states 29302). Stack was fine (0+360+0+0).

Reduce to 0..8 iterations: real SYN packets put MSS in the first
1-4 options (Linux's tcp_options_write emits MSS very early), and
8 is comfortable headroom while keeping verifier state-space
bounded. Worst-case behavior change: SYNs that bury MSS deeper
than option-position 8 won't get clamped — that's pathological
in practice and we'd bump mss_clamp_skipped on those.

Also flatten the inner `match` to sequential `if`s (kind == 0,
kind == 1, length-prefixed) — fewer branches per iteration cuts
the state-space split factor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundled into the v0.2.4-mss-clamp PR so the release tag flow is a
single merge + tag rather than a separate release-prep PR.

Bumped: workspace Cargo.toml (0.2.3 -> 0.2.4), VERSION, both README
install snippets (v0.2.3 -> v0.2.4 for the .deb and tarball blocks).
Cargo.lock cascaded via `cargo check`. BPF crates at
crates/modules/*/bpf/ stay at 0.0.1 -- independent versioning, never
tracked the workspace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lunarthegrey lunarthegrey merged commit 5e41c5a into main May 5, 2026
10 checks passed
@lunarthegrey lunarthegrey deleted the v0.2.4-mss-clamp branch May 5, 2026 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant