xdp: move init steps that require caps early in validator process start up #9133

t-nelson · 2025-11-18T22:44:10Z

Problem

we do the xdp init steps that requires capabilities quite late in validator process start up. we've already spawned many threads, which will have inherited caps that they do not need, posing security risk

Summary of Changes

do a bit of refactoring to get the steps that require caps isolated
move xdp init as early as possible in validator start up
revoke caps from the process when we're done with them

Fixes: #6087

i know the new type names suck. suggestions welcome

codecov-commenter · 2025-11-18T23:16:12Z

Codecov Report

❌ Patch coverage is 1.51229% with 521 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.6%. Comparing base (252709c) to head (8df9c74).
⚠️ Report is 15 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #9133     +/-   ##
=========================================
- Coverage    82.6%    82.6%   -0.1%     
=========================================
  Files         852      852             
  Lines      318162   318341    +179     
=========================================
+ Hits       262963   262991     +28     
- Misses      55199    55350    +151

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

t-nelson · 2026-01-12T20:10:27Z

ok this is ready so far as i'm concerned. definitely recommend per-commit review. feel free to suggest a different batching for PRing the commits

turbine/src/xdp.rs

t-nelson · 2026-01-12T21:47:05Z

~~fixing the platform stuff~~

done

turbine/src/xdp.rs

validator/src/commands/run/execute.rs

xdp/src/tx_loop.rs

validator/src/commands/run/execute.rs

xdp/src/umem.rs

xdp/src/device.rs

t-nelson · 2026-01-14T16:14:34Z

please stop with the irrelevant pedantic shit. look at the pr title. that's what i'm trying to do here. if you can't stay focused i don't want your review

puhtaytow · 2026-01-14T16:42:20Z

please stop with the irrelevant pedantic shit. look at the pr title. that's what i'm trying to do here. if you can't stay focused i don't want your review

there is no more pedantic shit to point out so ill focus on the code now, but it shouldnt hurt to fix typos etc at the final point 👍

gregcusack

lgtm from a technical point of view. a few nits. but looks good. alessandro may have some comments on keeping the xdp/ generic enough for aya. thanks for doing this!

core/src/validator.rs

xdp/src/device.rs

t-nelson · 2026-01-15T16:12:01Z

Just a note - if we merge this we would be unable to backport any xdp PRs anymore unless this one is also backported.

this is what we get for merging features without cleaning up our mess first

t-nelson · 2026-01-15T19:20:29Z

i'll give @alessandrod 'til eod tomorrow to take a look, then merge

Copilot

Pull request overview

This pull request moves XDP initialization steps that require elevated Linux capabilities to happen early in the validator process startup, before any threads are spawned. This prevents capability inheritance by child threads, reducing the attack surface and security risk.

Changes:

Refactored XDP initialization into a builder pattern that separates capability-requiring operations from runtime operations
Moved XDP initialization to happen immediately after node creation and before validator thread spawning
Added proper capability management including clearing inheritable and permitted capability sets after initialization

Reviewed changes

Copilot reviewed 11 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
xdp/src/umem.rs	Added OwnedUmem type for memory ownership and moved methods to traits for better abstraction
xdp/src/tx_loop.rs	Refactored tx_loop into builder pattern (TxLoopBuilder) and runtime execution (TxLoop)
xdp/src/device.rs	Added Send trait implementations for thread-safe types with safety documentation
xdp/Cargo.toml	Removed caps dependency (moved to turbine)
validator/src/commands/run/execute.rs	Moved XDP initialization to early startup before thread spawning
validator/src/admin_rpc_service.rs	Updated test to pass new None parameter to Validator::new
turbine/src/xdp.rs	Split XdpRetransmitter into XdpRetransmitBuilder (caps phase) and XdpRetransmitter (runtime)
turbine/Cargo.toml	Added aya dependency
test-validator/src/lib.rs	Updated test to pass new None parameter
local-cluster/src/local_cluster.rs	Updated tests to pass new None parameter
core/src/validator.rs	Added maybe_partial_xdp_retransmitter parameter to Validator::new
Cargo.lock, dev-bins/Cargo.lock, programs/sbf/Cargo.lock	Updated dependency graph reflecting caps movement

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

xdp/src/device.rs

turbine/src/xdp.rs

xdp/src/tx_loop.rs

xdp/src/device.rs

alessandrod

Thanks for doing this work!

It was SO EASY to review since the changes are so nicely split in atomic commits. We really need to start doing this more it saves so much time.

turbine/src/xdp.rs

xdp/src/tx_loop.rs

alexpyattaev

We probably want to drop the caps no matter if XDP retransmit is enabled or not. It is likely that operators will have the binary marked with extra capabilities (setting caps always in build/startup scripts), and if they do not enable XDP retransmit for whatever reason, the capabilities will not get dropped.

As a secondary layer of protection, we may also set PR_SET_NO_NEW_PRIVS to be certain (should probably be another PR though).

turbine/src/xdp.rs

validator/src/commands/run/execute.rs

t-nelson · 2026-01-20T21:48:52Z

ok addressed all the relevant review comments. ci should be good to go

t-nelson · 2026-01-20T21:54:59Z

xdp/src/tx_loop.rs

+        // since we aren't necessarily allocating from the thread that we intend to run on,
+        // temporarily switch to the target cpu to ensure that the region is allocated to the
+        // correct numa node
+        set_cpu_affinity([cpu_id]).unwrap();


we can possibly numa(7) our way out of this later. i only skimmed, but i think you can induce migration after the fact

xdp/src/tx_loop.rs

validator/src/commands/run/execute.rs

alessandrod

oops, one last round trip

core/src/validator.rs

alessandrod · 2026-01-21T09:08:06Z

validator/src/commands/run/execute.rs

+        .cloned()
+        .collect::<HashSet<_>>();
+    if !reserved.is_empty() {
+        let available = core_affinity::get_core_ids()


it turns outtt... this returns the current affinity mask, so we have to call it
before the block above, or this ends up doing set_difference([i], [..., i]) = [] and
the validator doesn't start

I've confirmed that if I move this call above the block then the validator
starts and XDP seems to work correctly

t-nelson

ok took care of the moving the main thread back to its initial cpu

t-nelson · 2026-01-21T23:24:25Z

xdp/src/tx_loop.rs

+
+        // try to allocate huge pages first, then fall back to regular pages
+        const HUGE_2MB: usize = 2 * 1024 * 1024;
+        set_cpu_affinity([cpu_id]).unwrap();


for the first refactor, we're still constructing the TxLoopBuilder on the target thread. set cpu affinity here, even though we'll do it again below in TxLoop::new

t-nelson · 2026-01-21T23:27:15Z

turbine/src/xdp.rs

+        // since we aren't necessarily allocating from the thread that we intend to run on,
+        // temporarily switch to the target cpu for each TxLoop to ensure that the Umem region
+        // is allocated to the correct numa node
+        let this_cpu = get_cpu().expect("linux implements sched_getcpu");


next we pull TxLoopBuilder init out to main thread. stash initial main thread cpu and restore it once all of the tx loops have been started

t-nelson · 2026-01-21T23:28:07Z

turbine/src/xdp.rs

+        // since we aren't necessarily allocating from the thread that we intend to run on,
+        // temporarily switch to the target cpu for each TxLoop to ensure that the Umem region
+        // is allocated to the correct numa node
+        let this_cpu = get_cpu().expect("linux implements sched_getcpu");
+        let tx_loop_builders = cpus
+            .into_iter()
+            .zip(std::iter::repeat_with(|| tx_loop_config.clone()))
+            .enumerate()
+            .map(|(i, (cpu_id, config))| {
+                TxLoopBuilder::new(cpu_id, QueueId(i as u64), config, &dev)
+            })
+            .collect::<Vec<_>>();
+        // migrate main thread back off of the last xdp reserved cpu
+        set_cpu_affinity([this_cpu]).unwrap();


retain the main thread cpu restoration after isolating the logic that requires caps being raised

set_cpu_affinity([this_cpu]) seems incorrect? this cpu != all the possible cpus in the original mask

we haven't applied the masked yet. i moved that back

t-nelson requested a review from a team as a code owner November 18, 2025 22:44

t-nelson marked this pull request as draft November 18, 2025 22:44

t-nelson requested review from alessandrod, cpubot, kskalski and vadorovsky November 18, 2025 22:45

t-nelson mentioned this pull request Nov 18, 2025

Update changelog service file capabilities attribute #8977

Open

alexpyattaev mentioned this pull request Dec 18, 2025

XDP: determine NetworkDevice to use early in startup #9649

Open

t-nelson force-pushed the xdpinit branch 2 times, most recently from b5b9f5d to af2baa0 Compare January 12, 2026 20:05

t-nelson marked this pull request as ready for review January 12, 2026 20:05

puhtaytow reviewed Jan 12, 2026

View reviewed changes

turbine/src/xdp.rs Show resolved Hide resolved

t-nelson force-pushed the xdpinit branch from af2baa0 to 4df4844 Compare January 12, 2026 23:08

gregcusack self-requested a review January 12, 2026 23:26

puhtaytow reviewed Jan 13, 2026

View reviewed changes

turbine/src/xdp.rs Show resolved Hide resolved

puhtaytow reviewed Jan 13, 2026

View reviewed changes

validator/src/commands/run/execute.rs Show resolved Hide resolved

t-nelson commented Jan 13, 2026

View reviewed changes

xdp/src/tx_loop.rs Outdated Show resolved Hide resolved

t-nelson force-pushed the xdpinit branch from 4df4844 to bf97ca9 Compare January 14, 2026 03:39