Skip to content

Bugfixkptrv2#12003

Open
RazeLighter777 wants to merge 2 commits intokernel-patches:bpf-next_basefrom
RazeLighter777:bugfixkptrv2
Open

Bugfixkptrv2#12003
RazeLighter777 wants to merge 2 commits intokernel-patches:bpf-next_basefrom
RazeLighter777:bugfixkptrv2

Conversation

@RazeLighter777
Copy link
Copy Markdown

No description provided.

@RazeLighter777 RazeLighter777 force-pushed the bugfixkptrv2 branch 6 times, most recently from b7f1b70 to 5dfd5e2 Compare May 5, 2026 18:48
A BPF program attached to tp_btf/nmi_handler can delete map entries or
swap out referenced kptrs from NMI context. Today that runs the kptr
destructor inline. Destructors such as bpf_cpumask_release() can take
RCU-related locks, so running them from NMI can deadlock the system.

Preallocate offload jobs from the global BPF memory allocator, track the
number of live destructor-backed references so the pool stays ahead of
NMI frees, and let the worker invoke the destructor after NMI exits.

The algorithm for preallocation is simple: The invariant is total >=
refs + active, where refs = the ref kptrs installed in maps, active =
jobs being executed in the irq_work worker, and total is the number of
job structures allocated. To avoid excessive pre-allocation calls while
maintaining the invariant, we allocate the needed slots, plus a small
amount of extra, min(needed, BPF_DTOR_KPTR_RESERVE_HEADROOM), where
BPF_DTOR_KPTR_RESERVE_HEADROOM is 64 in this patch.

A small but harmless ordering subtlety: the active atomic is read before
refs. This can result in a small amount of over allocation, but this
won't be leaked and will properly be carried into the trim stage.

The trim stage is simple. It uses a CAS loop to free excessive leftover
idle job slots. It snapshots total refs and active, pops an idle job if
the pool is excessively large, and attempts a cmpxhg to decrement it
atomically. On a failure case, it will just push the job back into the
idle list and retry.

There are several best-effort mitigation methods to tackle the memory
pressure problem, preserving integrity under this unlikely scenario.

If reserving another offload slot fails while installing a new
destructor-backed kptr through bpf_kptr_xchg(), leave the destination
unchanged and return the incoming pointer so the caller keeps ownership.

This is superior to leaking the pointer, and should only happen if the
accounting is incorrect. Moreover, this is a condition the caller can
check for and recover from.

If NMI teardown still fails to grab an idle offload job despite that
reserve accounting, warn once and run the destructor inline rather than
leak the object permanently. Attempt to repair the counter safely with
another CAS loop in that case, preserving concurrent increments.

This fix does come with small performance tradeoffs for safety. xchg can
no longer be inlined for referenced kptrs, as inlining would break the
reference counting. The inlining fix is preserved for kptrs with no
destructor defined.

This keeps refcounted kptr teardown out of NMI context without slowing
down raw kptr exchanges that never need destructor handling.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Justin Suess <utilityemal77@gmail.com>
Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
Programs attached to tp_btf/nmi_handler can drop refcounted kptrs from
NMI context by deleting map entries or clearing map values.  Add a
dedicated BPF-side selftest program that populates hash and array maps
with cpumask kptrs and clears them again from the NMI handler.

This test fails on the upstream and results in a lockdep warning, but
passes when NMI dtors are properly offloaded by the previous commit.

The test asserts that every object queued for destruction in hardirq
from NMI had the dtor called on it. The irq_work which has the
IRQ_WORK_HARD_IRQ flag is drained with kern_sync_rcu to ensure
consistency.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants