bpf: Fix deadlock in kptr dtor in nmi by kernel-patches-daemon-bpf[bot] · Pull Request #12023 · kernel-patches/bpf

kernel-patches-daemon-bpf · 2026-05-05T15:24:32Z

Pull request for series with
subject: bpf: Fix deadlock in kptr dtor in nmi
version: 2
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1089971

A BPF program attached to tp_btf/nmi_handler can delete map entries or swap out referenced kptrs from NMI context. Today that runs the kptr destructor inline. Destructors such as bpf_cpumask_release() can take RCU-related locks, so running them from NMI can deadlock the system. Preallocate offload jobs from the global BPF memory allocator, track the number of live destructor-backed references so the pool stays ahead of NMI frees, and let the worker invoke the destructor after NMI exits. The algorithm for preallocation is simple: The invariant is total >= refs + active, where refs = the ref kptrs installed in maps, active = jobs being executed in the irq_work worker, and total is the number of job structures allocated. To avoid excessive pre-allocation calls while maintaining the invariant, we allocate the needed slots, plus a small amount of extra, min(needed, BPF_DTOR_KPTR_RESERVE_HEADROOM), where BPF_DTOR_KPTR_RESERVE_HEADROOM is 64 in this patch. A small but harmless ordering subtlety: the active atomic is read before refs. This can result in a small amount of over allocation, but this won't be leaked and will properly be carried into the trim stage. The trim stage is simple. It uses a CAS loop to free excessive leftover idle job slots. It snapshots total refs and active, pops an idle job if the pool is excessively large, and attempts a cmpxhg to decrement it atomically. On a failure case, it will just push the job back into the idle list and retry. There are several best-effort mitigation methods to tackle the memory pressure problem, preserving integrity under this unlikely scenario. If reserving another offload slot fails while installing a new destructor-backed kptr through bpf_kptr_xchg(), leave the destination unchanged and return the incoming pointer so the caller keeps ownership. This is superior to leaking the pointer, and should only happen if the accounting is incorrect. Moreover, this is a condition the caller can check for and recover from. If NMI teardown still fails to grab an idle offload job despite that reserve accounting, warn once and run the destructor inline rather than leak the object permanently. Attempt to repair the counter safely with another CAS loop in that case, preserving concurrent increments. This fix does come with small performance tradeoffs for safety. xchg can no longer be inlined for referenced kptrs, as inlining would break the reference counting. The inlining fix is preserved for kptrs with no destructor defined. This keeps refcounted kptr teardown out of NMI context without slowing down raw kptr exchanges that never need destructor handling. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reported-by: Justin Suess <utilityemal77@gmail.com> Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/ Signed-off-by: Justin Suess <utilityemal77@gmail.com>

Programs attached to tp_btf/nmi_handler can drop refcounted kptrs from NMI context by deleting map entries or clearing map values. Add a dedicated BPF-side selftest program that populates hash and array maps with cpumask kptrs and clears them again from the NMI handler. This test fails on the upstream and results in a lockdep warning, but passes when NMI dtors are properly offloaded by the previous commit. The test asserts that every object queued for destruction in hardirq from NMI had the dtor called on it. The irq_work which has the IRQ_WORK_HARD_IRQ flag is drained with kern_sync_rcu to ensure consistency. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Justin Suess <utilityemal77@gmail.com>

kernel-patches-daemon-bpf · 2026-05-05T15:24:34Z

Upstream branch: 2ca6723
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1089971
version: 2

kernel-patches-review-bot · 2026-05-05T15:54:15Z

commit d43c387f1e2c5580e6109169fe1a51bec39bce67
Author: Justin Suess <utilityemal77@gmail.com>

bpf: Offload kptr destructors that run from NMI

A BPF program attached to tp_btf/nmi_handler can delete map entries or
swap out referenced kptrs from NMI context. Today that runs the kptr
destructor inline. Destructors such as bpf_cpumask_release() can take
RCU-related locks, so running them from NMI can deadlock the system.

Preallocate offload jobs from the global BPF memory allocator, track the
number of live destructor-backed references so the pool stays ahead of
NMI frees, and let the worker invoke the destructor after NMI exits.

Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/

This commit fixes a bug but lacks a Fixes: tag. Suggested:

Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 3b1f0ba02f61..162bfd4796ea 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c

[ ... ]

> @@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);
>  static DEFINE_IDR(link_idr);
>  static DEFINE_SPINLOCK(link_idr_lock);
>
> +struct bpf_dtor_kptr_work {
> +	struct llist_node node;
> +	void *obj;
> +	btf_dtor_kfunc_t dtor;
> +};
> +
> +/* Queue pending dtors per CPU; the idle pool stays global. */
> +static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
> +static LLIST_HEAD(bpf_dtor_kptr_idle);
> +/* Keep total >= refs + active so NMI frees never need to allocate. */
> +static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
> +static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
> +static atomic_long_t bpf_dtor_kptr_total = ATOMIC_LONG_INIT(0);
> +
> +/* Bound reserve overshoot so the pool tracks demand instead of growing on itself. */
> +#define BPF_DTOR_KPTR_RESERVE_HEADROOM 64L
> +
> +static void bpf_dtor_kptr_worker(struct irq_work *work);
> +static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
> +	IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
> +
> +static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
> +{
> +	llist_add(&job->node, &bpf_dtor_kptr_idle);
> +}
> +
> +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
> +{
> +	struct llist_node *node;
> +
> +	node = llist_del_first(&bpf_dtor_kptr_idle);
> +	if (!node)
> +		return NULL;
> +
> +	return llist_entry(node, struct bpf_dtor_kptr_work, node);
> +}

Does bpf_dtor_kptr_pop_idle() handle multiple concurrent consumers
safely?

The documentation for llist_del_first() in lib/llist.c explicitly
states:

  Only one llist_del_first user can be used simultaneously with multiple
  llist_add users without lock. Because otherwise llist_del_first,
  llist_add, llist_add (or llist_del_all, llist_add, llist_add)
  sequence in another user may change @head->first->next, but keep
  @head->first.

This is the classic ABA race that llist_del_first cannot defend against.

In this patch, bpf_dtor_kptr_pop_idle() has multiple concurrent
consumers:

  1. bpf_dtor_kptr_offload() called from NMI context on any CPU
     (the function installed below, called from bpf_obj_free_fields
     when in_nmi()).

  2. bpf_dtor_kptr_trim() called from the per-CPU irq_work worker on
     any CPU (called from bpf_dtor_kptr_worker below).

Both paths call bpf_dtor_kptr_pop_idle() which calls
llist_del_first(&bpf_dtor_kptr_idle) on the global idle list. Two or
more CPUs can be popping simultaneously with no external lock and no
single-consumer guarantee.

Can the following race timeline produce list corruption or
double-ownership?

Initial state: idle list = A -> B -> C -> NULL

CPU0 (NMI): enters bpf_dtor_kptr_pop_idle
  entry = head->first = A
  next  = A->next     = B
  (stalls before cmpxchg)

CPU1 (NMI/worker): runs bpf_dtor_kptr_pop_idle
  entry = A, next = B, cmpxchg(head->first, A, B) succeeds.
  Now idle = B -> C -> NULL. CPU1 uses A (for offload or mem_free).

CPU1: pops again, takes B. idle = C -> NULL.

CPU1 eventually runs worker -> push_idle(A). Now idle = A -> C -> NULL.
   A->next = C (not B anymore).

CPU0 resumes:
  Its cached next variable is still B, cmpxchg(&head->first, A, B)
  finds head->first is currently A, so the cmpxchg succeeds.
  Result: head->first = B, but B has been popped and is in CPU1's
  per-CPU bpf_dtor_kptr_jobs list already (not free).

Node B is now simultaneously:
  (a) considered removed from idle (CPU1 popped it and is using it)
  (b) referenced as idle->first by the head pointer

Another pop_idle() may grab B again as if it were idle, while it is
active in a per-CPU jobs list. Then bpf_dtor_kptr_worker will execute
job->dtor() on already-freed/reused job->obj, or the same job is pushed
into two lists, corrupting llist_node chains.

Concurrency is not hypothetical here. A BPF tp_btf/nmi_handler program
calling bpf_map_delete_elem can fire on multiple CPUs simultaneously.
The irq_work worker also runs on any CPU that queued work. Thus both
bpf_dtor_kptr_offload and bpf_dtor_kptr_trim can run concurrently on
different CPUs, giving at least 2 concurrent llist_del_first consumers.

Typical fixes require either:
  - using a locked list (spinlock + list_head), or
  - using llist_del_all() with batch processing, or
  - a cmpxchg-based stack that properly handles ABA (e.g., using a
    tagged pointer).

> +
> +static void bpf_dtor_kptr_trim(void)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +	long total;
> +	long needed;
> +
> +	for (;;) {
> +		total = atomic_long_read(&bpf_dtor_kptr_total);
> +		needed = atomic_long_read(&bpf_dtor_kptr_refs) +
> +			 atomic_long_read(&bpf_dtor_kptr_active);
> +		if (total <= needed)
> +			return;
> +
> +		job = bpf_dtor_kptr_pop_idle();
> +		if (!job)
> +			return;
> +
> +		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_total, &total, total - 1)) {
> +			bpf_dtor_kptr_push_idle(job);
> +			continue;
> +		}
> +
> +		bpf_mem_free(&bpf_global_ma, job);
> +	}
> +}

[ ... ]

> +static void bpf_dtor_kptr_worker(struct irq_work *work)
> +{
> +	struct llist_node *jobs, *node, *next;
> +
> +	jobs = llist_del_all(this_cpu_ptr(&bpf_dtor_kptr_jobs));
> +	llist_for_each_safe(node, next, jobs) {
> +		struct bpf_dtor_kptr_work *job;
> +
> +		job = llist_entry(node, struct bpf_dtor_kptr_work, node);
> +		job->dtor(job->obj);
> +		atomic_long_dec(&bpf_dtor_kptr_active);
> +		bpf_dtor_kptr_push_idle(job);
> +	}
> +
> +	bpf_dtor_kptr_trim();
> +}
> +
> +static void bpf_dtor_kptr_offload(void *obj, btf_dtor_kfunc_t dtor)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +
> +	atomic_long_inc(&bpf_dtor_kptr_active);
> +	job = bpf_dtor_kptr_pop_idle();
> +	if (WARN_ON_ONCE(!job)) {
> +		atomic_long_dec(&bpf_dtor_kptr_active);
> +		/*
> +		 * This should stay unreachable if reserve accounting is correct. If it
> +		 * ever breaks, running the destructor unsafely is still better than
> +		 * leaking the object permanently.
> +		 */
> +		dtor(obj);
> +		return;
> +	}
> +
> +	job->obj = obj;
> +	job->dtor = dtor;
> +	if (llist_add(&job->node, this_cpu_ptr(&bpf_dtor_kptr_jobs)))
> +		irq_work_queue(this_cpu_ptr(&bpf_dtor_kptr_irq_work));
> +}
> +
>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  {
>  	const struct btf_field *fields;
> @@ -842,6 +1010,19 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  		xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>  		if (!xchgd_field)
>  			break;
> +		if (in_nmi() && field->kptr.dtor) {
> +			bpf_dtor_kptr_offload(xchgd_field, field->kptr.dtor);
> +			bpf_kptr_offload_dec();
> +			break;
> +		}
> +		if (field->kptr.dtor)
> +			/*
> +			 * Dtor kptrs reach storage through bpf_ref_kptr_xchg(), which
> +			 * pairs installation with bpf_kptr_offload_inc(). Drop that
> +			 * reservation on non-NMI teardown once no active transition is
> +			 * needed.
> +			 */
> +			bpf_kptr_offload_dec();
>
>  		if (!btf_is_kernel(field->kptr.btf)) {
>  			pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,

AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: Offload kptr destructors that run from NMI
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25385531271

kernel-patches-daemon-bpf · 2026-05-05T16:06:22Z

Forwarding comment 4380895978 via email
In-Reply-To: 20260505150851.3090688-2-utilityemal77@gmail.com
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/20260505150851.3090688-2-utilityemal77@gmail.com/

ArcMutexHashmap added 2 commits May 5, 2026 08:21

kernel-patches-daemon-bpf Bot added new bpf-next V2 labels May 5, 2026

kernel-patches-review-bot Bot added the ai-review label May 5, 2026

kernel-patches-daemon-bpf Bot added V2-ci-fail and removed ai-review labels May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpf: Fix deadlock in kptr dtor in nmi#12023

bpf: Fix deadlock in kptr dtor in nmi#12023
kernel-patches-daemon-bpf[bot] wants to merge 2 commits intobpf-next_basefrom
series/1089971=>bpf-next

kernel-patches-daemon-bpf Bot commented May 5, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 5, 2026

Uh oh!

kernel-patches-review-bot Bot commented May 5, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kernel-patches-daemon-bpf Bot commented May 5, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 5, 2026

Uh oh!

kernel-patches-review-bot Bot commented May 5, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant