Skip to content

bpf: Add support for sleepable tracepoint programs#11912

Closed
kernel-patches-daemon-bpf[bot] wants to merge 6 commits into
bpf-next_basefrom
series/1084341=>bpf-next
Closed

bpf: Add support for sleepable tracepoint programs#11912
kernel-patches-daemon-bpf[bot] wants to merge 6 commits into
bpf-next_basefrom
series/1084341=>bpf-next

Conversation

@kernel-patches-daemon-bpf

Copy link
Copy Markdown

Pull request for series with
subject: bpf: Add support for sleepable tracepoint programs
version: 12
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1084341

mykyta5 added 6 commits April 22, 2026 08:38
Rework __bpf_trace_run() to support sleepable BPF programs by using
explicit RCU flavor selection, following the uprobe_prog_run() pattern.

For sleepable programs, use rcu_read_lock_tasks_trace() for lifetime
protection with migrate_disable(). For non-sleepable programs, use the
regular rcu_read_lock_dont_migrate().

Remove the preempt_disable_notrace/preempt_enable_notrace pair from
the faultable tracepoint BPF probe wrapper in bpf_probe.h, since
migration protection and RCU locking are now handled per-program
inside __bpf_trace_run().

Adapt bpf_prog_test_run_raw_tp() for sleepable programs: reject
BPF_F_TEST_RUN_ON_CPU since sleepable programs cannot run in hardirq
or preempt-disabled context, and call __bpf_prog_test_run_raw_tp()
directly instead of via smp_call_function_single(). Rework
__bpf_prog_test_run_raw_tp() to select RCU flavor per-program and
add per-program recursion context guard for private stack safety.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Add bpf_prog_run_array_sleepable() for running BPF program arrays
on faultable tracepoints. Unlike bpf_prog_run_array_uprobe(), it
includes per-program recursion checking for private stack safety
and hardcodes is_uprobe to false.

Skip dummy_bpf_prog at the top of the loop. When
bpf_prog_array_delete_safe() replaces a detached program with
dummy_bpf_prog on allocation failure, the dummy is statically
allocated and has NULL active, stats, and aux fields. Identify
it by prog->len == 0, since every real program has at least one
instruction.

Keep bpf_prog_run_array_uprobe() unchanged for uprobe callers.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Add trace_call_bpf_faultable(), a variant of trace_call_bpf() for
faultable tracepoints that supports sleepable BPF programs. It uses
rcu_tasks_trace for lifetime protection and
bpf_prog_run_array_sleepable() for per-program RCU flavor selection,
following the uprobe_prog_run() pattern.

Restructure perf_syscall_enter() and perf_syscall_exit() to run BPF
programs before perf event processing. Previously, BPF ran after the
per-cpu perf trace buffer was allocated under preempt_disable,
requiring cleanup via perf_swevent_put_recursion_context() on filter.
Now BPF runs in faultable context before preempt_disable, reading
syscall arguments from local variables instead of the per-cpu trace
record, removing the dependency on buffer allocation. This allows
sleepable BPF programs to execute and avoids unnecessary buffer
allocation when BPF filters the event. The perf event submission
path (buffer allocation, fill, submit) remains under preempt_disable
as before. Since BPF no longer runs within the buffer allocation
context, the fake_regs output parameter to perf_trace_buf_alloc()
is no longer needed and is replaced with NULL.

Add an attach-time check in __perf_event_set_bpf_prog() to reject
sleepable BPF_PROG_TYPE_TRACEPOINT programs on non-syscall
tracepoints, since only syscall tracepoints run in faultable context.

This prepares the classic tracepoint runtime and attach paths for
sleepable programs. The verifier changes to allow loading sleepable
BPF_PROG_TYPE_TRACEPOINT programs are in a subsequent patch.

To: Peter Zijlstra <peterz@infradead.org>
To: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> # for BPF bits
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Allow BPF_PROG_TYPE_RAW_TRACEPOINT, BPF_PROG_TYPE_TRACEPOINT, and
BPF_TRACE_RAW_TP (tp_btf) programs to be sleepable by adding them
to can_be_sleepable().

For BTF-based raw tracepoints (tp_btf), add a load-time check in
bpf_check_attach_target() that rejects sleepable programs attaching
to non-faultable tracepoints with a descriptive error message.

For raw tracepoints (raw_tp), add an attach-time check in
bpf_raw_tp_link_attach() that rejects sleepable programs on
non-faultable tracepoints. The attach-time check is needed because
the tracepoint name is not known at load time for raw_tp.

The attach-time check for classic tracepoints (tp) in
__perf_event_set_bpf_prog() was added in the previous patch.

Replace the verbose error message that enumerates allowed program
types with a generic "Program of this type cannot be sleepable"
message, since the list of sleepable-capable types keeps growing.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Add SEC_DEF entries for sleepable tracepoint variants:
  - "tp_btf.s+"     for sleepable BTF-based raw tracepoints
  - "raw_tp.s+"     for sleepable raw tracepoints
  - "raw_tracepoint.s+" (alias)
  - "tp.s+"         for sleepable classic tracepoints
  - "tracepoint.s+" (alias)

Extract sec_name_match_prefix() to share the prefix matching logic
between attach_tp() and attach_raw_tp(), eliminating duplicated
loops and hardcoded strcmp() checks for bare section names.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Cover all three sleepable tracepoint types (tp_btf.s, raw_tp.s, tp.s)
and sys_exit (via bpf_task_pt_regs) with functional tests using
bpf_copy_from_user() on getcwd. Verify alias and bare SEC variants,
bpf_prog_test_run_raw_tp() with BPF_F_TEST_RUN_ON_CPU rejection,
attach-time rejection on non-faultable tracepoints, and load-time
rejection for sleepable tp_btf on non-faultable tracepoints.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
@kernel-patches-daemon-bpf

Copy link
Copy Markdown
Author

Upstream branch: 9012cf2
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1084341
version: 12

@kernel-patches-review-bot

Copy link
Copy Markdown
commit 0d5a925cfda13033de1591b937dafdfc783846fe
Author: Mykyta Yatsenko <yatsenko@meta.com>

bpf: Add bpf_prog_run_array_sleepable()

Add bpf_prog_run_array_sleepable() for running BPF program arrays
on faultable tracepoints. Unlike bpf_prog_run_array_uprobe(), it
includes per-program recursion checking for private stack safety
and hardcodes is_uprobe to false.

Skip dummy_bpf_prog at the top of the loop. When
bpf_prog_array_delete_safe() replaces a detached program with
dummy_bpf_prog on allocation failure, the dummy is statically
allocated and has NULL active, stats, and aux fields. Identify
it by prog->len == 0, since every real program has at least one
instruction.

Keep bpf_prog_run_array_uprobe() unchanged for uprobe callers.

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -3079,6 +3079,56 @@ void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
>  void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr);
>  void bpf_prog_report_arena_violation(bool write, unsigned long addr, unsigned long fault_ip);
>
> +static __always_inline u32
> +bpf_prog_run_array_sleepable(const struct bpf_prog_array *array,
> +			     const void *ctx, bpf_prog_run_fn run_prog)
> +{
> +	const struct bpf_prog_array_item *item;
> +	struct bpf_prog *prog;
> +	struct bpf_run_ctx *old_run_ctx;
> +	struct bpf_trace_run_ctx run_ctx;
> +	u32 ret = 1;
> +
> +	if (unlikely(!array))
> +		return ret;
> +
> +	migrate_disable();
> +
> +	run_ctx.is_uprobe = false;
> +
> +	old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
> +	item = &array->items[0];
> +	while ((prog = READ_ONCE(item->prog))) {
> +		/* Skip dummy_bpf_prog placeholder (len == 0) */
> +		if (unlikely(!prog->len)) {
> +			item++;
> +			continue;
> +		}
> +
> +		if (unlikely(!bpf_prog_get_recursion_context(prog))) {
> +			bpf_prog_inc_misses_counter(prog);
> +			bpf_prog_put_recursion_context(prog);
> +			item++;
> +			continue;
> +		}
> +
> +		run_ctx.bpf_cookie = item->bpf_cookie;
> +
> +		if (prog->sleepable) {
> +			guard(rcu)();
> +			ret &= run_prog(prog, ctx);
> +		} else {
> +			ret &= run_prog(prog, ctx);
> +		}

The sleepable/non-sleepable branches appear inverted relative to
bpf_prog_run_array_uprobe(). In the uprobe sibling at
include/linux/bpf.h:2491-2526, classic RCU is acquired only for
non-sleepable programs:

  while ((prog = READ_ONCE(item->prog))) {
      if (!prog->sleepable)
          rcu_read_lock();
      run_ctx.bpf_cookie = item->bpf_cookie;
      ret &= run_prog(prog, ctx);
      item++;
      if (!prog->sleepable)
          rcu_read_unlock();
  }

Here, the condition takes classic RCU (guard(rcu)() expands to
rcu_read_lock()/rcu_read_unlock() via DEFINE_LOCK_GUARD_0 in
include/linux/rcupdate.h:1181) around sleepable programs while taking no
RCU read-side lock for non-sleepable ones.

Sleepable BPF programs are designed to fault and sleep. Executing them
inside a classic RCU read-side critical section violates the RCU
requirement that classic readers must not block. On CONFIG_PREEMPT_RCU=n
kernels this can cause RCU stalls.

Meanwhile, non-sleepable BPF programs often rely on classic-RCU-protected
structures (like classic RCU-protected map entries). Invoking them with no
classic RCU read-side critical section weakens protection. The caller added
in the subsequent commit 864dee5d55d1 holds only guard(rcu_tasks_trace)(),
so no outer classic RCU protects these accesses.

Should the condition be inverted to match the uprobe pattern?

This was raised by Alexei Starovoitov in v11 review:
https://lore.kernel.org/bpf/DHZ4I3RWVVCU.16XRZX1HPFI8K@gmail.com/

Alexei suggested in his April 21st comment:

  "Why grab rcu_read_lock() and undo it?
  imo it would be cleaner and faster to do
  bpf_prog_get_recursion_context() here ...

  ... and then here:
  if (!prog->sleepable) {
    guard(rcu)();
    ret &= run_prog(prog, ctx);
  } else {
    ret &= run_prog(prog, ctx);
  }"

Was this feedback addressed in v12?

> +
> +		bpf_prog_put_recursion_context(prog);
> +		item++;
> +	}
> +	bpf_reset_run_ctx(old_run_ctx);
> +	migrate_enable();
> +	return ret;
> +}
> +
>  #else /* !CONFIG_BPF_SYSCALL */
>  static inline struct bpf_prog *bpf_prog_get(u32 ufd)
>  {



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: Add bpf_prog_run_array_sleepable()
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24787752194

@kernel-patches-daemon-bpf

Copy link
Copy Markdown
Author

@kernel-patches-daemon-bpf kernel-patches-daemon-bpf Bot deleted the series/1084341=>bpf-next branch April 24, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant