uprobes/x86: Fix red zone issue for optimized uprobes#7860
uprobes/x86: Fix red zone issue for optimized uprobes#7860kernel-patches-daemon-bpf-rc[bot] wants to merge 14 commits into
Conversation
|
Upstream branch: 8496d90 |
|
Upstream branch: 8496d90 |
9010326 to
27ca60f
Compare
|
Upstream branch: 8496d90 |
27ca60f to
aceab3c
Compare
fa837ea to
100962e
Compare
|
Upstream branch: e42e53a |
aceab3c to
3c8609c
Compare
|
Upstream branch: e42e53a |
3c8609c to
d470ea6
Compare
|
Upstream branch: e42e53a |
d470ea6 to
4d09b39
Compare
100962e to
acd58e2
Compare
|
Upstream branch: be4c6c7 |
4d09b39 to
f50012c
Compare
acd58e2 to
5da2a4f
Compare
|
Upstream branch: b23705e |
f50012c to
bd8b562
Compare
5da2a4f to
56542d2
Compare
|
Upstream branch: a4a5d4e |
bd8b562 to
bfebc3b
Compare
56542d2 to
b2dc64a
Compare
|
Upstream branch: 7f9ce28 |
bfebc3b to
dcb4b00
Compare
b2dc64a to
d480387
Compare
|
Upstream branch: 9b435d2 |
dcb4b00 to
e8331c0
Compare
|
Upstream branch: 5b03831 |
2d7f3a2 to
94c8a08
Compare
825a38a to
1ffcf8b
Compare
|
Upstream branch: c49f336 |
94c8a08 to
b294a13
Compare
1ffcf8b to
864522b
Compare
|
Upstream branch: 1444ee8 |
b294a13 to
e830dd2
Compare
864522b to
4c7a1d5
Compare
|
Upstream branch: 63a6f3b |
|
Upstream branch: 50dff00 |
|
Upstream branch: b9452b5 |
|
Upstream branch: dd0f968 |
|
Upstream branch: f1a660b |
|
Upstream branch: 68f4e48 |
In the unregister path we use __in_uprobe_trampoline check with current->mm for the VMA lookup, which is wrong, because we are in the tracer context, not the traced process. Add mm_struct pointer argument to __in_uprobe_trampoline and changing related callers to pass proper mm_struct pointer. Fixes: ba2bfc9 ("uprobes/x86: Add support to optimize uprobes") Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Removing struct uprobe_trampoline object and it's tracking code, because it's not needed. We can do same thing directly on top of struct vm_area_struct objects. This makes the code simpler and allows easy propagation of the trampoline vma object into child process in following change. Note the original code called destroy_uprobe_trampoline if the optimiation failed, but it only freed the struct uprobe_trampoline object, not the vma. The new vma leak is fixed in following change. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
When we do fork or clone without CLONE_VM the new process won't have uprobe trampoline vma objects and at the same time it will have optimized code calling that trampoline and crash. Fixing this by allowing vma uprobe trampoline objects to be copied on fork to the new process. Fixes: ba2bfc9 ("uprobes/x86: Add support to optimize uprobes") Signed-off-by: Jiri Olsa <jolsa@kernel.org>
In case the optimization fails, we leak new-ly created trampoline vma mapping (in case we just created it), let's unmap it. Fixes: ba2bfc9 ("uprobes/x86: Add support to optimize uprobes") Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Andrii reported an issue with optimized uprobes [1] that can clobber
redzone area with call instruction storing return address on stack
where user code may keep temporary data without adjusting rsp.
Fixing this by moving the optimized uprobes on top of 10-bytes nop
instruction, so we can squeeze another instruction to escape the
redzone area before doing the call, like:
lea -0x80(%rsp), %rsp
call tramp
Note the lea instruction is used to adjust the rsp register without
changing the flags.
We use nop10 and following transformation to optimized instructions
above and back as suggested by Peterz [2].
Optimize path (int3_update_optimize):
1) Initial state after set_swbp() installed the uprobe:
cc 2e 0f 1f 84 00 00 00 00 00
From offset 0 this is INT3 followed by the tail of the original
10-byte NOP.
After a previous unoptimization bytes 5..9 may still contain the
old call instruction, which remains valid for threads already there.
2) Rewrite the LEA tail and call displacement:
cc [8d 64 24 80 e8 d0 d1 d2 d3]
From offset 0 this traps on the uprobe INT3. Bytes 1..9 are not
executable entry points while byte 0 is trapped.
3) Publish the first LEA byte:
[48] 8d 64 24 80 e8 d0 d1 d2 d3
From offset 0 this is:
lea -0x80(%rsp), %rsp
call <uprobe-trampoline>
Unoptimize path (int3_update_unoptimize):
1) Initial optimized state:
48 8d 64 24 80 e8 d0 d1 d2 d3
Same as 3) above.
2) Trap new entries before restoring the NOP bytes:
[cc] 8d 64 24 80 e8 d0 d1 d2 d3
From offset 0 this traps. A thread that had already executed the
LEA can still reach the intact CALL at offset 5.
3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
and byte 5 as CALL.
cc [2e 0f 1f 84] e8 d0 d1 d2 d3
From offset 0 this still traps. Offset 5 is still the CALL for any
thread that was already past the first LEA byte.
4) Publish the first byte of the original NOP:
[66] 2e 0f 1f 84 e8 d0 d1 d2 d3
From offset 0 this is the restored 10-byte NOP; the CALL opcode and
displacement are now only NOP operands. Offset 5 still decodes as
CALL for a thread that was already there.
Tthere is only a single target uprobe-trampoline for the given nop10
instruction address, so the CALL instruction will not be changed across
unoptimization/optimization cycles.
Therefore, any task that is preempted at the CALL instruction is guaranteed
to observe that CALL and not anything else.
Note as explained in [2] we need to use following nop10:
PF1 PF2 ESC NOPL MOD SIB DISP32
NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
attribute in is_prefix_bad function.
Also changing the uprobe syscall error when called out of uprobe
trampoline to -EPROTO, so we are able to detect the fixed kernel.
The optimized uprobe performance stays the same:
uprobe-nop : 3.129 ± 0.013M/s
uprobe-push : 3.045 ± 0.006M/s
uprobe-ret : 1.095 ± 0.004M/s
--> uprobe-nop10 : 7.170 ± 0.020M/s
uretprobe-nop : 2.143 ± 0.021M/s
uretprobe-push : 2.090 ± 0.000M/s
uretprobe-ret : 0.942 ± 0.000M/s
--> uretprobe-nop10: 3.381 ± 0.003M/s
usdt-nop : 3.245 ± 0.004M/s
--> usdt-nop10 : 7.256 ± 0.023M/s
[1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
[2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
Fixes: ba2bfc9 ("uprobes/x86: Add support to optimize uprobes")
Assisted-by: Codex:GPT-5.5
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
We now expect nop combo with 10 bytes nop instead of 5 bytes nop, fixing has_nop_combo to reflect that. Fixes: 41a5c7d ("libbpf: Add support to detect nop,nop5 instructions combo for usdt probe") Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
In the previous optimized uprobe fix we changed the syscall error used for its detection from ENXIO to EPROTO. Changing related probe_uprobe_syscall detection check. Acked-by: Andrii Nakryiko <andrii@kernel.org> Fixes: 05738da ("libbpf: Add uprobe syscall feature detection") Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Syncing latest usdt.h change [1]. Now that we have nop10 optimization support in kernel, let's emit nop,nop10 for usdt probe. We leave it up to the library to use desirable nop instruction. [1] TBD Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Optimized uprobes are now on top of 10-bytes nop instructions, reflect that in existing tests. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Changing uprobe/usdt trigger bench code to use nop10 instead of nop5. Also changing run_bench_uprobes.sh to use nop10 triggers. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
|
Upstream branch: c15261b |
Adding reattach tests for uprobe syscall tests to make sure we can re-attach and optimize same uprobe multiple times. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
The uprobe nop5 optimization used to replace a 5-byte NOP with a 5-byte CALL to a trampoline. The CALL pushes a return address onto the stack at [rsp-8], clobbering whatever was stored there. On x86-64, the red zone is the 128 bytes below rsp that user code may use for temporary storage without adjusting rsp. Compilers can place USDT argument operands there, generating specs like "8@-8(%rbp)" when rbp == rsp. With the CALL-based optimization, the return address overwrites that argument before the BPF-side USDT argument fetch runs. Add two tests for this case. The uprobe_syscall subtest stores known values at -8(%rsp), -16(%rsp), and -24(%rsp), executes an optimized nop10 uprobe, and verifies the red-zone data is still intact. The USDT subtest triggers a probe in a function where the compiler places three USDT operands in the red zone and verifies that all 10 optimized invocations deliver the expected argument values to BPF. On an unfixed kernel, the first hit goes through the INT3 path and later hits use the optimized CALL path, so the red-zone checks fail after optimization. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> [ updates to use nop10 ] Signed-off-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Adding tests for forked/cloned optimized uprobes and make sure the child can properly execute optimized probe for both fork (dups mm) and clone with CLONE_VM. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Pull request for series with
subject: uprobes/x86: Fix red zone issue for optimized uprobes
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1101263