Downstream stall notifier fixes #200

travisdowns · 2025-04-15T14:05:23Z

This downstreams most of the updates and fixes related to CORE-9596:

Diagnostic upgrades (accepted upstream):
scylladb#2691
scylladb#2727

Crash fix (not accepted upstream yet):
scylladb#2714

I did not downstream one diagnostic fix (accepted upstream) because it does not merge cleanly due to earlier changes to the stall notifier which are not yet in our fork:

scylladb#2712

We will get this fix when we rebase on steastar master.

I plan to backport these to at least 24.2.

On segfault we execute a handler that provides information including a backtrace. This currently emits all information in a single write call after collecting it in a buffer. If anything goes wrong, e.g., the backtrace() call itself crashes, then no information will be emitted. The backtrace() call is not signal safe in theory, and in practice the situation seems mixed as to its safety. So it not unlikely that situations may arise where no output can be emitted on SIGSEGV. Because we catch the signal and then re-raise it using pthread_kill, the specific information about the IP is lost in re-raise: this prevents the line in syslog which usually captures information about segfaults from appearing at all. So we may be left without useful information after a crash. In this change, we emit additional information before the backtrace() which is not likely to have any problem, and we emit each as separate write(2) calls so if there is a failure at any point we at least have the information emitted up to that point. After this, the start of the output on segfault looks like so: Segmentation fault, si_pid: 0, si_addr: 0000000000000000, ip: 0000579ba959751f Segmentation fault resolved ip: 0000000005e2751f in [0000579ba3770000+000000000e3f98d8] Segmentation fault on shard 0, in scheduling group admin. Followed by the backtrace. Closes scylladb#2691 (cherry picked from commit 320f13a)

Use 0x in the addresses in the segfault initial logging to make it clear they are hex (if there happen not to be any letters in the address this can be ambiguous). Closes scylladb#2727 (cherry picked from commit b843f5a)

If a an exception is in progress when a reactor stall is detected, omit taking the backtrace, as this can crash as detailed in scylladb#2697. Add a test which reproduces the issue, and which crashes (sometimes) before this fix and which runs cleanly afterwards. Fixes scylladb#2697. (cherry picked from commit 7f2aede)

travisdowns and others added 3 commits April 15, 2025 09:52

reactor: use 0x for hex addresses

c68dbf9

Use 0x in the addresses in the segfault initial logging to make it clear they are hex (if there happen not to be any letters in the address this can be ambiguous). Closes scylladb#2727 (cherry picked from commit b843f5a)

travisdowns requested review from StephanDollberg and ballard26 April 15, 2025 14:05

StephanDollberg approved these changes Apr 15, 2025

View reviewed changes

travisdowns merged commit 57559fd into redpanda-data:v25.2.x-pre Apr 15, 2025
14 checks passed

This was referenced Apr 15, 2025

Backport stall SEGV fixes to 25.1 #204

Merged

Backport stall SEGV fixes to 24.3 #205

Merged

Backport stall SEGV fixes to 24.2 #206

Merged

pgellert mentioned this pull request Apr 23, 2025

Downstream Print incrementally in sigsegv handler #207

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Downstream stall notifier fixes #200

Downstream stall notifier fixes #200

Uh oh!

travisdowns commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

Downstream stall notifier fixes #200

Downstream stall notifier fixes #200

Uh oh!

Conversation

travisdowns commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!