Skip to content

Downstream stall notifier fixes #200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

travisdowns
Copy link
Member

This downstreams most of the updates and fixes related to CORE-9596:

Diagnostic upgrades (accepted upstream):
scylladb#2691
scylladb#2727

Crash fix (not accepted upstream yet):
scylladb#2714

I did not downstream one diagnostic fix (accepted upstream) because it does not merge cleanly due to earlier changes to the stall notifier which are not yet in our fork:

scylladb#2712

We will get this fix when we rebase on steastar master.

I plan to backport these to at least 24.2.

travisdowns and others added 3 commits April 15, 2025 09:52
On segfault we execute a handler that provides information including
a backtrace. This currently emits all information in a single write
call after collecting it in a buffer. If anything goes wrong, e.g.,
the backtrace() call itself crashes, then no information will be
emitted. The backtrace() call is not signal safe in theory, and in
practice the situation seems mixed as to its safety. So it not
unlikely that situations may arise where no output can be emitted on
SIGSEGV.

Because we catch the signal and then re-raise it using pthread_kill, the
specific information about the IP is lost in re-raise: this prevents the
line in syslog which usually captures information about segfaults from
appearing at all. So we may be left without useful information after a
crash.

In this change, we emit additional information before the backtrace()
which is not likely to have any problem, and we emit each as separate
write(2) calls so if there is a failure at any point we at least have
the information emitted up to that point.

After this, the start of the output on segfault looks like so:

Segmentation fault, si_pid: 0, si_addr: 0000000000000000, ip: 0000579ba959751f
Segmentation fault resolved ip: 0000000005e2751f in [0000579ba3770000+000000000e3f98d8]
Segmentation fault on shard 0, in scheduling group admin.

Followed by the backtrace.

Closes scylladb#2691

(cherry picked from commit 320f13a)
Use 0x in the addresses in the segfault initial logging to make it clear
they are hex (if there happen not to be any letters in the address
this can be ambiguous).

Closes scylladb#2727

(cherry picked from commit b843f5a)
If a an exception is in progress when a reactor stall is detected,
omit taking the backtrace, as this can crash as detailed in scylladb#2697.

Add a test which reproduces the issue, and which crashes (sometimes)
before this fix and which runs cleanly afterwards.

Fixes scylladb#2697.

(cherry picked from commit 7f2aede)
@travisdowns travisdowns merged commit 57559fd into redpanda-data:v25.2.x-pre Apr 15, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants