Skip to content

Backport stall SEGV fixes to 24.3 #205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

travisdowns
Copy link
Member

Backport the 3 changes from #200.

travisdowns and others added 3 commits April 15, 2025 12:05
On segfault we execute a handler that provides information including
a backtrace. This currently emits all information in a single write
call after collecting it in a buffer. If anything goes wrong, e.g.,
the backtrace() call itself crashes, then no information will be
emitted. The backtrace() call is not signal safe in theory, and in
practice the situation seems mixed as to its safety. So it not
unlikely that situations may arise where no output can be emitted on
SIGSEGV.

Because we catch the signal and then re-raise it using pthread_kill, the
specific information about the IP is lost in re-raise: this prevents the
line in syslog which usually captures information about segfaults from
appearing at all. So we may be left without useful information after a
crash.

In this change, we emit additional information before the backtrace()
which is not likely to have any problem, and we emit each as separate
write(2) calls so if there is a failure at any point we at least have
the information emitted up to that point.

After this, the start of the output on segfault looks like so:

Segmentation fault, si_pid: 0, si_addr: 0000000000000000, ip: 0000579ba959751f
Segmentation fault resolved ip: 0000000005e2751f in [0000579ba3770000+000000000e3f98d8]
Segmentation fault on shard 0, in scheduling group admin.

Followed by the backtrace.

Closes scylladb#2691

(cherry picked from commit 320f13a)
(cherry picked from commit 749604f)
Use 0x in the addresses in the segfault initial logging to make it clear
they are hex (if there happen not to be any letters in the address
this can be ambiguous).

Closes scylladb#2727

(cherry picked from commit b843f5a)
(cherry picked from commit fec49a0)
If a an exception is in progress when a reactor stall is detected,
omit taking the backtrace, as this can crash as detailed in scylladb#2697.

Add a test which reproduces the issue, and which crashes (sometimes)
before this fix and which runs cleanly afterwards.

Fixes scylladb#2697.

(cherry picked from commit 7f2aede)
(cherry picked from commit 57559fd)
@StephanDollberg StephanDollberg merged commit d998d3f into redpanda-data:v24.3.x Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants