Skip to content

stall_detector: no backtrace if exception #2714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

travisdowns
Copy link
Contributor

@travisdowns travisdowns commented Apr 2, 2025

If a an exception is in progress when a reactor stall is detected, omit taking the backtrace, as this can crash as detailed in #2697.

Add a test which reproduces the issue, and which crashes (sometimes) before this fix and which runs cleanly afterwards.

Fixes #2697.

This is the same fix we have applied in our seastar branch for Redpanda, in order to fix a similar crash in our not-upstreamed CPU profiler which relies on the same mechanism as the stall notifier. We have not seen any issues with that approach and the CPU profiler, when enabled, will take backtraces at a much higher frequency than the stall notifier in general (every 100ms, by default).

Of course, the question is if std::uncaught_exceptions is itself "signal safe" and you won't find any guarantee about it in the C++ standard (which largely ignores the existence of signals). I did check the implementation of this method in libc++ and libstdc++ and it looks "OK" to me: they both read a single field from an exceptions globals area, which looks to be OK to me.

If a an exception is in progress when a reactor stall is detected,
omit taking the backtrace, as this can crash as detailed in scylladb#2697.

Add a test which reproduces the issue, and which crashes (sometimes)
before this fix and which runs cleanly afterwards.

Fixes scylladb#2697.
@xemul
Copy link
Contributor

xemul commented Apr 8, 2025

For the record. There was an attempt to still collect the backtrace and don't crash if it fails (#2420)

@travisdowns
Copy link
Contributor Author

For the record. There was an attempt to still collect the backtrace and don't crash if it fails (#2420)

Thanks for the pointer @xemul, I hadn't seen that issue. I had done an issue search I think but did not search PRs. So this PR at least provides a reproducer which may help in evaluating the other PR as well.

@travisdowns
Copy link
Contributor Author

I also considered the longjmp approach, but went with this as I think it is simpler and overall safer. The downside is missing stalls in exceptions: however, IME that is very uncommon. I can't recall any stall which was "always" in an exception (this would probably crash pretty frequently if so), and the ones that are occasionally in an exception are usually there very rarely and the non-exception reports point just fine to the underlying location. Exceptions are slow so we also tend to remove them from hot paths.

@avikivity
Copy link
Member

Maybe test @xemul fix with @travisdowns reproducer?

Stalls in exception storms will land in exceptions. Good code should be de-exceptionalize such areas, but that's only one once experienced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

seastar could crash if a stall report is emitted at an inopportune time
3 participants