-
Notifications
You must be signed in to change notification settings - Fork 584
Implement synchronous signal handling in umbrella/worker processes #10423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This doesn't compile on my machine🍎.
- Also, I'm afraid this will break OpenBSD https://man.openbsd.org/sigtimedwait which we explicitly ensured to build GHA: also build on Alpine to test LibreSSL which is used on OpenBSD #9949.
9b0ebe5
to
575e4f7
Compare
Well, I've not considered OpenBSD just not implementing POSIX (again)... It seems you're meant to use |
FreeBSD should work btw: https://man.freebsd.org/cgi/man.cgi?query=sigwaitinfo&sektion=2&format=html |
575e4f7
to
4299902
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't compile on my machine🍎.
Now it seems to compile. Though, I recommend you
- to Add MacOS GitHub actions #10424
- and to read https://icinga.com/blog/monitor-all-operating-systems-with-one-solution-icinga-2/
in order to properly test this.
lib/base/application.cpp
Outdated
#if defined(__APPLE__) || defined(__OpenBSD__) | ||
ret = sigwait(&sigSet, &info.si_signo); | ||
VERIFY(ret != EINVAL); | ||
#else | ||
ret = sigwaitinfo(&sigSet, &info); | ||
VERIFY(ret != -1 || errno != EINVAL); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why (to have to write and maintain) two *nix implementations? Is the BSD one not good enough for Linux? Besides less maintenance overhead, one implementation would be tested by Linux, so we won't have to wait until BSD users complain. (Eat your own 🐶food.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sigwaitinfo()
implementation also gets the proper siginfo_t information against which we check the parent or worker pids.
If we want to do this identically on all systems, we should either talk about a proper IPC, or just skip the pid-checking part entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seriously speaking, we already use AF_*NIX sockets for IPC across fork(2), and this use case where we even check PIDs is literally begging for it.
4299902
to
56ea041
Compare
I've just pushed a new version where I addressed some of @Al2Klimov's comments and also added some additional clarification and validation for the timeout passed to |
56ea041
to
49778e7
Compare
Another update that now replaces the timeout value in Also fixed another small oversight where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, SIGUSR2 is used only once in each direction and only between umbrella and worker process. So it should be as simple as an AF_*NIX socketpair(2)/pipe(2) per worker process: instead of kill(2) you'd just write(2) a single byte (and close(2)?). Then we'd only need one *nix implementation AND we'd know where the IPC comes from, EVEN on OpenBSD.
@Al2Klimov We're also using the PID for the other signals, even SIGTERM on the worker side. But I get what you mean. And there's already helper classes that would make this much simpler. I will talk to the others and see what's their opinion about this. If they think it's worth the effort I'll get to work towards a separate PR, since that would be far outside the scope of this one anyway. |
Maybe. Maybe not, remember where this discussion started: #10423 (comment). Sure, it's an option to keep those two implementations for now – IF you test the final PR version on all different OS to be supported. |
I didn't necessarily mean that I want this PR to be merged first. In fact, it makes little sense to merge this if we decide on something else. What I meant is that if we end up deciding on a different approach, it should be implemented in a new PR (with this one being closed as a consequence). |
This PR is an attempt to fix #9676.
Problem
As documented in the issue linked above, on the current master, Signal handling is done in asynchronous handlers installed with
sigaction()
. As per the POSIX standard, these handlers aren't allowed to use any functions that aren't listed asasync-signal-safe
, which we currently violate by calling functions such asLog()
that perform allocations usingoperator new()
.I've had difficulty reproducing the original problem, even when adding multiple big allocations in the event loops. But even if we can't easily reproduce it, doing these things in signal handlers is still incorrect and should be fixed.
Considerations
There are two potential solutions to this problem. Either remove all non-
async-signal-safe
functions from the handlers and make them as simple as possible, or make the signal handling synchronous instead, as suggested in #9676 by usingsigtimedwait()
.Currently there is additional effort involved in setting atomic variables (which is necessary because all threads can potentially receive the signal) that get then queried in the processes main event loops.
Considering this additional effort in re-synchronizing the results from signal handling, the synchronous approach seemed simpler and actually less invasive if we want to still keep a way of logging inside of signal handlers.
The asynchronous handlers for SIGABRT will remain as its critical to what it does.
Implementation
I moved blocking the relevant signals to earlier in
DeamonCommand::Run()
and now we leave them blocked, instead of unblocking them after forking into "umbrella" and worker processes and installing signal handlers instead. Except for SIGHUP, which gets set to be ignore (SIG_IGN) for the worker, same as before.From there, on the "umbrella"-side, pending signals are polled individually directly where needed, removing the necessity for all of the global state variables.
icinga2/lib/cli/daemoncommand.cpp
Lines 472 to 477 in 4299902
However on the worker side, this is not easily possible because the state variables (static members of
Application::
in this case) are shared with the WIN32 code-paths and have other input sources as part of the regular control flow. Therefore aWorkerSignalHandler()
, similar to the old one inlib/cli/daemoncommand.cpp
was added tolib/base/application.cpp
and replaces theUtility::Sleep()
inApplication::RunEventLoop()
on non-WIN32 systems.icinga2/lib/base/application.cpp
Lines 474 to 478 in 4299902
The function
Application::GetPendingSignal()
contains all the platform-y ugliness to support MacOS and OpenBSD, so the rest of the code doesn't have to deal with it. It also supports being called with an infinite timeout and zero timeout.Testing
Manual testing has been done to verify that the signals involved in the umbrella/worker setup and termination (SIGUSR1, SIGUSR2, SIGTERM, SIGINT, SIGHUP) are processed and propagated as expected.
In the future (or pending discussion in the context of this PR), adding automated test cases may be a good idea but considering how entangled these classes and functions currently are this would be difficult without refactoring into more self-contained parts first.