Skip to content

Implement synchronous signal handling in umbrella/worker processes #10423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jschmidt-icinga
Copy link
Contributor

@jschmidt-icinga jschmidt-icinga commented Apr 28, 2025

This PR is an attempt to fix #9676.

Problem

As documented in the issue linked above, on the current master, Signal handling is done in asynchronous handlers installed with sigaction(). As per the POSIX standard, these handlers aren't allowed to use any functions that aren't listed as async-signal-safe, which we currently violate by calling functions such as Log() that perform allocations using operator new().

I've had difficulty reproducing the original problem, even when adding multiple big allocations in the event loops. But even if we can't easily reproduce it, doing these things in signal handlers is still incorrect and should be fixed.

Considerations

There are two potential solutions to this problem. Either remove all non-async-signal-safe functions from the handlers and make them as simple as possible, or make the signal handling synchronous instead, as suggested in #9676 by using sigtimedwait().

Currently there is additional effort involved in setting atomic variables (which is necessary because all threads can potentially receive the signal) that get then queried in the processes main event loops.

Considering this additional effort in re-synchronizing the results from signal handling, the synchronous approach seemed simpler and actually less invasive if we want to still keep a way of logging inside of signal handlers.

The asynchronous handlers for SIGABRT will remain as its critical to what it does.

Implementation

I moved blocking the relevant signals to earlier in DeamonCommand::Run() and now we leave them blocked, instead of unblocking them after forking into "umbrella" and worker processes and installing signal handlers instead. Except for SIGHUP, which gets set to be ignore (SIG_IGN) for the worker, same as before.

From there, on the "umbrella"-side, pending signals are polled individually directly where needed, removing the necessity for all of the global state variables.

auto sig = Application::GetPendingSignal(SIGUSR2);
if (sig && (sig->si_pid == 0 || sig->si_pid == pid)) {
Log(LogNotice, "Application")
<< "Worker process successfully loaded its config";
break;
}

However on the worker side, this is not easily possible because the state variables (static members of Application:: in this case) are shared with the WIN32 code-paths and have other input sources as part of the regular control flow. Therefore a WorkerSignalHandler(), similar to the old one in lib/cli/daemoncommand.cpp was added to lib/base/application.cpp and replaces the Utility::Sleep() in Application::RunEventLoop() on non-WIN32 systems.

#ifndef _WIN32
WorkerSignalHandler(2.5);
#else
Utility::Sleep(2.5);
#endif /* _WIN32 */

The function Application::GetPendingSignal() contains all the platform-y ugliness to support MacOS and OpenBSD, so the rest of the code doesn't have to deal with it. It also supports being called with an infinite timeout and zero timeout.

Testing

Manual testing has been done to verify that the signals involved in the umbrella/worker setup and termination (SIGUSR1, SIGUSR2, SIGTERM, SIGINT, SIGHUP) are processed and propagated as expected.

In the future (or pending discussion in the context of this PR), adding automated test cases may be a good idea but considering how entangled these classes and functions currently are this would be difficult without refactoring into more self-contained parts first.

Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschmidt-icinga jschmidt-icinga force-pushed the jschmidt/synchronous-signal-handling branch from 9b0ebe5 to 575e4f7 Compare April 28, 2025 13:47
@jschmidt-icinga
Copy link
Contributor Author

jschmidt-icinga commented Apr 28, 2025

Well, I've not considered OpenBSD just not implementing POSIX (again)...

It seems you're meant to use sigwait() in combination with timer_settime() on OpenBSD. I will look into it.

@Al2Klimov
Copy link
Member

@jschmidt-icinga jschmidt-icinga force-pushed the jschmidt/synchronous-signal-handling branch from 575e4f7 to 4299902 Compare April 29, 2025 14:24
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't compile on my machine🍎.

Now it seems to compile. Though, I recommend you

in order to properly test this.

Comment on lines 357 to 369
#if defined(__APPLE__) || defined(__OpenBSD__)
ret = sigwait(&sigSet, &info.si_signo);
VERIFY(ret != EINVAL);
#else
ret = sigwaitinfo(&sigSet, &info);
VERIFY(ret != -1 || errno != EINVAL);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why (to have to write and maintain) two *nix implementations? Is the BSD one not good enough for Linux? Besides less maintenance overhead, one implementation would be tested by Linux, so we won't have to wait until BSD users complain. (Eat your own 🐶food.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sigwaitinfo() implementation also gets the proper siginfo_t information against which we check the parent or worker pids.

If we want to do this identically on all systems, we should either talk about a proper IPC, or just skip the pid-checking part entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!

talk about a proper IPC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seriously speaking, we already use AF_*NIX sockets for IPC across fork(2), and this use case where we even check PIDs is literally begging for it.

@jschmidt-icinga jschmidt-icinga force-pushed the jschmidt/synchronous-signal-handling branch from 4299902 to 56ea041 Compare April 30, 2025 08:23
@jschmidt-icinga
Copy link
Contributor Author

I've just pushed a new version where I addressed some of @Al2Klimov's comments and also added some additional clarification and validation for the timeout passed to GetPendingSignal().

@jschmidt-icinga jschmidt-icinga force-pushed the jschmidt/synchronous-signal-handling branch from 56ea041 to 49778e7 Compare April 30, 2025 12:53
@jschmidt-icinga
Copy link
Contributor Author

jschmidt-icinga commented Apr 30, 2025

Another update that now replaces the timeout value in GetPendingSignal() with std::chrono::microseconds in a variant with bool to signify infinite or zero wait.

Also fixed another small oversight where EINTR wasn't handled for sigwaitinfo(). I don't know how relevant that even is, but now it's there.

Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, SIGUSR2 is used only once in each direction and only between umbrella and worker process. So it should be as simple as an AF_*NIX socketpair(2)/pipe(2) per worker process: instead of kill(2) you'd just write(2) a single byte (and close(2)?). Then we'd only need one *nix implementation AND we'd know where the IPC comes from, EVEN on OpenBSD.

@jschmidt-icinga
Copy link
Contributor Author

jschmidt-icinga commented May 5, 2025

@Al2Klimov We're also using the PID for the other signals, even SIGTERM on the worker side.

But I get what you mean. And there's already helper classes that would make this much simpler. I will talk to the others and see what's their opinion about this. If they think it's worth the effort I'll get to work towards a separate PR, since that would be far outside the scope of this one anyway.

@Al2Klimov
Copy link
Member

separate PR, since that would be far outside the scope of this one anyway

Maybe.

Maybe not, remember where this discussion started: #10423 (comment).

Sure, it's an option to keep those two implementations for now – IF you test the final PR version on all different OS to be supported.

@jschmidt-icinga
Copy link
Contributor Author

Sure, it's an option to keep those two implementations for now – IF you test the final PR version on all different OS to be supported.

I didn't necessarily mean that I want this PR to be merged first. In fact, it makes little sense to merge this if we decide on something else. What I meant is that if we end up deciding on a different approach, it should be implemented in a new PR (with this one being closed as a consequence).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Signal handlers should only use async-signal-safe functions
2 participants