Implement synchronous signal handling in umbrella/worker processes #10423

jschmidt-icinga · 2025-04-28T13:09:28Z

This PR is an attempt to fix #9676.

Problem

As documented in the issue linked above, on the current master, Signal handling is done in asynchronous handlers installed with sigaction(). As per the POSIX standard, these handlers aren't allowed to use any functions that aren't listed as async-signal-safe, which we currently violate by calling functions such as Log() that perform allocations using operator new().

I've had difficulty reproducing the original problem, even when adding multiple big allocations in the event loops. But even if we can't easily reproduce it, doing these things in signal handlers is still incorrect and should be fixed.

Considerations

There are two potential solutions to this problem. Either remove all non-async-signal-safe functions from the handlers and make them as simple as possible, or make the signal handling synchronous instead, as suggested in #9676 by using sigtimedwait().

Currently there is additional effort involved in setting atomic variables (which is necessary because all threads can potentially receive the signal) that get then queried in the processes main event loops.

Considering this additional effort in re-synchronizing the results from signal handling, the synchronous approach seemed simpler and actually less invasive if we want to still keep a way of logging inside of signal handlers.

The asynchronous handlers for SIGABRT will remain as its critical to what it does.

Implementation

I moved blocking the relevant signals to earlier in DeamonCommand::Run() and now we leave them blocked, instead of unblocking them after forking into "umbrella" and worker processes and installing signal handlers instead. Except for SIGHUP, which gets set to be ignore (SIG_IGN) for the worker, same as before.

From there, on the "umbrella"-side, pending signals are polled individually directly where needed, removing the necessity for all of the global state variables.

icinga2/lib/cli/daemoncommand.cpp

Lines 472 to 477 in 4299902

    
           auto sig = Application::GetPendingSignal(SIGUSR2); 
        
           if (sig && (sig->si_pid == 0 || sig->si_pid == pid)) { 
        
           	Log(LogNotice, "Application") 
        
           		<< "Worker process successfully loaded its config"; 
        
           	break; 
        
           }

However on the worker side, this is not easily possible because the state variables (static members of Application:: in this case) are shared with the WIN32 code-paths and have other input sources as part of the regular control flow. Therefore a WorkerSignalHandler(), similar to the old one in lib/cli/daemoncommand.cpp was added to lib/base/application.cpp and replaces the Utility::Sleep() in Application::RunEventLoop() on non-WIN32 systems.

icinga2/lib/base/application.cpp

Lines 474 to 478 in 4299902

    
           #ifndef _WIN32 
        
           		WorkerSignalHandler(2.5); 
        
           #else 
        
           		Utility::Sleep(2.5); 
        
           #endif /* _WIN32 */

The function Application::GetPendingSignal() contains all the platform-y ugliness to support MacOS and OpenBSD, so the rest of the code doesn't have to deal with it. It also supports being called with an infinite timeout and zero timeout.

Testing

Manual testing has been done to verify that the signals involved in the umbrella/worker setup and termination (SIGUSR1, SIGUSR2, SIGTERM, SIGINT, SIGHUP) are processed and propagated as expected.

In the future (or pending discussion in the context of this PR), adding automated test cases may be a good idea but considering how entangled these classes and functions currently are this would be difficult without refactoring into more self-contained parts first.

Al2Klimov

This doesn't compile on my machine🍎.
Also, I'm afraid this will break OpenBSD https://man.openbsd.org/sigtimedwait which we explicitly ensured to build #9949.

lib/cli/daemoncommand.cpp

jschmidt-icinga · 2025-04-28T14:00:35Z

This doesn't compile on my machine🍎.

Also, I'm afraid this will break OpenBSD https://man.openbsd.org/sigtimedwait which we explicitly ensured to build GHA: also build on Alpine to test LibreSSL which is used on OpenBSD #9949.

Well, I've not considered OpenBSD just not implementing POSIX (again)...

It seems you're meant to use sigwait() in combination with timer_settime() on OpenBSD. I will look into it.

Al2Klimov · 2025-04-28T14:03:26Z

FreeBSD should work btw: https://man.freebsd.org/cgi/man.cgi?query=sigwaitinfo&sektion=2&format=html

Al2Klimov

This doesn't compile on my machine🍎.

Now it seems to compile. Though, I recommend you

to #10424
and to read https://icinga.com/blog/monitor-all-operating-systems-with-one-solution-icinga-2/

in order to properly test this.

Al2Klimov · 2025-04-29T14:39:05Z

lib/base/application.cpp

+#if defined(__APPLE__) || defined(__OpenBSD__)
+		ret = sigwait(&sigSet, &info.si_signo);
+		VERIFY(ret != EINVAL);
+#else
+		ret = sigwaitinfo(&sigSet, &info);
+		VERIFY(ret != -1 || errno != EINVAL);
+#endif


Also, why (to have to write and maintain) two *nix implementations? Is the BSD one not good enough for Linux? Besides less maintenance overhead, one implementation would be tested by Linux, so we won't have to wait until BSD users complain. (Eat your own 🐶food.)

The sigwaitinfo() implementation also gets the proper siginfo_t information against which we check the parent or worker pids.

If we want to do this identically on all systems, we should either talk about a proper IPC, or just skip the pid-checking part entirely.

I see!

talk about a proper IPC

Seriously speaking, we already use AF_*NIX sockets for IPC across fork(2), and this use case where we even check PIDs is literally begging for it.

lib/cli/daemoncommand.cpp

lib/base/application.cpp

jschmidt-icinga · 2025-04-30T08:32:25Z

I've just pushed a new version where I addressed some of @Al2Klimov's comments and also added some additional clarification and validation for the timeout passed to GetPendingSignal().

jschmidt-icinga · 2025-04-30T12:58:57Z

Another update that now replaces the timeout value in GetPendingSignal() with std::chrono::microseconds in a variant with bool to signify infinite or zero wait.

Also fixed another small oversight where EINTR wasn't handled for sigwaitinfo(). I don't know how relevant that even is, but now it's there.

Al2Klimov

IIRC, SIGUSR2 is used only once in each direction and only between umbrella and worker process. So it should be as simple as an AF_*NIX socketpair(2)/pipe(2) per worker process: instead of kill(2) you'd just write(2) a single byte (and close(2)?). Then we'd only need one *nix implementation AND we'd know where the IPC comes from, EVEN on OpenBSD.

jschmidt-icinga · 2025-05-05T07:05:34Z

@Al2Klimov We're also using the PID for the other signals, even SIGTERM on the worker side.

But I get what you mean. And there's already helper classes that would make this much simpler. I will talk to the others and see what's their opinion about this. If they think it's worth the effort I'll get to work towards a separate PR, since that would be far outside the scope of this one anyway.

Al2Klimov · 2025-05-05T08:55:42Z

separate PR, since that would be far outside the scope of this one anyway

Maybe.

Maybe not, remember where this discussion started: #10423 (comment).

Sure, it's an option to keep those two implementations for now – IF you test the final PR version on all different OS to be supported.

jschmidt-icinga · 2025-05-05T08:59:43Z

Sure, it's an option to keep those two implementations for now – IF you test the final PR version on all different OS to be supported.

I didn't necessarily mean that I want this PR to be merged first. In fact, it makes little sense to merge this if we decide on something else. What I meant is that if we end up deciding on a different approach, it should be implemented in a new PR (with this one being closed as a consequence).

jschmidt-icinga requested a review from julianbrost April 28, 2025 13:09

cla-bot bot added the cla/signed label Apr 28, 2025

Al2Klimov reviewed Apr 28, 2025

View reviewed changes

lib/cli/daemoncommand.cpp Outdated Show resolved Hide resolved

lib/cli/daemoncommand.cpp Outdated Show resolved Hide resolved

lib/cli/daemoncommand.cpp Outdated Show resolved Hide resolved

jschmidt-icinga force-pushed the jschmidt/synchronous-signal-handling branch from 9b0ebe5 to 575e4f7 Compare April 28, 2025 13:47

Al2Klimov mentioned this pull request Apr 28, 2025

Add MacOS GitHub actions #10424

Open

jschmidt-icinga force-pushed the jschmidt/synchronous-signal-handling branch from 575e4f7 to 4299902 Compare April 29, 2025 14:24

Al2Klimov reviewed Apr 29, 2025

View reviewed changes

jschmidt-icinga force-pushed the jschmidt/synchronous-signal-handling branch from 4299902 to 56ea041 Compare April 30, 2025 08:23

Implement synchronous umbrella/worker signal handling

49778e7

jschmidt-icinga force-pushed the jschmidt/synchronous-signal-handling branch from 56ea041 to 49778e7 Compare April 30, 2025 12:53

Al2Klimov reviewed Apr 30, 2025

View reviewed changes

	auto sig = Application::GetPendingSignal(SIGUSR2);
	if (sig && (sig->si_pid == 0 \|\| sig->si_pid == pid)) {
	Log(LogNotice, "Application")
	<< "Worker process successfully loaded its config";
	break;
	}

	#ifndef _WIN32
	WorkerSignalHandler(2.5);
	#else
	Utility::Sleep(2.5);
	#endif /* _WIN32 */

Implement synchronous signal handling in umbrella/worker processes #10423

Are you sure you want to change the base?

Implement synchronous signal handling in umbrella/worker processes #10423

Uh oh!

Conversation

jschmidt-icinga commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Considerations

Implementation

Testing

Uh oh!

Al2Klimov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jschmidt-icinga commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Al2Klimov commented Apr 28, 2025

Uh oh!

Al2Klimov left a comment

Choose a reason for hiding this comment

Uh oh!

Al2Klimov Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

jschmidt-icinga Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Al2Klimov Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Al2Klimov Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jschmidt-icinga commented Apr 30, 2025

Uh oh!

jschmidt-icinga commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Al2Klimov left a comment

Choose a reason for hiding this comment

Uh oh!

jschmidt-icinga commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Al2Klimov commented May 5, 2025

Uh oh!

jschmidt-icinga commented May 5, 2025

Uh oh!

Uh oh!

jschmidt-icinga commented Apr 28, 2025 •

edited

Loading

jschmidt-icinga commented Apr 28, 2025 •

edited

Loading

jschmidt-icinga commented Apr 30, 2025 •

edited

Loading

jschmidt-icinga commented May 5, 2025 •

edited

Loading