Skip to content

Keep object locked until events are dispatched. #10372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

w1ll-i-code
Copy link

@w1ll-i-code w1ll-i-code commented Mar 12, 2025

fixes #10364
fixes #10366

@cla-bot cla-bot bot added the cla/signed label Mar 12, 2025
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @w1ll-i-code, thanks the PR!

In general, this is exactly how it should be 👍. But since there are still a few boost signals that are triggered while holding the object lock, I'm not sure we won't end up with some mysterious deadlocks. Have you checked that the handlers of these signals aren't accessing something from the checkable that requires locking in combination with some other mutex in-between? Something like a call to checkable->Get...() and Get...() locks another mutex first and then the olock on the checkable. I know it's a lot to check every access to the checkable, but we need to make sure we don't accidentally introduce any deadlocks. The risk of such a thing happening can also be minimised if all handlers of these signals would only operate on the provided CheckResult object (if any) instead of the checkable, e.g. instead of calling checkable->GetState(), we would use cr->GetState() and only access something on the checkable if the cr object does not already provide it. Thank you again and I would be delighted if you could share your findings here.

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Mar 27, 2025

@yhabteab Ok, correct me if I'm wrong, but there should not be any deadlocks possible:

When the boost singal2 is called, all connected functions will be called sequentially, on the same thread. Since the object lock uses std::recursive_mutex, retaking the object lock on the same thread will not cause any deadlocks. If any of the connected functions for the boost signals dispatches work to another thread, then that thread will be waiting until the lock is released.

I haven't seen any use of the actor patter that would suggest to me that any signal will wait for another thread to complete its task. If that's the case in a part of the code base I have not seen yet, I do apologize.

@yhabteab
Copy link
Member

When the boost singal2 is called, all connected functions will be called sequentially, on the same thread. Since the object lock uses std::recursive_mutex, retaking the object lock on the same thread will not cause any deadlocks.

Yes, that alone will not cause any deadlocks, but what I was trying to point out is that we should check all handlers of the boost signals that are triggered within the process check result to make sure that they do not access methods of the checkable that acquires multiple mutexes. I am not saying that there is such code, but make sure that there is none, especially if there is something like that somewhere e.g. in the Host class:

... Host::DoSomething()
{
        // First lock some unrelated mutex and then...
	std::unique_lock<std::mutex> lock(someOtherMutex);

        //... do some other things here and then try to recursively acquire the object lock with
	ObjectLock olock(this);
        ...
}

If thread1 processes a check result for a particular host and locks the object lock on that host, another thread can still call host->DoSomething() from a completely different location and would acquire someOtherMutex and await the object lock to be released. However, if some of the boost signal handlers of ProcessCheckResult() also call host->DoSomething(), they will deadlock each other, since thread1 holds the object lock and wants to lock someOtherMutex and a random thread holds the someOtherMutex lock and wants to acquire the object lock. Again, I'm not saying that there is such a stupid code, but I can't tell you for sure that there isn't one either.

@w1ll-i-code
Copy link
Author

@yhabteab I have scanned through the code and not found anything. But I'm not 100% sure I have not missed some niche interaction in C++ I'm not yet aware of.

@Al2Klimov
Copy link
Member

💡

I'm wondering whether a dedicated per-checkable mutex for the whole ProcessCheckResult() call would fix your problem...

Though, I'm 100% sure it won't deadlock. (It only could in case of nested ProcessCheckResult() for the same checkable which we don't do atm and even the imagination of this is scary.)

@yhabteab
Copy link
Member

yhabteab commented Apr 8, 2025

@yhabteab I have scanned through the code and not found anything. But I'm not 100% sure I have not missed some niche interaction in C++ I'm not yet aware of.

Thanks! I will also try to verify this on my own and get back to you if I find something.

Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All relevant events that were previously triggered outside the object lock and now fall within this lock can be found as follows. I can't rule out the possibility that I've missed something, but this grep should list all relevant events and their subscribers. After the first round I couldn't find anything that would cause a problem with this PR, but I will run Icinga 2 with all components overnight and see tomorrow if something gets broken.

grep -ERn --include=\*.{hpp,cpp} --exclude-dir={test,prefix,doc} 'On(CommentRemoved|NewCheckResult|StateChange|EventCommandExecuted|AcknowledgementCleared|(HostProblem|State|Flapping|Reachability|RemovalInfo|StateBeforeSuppression|SuppressedNotifications)Changed|NotificationsRequested|NotificationSentTo(User|AllUsers)|LastNotifiedStatePerUser(Cleared|Updated))\.connect' .
See Results
./lib/compat/compatlogger.cpp:52:	Checkable::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, const MessageOrigin::Ptr&) {
./lib/compat/compatlogger.cpp:55:	Checkable::OnNotificationSentToUser.connect([this](const Notification::Ptr& notification, const Checkable::Ptr& checkable,
./lib/compat/compatlogger.cpp:63:	Checkable::OnEventCommandExecuted.connect([this](const Checkable::Ptr& checkable) { EventCommandHandler(checkable); });
./lib/compat/compatlogger.cpp:65:	Checkable::OnFlappingChanged.connect([this](const Checkable::Ptr& checkable, const Value&) { FlappingChangedHandler(checkable); });
./lib/notification/notificationcomponent.cpp:42:	Checkable::OnNotificationsRequested.connect([this](const Checkable::Ptr& checkable, NotificationType type, const CheckResult::Ptr& cr,
./lib/perfdata/gelfwriter.cpp:93:	m_HandleCheckResults = Checkable::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/gelfwriter.cpp:97:	m_HandleNotifications = Checkable::OnNotificationSentToUser.connect([this](const Notification::Ptr& notification,
./lib/perfdata/gelfwriter.cpp:102:	m_HandleStateChanges = Checkable::OnStateChange.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/opentsdbwriter.cpp:84:	m_HandleCheckResults = Service::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, const MessageOrigin::Ptr&) {
./lib/perfdata/perfdatawriter.cpp:56:	m_HandleCheckResults = Checkable::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/influxdbcommonwriter.cpp:100:	m_HandleCheckResults = Checkable::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/graphitewriter.cpp:98:	m_HandleCheckResults = Checkable::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/elasticsearchwriter.cpp:98:	m_HandleCheckResults = Checkable::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/elasticsearchwriter.cpp:102:	m_HandleStateChanges = Checkable::OnStateChange.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/elasticsearchwriter.cpp:106:	m_HandleNotifications = Checkable::OnNotificationSentToAllUsers.connect([this](const Notification::Ptr& notification,
./lib/icingadb/icingadb-objects.cpp:87:	Checkable::OnStateChange.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, StateType type, const MessageOrigin::Ptr&) {
./lib/icingadb/icingadb-objects.cpp:94:	Checkable::OnAcknowledgementCleared.connect([](const Checkable::Ptr& checkable, const String& removedBy, double changeTime, const MessageOrigin::Ptr&) {
./lib/icingadb/icingadb-objects.cpp:98:	Checkable::OnReachabilityChanged.connect([](const Checkable::Ptr& parent, const CheckResult::Ptr&, std::set<Checkable::Ptr>, const MessageOrigin::Ptr&) {
./lib/icingadb/icingadb-objects.cpp:121:	Checkable::OnNotificationSentToAllUsers.connect([](
./lib/icingadb/icingadb-objects.cpp:130:	Comment::OnCommentRemoved.connect(&IcingaDB::CommentRemovedHandler);
./lib/icingadb/icingadb-objects.cpp:134:	Checkable::OnNewCheckResult.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr&, const MessageOrigin::Ptr&) {
./lib/icingadb/icingadb-objects.cpp:142:	Service::OnHostProblemChanged.connect([](const Service::Ptr& service, const CheckResult::Ptr&, const MessageOrigin::Ptr&) {
./lib/db_ido/dbobject.cpp:39:	ConfigObject::OnStateChanged.connect([](const ConfigObject::Ptr& object) { StateChangedHandler(object); });
./lib/db_ido/dbevents.cpp:31:	Comment::OnCommentRemoved.connect([](const Comment::Ptr& comment) { DbEvents::RemoveComment(comment); });
./lib/db_ido/dbevents.cpp:39:	Checkable::OnAcknowledgementCleared.connect([](const Checkable::Ptr& checkable, const String&, double, const MessageOrigin::Ptr&) {
./lib/db_ido/dbevents.cpp:44:	Checkable::OnFlappingChanged.connect([](const Checkable::Ptr& checkable, const Value&) { FlappingChangedHandler(checkable); });
./lib/db_ido/dbevents.cpp:45:	Checkable::OnNotificationSentToAllUsers.connect([](const Notification::Ptr& notification, const Checkable::Ptr& checkable,
./lib/db_ido/dbevents.cpp:67:	Checkable::OnReachabilityChanged.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr,
./lib/db_ido/dbevents.cpp:80:	Checkable::OnNotificationSentToAllUsers.connect([](const Notification::Ptr& notification, const Checkable::Ptr& checkable,
./lib/db_ido/dbevents.cpp:86:	Checkable::OnStateChange.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, StateType type, const MessageOrigin::Ptr&) {
./lib/db_ido/dbevents.cpp:90:	Checkable::OnNewCheckResult.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, const MessageOrigin::Ptr&) {
./lib/db_ido/dbevents.cpp:93:	Checkable::OnNotificationSentToUser.connect([](const Notification::Ptr& notification, const Checkable::Ptr& checkable,
./lib/db_ido/dbevents.cpp:98:	Checkable::OnFlappingChanged.connect([](const Checkable::Ptr& checkable, const Value&) {
./lib/db_ido/dbevents.cpp:107:	Checkable::OnFlappingChanged.connect([](const Checkable::Ptr& checkable, const Value&) { DbEvents::AddFlappingChangedHistory(checkable); });
./lib/db_ido/dbevents.cpp:109:	Checkable::OnNewCheckResult.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, const MessageOrigin::Ptr&) {
./lib/db_ido/dbevents.cpp:113:	Checkable::OnEventCommandExecuted.connect([](const Checkable::Ptr& checkable) { DbEvents::AddEventHandlerHistory(checkable); });
./lib/icinga/clusterevents.cpp:48:	Checkable::OnNewCheckResult.connect(&ClusterEvents::CheckResultHandler);
./lib/icinga/clusterevents.cpp:51:	Checkable::OnStateBeforeSuppressionChanged.connect(&ClusterEvents::StateBeforeSuppressionChangedHandler);
./lib/icinga/clusterevents.cpp:52:	Checkable::OnSuppressedNotificationsChanged.connect(&ClusterEvents::SuppressedNotificationsChangedHandler);
./lib/icinga/clusterevents.cpp:53:	Notification::OnSuppressedNotificationsChanged.connect(&ClusterEvents::SuppressedNotificationTypesChangedHandler);
./lib/icinga/clusterevents.cpp:55:	Notification::OnLastNotifiedStatePerUserUpdated.connect(&ClusterEvents::LastNotifiedStatePerUserUpdatedHandler);
./lib/icinga/clusterevents.cpp:56:	Notification::OnLastNotifiedStatePerUserCleared.connect(&ClusterEvents::LastNotifiedStatePerUserClearedHandler);
./lib/icinga/clusterevents.cpp:59:	Checkable::OnNotificationsRequested.connect(&ClusterEvents::SendNotificationsHandler);
./lib/icinga/clusterevents.cpp:60:	Checkable::OnNotificationSentToUser.connect(&ClusterEvents::NotificationSentUserHandler);
./lib/icinga/clusterevents.cpp:61:	Checkable::OnNotificationSentToAllUsers.connect(&ClusterEvents::NotificationSentToAllUsersHandler);
./lib/icinga/clusterevents.cpp:64:	Checkable::OnAcknowledgementCleared.connect(&ClusterEvents::AcknowledgementClearedHandler);
./lib/icinga/clusterevents.cpp:66:	Comment::OnRemovalInfoChanged.connect(&ClusterEvents::SetRemovalInfoHandler);
./lib/icinga/clusterevents.cpp:67:	Downtime::OnRemovalInfoChanged.connect(&ClusterEvents::SetRemovalInfoHandler);
./lib/icinga/apievents.cpp:16:	Checkable::OnNewCheckResult.connect(&ApiEvents::CheckResultHandler);
./lib/icinga/apievents.cpp:17:	Checkable::OnStateChange.connect(&ApiEvents::StateChangeHandler);
./lib/icinga/apievents.cpp:18:	Checkable::OnNotificationSentToAllUsers.connect(&ApiEvents::NotificationSentToAllUsersHandler);
./lib/icinga/apievents.cpp:20:	Checkable::OnFlappingChanged.connect(&ApiEvents::FlappingChangedHandler);
./lib/icinga/apievents.cpp:23:	Checkable::OnAcknowledgementCleared.connect(&ApiEvents::AcknowledgementClearedHandler);
./lib/icinga/apievents.cpp:26:	Comment::OnCommentRemoved.connect(&ApiEvents::CommentRemovedHandler);
./build/lib/icinga/service-ti.cpp:186:			ObjectImpl<Service>::OnStateChanged.connect(callback);
./build/lib/icinga/host-ti.cpp:168:			ObjectImpl<Host>::OnStateChanged.connect(callback);
./build/lib/icinga/notification-ti.cpp:299:			ObjectImpl<Notification>::OnSuppressedNotificationsChanged.connect(callback);
./build/lib/icinga/checkable-ti.cpp:535:			ObjectImpl<Checkable>::OnSuppressedNotificationsChanged.connect(callback);
./build/lib/icinga/checkable-ti.cpp:541:			ObjectImpl<Checkable>::OnStateBeforeSuppressionChanged.connect(callback);
./build/lib/icinga/checkable-ti.cpp:619:			ObjectImpl<Checkable>::OnFlappingChanged.connect(callback);
./build/lib/icinga/checkresult-ti.cpp:202:			ObjectImpl<CheckResult>::OnStateChanged.connect(callback);

PS: While you're at it, can you please rebase this against the latest master to get the GHAs pass, thanks!

@w1ll-i-code w1ll-i-code force-pushed the 10364-race-condition-while-calculating-object-state branch from 911238f to 1204d40 Compare April 10, 2025 08:26
@yhabteab
Copy link
Member

Actually, the grep command from my previous comment includes some events like OnFlappingChange, OnStateBeforeSuppressionChanged and OnSuppressedNotificationsChanged, which were also triggered while holding the object lock, even with master branch. So here is the new, customised command.

grep -ERn --include=\*.{hpp,cpp} --exclude-dir={test,prefix,doc} 'On(CommentRemoved|NewCheckResult|StateChange|EventCommandExecuted|AcknowledgementCleared|(HostProblem|State|Flapping|Reachability|RemovalInfo)Changed|NotificationsRequested|NotificationSentTo(User|AllUsers)|LastNotifiedStatePerUser(Cleared|Updated))\.connect' .

Today, after the second round of tracing these events, I was also unable to find any faulty code and my Icinga 2 daemon is still running fine. So I have nothing to complain about except for the two open comments above.

Copy link

cla-bot bot commented Apr 16, 2025

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: William Calliari.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot cla-bot bot removed the cla/signed label Apr 16, 2025
@w1ll-i-code w1ll-i-code force-pushed the 10364-race-condition-while-calculating-object-state branch from 22bc462 to b697dc8 Compare April 16, 2025 10:05
@cla-bot cla-bot bot added the cla/signed label Apr 16, 2025
@w1ll-i-code w1ll-i-code force-pushed the 10364-race-condition-while-calculating-object-state branch from b697dc8 to ff64126 Compare April 22, 2025 07:56
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LFTM! Thanks!

@yhabteab yhabteab added bug Something isn't working area/checks Check execution and results labels May 12, 2025
@yhabteab yhabteab added this to the 2.15.0 milestone May 12, 2025
@yhabteab yhabteab requested a review from julianbrost May 16, 2025 13:52
@yhabteab
Copy link
Member

This PR now has some conflicts introduced with #10397 and needs rebasing! I already did that locally but it seems I'm not allowed to push to your PR, so can you (@w1ll-i-code) please either enable repository maintainer permissions on-existing (your)-pull-request, so that I can do it for you or rebase and resolve the conflicts by yourself! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Check execution and results bug Something isn't working cla/signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race-condition during object shutdown Race-condition while calculating the object state.
3 participants