-
Notifications
You must be signed in to change notification settings - Fork 586
Keep object locked until events are dispatched. #10372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Keep object locked until events are dispatched. #10372
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @w1ll-i-code, thanks the PR!
In general, this is exactly how it should be 👍. But since there are still a few boost signals that are triggered while holding the object lock, I'm not sure we won't end up with some mysterious deadlocks. Have you checked that the handlers of these signals aren't accessing something from the checkable
that requires locking in combination with some other mutex in-between? Something like a call to checkable->Get...()
and Get...()
locks another mutex first and then the olock
on the checkable. I know it's a lot to check every access to the checkable, but we need to make sure we don't accidentally introduce any deadlocks. The risk of such a thing happening can also be minimised if all handlers of these signals would only operate on the provided CheckResult
object (if any) instead of the checkable, e.g. instead of calling checkable->GetState()
, we would use cr->GetState()
and only access something on the checkable if the cr
object does not already provide it. Thank you again and I would be delighted if you could share your findings here.
@yhabteab Ok, correct me if I'm wrong, but there should not be any deadlocks possible: When the boost singal2 is called, all connected functions will be called sequentially, on the same thread. Since the object lock uses std::recursive_mutex, retaking the object lock on the same thread will not cause any deadlocks. If any of the connected functions for the boost signals dispatches work to another thread, then that thread will be waiting until the lock is released. I haven't seen any use of the actor patter that would suggest to me that any signal will wait for another thread to complete its task. If that's the case in a part of the code base I have not seen yet, I do apologize. |
Yes, that alone will not cause any deadlocks, but what I was trying to point out is that we should check all handlers of the boost signals that are triggered within the process check result to make sure that they do not access methods of the checkable that acquires multiple mutexes. I am not saying that there is such code, but make sure that there is none, especially if there is something like that somewhere e.g. in the ... Host::DoSomething()
{
// First lock some unrelated mutex and then...
std::unique_lock<std::mutex> lock(someOtherMutex);
//... do some other things here and then try to recursively acquire the object lock with
ObjectLock olock(this);
...
} If thread1 processes a check result for a particular host and locks the object lock on that host, another thread can still call |
@yhabteab I have scanned through the code and not found anything. But I'm not 100% sure I have not missed some niche interaction in C++ I'm not yet aware of. |
💡I'm wondering whether a dedicated per-checkable mutex for the whole ProcessCheckResult() call would fix your problem... Though, I'm 100% sure it won't deadlock. (It only could in case of nested ProcessCheckResult() for the same checkable which we don't do atm and even the imagination of this is scary.) |
Thanks! I will also try to verify this on my own and get back to you if I find something. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All relevant events that were previously triggered outside the object lock and now fall within this lock can be found as follows. I can't rule out the possibility that I've missed something, but this grep should list all relevant events and their subscribers. After the first round I couldn't find anything that would cause a problem with this PR, but I will run Icinga 2 with all components overnight and see tomorrow if something gets broken.
grep -ERn --include=\*.{hpp,cpp} --exclude-dir={test,prefix,doc} 'On(CommentRemoved|NewCheckResult|StateChange|EventCommandExecuted|AcknowledgementCleared|(HostProblem|State|Flapping|Reachability|RemovalInfo|StateBeforeSuppression|SuppressedNotifications)Changed|NotificationsRequested|NotificationSentTo(User|AllUsers)|LastNotifiedStatePerUser(Cleared|Updated))\.connect' .
See Results
./lib/compat/compatlogger.cpp:52: Checkable::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, const MessageOrigin::Ptr&) {
./lib/compat/compatlogger.cpp:55: Checkable::OnNotificationSentToUser.connect([this](const Notification::Ptr& notification, const Checkable::Ptr& checkable,
./lib/compat/compatlogger.cpp:63: Checkable::OnEventCommandExecuted.connect([this](const Checkable::Ptr& checkable) { EventCommandHandler(checkable); });
./lib/compat/compatlogger.cpp:65: Checkable::OnFlappingChanged.connect([this](const Checkable::Ptr& checkable, const Value&) { FlappingChangedHandler(checkable); });
./lib/notification/notificationcomponent.cpp:42: Checkable::OnNotificationsRequested.connect([this](const Checkable::Ptr& checkable, NotificationType type, const CheckResult::Ptr& cr,
./lib/perfdata/gelfwriter.cpp:93: m_HandleCheckResults = Checkable::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/gelfwriter.cpp:97: m_HandleNotifications = Checkable::OnNotificationSentToUser.connect([this](const Notification::Ptr& notification,
./lib/perfdata/gelfwriter.cpp:102: m_HandleStateChanges = Checkable::OnStateChange.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/opentsdbwriter.cpp:84: m_HandleCheckResults = Service::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, const MessageOrigin::Ptr&) {
./lib/perfdata/perfdatawriter.cpp:56: m_HandleCheckResults = Checkable::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/influxdbcommonwriter.cpp:100: m_HandleCheckResults = Checkable::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/graphitewriter.cpp:98: m_HandleCheckResults = Checkable::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/elasticsearchwriter.cpp:98: m_HandleCheckResults = Checkable::OnNewCheckResult.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/elasticsearchwriter.cpp:102: m_HandleStateChanges = Checkable::OnStateChange.connect([this](const Checkable::Ptr& checkable,
./lib/perfdata/elasticsearchwriter.cpp:106: m_HandleNotifications = Checkable::OnNotificationSentToAllUsers.connect([this](const Notification::Ptr& notification,
./lib/icingadb/icingadb-objects.cpp:87: Checkable::OnStateChange.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, StateType type, const MessageOrigin::Ptr&) {
./lib/icingadb/icingadb-objects.cpp:94: Checkable::OnAcknowledgementCleared.connect([](const Checkable::Ptr& checkable, const String& removedBy, double changeTime, const MessageOrigin::Ptr&) {
./lib/icingadb/icingadb-objects.cpp:98: Checkable::OnReachabilityChanged.connect([](const Checkable::Ptr& parent, const CheckResult::Ptr&, std::set<Checkable::Ptr>, const MessageOrigin::Ptr&) {
./lib/icingadb/icingadb-objects.cpp:121: Checkable::OnNotificationSentToAllUsers.connect([](
./lib/icingadb/icingadb-objects.cpp:130: Comment::OnCommentRemoved.connect(&IcingaDB::CommentRemovedHandler);
./lib/icingadb/icingadb-objects.cpp:134: Checkable::OnNewCheckResult.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr&, const MessageOrigin::Ptr&) {
./lib/icingadb/icingadb-objects.cpp:142: Service::OnHostProblemChanged.connect([](const Service::Ptr& service, const CheckResult::Ptr&, const MessageOrigin::Ptr&) {
./lib/db_ido/dbobject.cpp:39: ConfigObject::OnStateChanged.connect([](const ConfigObject::Ptr& object) { StateChangedHandler(object); });
./lib/db_ido/dbevents.cpp:31: Comment::OnCommentRemoved.connect([](const Comment::Ptr& comment) { DbEvents::RemoveComment(comment); });
./lib/db_ido/dbevents.cpp:39: Checkable::OnAcknowledgementCleared.connect([](const Checkable::Ptr& checkable, const String&, double, const MessageOrigin::Ptr&) {
./lib/db_ido/dbevents.cpp:44: Checkable::OnFlappingChanged.connect([](const Checkable::Ptr& checkable, const Value&) { FlappingChangedHandler(checkable); });
./lib/db_ido/dbevents.cpp:45: Checkable::OnNotificationSentToAllUsers.connect([](const Notification::Ptr& notification, const Checkable::Ptr& checkable,
./lib/db_ido/dbevents.cpp:67: Checkable::OnReachabilityChanged.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr,
./lib/db_ido/dbevents.cpp:80: Checkable::OnNotificationSentToAllUsers.connect([](const Notification::Ptr& notification, const Checkable::Ptr& checkable,
./lib/db_ido/dbevents.cpp:86: Checkable::OnStateChange.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, StateType type, const MessageOrigin::Ptr&) {
./lib/db_ido/dbevents.cpp:90: Checkable::OnNewCheckResult.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, const MessageOrigin::Ptr&) {
./lib/db_ido/dbevents.cpp:93: Checkable::OnNotificationSentToUser.connect([](const Notification::Ptr& notification, const Checkable::Ptr& checkable,
./lib/db_ido/dbevents.cpp:98: Checkable::OnFlappingChanged.connect([](const Checkable::Ptr& checkable, const Value&) {
./lib/db_ido/dbevents.cpp:107: Checkable::OnFlappingChanged.connect([](const Checkable::Ptr& checkable, const Value&) { DbEvents::AddFlappingChangedHistory(checkable); });
./lib/db_ido/dbevents.cpp:109: Checkable::OnNewCheckResult.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, const MessageOrigin::Ptr&) {
./lib/db_ido/dbevents.cpp:113: Checkable::OnEventCommandExecuted.connect([](const Checkable::Ptr& checkable) { DbEvents::AddEventHandlerHistory(checkable); });
./lib/icinga/clusterevents.cpp:48: Checkable::OnNewCheckResult.connect(&ClusterEvents::CheckResultHandler);
./lib/icinga/clusterevents.cpp:51: Checkable::OnStateBeforeSuppressionChanged.connect(&ClusterEvents::StateBeforeSuppressionChangedHandler);
./lib/icinga/clusterevents.cpp:52: Checkable::OnSuppressedNotificationsChanged.connect(&ClusterEvents::SuppressedNotificationsChangedHandler);
./lib/icinga/clusterevents.cpp:53: Notification::OnSuppressedNotificationsChanged.connect(&ClusterEvents::SuppressedNotificationTypesChangedHandler);
./lib/icinga/clusterevents.cpp:55: Notification::OnLastNotifiedStatePerUserUpdated.connect(&ClusterEvents::LastNotifiedStatePerUserUpdatedHandler);
./lib/icinga/clusterevents.cpp:56: Notification::OnLastNotifiedStatePerUserCleared.connect(&ClusterEvents::LastNotifiedStatePerUserClearedHandler);
./lib/icinga/clusterevents.cpp:59: Checkable::OnNotificationsRequested.connect(&ClusterEvents::SendNotificationsHandler);
./lib/icinga/clusterevents.cpp:60: Checkable::OnNotificationSentToUser.connect(&ClusterEvents::NotificationSentUserHandler);
./lib/icinga/clusterevents.cpp:61: Checkable::OnNotificationSentToAllUsers.connect(&ClusterEvents::NotificationSentToAllUsersHandler);
./lib/icinga/clusterevents.cpp:64: Checkable::OnAcknowledgementCleared.connect(&ClusterEvents::AcknowledgementClearedHandler);
./lib/icinga/clusterevents.cpp:66: Comment::OnRemovalInfoChanged.connect(&ClusterEvents::SetRemovalInfoHandler);
./lib/icinga/clusterevents.cpp:67: Downtime::OnRemovalInfoChanged.connect(&ClusterEvents::SetRemovalInfoHandler);
./lib/icinga/apievents.cpp:16: Checkable::OnNewCheckResult.connect(&ApiEvents::CheckResultHandler);
./lib/icinga/apievents.cpp:17: Checkable::OnStateChange.connect(&ApiEvents::StateChangeHandler);
./lib/icinga/apievents.cpp:18: Checkable::OnNotificationSentToAllUsers.connect(&ApiEvents::NotificationSentToAllUsersHandler);
./lib/icinga/apievents.cpp:20: Checkable::OnFlappingChanged.connect(&ApiEvents::FlappingChangedHandler);
./lib/icinga/apievents.cpp:23: Checkable::OnAcknowledgementCleared.connect(&ApiEvents::AcknowledgementClearedHandler);
./lib/icinga/apievents.cpp:26: Comment::OnCommentRemoved.connect(&ApiEvents::CommentRemovedHandler);
./build/lib/icinga/service-ti.cpp:186: ObjectImpl<Service>::OnStateChanged.connect(callback);
./build/lib/icinga/host-ti.cpp:168: ObjectImpl<Host>::OnStateChanged.connect(callback);
./build/lib/icinga/notification-ti.cpp:299: ObjectImpl<Notification>::OnSuppressedNotificationsChanged.connect(callback);
./build/lib/icinga/checkable-ti.cpp:535: ObjectImpl<Checkable>::OnSuppressedNotificationsChanged.connect(callback);
./build/lib/icinga/checkable-ti.cpp:541: ObjectImpl<Checkable>::OnStateBeforeSuppressionChanged.connect(callback);
./build/lib/icinga/checkable-ti.cpp:619: ObjectImpl<Checkable>::OnFlappingChanged.connect(callback);
./build/lib/icinga/checkresult-ti.cpp:202: ObjectImpl<CheckResult>::OnStateChanged.connect(callback);
PS: While you're at it, can you please rebase this against the latest master to get the GHAs pass, thanks!
911238f
to
1204d40
Compare
Actually, the grep command from my previous comment includes some events like grep -ERn --include=\*.{hpp,cpp} --exclude-dir={test,prefix,doc} 'On(CommentRemoved|NewCheckResult|StateChange|EventCommandExecuted|AcknowledgementCleared|(HostProblem|State|Flapping|Reachability|RemovalInfo)Changed|NotificationsRequested|NotificationSentTo(User|AllUsers)|LastNotifiedStatePerUser(Cleared|Updated))\.connect' . Today, after the second round of tracing these events, I was also unable to find any faulty code and my Icinga 2 daemon is still running fine. So I have nothing to complain about except for the two open comments above. |
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: William Calliari.
|
22bc462
to
b697dc8
Compare
b697dc8
to
ff64126
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LFTM! Thanks!
This PR now has some conflicts introduced with #10397 and needs rebasing! I already did that locally but it seems I'm not allowed to push to your PR, so can you (@w1ll-i-code) please either enable repository maintainer permissions on-existing (your)-pull-request, so that I can do it for you or rebase and resolve the conflicts by yourself! Thanks! |
fixes #10364
fixes #10366