- 
                Notifications
    
You must be signed in to change notification settings  - Fork 2.3k
 
feat(notify): make mute stage honor send_resolved #4634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Make mute stage honor `send_resolved` in receivers. This fixes cases where an alert was already sent to a receiver before being silenced. Sending resolved notifications for a now silenced alert helps update the receiver's status. For example an opened PagerDuty incident will be closed. Signed-off-by: Siavash Safi <[email protected]>
8546060    to
    e3bc55d      
    Compare
  
    | 
           Nice. I think the only real issue is that this is a reasonably impactful behavior change. So we should probably put it behind a flag as to not surprise users.  | 
    
| 
           Thanks for doing all this work. If possible, I think we should explore making this behavior a first class citizen of silences instead of having silences inherit  The main argument for this is that it keeps silences predictable. When we inherit behavior from downstream integrations, the the same silence now behaves differently in different aggregation groups, as its behaviour changes based on which integration is used. This is likely going to be really really confusing for operators as some silenced alerts will now send resolved notifications while others will not, or even the same silenced alert sends a resolved notification in one aggregation group but not another. By making this a property of the silence instead we can make it an explicit choice of the silence author, and it becomes auditable via the history of expired silences. What do you think?  | 
    
| 
           Sounds good to me, let me try and add this.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great idea! We have some hacky behavior in our internal fork that approximates what this PR does.
I actually think it's very useful for the Silence's send_resolve behavior to be tied to each integration. My main argument for this is that most integrations talk to stateful services, and it's really up to the integration whether resolved messages make sense. When they do make sense, I think it's much better if we make an effort to always deliver them.
The easiest example of this is pagerduty: when alertmanager sends a notification to pagerduty, it creates an incident. The incident isn't resolved until alertmanager sends a resolved notification. If a new alert is sent to pagerduty from the same group (e.g. with the same group_key), pagerduty will not notify a human again.
Today, the behavior of alertmanager silences interacts with this very poorly. If a firing alert gets silenced, it will never send a resolved notification to pagerduty and thus will never close the incident. This means that a future alert in the same group doesn't page until the incident is manually closed. This is such a problem that we run a separate service which periodically closes stale pagerduty incidents for our users.
I think this same argument can apply to most integrations, even if it's not a machine keeping track of state like in pagerduty. If users configure their MSTeams/slack/email integrations with send_resolved, it makes sense to always send those resolved messages, regardless of the silence itself.
| muter: m, | ||
| metrics: metrics, | ||
| notificationLog: notificationLog, | ||
| receivers: receivers, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem I see here is that we need to decide if we should send_resolved for all integrations when we make the decision in the mute stage. What if we changed the position of the MuteStage in the pipeline so that we get one stage per-integration? I think this would be a very simple change (just move construction of this stage into createReceiverStage.
In my mind, that simplifies things a little bit by explicitly making the MuteStage per-integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the reasons I propose we make this a first class citizen of silences instead. It removes a lot of the ambiguity while still allowing integrations to make their own decisions for resolved notifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, though I'm still not sure I agree. Just to make sure I'm clear on the behavior, are you proposing that:
Silence.send_resolved: true
Integration.send_resolved: true
-> resolved notifications for silenced alerts (that already notified)
Silence.send_resolved: false
Integration.send_resolved: true
-> No resolved notifications for silenced alerts
Silence.send_resolved: true
Integration.send_resolved: false
-> No resolved notifications for silenced alerts
I'd supposed I'd be happy with this behavior, though I'd probably disallow non-send-resolve silences via tooling. That's all right, and it's good to give users the option both ways.
I guess the question here is whether it's more confusing for some silences to allow resolved notifications or for some integrations to always send resolved notifications, regardless of alert state.
My biggest concern is still for integrations that maintain state, like pagerduty. I'm pretty firmly of the belief that it's always wrong to skip sending a resolved notification to pagerduty because it'll cause the incident to become stale. This prevents future alerts from notifying a human. So I wouldn't want a user of our alertmanager cluster to accidentally create a silence that blocks send_resolved to pagerduty.
But really, I'd be happy enough with either version of this behavior... I really want this change 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and addressing the technical comment: Whether or not this is a flag on the silence, we'd need to adjust the notification pipeline so that the MuteStage is applied per-integration (since send_resolved is a setting on the integration, not the receiver).
That should actually be as simple as moving the stage further into the pipeline. There's some performance impact, but I think it's pretty minimal. The notification pipeline isn't a big resource user and integration pipelines can happen concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I'm clear on the behavior, are you proposing that...
Yep, exactly this!
I guess the question here is whether it's more confusing for some silences to allow resolved notifications or for some integrations to always send resolved notifications, regardless of alert state.
Far more confusing for integrations to influence the behavior of silences (and if we extend this to inhibition rules, the behavior of inhibition rules too).
Whether or not this is a flag on the silence, we'd need to adjust the notification pipeline so that the MuteStage is applied per-integration
If we make this a flag on the silence, I think we avoid having to execute the MuteStage for each integration. Instead we can execute it for each aggregation group which is a big difference, since silences and inhibitions are computationally expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is much less complex! It's actually very similar to our internal fork's hack (except we control it via an always_send_resolved flag on the integration).
However, I don't think DedupStage ever checks that a resolved alert actually sent a non-resolved notification. It only checks if the resolved alert already sent a resolved notification. This means that silences with send_resolved will send resolved notifications for alerts that have been silenced for their entire lifecycle, which is much less useful.
The other problem is that the DedupStage currently passes groups that only contain resolved alerts (because we need to write the nflog entry for these groups whether or not we notify). Since the algorithm here will filter active alerts but not muted alerts for send_resolved silences, that'll probably cause some problems as well.
I think we could probably change the logic of the DebupStage to accommodate the new behavior, but it's not immediately obvious to me that it's better than checking the nflog in the MuteStage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I don't think DedupStage ever checks that a resolved alert actually sent a non-resolved notification
Doesn't that happen here?
There is an edge case if you have a non previously notified resolved alert grouped with other active alerts, in which case the non previously notified resolved alert will be included in the notification.
My answer to this edge case is that it already happens in the general case when you don't even use silences. I think the fix for that needs to be done in the DedupStage for it to cover all edge cases, and we can do it separately from this effort?
because we need to write the nflog entry for these groups whether or not we notify
I'm not following this edge case, can you share an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add, whatever we can do to keep the MuteStage clean is my preferred option. I really want to avoid two "DedupStages" where some nflog deduplication happens in the MuteStage and the rest of the nflog deduplication happens in the DedupStage. I would really like to keep the responsibilities separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't that happen here?
There is an edge case if you have a non previously notified resolved alert grouped with other active alerts, in which case the non previously notified resolved alert will be included in the notification.
Ah yeah, kinda. But as you say, this only stops the notification if the group did not have any firing alerts in the last during the last notification attempt. If it did, we'd send resolutions for all the alerts, even the ones which were not in the original group. It also has an edge case going to opposite direction - if some alert in the group is not silenced, it'll cause us to pass forward all silenced resolved alerts even if we otherwise shouldn't have.
I think the fix for that needs to be done in the DedupStage for it to cover all edge cases, and we can do it separately from this effort?
But perhaps you're right, this is already a problem in some cases today. We'll make it worse since it'll now happen for any group that has resolutions while it's silenced by a send_resolved: true silence. We could adjust the DedupStage so that it only passes resolved alerts that are actually a subset of the previous notification's firing alerts set. I think that'd solve the problem in the general case.
I'm not following this edge case, can you share an example?
This is part of what's handled by the same condition here https://github.com/prometheus/alertmanager/blob/main/notify/notify.go#L679
In order to re-notify after all alerts in a group are resolved, we need to write an nflog entry indicating that alerts have resolved. Otherwise, we need to wait for the nflog entry to expire (which could take a very long time).
The other half of the logic is in RetryStage: https://github.com/prometheus/alertmanager/blob/main/notify/notify.go#L790-L805. The key part being https://github.com/prometheus/alertmanager/blob/main/notify/notify.go#L798. This causes the SetNotifiesStage to see the 'resolution' message as sent and record it in the nflog. So this is all part of the same set of edge cases we're discussing.
Imagine this scenario:
- Alert 
alertname=x, instance=afires and sends a notification - User creates a silence with 
send_resolved: trueforalertname=x - Alert 
alertname=x, instance=bfires. No notification is sent because of the silence - Alert 
alertname=x, instance=bresolves. Since the silence hassend_resolved: true, the the user will get a resolved message. Since the message is filtered by theMuteStageto just resolved alerts, it will write an nflog entry that contains only resolved alerts. If the receiver is pagerduty, the incident will be closed by the resolution message. - The user lifts the silence
 - Since 
alertname=x, instance=ais still firing and the last nflog entry contains no firing alerts, it sends a duplicate notification on the next flush 
This particular case would be avoided if the DedupStage checked that the resolved alerts its forwarding are actually a subset of the previous firing alerts.
I actually think if we re-ordered the steps we could recreate this behavior for any of the proposed implementations... So that's not ideal. The problem here is that we don't record which alerts were muted in the nflog. The next flush isn't able to distinguish between "some alerts were resolved, but not all of them" and "all alerts were resolved".
Maybe we could work around that by limiting the send_resolved silence behavior to groups that only contain resolved alerts? This way we would only send a resolved notification for a silenced group if everything in the group is resolved. That's actually closer to the behavior for unsilenced groups since we only send resolved messages when the whole group is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are versions of this sequence that can happen without send_resolved silences as well, but I think send_resolved makes it more likely to happen...
The DedupStage logic is actually pretty fragile when silences partially mute alerts...
| 
           @Spaceman1701 Yup, PagerDuty incidents is exactly the reason I raised this issue in the past.  | 
    
          
 @SuperQ 👍 this is why my preference is for alertmanager's contract to be "For integrations with   | 
    
          
 I think if that's a goal it may be worth a wider discussion in an issue as we also would need to address resolved notifications for these additional cases: 
 I can see that being a wider engineering effort that yourself and @siavashs could lead if you would like to!  | 
    
Make mute stage honor
send_resolvedin receivers. This fixes cases where an alert was already sent to a receiver before being silenced.Sending resolved notifications for a now silenced alert helps update the receiver's status.
For example an opened PagerDuty incident will be closed.
Related #226