feat(notify): make mute stage honor send_resolved #4634

siavashs · 2025-10-27T14:31:33Z

Make mute stage honor send_resolved in receivers. This fixes cases where an alert was already sent to a receiver before being silenced.
Sending resolved notifications for a now silenced alert helps update the receiver's status.
For example an opened PagerDuty incident will be closed.

Related #226

Make mute stage honor `send_resolved` in receivers. This fixes cases where an alert was already sent to a receiver before being silenced. Sending resolved notifications for a now silenced alert helps update the receiver's status. For example an opened PagerDuty incident will be closed. Signed-off-by: Siavash Safi <[email protected]>

SuperQ · 2025-10-28T08:56:09Z

Nice. I think the only real issue is that this is a reasonably impactful behavior change. So we should probably put it behind a flag as to not surprise users.

grobinson-grafana · 2025-10-29T08:28:52Z

Thanks for doing all this work. If possible, I think we should explore making this behavior a first class citizen of silences instead of having silences inherit send_resolved from downstream integrations.

The main argument for this is that it keeps silences predictable.

When we inherit behavior from downstream integrations, the the same silence now behaves differently in different aggregation groups, as its behaviour changes based on which integration is used.

This is likely going to be really really confusing for operators as some silenced alerts will now send resolved notifications while others will not, or even the same silenced alert sends a resolved notification in one aggregation group but not another.

By making this a property of the silence instead we can make it an explicit choice of the silence author, and it becomes auditable via the history of expired silences.

What do you think?

siavashs · 2025-10-29T11:46:42Z

Sounds good to me, let me try and add this.

Spaceman1701

I think this is a great idea! We have some hacky behavior in our internal fork that approximates what this PR does.

I actually think it's very useful for the Silence's send_resolve behavior to be tied to each integration. My main argument for this is that most integrations talk to stateful services, and it's really up to the integration whether resolved messages make sense. When they do make sense, I think it's much better if we make an effort to always deliver them.

The easiest example of this is pagerduty: when alertmanager sends a notification to pagerduty, it creates an incident. The incident isn't resolved until alertmanager sends a resolved notification. If a new alert is sent to pagerduty from the same group (e.g. with the same group_key), pagerduty will not notify a human again.

Today, the behavior of alertmanager silences interacts with this very poorly. If a firing alert gets silenced, it will never send a resolved notification to pagerduty and thus will never close the incident. This means that a future alert in the same group doesn't page until the incident is manually closed. This is such a problem that we run a separate service which periodically closes stale pagerduty incidents for our users.

I think this same argument can apply to most integrations, even if it's not a machine keeping track of state like in pagerduty. If users configure their MSTeams/slack/email integrations with send_resolved, it makes sense to always send those resolved messages, regardless of the silence itself.

Spaceman1701 · 2025-10-30T17:48:44Z

notify/notify.go

+		muter:           m,
+		metrics:         metrics,
+		notificationLog: notificationLog,
+		receivers:       receivers,


One problem I see here is that we need to decide if we should send_resolved for all integrations when we make the decision in the mute stage. What if we changed the position of the MuteStage in the pipeline so that we get one stage per-integration? I think this would be a very simple change (just move construction of this stage into createReceiverStage.

In my mind, that simplifies things a little bit by explicitly making the MuteStage per-integration.

This is one of the reasons I propose we make this a first class citizen of silences instead. It removes a lot of the ambiguity while still allowing integrations to make their own decisions for resolved notifications.

That's fair, though I'm still not sure I agree. Just to make sure I'm clear on the behavior, are you proposing that:

Silence.send_resolved: true Integration.send_resolved: true -> resolved notifications for silenced alerts (that already notified)

Silence.send_resolved: false Integration.send_resolved: true -> No resolved notifications for silenced alerts

Silence.send_resolved: true Integration.send_resolved: false -> No resolved notifications for silenced alerts

I'd supposed I'd be happy with this behavior, though I'd probably disallow non-send-resolve silences via tooling. That's all right, and it's good to give users the option both ways.

I guess the question here is whether it's more confusing for some silences to allow resolved notifications or for some integrations to always send resolved notifications, regardless of alert state.

My biggest concern is still for integrations that maintain state, like pagerduty. I'm pretty firmly of the belief that it's always wrong to skip sending a resolved notification to pagerduty because it'll cause the incident to become stale. This prevents future alerts from notifying a human. So I wouldn't want a user of our alertmanager cluster to accidentally create a silence that blocks send_resolved to pagerduty.

But really, I'd be happy enough with either version of this behavior... I really want this change 😄

Oh and addressing the technical comment: Whether or not this is a flag on the silence, we'd need to adjust the notification pipeline so that the MuteStage is applied per-integration (since send_resolved is a setting on the integration, not the receiver).

That should actually be as simple as moving the stage further into the pipeline. There's some performance impact, but I think it's pretty minimal. The notification pipeline isn't a big resource user and integration pipelines can happen concurrently.

Just to make sure I'm clear on the behavior, are you proposing that...

Yep, exactly this!

I guess the question here is whether it's more confusing for some silences to allow resolved notifications or for some integrations to always send resolved notifications, regardless of alert state.

Far more confusing for integrations to influence the behavior of silences (and if we extend this to inhibition rules, the behavior of inhibition rules too).

Whether or not this is a flag on the silence, we'd need to adjust the notification pipeline so that the MuteStage is applied per-integration

If we make this a flag on the silence, I think we avoid having to execute the MuteStage for each integration. Instead we can execute it for each aggregation group which is a big difference, since silences and inhibitions are computationally expensive.

That is much less complex! It's actually very similar to our internal fork's hack (except we control it via an always_send_resolved flag on the integration).

However, I don't think DedupStage ever checks that a resolved alert actually sent a non-resolved notification. It only checks if the resolved alert already sent a resolved notification. This means that silences with send_resolved will send resolved notifications for alerts that have been silenced for their entire lifecycle, which is much less useful.

The other problem is that the DedupStage currently passes groups that only contain resolved alerts (because we need to write the nflog entry for these groups whether or not we notify). Since the algorithm here will filter active alerts but not muted alerts for send_resolved silences, that'll probably cause some problems as well.

I think we could probably change the logic of the DebupStage to accommodate the new behavior, but it's not immediately obvious to me that it's better than checking the nflog in the MuteStage.

However, I don't think DedupStage ever checks that a resolved alert actually sent a non-resolved notification

Doesn't that happen here?

There is an edge case if you have a non previously notified resolved alert grouped with other active alerts, in which case the non previously notified resolved alert will be included in the notification.

My answer to this edge case is that it already happens in the general case when you don't even use silences. I think the fix for that needs to be done in the DedupStage for it to cover all edge cases, and we can do it separately from this effort?

because we need to write the nflog entry for these groups whether or not we notify

I'm not following this edge case, can you share an example?

Just to add, whatever we can do to keep the MuteStage clean is my preferred option. I really want to avoid two "DedupStages" where some nflog deduplication happens in the MuteStage and the rest of the nflog deduplication happens in the DedupStage. I would really like to keep the responsibilities separate.

Doesn't that happen here?
There is an edge case if you have a non previously notified resolved alert grouped with other active alerts, in which case the non previously notified resolved alert will be included in the notification.

Ah yeah, kinda. But as you say, this only stops the notification if the group did not have any firing alerts in the last during the last notification attempt. If it did, we'd send resolutions for all the alerts, even the ones which were not in the original group. It also has an edge case going to opposite direction - if some alert in the group is not silenced, it'll cause us to pass forward all silenced resolved alerts even if we otherwise shouldn't have.

I think the fix for that needs to be done in the DedupStage for it to cover all edge cases, and we can do it separately from this effort?

But perhaps you're right, this is already a problem in some cases today. We'll make it worse since it'll now happen for any group that has resolutions while it's silenced by a send_resolved: true silence. We could adjust the DedupStage so that it only passes resolved alerts that are actually a subset of the previous notification's firing alerts set. I think that'd solve the problem in the general case.

I'm not following this edge case, can you share an example?

This is part of what's handled by the same condition here https://github.com/prometheus/alertmanager/blob/main/notify/notify.go#L679

In order to re-notify after all alerts in a group are resolved, we need to write an nflog entry indicating that alerts have resolved. Otherwise, we need to wait for the nflog entry to expire (which could take a very long time).

The other half of the logic is in RetryStage: https://github.com/prometheus/alertmanager/blob/main/notify/notify.go#L790-L805. The key part being https://github.com/prometheus/alertmanager/blob/main/notify/notify.go#L798. This causes the SetNotifiesStage to see the 'resolution' message as sent and record it in the nflog. So this is all part of the same set of edge cases we're discussing.

Imagine this scenario:

Alert alertname=x, instance=a fires and sends a notification

User creates a silence with send_resolved: true for alertname=x

Alert alertname=x, instance=b fires. No notification is sent because of the silence

Alert alertname=x, instance=b resolves. Since the silence has send_resolved: true, the the user will get a resolved message. Since the message is filtered by the MuteStage to just resolved alerts, it will write an nflog entry that contains only resolved alerts. If the receiver is pagerduty, the incident will be closed by the resolution message.

The user lifts the silence

Since alertname=x, instance=a is still firing and the last nflog entry contains no firing alerts, it sends a duplicate notification on the next flush

This particular case would be avoided if the DedupStage checked that the resolved alerts its forwarding are actually a subset of the previous firing alerts.

I actually think if we re-ordered the steps we could recreate this behavior for any of the proposed implementations... So that's not ideal. The problem here is that we don't record which alerts were muted in the nflog. The next flush isn't able to distinguish between "some alerts were resolved, but not all of them" and "all alerts were resolved".

Maybe we could work around that by limiting the send_resolved silence behavior to groups that only contain resolved alerts? This way we would only send a resolved notification for a silenced group if everything in the group is resolved. That's actually closer to the behavior for unsilenced groups since we only send resolved messages when the whole group is resolved.

There are versions of this sequence that can happen without send_resolved silences as well, but I think send_resolved makes it more likely to happen...

The DedupStage logic is actually pretty fragile when silences partially mute alerts...

SuperQ · 2025-10-31T16:36:58Z

@Spaceman1701 Yup, PagerDuty incidents is exactly the reason I raised this issue in the past.

Spaceman1701 · 2025-10-31T16:52:14Z

@Spaceman1701 Yup, PagerDuty incidents is exactly the reason I raised this issue in the past.

@SuperQ 👍 this is why my preference is for alertmanager's contract to be "For integrations with send_resolved: true, If we sent a notification, we will send a resolved notification"

grobinson-grafana · 2025-10-31T17:24:07Z

@Spaceman1701 Yup, PagerDuty incidents is exactly the reason I raised this issue in the past.

@SuperQ 👍 this is why my preference is for alertmanager's contract to be "For integrations with send_resolved: true, If we sent a notification, we will send a resolved notification"

I think if that's a goal it may be worth a wider discussion in an issue as we also would need to address resolved notifications for these additional cases:

Muted by an inhibition rule. Example: you have a warning alert that fired first, then a critical alert. The critical alert inhibits the warning. Both resolved at the same time, but only the critical sends a resolved notification.
Muted by an active/mute time window. The alert fires outside the window, and then resolves within the window.
Alertmanager crashes/restarts just before a long group_interval flushes (i.e. multi hour group intervals).

I can see that being a wider engineering effort that yourself and @siavashs could lead if you would like to!

siavashs force-pushed the fix/silence-honor-send-resolved branch from 8546060 to e3bc55d Compare October 27, 2025 14:56

Spaceman1701 reviewed Oct 30, 2025

View reviewed changes

feat(notify): make mute stage honor send_resolved #4634

Are you sure you want to change the base?

feat(notify): make mute stage honor send_resolved #4634

Uh oh!

Conversation

siavashs commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SuperQ commented Oct 28, 2025

Uh oh!

grobinson-grafana commented Oct 29, 2025

Uh oh!

siavashs commented Oct 29, 2025

Uh oh!

Spaceman1701 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grobinson-grafana Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grobinson-grafana Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SuperQ commented Oct 31, 2025

Uh oh!

Spaceman1701 commented Oct 31, 2025

Uh oh!

grobinson-grafana commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

siavashs commented Oct 27, 2025 •

edited

Loading

grobinson-grafana Oct 31, 2025 •

edited

Loading

grobinson-grafana Oct 31, 2025 •

edited

Loading