Alert Manager webhook behaviour it is restarted with an active alert #4310

stephen-monaghan · 2025-03-12T19:51:13Z

stephen-monaghan
Mar 12, 2025

Hello,

I'm using Prometheus connected to Alert Manager and Alert Manager calling out to a webhook when a service goes up/down. What I am seeing is if I have an active alert (I stop a Prometheus scraped service), which has been sent to the webhook, and then I restart Alert Manager, after Alert Manager restarts it behaves differently than if there was no alert active before the restart.

It seems to maintain state, and if it is notified again about the alert that was active before it restarted it will not forward the firing alert to the webhook. However when the service is restarted the webhook will be notified about the resolving of the alert. This suggests to me that Alert Manager is applying some rules to the old Alert although this alert is not showing up in the web interface after the reboot.

If I stop Alert Manager and remove the nflog file, and restart Alert Manager then it behaves differently and it sends a firing notification to the webhook when the service is stopped.

Most of this behaviour is as I would like, except for the fact it doesn't seem to expire the alert at the endsAt time.

The sequence is:

stop a Prometheus monitored service and wait until there is an active alert in alert manager
stop Prometheus (this is stopped to get it to trigger a new alert when it starts at step 4)
restart Alert Manager
start Prometheus
When prometheus starts and sends the alert again to Alert manager, Alert manager will not send the alert to the webhook.
start the Prometheus monitored service
Alert Manager will notify the webhook about the resolved alert

The sequence that behaves differently is:

stop a Prometheus monitored service and wait until there is an active alert in alert manager
stop Prometheus (this is stopped to get it to trigger a new alert when it starts at step 6)
stop Alert Manager
remove the nflog file
start Alert Manager
start Prometheus
When prometheus starts and sends the alert again to Alert manager, Alert manager will send the alert to the webhook.
start the Prometheus monitored service
Alert Manager will notify the webhook about the resolved alert

The reason this is an issue is I am maintaining counts of "firing" and "resolved" and in the first sequence above they get out of sync. Note the actual use case is a restart of a machine with an active alert, however I recreated the issue with just restart of alert manager to make it easier to recreate.

Is this expected behaviour? Am I doing something wrong and it is possible for an alert to be resolved at the endsAt time through a restart of alert manager?

I am using alert manager 0.27.0, prometheus 2.45.4 and running on redhat 9.5.

Thank you for any help.
Stephen

Answered by grobinson-grafana

Mar 13, 2025

Alertmanager keeps alerts in memory, however as you found it uses a file to track the last notification sent for each group of alerts. If you restart Alertmanager, it will lose any alerts it had in memory. That means even if the endsAt time has elapsed, following a restart Alertmanager won't know about it as the alert will have been lost.

What happens is Prometheus re-sends all of its alerts to Alertmanager at a regular interval, and this is how Alertmanager recovers its state. But since the alert never resolved in Prometheus, Alertmanager just sees the same alert firing as before the crash.

What people tend to do in this situation is run something called high availability mode. You can r…

View full answer

grobinson-grafana · 2025-03-13T08:36:12Z

grobinson-grafana
Mar 13, 2025
Maintainer

I'm not 100% sure I understand. If we look at the first example:

stop a Prometheus monitored service and wait until there is an active alert in alert manager
stop Prometheus (this is stopped to get it to trigger a new alert when it starts at step 4)
restart Alert Manager
start Prometheus
When prometheus starts and sends the alert again to Alert manager, Alert manager will not send the alert to the webhook.
start the Prometheus monitored service
Alert Manager will notify the webhook about the resolved alert

As long as the alert didn't resolve between restarting Prometheus and Alertmanager, then it is expected that Alertmanager will not re-send the alert. As far as Alertmanager is concerned, this is the same alert, and a notification for it was just sent before it was stopped. The notification will be re-sent after the repeat_interval as long as the alert is still firing at that time.

stop a Prometheus monitored service and wait until there is an active alert in alert manager
stop Prometheus (this is stopped to get it to trigger a new alert when it starts at step 6)
stop Alert Manager
remove the nflog file
start Alert Manager
start Prometheus
When prometheus starts and sends the alert again to Alert manager, Alert manager will send the alert to the webhook.
start the Prometheus monitored service
Alert Manager will notify the webhook about the resolved alert

Yes, this will happen because you deleted the file that Alertmanager uses to track if it needs to send a notification or not.

0 replies

stephen-monaghan · 2025-03-13T08:44:00Z

stephen-monaghan
Mar 13, 2025
Author

But in case 1 above, if Alert Manager is aware that the alert was already sent, shouldn't it honour the endAt time? And send the resolved message to the webhook if it doesn't receive another alert from Prometheus?

2 replies

grobinson-grafana Mar 13, 2025
Maintainer

Alertmanager keeps alerts in memory, however as you found it uses a file to track the last notification sent for each group of alerts. If you restart Alertmanager, it will lose any alerts it had in memory. That means even if the endsAt time has elapsed, following a restart Alertmanager won't know about it as the alert will have been lost.

What happens is Prometheus re-sends all of its alerts to Alertmanager at a regular interval, and this is how Alertmanager recovers its state. But since the alert never resolved in Prometheus, Alertmanager just sees the same alert firing as before the crash.

What people tend to do in this situation is run something called high availability mode. You can read more about it here https://promlabs.com/blog/2023/08/31/high-availability-for-prometheus-and-alertmanager-an-overview/.

Answer selected by stephen-monaghan

stephen-monaghan Mar 13, 2025
Author

Thank you. I will look at this setup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alert Manager webhook behaviour it is restarted with an active alert #4310

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Alert Manager webhook behaviour it is restarted with an active alert #4310

Uh oh!

stephen-monaghan Mar 12, 2025

Replies: 2 comments · 2 replies

Uh oh!

grobinson-grafana Mar 13, 2025 Maintainer

Uh oh!

stephen-monaghan Mar 13, 2025 Author

Uh oh!

Uh oh!

grobinson-grafana Mar 13, 2025 Maintainer

Uh oh!

stephen-monaghan Mar 13, 2025 Author

stephen-monaghan
Mar 12, 2025

Replies: 2 comments 2 replies

grobinson-grafana
Mar 13, 2025
Maintainer

stephen-monaghan
Mar 13, 2025
Author

grobinson-grafana Mar 13, 2025
Maintainer

stephen-monaghan Mar 13, 2025
Author