Alert Manager webhook behaviour it is restarted with an active alert #4310
-
Hello, I'm using Prometheus connected to Alert Manager and Alert Manager calling out to a webhook when a service goes up/down. What I am seeing is if I have an active alert (I stop a Prometheus scraped service), which has been sent to the webhook, and then I restart Alert Manager, after Alert Manager restarts it behaves differently than if there was no alert active before the restart. It seems to maintain state, and if it is notified again about the alert that was active before it restarted it will not forward the firing alert to the webhook. However when the service is restarted the webhook will be notified about the resolving of the alert. This suggests to me that Alert Manager is applying some rules to the old Alert although this alert is not showing up in the web interface after the reboot. If I stop Alert Manager and remove the nflog file, and restart Alert Manager then it behaves differently and it sends a firing notification to the webhook when the service is stopped. Most of this behaviour is as I would like, except for the fact it doesn't seem to expire the alert at the endsAt time. The sequence is:
The sequence that behaves differently is:
The reason this is an issue is I am maintaining counts of "firing" and "resolved" and in the first sequence above they get out of sync. Note the actual use case is a restart of a machine with an active alert, however I recreated the issue with just restart of alert manager to make it easier to recreate. Is this expected behaviour? Am I doing something wrong and it is possible for an alert to be resolved at the endsAt time through a restart of alert manager? I am using alert manager 0.27.0, prometheus 2.45.4 and running on redhat 9.5. Thank you for any help. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
I'm not 100% sure I understand. If we look at the first example:
As long as the alert didn't resolve between restarting Prometheus and Alertmanager, then it is expected that Alertmanager will not re-send the alert. As far as Alertmanager is concerned, this is the same alert, and a notification for it was just sent before it was stopped. The notification will be re-sent after the
Yes, this will happen because you deleted the file that Alertmanager uses to track if it needs to send a notification or not. |
Beta Was this translation helpful? Give feedback.
-
But in case 1 above, if Alert Manager is aware that the alert was already sent, shouldn't it honour the endAt time? And send the resolved message to the webhook if it doesn't receive another alert from Prometheus? |
Beta Was this translation helpful? Give feedback.
Alertmanager keeps alerts in memory, however as you found it uses a file to track the last notification sent for each group of alerts. If you restart Alertmanager, it will lose any alerts it had in memory. That means even if the
endsAt
time has elapsed, following a restart Alertmanager won't know about it as the alert will have been lost.What happens is Prometheus re-sends all of its alerts to Alertmanager at a regular interval, and this is how Alertmanager recovers its state. But since the alert never resolved in Prometheus, Alertmanager just sees the same alert firing as before the crash.
What people tend to do in this situation is run something called high availability mode. You can r…