Description
What did you do?
I have 2 Alertmanagers installed in HA and am using Opsgenie integration. Sometimes the configuration is updated and reloaded automatically on both instances at around same time.
What did you expect to see?
No alerts notifications towards integrations are dropped.
What did you see instead? Under which circumstances?
Sometimes the alert would remain open on Opsgenie, even though it's resolved on Prometheus. After some investigation I pinpointed the issue on config reload.
If one AM reloads configuration slightly before the other, no problems (which happens most of the time). But sometimes, when the configuration reloads align just right, some notifications are dropped due to top level context being canceled on both of them. (Logs attached)
I fixed this for me by setting up different moments in time for config reload, so that at least one AM is "active" at a time, but was wondering if some kind of graceful shutdown of integrations wound be a good idea.
Environment
- Alertmanager version:
alertmanager, version 0.17.0 (branch: HEAD, revision: c7551cd75c414dc81df027f691e2eb21d4fd85b2)
build user: root@932a86a52b76
build date: 20190503-09:10:07
go version: go1.12.4
- Logs:
level=debug ts=2019-12-17T08:05:00.481869442Z caller=dispatch.go:264 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Post https://api.eu.opsgenie.com/v2/alerts: context canceled"
level=debug ts=2019-12-17T08:05:01.504148372Z caller=dispatch.go:264 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Post https://api.eu.opsgenie.com/v2/alerts: context canceled"
level=debug ts=2019-12-17T08:05:01.650700099Z caller=dispatch.go:264 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Post https://api.eu.opsgenie.com/v2/alerts: context canceled"