Description
What did you do?
- Run a single instance of Alertmanager:
rm -r alertmanager-data; alertmanager-0.25.0.linux-amd64/alertmanager --config.file=alertmanager.yml --web.listen-address=:9093 --cluster.listen-address=127.0.0.1:9094 --storage.path=alertmanager-data --log.level=debug
- Post a bunch of alerts to it:
for i in {1..20}; do alertmanager-0.25.0.linux-amd64/amtool alert add alertname=test label=$i --start $(date -u +'%Y-%m-%dT%H:%M:%SZ') --end 2024-01-01T00:00:00Z --alertmanager.url=http://localhost:9093/ --generator-url=http://localhost/ & done
- Wait for at least GroupWait so that the alerts are flushed
- Issue a config reload:
curl -XPOST http://localhost:9093/-/reload
What did you expect to see?
Alertmanager should respect RepeatInterval and not issue any notifications until at least RepeatInterval passes since the previous notification and then issue an identical notification.
What did you see instead? Under which circumstances?
Instead, a notification is sent ~immediately after the config reload. This notification contains only a small subset of the active alerts, often just one or two. After GroupInterval, a full notification is sent with all active alerts.
This is pretty bad, especially if GroupInterval is large, because it hides almost all active alerts.
Environment
- Alertmanager version:
$ alertmanager-0.25.0.linux-amd64/alertmanager --version
alertmanager, version 0.25.0 (branch: HEAD, revision: 258fab7cdd551f2cf251ed0348f0ad7289aee789)
build user: root@abe866dd5717
build date: 20221222-14:51:36
go version: go1.19.4
platform: linux/amd64
- Alertmanager configuration file:
$ cat alertmanager.yml
global:
resolve_timeout: 500000s
route:
group_by: ["alertname"]
group_wait: 5s
group_interval: 30s
repeat_interval: 30h
receiver: main
receivers:
- name: main
Here, no notification configs are specified and so I look at the "flushing alerts" log lines as an indication of the bug, but in production we've seen this issue occur with actual receivers and actual incorrect notifications being sent (and with as few as 3 alerts in the group).
Logs
Startup:
$ rm -r alertmanager-data; alertmanager-0.25.0.linux-amd64/alertmanager --config.file=alertmanager.yml --web.listen-address=:9093 --cluster.listen-address=127.0.0.1:9094 --storage.path=alertmanager-data --log.level=debug
ts=2023-06-27T16:45:12.003Z caller=main.go:240 level=info msg="Starting Alertmanager" version="(version=0.25.0, branch=HEAD, revision=258fab7cdd551f2cf251ed0348f0ad7289aee789)"
<...>
ts=2023-06-27T16:45:22.007Z caller=cluster.go:698 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.001910727s
After sending the alerts as described in step 2:
ts=2023-06-27T16:45:46.548Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d93e7a][active]
ts=2023-06-27T16:45:46.548Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d8de7a][active]
ts=2023-06-27T16:45:46.549Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[491a88a][active]
ts=2023-06-27T16:45:46.549Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d89e7a][active]
ts=2023-06-27T16:45:46.550Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d83e7a][active]
ts=2023-06-27T16:45:46.550Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d7de7a][active]
ts=2023-06-27T16:45:46.551Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d9de7a][active]
ts=2023-06-27T16:45:46.551Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[24dabea][active]
ts=2023-06-27T16:45:46.551Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[48f688a][active]
ts=2023-06-27T16:45:46.552Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[48f888a][active]
ts=2023-06-27T16:45:46.552Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[48ec88a][active]
ts=2023-06-27T16:45:46.555Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[490088a][active]
ts=2023-06-27T16:45:46.555Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d91e7a][active]
ts=2023-06-27T16:45:46.555Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d7fe7a][active]
ts=2023-06-27T16:45:46.557Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[48ee88a][active]
ts=2023-06-27T16:45:46.565Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[48f288a][active]
ts=2023-06-27T16:45:46.565Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2da1e7a][active]
ts=2023-06-27T16:45:46.566Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d87e7a][active]
ts=2023-06-27T16:45:46.566Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[491e88a][active]
ts=2023-06-27T16:45:46.566Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[48fc88a][active]
ts=2023-06-27T16:45:51.549Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}:{alertname=\"test\"}" msg=flushing alerts="[test[490088a][active] test[2d83e7a][active] test[2d87e7a][active] test[2d7de7a][active] test[2d7fe7a][active] test[2d91e7a][active] test[2d93e7a][active] test[2d89e7a][active] test[2d8de7a][active] test[2d9de7a][active] test[2da1e7a][active] test[48fc88a][active] test[24dabea][active] test[48f888a][active] test[48f688a][active] test[48f288a][active] test[48ee88a][active] test[48ec88a][active] test[491e88a][active] test[491a88a][active]]"
The alerts are correctly flushed after WaitInterval=5s.
After reloading the config as described in step 4:
ts=2023-06-27T16:45:53.959Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=alertmanager.yml
ts=2023-06-27T16:45:53.960Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=alertmanager.yml
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d87e7a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d89e7a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d83e7a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[24dabea][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[48f888a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2da1e7a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d8de7a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d9de7a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d91e7a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}:{alertname=\"test\"}" msg=flushing alerts="[test[2d83e7a][active] test[2d87e7a][active] test[2d89e7a][active]]"
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d7fe7a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[48f288a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d93e7a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[2d7de7a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[48f688a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[48ee88a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}:{alertname=\"test\"}" msg=flushing alerts="[test[2d83e7a][active] test[2d87e7a][active] test[2d91e7a][active] test[2d89e7a][active] test[2d8de7a][active] test[2d9de7a][active] test[2da1e7a][active] test[24dabea][active] test[48f888a][active]]"
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[48fc88a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[491a88a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[48ec88a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[490088a][active]
ts=2023-06-27T16:45:53.963Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=test[491e88a][active]
ts=2023-06-27T16:46:23.964Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}:{alertname=\"test\"}" msg=flushing alerts="[test[490088a][active] test[2d83e7a][active] test[2d87e7a][active] test[2d7de7a][active] test[2d7fe7a][active] test[2d91e7a][active] test[2d93e7a][active] test[2d89e7a][active] test[2d8de7a][active] test[2d9de7a][active] test[2da1e7a][active] test[48fc88a][active] test[24dabea][active] test[48f888a][active] test[48f688a][active] test[48f288a][active] test[48ee88a][active] test[48ec88a][active] test[491e88a][active] test[491a88a][active]]"
ts=2023-06-27T16:46:53.965Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}:{alertname=\"test\"}" msg=flushing alerts="[test[490088a][active] test[2d83e7a][active] test[2d87e7a][active] test[2d7de7a][active] test[2d7fe7a][active] test[2d91e7a][active] test[2d93e7a][active] test[2d89e7a][active] test[2d8de7a][active] test[2d9de7a][active] test[2da1e7a][active] test[48fc88a][active] test[24dabea][active] test[48f888a][active] test[48f688a][active] test[48f288a][active] test[48ee88a][active] test[48ec88a][active] test[491e88a][active] test[491a88a][active]]"
<...>
Alerts are flushed right after config reload, before the group state is fully reconstructed, leading to an incomplete flush. Only GroupInterval=30s after config reload the full flush occurs.
Actually, in this case you can also see a second incomplete flush also happening right after the config reload, and the correct flush
only comes third. I couldn't figure out what determines whether we see this second incomplete flush; it seems just random. If I increase the number of generated alerts to 100, I reliably get this second incomplete flush but, interestingly, no additional incomplete flushes beyond the two.
Root cause
My hypothesis (that I haven't tested) is that this happens because of this special case inside aggrGroup.insert
:
alertmanager/dispatch/dispatch.go
Line 481 in 229ecb7
// Immediately trigger a flush if the wait duration for this
// alert is already over.
ag.mtx.Lock()
defer ag.mtx.Unlock()
if !ag.hasFlushed && alert.StartsAt.Add(ag.opts.GroupWait).Before(time.Now()) {
ag.next.Reset(0)
}
Since by the time the config is reloaded the alerts aren't new, they immediately trigger this condition before the config reload logic has a chance to add all existing alerts to the group, leading to a race between the config reload group state reconstruction logic and the flush logic in ag.run.
Notably, the first alert is inserted into the aggregation group even before it's run (
alertmanager/dispatch/dispatch.go
Line 337 in 229ecb7
Possible solutions:
- Just remove this special case. Not sure about the context around it and whether it actually resolves some pain for users, but It's been in place for 8 years (4aa5dcc#diff-f6544e5b8846760723cf71073677dec6cd72e835928d3a31d88795c37e401d96R204), so the context might have changed, and it clearly clashes with config reloads (or simply two alerts arriving at the same time, I suppose).
- Come up with a more principled way to reload configs, for example, have some intermediate stage for reconstructing the state without actually issuing notifications.
That said, I cannot explain why sometimes there is a second incomplete flush using just this hypothesis, so it's possible there's something else going on as well.
Related issue: #2492 (dup: #3037), which also happens due to the very blunt method of reloading configs that wipes existing state.