Description
tldr
If clustering is enabled, AM should be considered ready when gossip has settled, this is currently not the case. I'm wondering if this is a bug or intended behavior.
Description
I'm investigation failed inhibitions and some other alertmanager issues that we're facing, and think that waiting for gossip to settle might be an improvement that could reduce these weird behaviors when an instance is restarted.
When alertmanager starts, it immediately report itself as ready and starts serving traffic without having a complete picture of what is going on in the cluster. This can lead to unexpected behavior, such as inhibitions being missed, or duplicate alerts being send out etc.
Code Snippets
Here we can see that there is a method
alertmanager/cluster/cluster.go
Lines 667 to 672 in e252c2e
that gets called in a go routine during startup
alertmanager/cmd/alertmanager/main.go
Lines 326 to 343 in e252c2e
The cluster struct, also provides a method that reports if gossip has settled or not
alertmanager/cluster/cluster.go
Lines 582 to 590 in e252c2e
However, this method is not check during stratup nor during
/-/ready
.