Skip to content

make inhibitions part of the gossip #4315

Open
@OGKevin

Description

@OGKevin

tldr

If clustering is enabled, AM should be considered ready when gossip has settled, this is currently not the case. I'm wondering if this is a bug or intended behavior.

Description

I'm investigation failed inhibitions and some other alertmanager issues that we're facing, and think that waiting for gossip to settle might be an improvement that could reduce these weird behaviors when an instance is restarted.

When alertmanager starts, it immediately report itself as ready and starts serving traffic without having a complete picture of what is going on in the cluster. This can lead to unexpected behavior, such as inhibitions being missed, or duplicate alerts being send out etc.

Code Snippets

Here we can see that there is a method

// Settle waits until the mesh is ready (and sets the appropriate internal state when it is).
// The idea is that we don't want to start "working" before we get a chance to know most of the alerts and/or silences.
// Inspired from https://github.com/apache/cassandra/blob/7a40abb6a5108688fb1b10c375bb751cbb782ea4/src/java/org/apache/cassandra/gms/Gossiper.java
// This is clearly not perfect or strictly correct but should prevent the alertmanager to send notification before it is obviously not ready.
// This is especially important for those that do not have persistent storage.
func (p *Peer) Settle(ctx context.Context, interval time.Duration) {

that gets called in a go routine during startup
// Peer state listeners have been registered, now we can join and get the initial state.
if peer != nil {
err = peer.Join(
*reconnectInterval,
*peerReconnectTimeout,
)
if err != nil {
logger.Warn("unable to join gossip mesh", "err", err)
}
ctx, cancel := context.WithTimeout(context.Background(), *settleTimeout)
defer func() {
cancel()
if err := peer.Leave(10 * time.Second); err != nil {
logger.Warn("unable to leave gossip mesh", "err", err)
}
}()
go peer.Settle(ctx, *gossipInterval*10)
}

The cluster struct, also provides a method that reports if gossip has settled or not

// Return true when router has settled.
func (p *Peer) Ready() bool {
select {
case <-p.readyc:
return true
default:
}
return false
}

However, this method is not check during stratup nor during /-/ready.

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions