Skip to content

Old peer is not forgotten properly #1722

Open
@zltyfsh

Description

@zltyfsh

When a peer has left the cluster it's not completely forgotten, and Alertmanager still tries to reconnect to it.

Output from the "remaining" host in the cluster:

level=debug ts=2019-01-24T07:39:01.831635458Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:07:46.025185567Z caller=delegate.go:215 component=cluster received=NotifyLeave node=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:07:46.025307733Z caller=cluster.go:501 component=cluster msg="peer left" peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ
level=debug ts=2019-01-24T08:07:54.60162996Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:07:54 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:07:54.602236841Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:07:54 [DEBUG] memberlist: Stream connection from=100.99.2.1:60450\n"
level=debug ts=2019-01-24T08:07:54.606199938Z caller=cluster.go:406 component=cluster msg=reconnect result=success peer= addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:07:57.611894452Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:07:57 [DEBUG] memberlist: Failed to join 100.99.6.139: dial tcp 100.99.6.139:9094: connect:no route to host\n"
level=debug ts=2019-01-24T08:07:57.611952615Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:07:59.60283097Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:07:59 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:07:59.603411272Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:07:59 [DEBUG] memberlist: Stream connection from=100.99.2.1:60478\n"
level=debug ts=2019-01-24T08:07:59.604869096Z caller=cluster.go:450 component=cluster msg=refresh result=success addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:04.601560627Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:04 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:08:04.602427363Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:04 [DEBUG] memberlist: Stream connection from=100.99.2.1:60508\n"
level=debug ts=2019-01-24T08:08:04.604259465Z caller=cluster.go:406 component=cluster msg=reconnect result=success peer= addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:07.609886083Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:07 [DEBUG] memberlist: Failed to join 100.99.6.139: dial tcp 100.99.6.139:9094: connect:no route to host\n"
level=debug ts=2019-01-24T08:08:07.609945435Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:08:14.6018747Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:14 [DEBUG] memberlist: Stream connection from=100.99.2.1:60562\n"
level=debug ts=2019-01-24T08:08:14.601972126Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:14 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:08:14.603966313Z caller=cluster.go:406 component=cluster msg=reconnect result=success peer= addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:14.605202972Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:14 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:08:14.605790684Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:14 [DEBUG] memberlist: Stream connection from=100.99.2.1:60566\n"
level=debug ts=2019-01-24T08:08:14.607740619Z caller=cluster.go:450 component=cluster msg=refresh result=success addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:17.207859317Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:17 [DEBUG] memberlist: Stream connection from=100.96.6.1:47826\n"
level=debug ts=2019-01-24T08:08:17.210823151Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01D1ZEE2HC3AAXWTCRSNV9MX9J addr=100.96.6.104:9094
level=debug ts=2019-01-24T08:08:17.609865224Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:17 [DEBUG] memberlist: Failed to join 100.99.6.139: dial tcp 100.99.6.139:9094: connect:no route to host\n"
level=debug ts=2019-01-24T08:08:17.609909229Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:08:24.601744694Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:24 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:08:24.603611165Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:24 [DEBUG] memberlist: Stream connection from=100.99.2.1:60624\n"
level=debug ts=2019-01-24T08:08:24.607934174Z caller=cluster.go:406 component=cluster msg=reconnect result=success peer= addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:27.215035735Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:27 [DEBUG] memberlist: Stream connection from=100.96.6.1:47856\n"
level=debug ts=2019-01-24T08:08:27.61386911Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:27 [DEBUG] memberlist: Failed to join 100.99.6.139: dial tcp 100.99.6.139:9094: connect: o route to host\n"
level=debug ts=2019-01-24T08:08:27.613914234Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:08:29.603204984Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:29 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:08:29.604776566Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:29 [DEBUG] memberlist: Stream connection from=100.99.2.1:60652\n"
level=debug ts=2019-01-24T08:08:29.607991038Z caller=cluster.go:450 component=cluster msg=refresh result=success addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:34.601649401Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:34 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:08:34.602089135Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:34 [DEBUG] memberlist: Stream connection from=100.99.2.1:60682\n"
level=debug ts=2019-01-24T08:08:34.615795533Z caller=cluster.go:406 component=cluster msg=reconnect result=success peer= addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:37.215220886Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:37 [DEBUG] memberlist: Stream connection from=100.96.6.1:47876\n"
level=debug ts=2019-01-24T08:08:37.621912854Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:37 [DEBUG] memberlist: Failed to join 100.99.6.139: dial tcp 100.99.6.139:9094: connect:no route to host\n"
level=debug ts=2019-01-24T08:08:37.621961912Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:08:41.56971564Z caller=nflog.go:334 component=nflog msg="Running maintenance"
level=debug ts=2019-01-24T08:08:41.57000306Z caller=silence.go:262 component=silences msg="Running maintenance"
level=debug ts=2019-01-24T08:08:41.577599624Z caller=silence.go:264 component=silences msg="Maintenance done" duration=7.597322ms size=0
level=debug ts=2019-01-24T08:08:41.577684375Z caller=nflog.go:336 component=nflog msg="Maintenance done" duration=7.971856ms size=12018
level=debug ts=2019-01-24T08:08:47.64798738Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:08:57.62191812Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:09:07.613923739Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:09:17.611984695Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094

As a sideeffect of this the metric alertmanager_cluster_reconnections_failed_total is increasing (and causes our Alertmanager alerts to fire).

  • System information:

An OpenShift 3.10 cluster there the alertmanager instances are deployed as separate DeploymentConfigs.

  • Alertmanager version:

level=info ts=2019-01-24T07:38:41.535519959Z caller=main.go:177 msg="Starting Alertmanager" version="(version=0.16.0, branch=HEAD, revision=73bdd966e0055f3b828340ade3ef1f3a38169cdc)"

  • Alertmanager configuration file:
kind: DeploymentConfig
apiVersion: apps.openshift.io/v1
metadata:
  labels:
    app: alertmanager
  name: alertmanager-00a
spec:
  replicas: 1
  revisionHistoryLimit: 2
  selector:
    app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: app
        image: quay.io/prometheus/alertmanager:v0.16.0
        args:
        - --cluster.listen-address=:9094
        - --cluster.peer=alertmanager-clustering:9094
        - --config.file=/etc/alertmanager/alertmanager.yaml
        - --log.level=debug
        - --storage.path=/data
        - --web.external-url=https://alertmanager.example.com
        - --web.listen-address=:9093
        ports:
        - containerPort: 9093
          name: alertmanager
          protocol: TCP
        - containerPort: 9094
          name: clustering
          protocol: TCP
        volumeMounts:
        - mountPath: /etc/alertmanager
          name: config
        - mountPath: /data
          name: data
        - mountPath: /etc/alertmanager/templates/
          name: templates
---
kind: Service
apiVersion: v1
metadata:
  labels:
    app: alertmanager
  name: alertmanager-clustering
spec:
  type: ClusterIP
  selector:
    app: alertmanager
  ports:
  - name: clustering
    port: 9094
    targetPort: clustering

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions