Open
Description
When a peer has left the cluster it's not completely forgotten, and Alertmanager still tries to reconnect to it.
Output from the "remaining" host in the cluster:
level=debug ts=2019-01-24T07:39:01.831635458Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:07:46.025185567Z caller=delegate.go:215 component=cluster received=NotifyLeave node=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:07:46.025307733Z caller=cluster.go:501 component=cluster msg="peer left" peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ
level=debug ts=2019-01-24T08:07:54.60162996Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:07:54 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:07:54.602236841Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:07:54 [DEBUG] memberlist: Stream connection from=100.99.2.1:60450\n"
level=debug ts=2019-01-24T08:07:54.606199938Z caller=cluster.go:406 component=cluster msg=reconnect result=success peer= addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:07:57.611894452Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:07:57 [DEBUG] memberlist: Failed to join 100.99.6.139: dial tcp 100.99.6.139:9094: connect:no route to host\n"
level=debug ts=2019-01-24T08:07:57.611952615Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:07:59.60283097Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:07:59 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:07:59.603411272Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:07:59 [DEBUG] memberlist: Stream connection from=100.99.2.1:60478\n"
level=debug ts=2019-01-24T08:07:59.604869096Z caller=cluster.go:450 component=cluster msg=refresh result=success addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:04.601560627Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:04 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:08:04.602427363Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:04 [DEBUG] memberlist: Stream connection from=100.99.2.1:60508\n"
level=debug ts=2019-01-24T08:08:04.604259465Z caller=cluster.go:406 component=cluster msg=reconnect result=success peer= addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:07.609886083Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:07 [DEBUG] memberlist: Failed to join 100.99.6.139: dial tcp 100.99.6.139:9094: connect:no route to host\n"
level=debug ts=2019-01-24T08:08:07.609945435Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:08:14.6018747Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:14 [DEBUG] memberlist: Stream connection from=100.99.2.1:60562\n"
level=debug ts=2019-01-24T08:08:14.601972126Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:14 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:08:14.603966313Z caller=cluster.go:406 component=cluster msg=reconnect result=success peer= addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:14.605202972Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:14 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:08:14.605790684Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:14 [DEBUG] memberlist: Stream connection from=100.99.2.1:60566\n"
level=debug ts=2019-01-24T08:08:14.607740619Z caller=cluster.go:450 component=cluster msg=refresh result=success addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:17.207859317Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:17 [DEBUG] memberlist: Stream connection from=100.96.6.1:47826\n"
level=debug ts=2019-01-24T08:08:17.210823151Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01D1ZEE2HC3AAXWTCRSNV9MX9J addr=100.96.6.104:9094
level=debug ts=2019-01-24T08:08:17.609865224Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:17 [DEBUG] memberlist: Failed to join 100.99.6.139: dial tcp 100.99.6.139:9094: connect:no route to host\n"
level=debug ts=2019-01-24T08:08:17.609909229Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:08:24.601744694Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:24 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:08:24.603611165Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:24 [DEBUG] memberlist: Stream connection from=100.99.2.1:60624\n"
level=debug ts=2019-01-24T08:08:24.607934174Z caller=cluster.go:406 component=cluster msg=reconnect result=success peer= addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:27.215035735Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:27 [DEBUG] memberlist: Stream connection from=100.96.6.1:47856\n"
level=debug ts=2019-01-24T08:08:27.61386911Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:27 [DEBUG] memberlist: Failed to join 100.99.6.139: dial tcp 100.99.6.139:9094: connect: o route to host\n"
level=debug ts=2019-01-24T08:08:27.613914234Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:08:29.603204984Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:29 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:08:29.604776566Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:29 [DEBUG] memberlist: Stream connection from=100.99.2.1:60652\n"
level=debug ts=2019-01-24T08:08:29.607991038Z caller=cluster.go:450 component=cluster msg=refresh result=success addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:34.601649401Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:34 [DEBUG] memberlist: Initiating push/pull sync with: 100.112.120.56:9094\n"
level=debug ts=2019-01-24T08:08:34.602089135Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:34 [DEBUG] memberlist: Stream connection from=100.99.2.1:60682\n"
level=debug ts=2019-01-24T08:08:34.615795533Z caller=cluster.go:406 component=cluster msg=reconnect result=success peer= addr=100.112.120.56:9094
level=debug ts=2019-01-24T08:08:37.215220886Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:37 [DEBUG] memberlist: Stream connection from=100.96.6.1:47876\n"
level=debug ts=2019-01-24T08:08:37.621912854Z caller=cluster.go:295 component=cluster memberlist="2019/01/24 08:08:37 [DEBUG] memberlist: Failed to join 100.99.6.139: dial tcp 100.99.6.139:9094: connect:no route to host\n"
level=debug ts=2019-01-24T08:08:37.621961912Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:08:41.56971564Z caller=nflog.go:334 component=nflog msg="Running maintenance"
level=debug ts=2019-01-24T08:08:41.57000306Z caller=silence.go:262 component=silences msg="Running maintenance"
level=debug ts=2019-01-24T08:08:41.577599624Z caller=silence.go:264 component=silences msg="Maintenance done" duration=7.597322ms size=0
level=debug ts=2019-01-24T08:08:41.577684375Z caller=nflog.go:336 component=nflog msg="Maintenance done" duration=7.971856ms size=12018
level=debug ts=2019-01-24T08:08:47.64798738Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:08:57.62191812Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:09:07.613923739Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
level=debug ts=2019-01-24T08:09:17.611984695Z caller=cluster.go:403 component=cluster msg=reconnect result=failure peer=01D1ZCRG9XF3NDNYNR3Z55PFWQ addr=100.99.6.139:9094
As a sideeffect of this the metric alertmanager_cluster_reconnections_failed_total
is increasing (and causes our Alertmanager alerts to fire).
- System information:
An OpenShift 3.10 cluster there the alertmanager instances are deployed as separate DeploymentConfigs.
- Alertmanager version:
level=info ts=2019-01-24T07:38:41.535519959Z caller=main.go:177 msg="Starting Alertmanager" version="(version=0.16.0, branch=HEAD, revision=73bdd966e0055f3b828340ade3ef1f3a38169cdc)"
- Alertmanager configuration file:
kind: DeploymentConfig
apiVersion: apps.openshift.io/v1
metadata:
labels:
app: alertmanager
name: alertmanager-00a
spec:
replicas: 1
revisionHistoryLimit: 2
selector:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: app
image: quay.io/prometheus/alertmanager:v0.16.0
args:
- --cluster.listen-address=:9094
- --cluster.peer=alertmanager-clustering:9094
- --config.file=/etc/alertmanager/alertmanager.yaml
- --log.level=debug
- --storage.path=/data
- --web.external-url=https://alertmanager.example.com
- --web.listen-address=:9093
ports:
- containerPort: 9093
name: alertmanager
protocol: TCP
- containerPort: 9094
name: clustering
protocol: TCP
volumeMounts:
- mountPath: /etc/alertmanager
name: config
- mountPath: /data
name: data
- mountPath: /etc/alertmanager/templates/
name: templates
---
kind: Service
apiVersion: v1
metadata:
labels:
app: alertmanager
name: alertmanager-clustering
spec:
type: ClusterIP
selector:
app: alertmanager
ports:
- name: clustering
port: 9094
targetPort: clustering