Skip to content

Alertmanager merges peers through IP instead DNS #2295

Open
@devlucasc

Description

@devlucasc

What did you do?
I configured the alertmanager in the AWS EKS cluster using the prometheus-operator helm chart and 3 replicas.

What did you expect to see?
The alarms were expected to be propagated and synchronized correctly between the pods.

What did you see instead? Under which circumstances?
Alarms are lost between pods when using more than one replica. The problem is that statefulset pods end up going up in parallel using podmanagementpolicy as parallel, but that doesn't always happen. For example, if pod-0 starts last, pod-0 can communicate with pod-1 and pod-2, but not the other way around. The same happens when one pod falls and another pod rises. Considering this, the pods start to act independently since they are unable to join each other, the sync is lost, and the alerts start to double when they are sent by API using the DNS configured in Ingress. I checked other issues, tested the connectivity in both TCP and UDP. Changing the log for debug I found that the alertmanager resolved the DNS to IP and instead of using DNS, uses the IP of a Pod that no longer exists, as the private IP of EKS is allocated to the pod dynamically, he can no longer see the peer.
PodManagementPolicy is configured as parallel: here
I looked at this issue: 1261 and 1312
I believe this issue is related to way how alertmanager resolve peers, converting to direct IP address instead using k8s DNS like svc.cluster.local iplookup

Environment
AWS EKS

  • System information:
    EKS - Kubernetes 1.15 using official docker image from quay.io

  • Alertmanager version:

alertmanager, version 0.20.0 (branch: HEAD, revision: f74be0400a6243d10bb53812d6fa408ad71ff32d)
  build user:       root@00c3106655f8
  build date:       20191211-14:13:14
  go version:       go1.13.5
/bin/alertmanager --config.file=/etc/alertmanager/config/alertmanager.yaml --cluster.listen-address=[***.***.***.217]:9094 --storage.path=/alertmanager --data.retention=120h --web.listen-address=:9093 --web.external-url=http://redacted/ --web.route-prefix=/ --cluster.peer=alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:9094
--cluster.peer=alertmanager-prometheus-operator-alertmanager-1.alertmanager-operated.monitoring.svc:9094
--cluster.peer=alertmanager-prometheus-operator-alertmanager-2.alertmanager-operated.monitoring.svc:9094

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions