Skip to content

Alertmanagers in HA mode goes sometimes out of sync #2384

Open
@shubhamc183

Description

@shubhamc183

I am running two Prometheus which sends alerts to two Alertmanagers running in HA mode and they(Alertmanager) are going out of sync(peering is lost) sometimes and alerts are sent twice.

Sometimes only one Peer is available in /staus of Alertmanager
image

  • System information:
    Linux 4.14.106-97.85.amzn2.x86_64 x86_64

  • Alertmanager version:

alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
  build user:       root@dee35927357f
  build date:       20200617-08:54:02
  go version:       go1.14.4
  • Alertmanager Args:
--config.file=/etc/alertmanager/alertmanager.yml
--storage.path=/alertmanager
--cluster.peer=prod-prometheus01.prod:9094
--cluster.peer=prod-prometheus02.prod:9094
  • Prometheus 1/2 Config:
global:
  scrape_interval: 60s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  scrape_timeout: 30s # Default is 10 seconds
  external_labels:
    prometheus : PROD-STG
    replica: prod-prometheus01/02
alerting:
  alert_relabel_configs:
  - source_labels: [replica]
    regex: (.+?)\d+
    target_label: replica
  alertmanagers:
  - static_configs:
    - targets:
       - prod-prometheus01.prod:9093
       - prod-prometheus02.prod:9093
  • Alertmanager Log 1:
level=error ts=2020-10-01T16:55:22.073Z caller=api.go:660 component=api version=v2 path=/silence/08efc9d9-a6bf-4d7a-85f1-686f4a720264 method=GET msg="Failed to find silence" err=null id=08efc9d9-a6bf-4d7a-85f1-686f4a720264
level=error ts=2020-10-01T19:38:44.905Z caller=api.go:780 component=api version=v1 msg="API error" err="server_error: context canceled"
level=error ts=2020-10-02T16:47:53.317Z caller=api.go:780 component=api version=v1 msg="API error" err="server_error: context canceled"
level=error ts=2020-10-02T16:47:53.323Z caller=api.go:780 component=api version=v1 msg="API error" err="server_error: context canceled"
level=error ts=2020-10-05T06:06:41.295Z caller=api.go:780 component=api version=v1 msg="API error" err="server_error: context canceled"
  • Alertmanager Log 2:
level=error ts=2020-10-01T16:56:02.180Z caller=api.go:660 component=api version=v2 path=/silence/c855ced3-24e8-481b-9f65-30b3a8b1631e method=GET msg="Failed to find silence" err=null id=c855ced3-24e8-481b-9f65-30b3a8b1631e
level=error ts=2020-10-03T05:26:30.922Z caller=api.go:660 component=api version=v2 path=/silence/4013daf9-9f63-4ac8-955a-be33ab00fef3 method=GET msg="Failed to find silence" err=null id=4013daf9-9f63-4ac8-955a-be33ab00fef3
level=error ts=2020-10-03T05:28:41.603Z caller=api.go:660 component=api version=v2 path=/silence/dffc0882-ea51-45fc-95ca-e6c45dec2a83 method=GET msg="Failed to find silence" err=null id=dffc0882-ea51-45fc-95ca-e6c45dec2a83
level=error ts=2020-10-05T06:06:41.295Z caller=api.go:780 component=api version=v1 msg="API error" err="server_error: context canceled"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions