Skip to content

Silenced alert becomes active after cluster restart with silence still active #2158

Open
@FUSAKLA

Description

@FUSAKLA

What did you do?

Upgraded alertmanager cluster with 2 instances from 0.19.0 to 0.20.0.
Running in Kubernetes each instance in different DC communicating over NodePort.

  1. Many alerts is firing from multiple Prometheis instances but silenced in the Alertmanager. Everything working fine.
  2. Upgraded instance A
  3. Wait for it to be ready
  4. Upgraded instance B
  5. Wait for it to be ready
  6. PagerDuty receives notification for one of these silenced alerts from AM instance B.
  7. Checked the instance B UI and there was the alert shown as active but also in the silenced alerts.
  8. Tried to examine it using API and had an empty list in the silencedBy list and was active, but was returend by the API even when the query was /api/v1/alerts?silenced=true.
The alerts json data from instance B:

    {
      "labels": {
        "alertname": "xxxUpAbsent",
        "app_label": "xxx",
        "cluster": "clusterA",
        "locality": "localityA",
        "namespace": "xxx",
        "prometheus_type": "harvester",
        "severity": "critical",
        "sre": "true",
        "team": "xxx"
      },
      "annotations": {
        "description": "xxx's metric up is absent for at least tstaleness+4 minutes",
        "playbook": "howto/k8s-apps-down.md",
        "title": "missing metrics at all for xxx"
      },
      "startsAt": "2019-12-20T10:59:19.477329823+01:00",
      "endsAt": "2020-01-08T13:04:59.477329823+01:00",
      "generatorURL": "xxx",
      "status": {
        "state": "active",
        "silencedBy": [],
        "inhibitedBy": []
      },
      "receivers": [
        "mattermost_production_alerts",
        "pagerduty_critical"
      ],
      "fingerprint": "fb03b13ba7d00405"
    }

  1. Tried the exact API queries on the instance A and there the alert was marked as silenced s expected to be.
The alerts json data from instance A:

   {
      "labels": {
        "alertname": "xxxUpAbsent",
        "app_label": "xxx",
        "cluster": "clusterA",
        "locality": "localityA",
        "namespace": "xxx",
        "prometheus_type": "harvester",
        "severity": "critical",
        "sre": "true",
        "team": "xxx"
      },
      "annotations": {
        "description": "xxx's metric up is absent for at least tstaleness+4 minutes",
        "playbook": "howto/k8s-apps-down.md",
        "title": "missing metrics at all for xxx"
      },
      "startsAt": "2019-12-20T10:59:19.477329823+01:00",
      "endsAt": "2020-01-08T13:06:59.477329823+01:00",
      "generatorURL": "xxxx",
      "status": {
        "state": "suppressed",
        "silencedBy": [
          "986be0d6-b135-4199-bed8-8d390fd6d288"
        ],
        "inhibitedBy": null
      }

  1. The alert is still firing and it does not become suppressed.
  2. Just simple editing of the silence description which does match the alert and saving it in the instance B resolves the whole issue.

This is not the first time we experienced this issue. But it does not happen every time. Mostly when restarting both instances of the cluster. So not related to the 0.20.0 version particularly.

We can provide also nflogs and silences files of both instances from the moment of the issue, but prefer to share it privately.

What did you expect to see?

To not receive silenced alerts.

What did you see instead? Under which circumstances?

Received notification for continuously firing alert after cluster upgrade.

Environment
On premise kubernetes, running on physical machines.

  • System information:

Linux 5.3.0-24-generic x86_64

  • Alertmanager version:
alertmanager, version 0.20.0 (branch: non-git, revision: non-git)
  build user:       root@runner-xxxxx
  build date:       20200107-08:51:34
  go version:       go1.13.5
  • Alertmanager configuration file:

global:
  resolve_timeout: 1m

route:
  receiver: blackhole

  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: [alertname, locality, job, namespace, app, app_label, severity, deployment,
    cluster, exported_namespace, slo_domain, slo_type, original_alertname]

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 8h

  routes:
  - match:
      devops: true
    receiver: mattermost_xxx
    continue: true
    routes:
    - match:
        severity: info
      receiver: blackhole

  - match:
      sre: true
    receiver: mattermost_xxx
    routes:

    - match:
        alertname: PeriodicalMetaMonitoringAlert
        prometheus_type: harvester
      group_wait: 1s
      group_interval: 1s
      repeat_interval: 5m
      receiver: healtcheck_io_harvester
      continue: false

    - match:
        alertname: PeriodicalMetaMonitoringAlert
        prometheus_type: thanos-rule
      group_wait: 1s
      group_interval: 1s
      repeat_interval: 5m
      receiver: healtcheck_io_thanos_rule
      continue: false

        # Filter out alerts which were fired due to metrics based on kube-state-metrics.
        # Such metrics have namespace label containing namespace of kube-state-metrics
        # and exported_namespace of actual resource.
    - match_re:
        exported_namespace: xxx
      receiver: mattermost_xxx
      continue: false

        # Filter out xxx_namespace
    - match_re:
        namespace: <name>
      receiver: mattermost_xxx
      continue: false

        # Send critical alerts to production channel, but continue for possible pagerduty match.
    - match_re:
        severity: ^(critical|warning)$
      receiver: mattermost_xxx
      continue: true

        # Send info alerts to info channel
    - match:
        severity: info
      receiver: mattermost_xxx
      continue: true

        # Creates gitlab issue from warning alert but continue to pagerduty
    - match:
        severity: warning
      group_by: [alertname, app, app_label, namespace, severity, slo_domain, slo_type]
      receiver: xxx
      continue: true
      repeat_interval: 7d

        # Pagerduty match
    - match:
        severity: critical
      receiver: pagerduty_critical
      continue: false
      repeat_interval: 10m

    - match:
        severity: warning
      receiver: pagerduty_warning
      continue: false
      repeat_interval: 10m

        # Info alerts are sent only when has channel label
    - match:
        severity: info
        channel: ''
      receiver: blackhole
      continue: false



inhibit_rules:
- source_match:
    sre: true
    severity: critical

    # Matchers that have to be fulfilled in the alerts to be muted.
  target_match:
    severity: warning

    # Apply inhibition if the alertname is the same.
  equal: [alertname, locality, job, namespace, app, app_label]

- source_match:
    alertname: xxxReadinessFailing
    app: xxx
    sre: true
  target_match:
    alertname: ExternalHttpProbeFailing
    sre: true
- source_match:
    alertname: NoExternalBBEIsUp
    sre: true
  target_match:
    alertname: ExternalBBEIsDown
    sre: true
- source_match:
    alertname: KubeDeploymentReadyOnlyOne
    sre: true
  target_match:
    alertname: KubeDeploymentReadyNotAll
    sre: true
  equal: [locality, namespace, app]

- source_match_re:
    alertname: High5xxRate(ByEndpoint)?
    sre: true
    app: xxx

  target_match_re:
    sre: true
    alertname: High5xxRate(ByEndpoint)?
    app: (xxx(-xxx)?|xxx|xxx)

  equal: [locality, namespace, job, cluster]
- source_match_re:
    alertname: High5xxRate(ByEndpoint)?
    app: xxx
    sre: true

  target_match_re:
    alertname: High5xxRate(ByEndpoint)?
    app: (xxx-(xxx|xxx)|xxx)
    sre: true

  equal: [locality, namespace, job, cluster]
- source_match:
    alertname: Inhibit0000to0600
  target_match:
    inhibit0000to0600: true

- source_match:
    alertname: Inhibit1600to0800
  target_match:
    inhibit1600to0800: true

- source_match:
    alertname: LocalityDisabled

  target_match_re:
    alertname: '[A-Za-z]+[Pp]roxyMetricsDoesNotChange'

  equal: [locality]


receivers:
- name: pagerduty_critical
  pagerduty_configs:
  - service_key: xxxx
    description: '{{ template "slack.title" . }}'
    client: Sklik Devops SRE Alertmanager
x    client_url: '{{ template "slack.alertmanager.link" . }}'
    details: {note: '{{ template "slack.text" . }}'}
    component: '{{ .CommonLabels.app }}'
    group: '{{ if .CommonLabels.namespace }}.{{ .CommonLabels.namespace }}{{ end }}{{
      if .CommonLabels.cluster }}.{{ .CommonLabels.cluster }}{{ end }}{{ if .CommonLabels.locality
      }}.{{ .CommonLabels.locality }}{{ end }}'
    class: '{{.CommonLabels.alertname }}'
    send_resolved: false
    http_config:
      proxy_url: http://proxy:xxx

- name: pagerduty_warning
  pagerduty_configs:
  - service_key: xxx
    description: '{{ template "slack.title" . }}'
    client: Sklik Devops SRE Alertmanager
    client_url: '{{ template "slack.alertmanager.link" . }}'
        #details: { note : '{{ template "slack.text" . }}'}
    send_resolved: true
    http_config:
      proxy_url: http://proxy:xxx

- name: blackhole
    # Deliberately left empty to not deliver anywhere.

- name: mattermost_production_alerts
  slack_configs:
  - api_url: https://xxx/hooks/xxx
    channel: '{{ if eq .CommonLabels.channel "" }}prod-alert{{ else }}{{ .CommonLabels.channel
      }}{{ end }}'
    send_resolved: true
    color: '{{ template "slack.color" . }}'
    title: '{{ template "slack.title" . }}'
    title_link: '{{ template "slack.link" . }}'
    text: '{{ template "slack.text" . }}'

- name: mattermost_pre_production_alerts
  slack_configs:
  - api_url: https://xxx/hooks/xxx
    channel: '{{ if eq .CommonLabels.channel "" }}preprod-alerts{{ else }}{{
      .CommonLabels.channel }}{{ end }}'
    send_resolved: true
    color: '{{ template "slack.color" . }}'
    title: '{{ template "slack.title" . }}'
    title_link: '{{ template "slack.link" . }}'
    text: '{{ template "slack.text" . }}'

- name: mattermost_info_alerts
  slack_configs:
  - api_url: https://xxx/hooks/xxx
    channel: '{{ if eq .CommonLabels.channel "" }}info-alerts{{ else }}{{ .CommonLabels.channel
      }}{{ end }}'
    send_resolved: true
    color: '{{ template "slack.color" . }}'
    title: '{{ template "slack.title" . }}'
    title_link: '{{ template "slack.link" . }}'
    text: '{{ template "slack.text" . }}'

- name: mattermost_devops_production_alerts
  slack_configs:
  - api_url: https://xxx/hooks/xxx
    channel: '{{ if eq .CommonLabels.channel "" }}devops-alerts{{ else }}{{ .CommonLabels.channel
      }}{{ end }}'
    send_resolved: true
    color: '{{ template "slack.color" . }}'
    title: '{{ template "slack.title" . }}'
    title_link: '{{ template "slack.link" . }}'
    text: '{{ template "slack.text" . }}'

- name: healtcheck_io_thanos_rule
  webhook_configs:
  - url: https://hchk.io/xxx
    send_resolved: false
    http_config:
      proxy_url: http://proxy:xxx
      bearer_token: '{{ template "slack.title" . }}'

- name: healtcheck_io_harvester
  webhook_configs:
  - url: https://hchk.io/xxx
    send_resolved: false
    http_config:
      proxy_url: http://proxy:xxx
      bearer_token: '{{ template "slack.title" . }}'

  # Incident reporter - termporay URL for development machine for now
- name: incident_reporter
  webhook_configs:
  - url: http://xxxx/

  # Creates gitlab issue
- name: xxx
  webhook_configs:
  - send_resolved: false
    url: http://xxx

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions