Silenced alert becomes active after cluster restart with silence still active

**What did you do?**

Upgraded alertmanager cluster with 2 instances from `0.19.0` to `0.20.0`.
Running in Kubernetes each instance in different DC communicating over NodePort.

1. Many alerts is firing from multiple Prometheis instances but silenced in the Alertmanager. Everything working fine.
1. Upgraded instance A
1. Wait for it to be ready
1. Upgraded instance B
1. Wait for it to be ready
1. PagerDuty receives notification for one of these silenced alerts from AM instance B.
1. Checked the instance B UI and there was the alert shown as active but also in the silenced alerts.
1. Tried to examine it using API and had an empty list in the `silencedBy` list and was active, but was returend by the API even when the query was `/api/v1/alerts?silenced=true`.
<details>The alerts json data from instance B:


```
 {
 "labels": {
 "alertname": "xxxUpAbsent",
 "app_label": "xxx",
 "cluster": "clusterA",
 "locality": "localityA",
 "namespace": "xxx",
 "prometheus_type": "harvester",
 "severity": "critical",
 "sre": "true",
 "team": "xxx"
 },
 "annotations": {
 "description": "xxx's metric up is absent for at least tstaleness+4 minutes",
 "playbook": "howto/k8s-apps-down.md",
 "title": "missing metrics at all for xxx"
 },
 "startsAt": "2019-12-20T10:59:19.477329823+01:00",
 "endsAt": "2020-01-08T13:04:59.477329823+01:00",
 "generatorURL": "xxx",
 "status": {
 "state": "active",
 "silencedBy": [],
 "inhibitedBy": []
 },
 "receivers": [
 "mattermost_production_alerts",
 "pagerduty_critical"
 ],
 "fingerprint": "fb03b13ba7d00405"
 }
```


</details>


9. Tried the exact API queries on the instance A and there the alert was marked as silenced s expected to be.
<details>The alerts json data from instance A:


```
 {
 "labels": {
 "alertname": "xxxUpAbsent",
 "app_label": "xxx",
 "cluster": "clusterA",
 "locality": "localityA",
 "namespace": "xxx",
 "prometheus_type": "harvester",
 "severity": "critical",
 "sre": "true",
 "team": "xxx"
 },
 "annotations": {
 "description": "xxx's metric up is absent for at least tstaleness+4 minutes",
 "playbook": "howto/k8s-apps-down.md",
 "title": "missing metrics at all for xxx"
 },
 "startsAt": "2019-12-20T10:59:19.477329823+01:00",
 "endsAt": "2020-01-08T13:06:59.477329823+01:00",
 "generatorURL": "xxxx",
 "status": {
 "state": "suppressed",
 "silencedBy": [
 "986be0d6-b135-4199-bed8-8d390fd6d288"
 ],
 "inhibitedBy": null
 }
```


</details>

10. The alert is still firing and it does not become suppressed.
11. Just simple editing of the silence description which does match the alert and saving it in the instance B resolves the whole issue. 

This is not the first time we experienced this issue. But it does not happen every time. Mostly when restarting both instances of the cluster. So not related to the 0.20.0 version particularly.

We can provide also nflogs and silences files of both instances from the moment of the issue, but prefer to share it privately. 

**What did you expect to see?**

To not receive silenced alerts. 

**What did you see instead? Under which circumstances?**

Received notification for continuously firing alert after cluster upgrade.

**Environment**
On premise kubernetes, running on physical machines.

* System information:

`Linux 5.3.0-24-generic x86_64`

* Alertmanager version:
```
alertmanager, version 0.20.0 (branch: non-git, revision: non-git)
 build user: root@runner-xxxxx
 build date: 20200107-08:51:34
 go version: go1.13.5
```

* Alertmanager configuration file:
<details>


```
global:
 resolve_timeout: 1m

route:
 receiver: blackhole

 # The labels by which incoming alerts are grouped together. For example,
 # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
 # be batched into a single group.
 group_by: [alertname, locality, job, namespace, app, app_label, severity, deployment,
 cluster, exported_namespace, slo_domain, slo_type, original_alertname]

 # When a new group of alerts is created by an incoming alert, wait at
 # least 'group_wait' to send the initial notification.
 # This way ensures that you get multiple alerts for the same group that start
 # firing shortly after another are batched together on the first
 # notification.
 group_wait: 30s

 # When the first notification was sent, wait 'group_interval' to send a batch
 # of new alerts that started firing for that group.
 group_interval: 5m

 # If an alert has successfully been sent, wait 'repeat_interval' to
 # resend them.
 repeat_interval: 8h

 routes:
 - match:
 devops: true
 receiver: mattermost_xxx
 continue: true
 routes:
 - match:
 severity: info
 receiver: blackhole

 - match:
 sre: true
 receiver: mattermost_xxx
 routes:

 - match:
 alertname: PeriodicalMetaMonitoringAlert
 prometheus_type: harvester
 group_wait: 1s
 group_interval: 1s
 repeat_interval: 5m
 receiver: healtcheck_io_harvester
 continue: false

 - match:
 alertname: PeriodicalMetaMonitoringAlert
 prometheus_type: thanos-rule
 group_wait: 1s
 group_interval: 1s
 repeat_interval: 5m
 receiver: healtcheck_io_thanos_rule
 continue: false

 # Filter out alerts which were fired due to metrics based on kube-state-metrics.
 # Such metrics have namespace label containing namespace of kube-state-metrics
 # and exported_namespace of actual resource.
 - match_re:
 exported_namespace: xxx
 receiver: mattermost_xxx
 continue: false

 # Filter out xxx_namespace
 - match_re:
 namespace: <name>
 receiver: mattermost_xxx
 continue: false

 # Send critical alerts to production channel, but continue for possible pagerduty match.
 - match_re:
 severity: ^(critical|warning)$
 receiver: mattermost_xxx
 continue: true

 # Send info alerts to info channel
 - match:
 severity: info
 receiver: mattermost_xxx
 continue: true

 # Creates gitlab issue from warning alert but continue to pagerduty
 - match:
 severity: warning
 group_by: [alertname, app, app_label, namespace, severity, slo_domain, slo_type]
 receiver: xxx
 continue: true
 repeat_interval: 7d

 # Pagerduty match
 - match:
 severity: critical
 receiver: pagerduty_critical
 continue: false
 repeat_interval: 10m

 - match:
 severity: warning
 receiver: pagerduty_warning
 continue: false
 repeat_interval: 10m

 # Info alerts are sent only when has channel label
 - match:
 severity: info
 channel: ''
 receiver: blackhole
 continue: false



inhibit_rules:
- source_match:
 sre: true
 severity: critical

 # Matchers that have to be fulfilled in the alerts to be muted.
 target_match:
 severity: warning

 # Apply inhibition if the alertname is the same.
 equal: [alertname, locality, job, namespace, app, app_label]

- source_match:
 alertname: xxxReadinessFailing
 app: xxx
 sre: true
 target_match:
 alertname: ExternalHttpProbeFailing
 sre: true
- source_match:
 alertname: NoExternalBBEIsUp
 sre: true
 target_match:
 alertname: ExternalBBEIsDown
 sre: true
- source_match:
 alertname: KubeDeploymentReadyOnlyOne
 sre: true
 target_match:
 alertname: KubeDeploymentReadyNotAll
 sre: true
 equal: [locality, namespace, app]

- source_match_re:
 alertname: High5xxRate(ByEndpoint)?
 sre: true
 app: xxx

 target_match_re:
 sre: true
 alertname: High5xxRate(ByEndpoint)?
 app: (xxx(-xxx)?|xxx|xxx)

 equal: [locality, namespace, job, cluster]
- source_match_re:
 alertname: High5xxRate(ByEndpoint)?
 app: xxx
 sre: true

 target_match_re:
 alertname: High5xxRate(ByEndpoint)?
 app: (xxx-(xxx|xxx)|xxx)
 sre: true

 equal: [locality, namespace, job, cluster]
- source_match:
 alertname: Inhibit0000to0600
 target_match:
 inhibit0000to0600: true

- source_match:
 alertname: Inhibit1600to0800
 target_match:
 inhibit1600to0800: true

- source_match:
 alertname: LocalityDisabled

 target_match_re:
 alertname: '[A-Za-z]+[Pp]roxyMetricsDoesNotChange'

 equal: [locality]


receivers:
- name: pagerduty_critical
 pagerduty_configs:
 - service_key: xxxx
 description: '{{ template "slack.title" . }}'
 client: Sklik Devops SRE Alertmanager
x client_url: '{{ template "slack.alertmanager.link" . }}'
 details: {note: '{{ template "slack.text" . }}'}
 component: '{{ .CommonLabels.app }}'
 group: '{{ if .CommonLabels.namespace }}.{{ .CommonLabels.namespace }}{{ end }}{{
 if .CommonLabels.cluster }}.{{ .CommonLabels.cluster }}{{ end }}{{ if .CommonLabels.locality
 }}.{{ .CommonLabels.locality }}{{ end }}'
 class: '{{.CommonLabels.alertname }}'
 send_resolved: false
 http_config:
 proxy_url: http://proxy:xxx

- name: pagerduty_warning
 pagerduty_configs:
 - service_key: xxx
 description: '{{ template "slack.title" . }}'
 client: Sklik Devops SRE Alertmanager
 client_url: '{{ template "slack.alertmanager.link" . }}'
 #details: { note : '{{ template "slack.text" . }}'}
 send_resolved: true
 http_config:
 proxy_url: http://proxy:xxx

- name: blackhole
 # Deliberately left empty to not deliver anywhere.

- name: mattermost_production_alerts
 slack_configs:
 - api_url: https://xxx/hooks/xxx
 channel: '{{ if eq .CommonLabels.channel "" }}prod-alert{{ else }}{{ .CommonLabels.channel
 }}{{ end }}'
 send_resolved: true
 color: '{{ template "slack.color" . }}'
 title: '{{ template "slack.title" . }}'
 title_link: '{{ template "slack.link" . }}'
 text: '{{ template "slack.text" . }}'

- name: mattermost_pre_production_alerts
 slack_configs:
 - api_url: https://xxx/hooks/xxx
 channel: '{{ if eq .CommonLabels.channel "" }}preprod-alerts{{ else }}{{
 .CommonLabels.channel }}{{ end }}'
 send_resolved: true
 color: '{{ template "slack.color" . }}'
 title: '{{ template "slack.title" . }}'
 title_link: '{{ template "slack.link" . }}'
 text: '{{ template "slack.text" . }}'

- name: mattermost_info_alerts
 slack_configs:
 - api_url: https://xxx/hooks/xxx
 channel: '{{ if eq .CommonLabels.channel "" }}info-alerts{{ else }}{{ .CommonLabels.channel
 }}{{ end }}'
 send_resolved: true
 color: '{{ template "slack.color" . }}'
 title: '{{ template "slack.title" . }}'
 title_link: '{{ template "slack.link" . }}'
 text: '{{ template "slack.text" . }}'

- name: mattermost_devops_production_alerts
 slack_configs:
 - api_url: https://xxx/hooks/xxx
 channel: '{{ if eq .CommonLabels.channel "" }}devops-alerts{{ else }}{{ .CommonLabels.channel
 }}{{ end }}'
 send_resolved: true
 color: '{{ template "slack.color" . }}'
 title: '{{ template "slack.title" . }}'
 title_link: '{{ template "slack.link" . }}'
 text: '{{ template "slack.text" . }}'

- name: healtcheck_io_thanos_rule
 webhook_configs:
 - url: https://hchk.io/xxx
 send_resolved: false
 http_config:
 proxy_url: http://proxy:xxx
 bearer_token: '{{ template "slack.title" . }}'

- name: healtcheck_io_harvester
 webhook_configs:
 - url: https://hchk.io/xxx
 send_resolved: false
 http_config:
 proxy_url: http://proxy:xxx
 bearer_token: '{{ template "slack.title" . }}'

 # Incident reporter - termporay URL for development machine for now
- name: incident_reporter
 webhook_configs:
 - url: http://xxxx/

 # Creates gitlab issue
- name: xxx
 webhook_configs:
 - send_resolved: false
 url: http://xxx
```


</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Silenced alert becomes active after cluster restart with silence still active #2158

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Silenced alert becomes active after cluster restart with silence still active #2158

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions