Description
What did you do?
Upgraded alertmanager cluster with 2 instances from 0.19.0
to 0.20.0
.
Running in Kubernetes each instance in different DC communicating over NodePort.
- Many alerts is firing from multiple Prometheis instances but silenced in the Alertmanager. Everything working fine.
- Upgraded instance A
- Wait for it to be ready
- Upgraded instance B
- Wait for it to be ready
- PagerDuty receives notification for one of these silenced alerts from AM instance B.
- Checked the instance B UI and there was the alert shown as active but also in the silenced alerts.
- Tried to examine it using API and had an empty list in the
silencedBy
list and was active, but was returend by the API even when the query was/api/v1/alerts?silenced=true
.
{
"labels": {
"alertname": "xxxUpAbsent",
"app_label": "xxx",
"cluster": "clusterA",
"locality": "localityA",
"namespace": "xxx",
"prometheus_type": "harvester",
"severity": "critical",
"sre": "true",
"team": "xxx"
},
"annotations": {
"description": "xxx's metric up is absent for at least tstaleness+4 minutes",
"playbook": "howto/k8s-apps-down.md",
"title": "missing metrics at all for xxx"
},
"startsAt": "2019-12-20T10:59:19.477329823+01:00",
"endsAt": "2020-01-08T13:04:59.477329823+01:00",
"generatorURL": "xxx",
"status": {
"state": "active",
"silencedBy": [],
"inhibitedBy": []
},
"receivers": [
"mattermost_production_alerts",
"pagerduty_critical"
],
"fingerprint": "fb03b13ba7d00405"
}
- Tried the exact API queries on the instance A and there the alert was marked as silenced s expected to be.
{
"labels": {
"alertname": "xxxUpAbsent",
"app_label": "xxx",
"cluster": "clusterA",
"locality": "localityA",
"namespace": "xxx",
"prometheus_type": "harvester",
"severity": "critical",
"sre": "true",
"team": "xxx"
},
"annotations": {
"description": "xxx's metric up is absent for at least tstaleness+4 minutes",
"playbook": "howto/k8s-apps-down.md",
"title": "missing metrics at all for xxx"
},
"startsAt": "2019-12-20T10:59:19.477329823+01:00",
"endsAt": "2020-01-08T13:06:59.477329823+01:00",
"generatorURL": "xxxx",
"status": {
"state": "suppressed",
"silencedBy": [
"986be0d6-b135-4199-bed8-8d390fd6d288"
],
"inhibitedBy": null
}
- The alert is still firing and it does not become suppressed.
- Just simple editing of the silence description which does match the alert and saving it in the instance B resolves the whole issue.
This is not the first time we experienced this issue. But it does not happen every time. Mostly when restarting both instances of the cluster. So not related to the 0.20.0 version particularly.
We can provide also nflogs and silences files of both instances from the moment of the issue, but prefer to share it privately.
What did you expect to see?
To not receive silenced alerts.
What did you see instead? Under which circumstances?
Received notification for continuously firing alert after cluster upgrade.
Environment
On premise kubernetes, running on physical machines.
- System information:
Linux 5.3.0-24-generic x86_64
- Alertmanager version:
alertmanager, version 0.20.0 (branch: non-git, revision: non-git)
build user: root@runner-xxxxx
build date: 20200107-08:51:34
go version: go1.13.5
- Alertmanager configuration file:
global:
resolve_timeout: 1m
route:
receiver: blackhole
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: [alertname, locality, job, namespace, app, app_label, severity, deployment,
cluster, exported_namespace, slo_domain, slo_type, original_alertname]
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 8h
routes:
- match:
devops: true
receiver: mattermost_xxx
continue: true
routes:
- match:
severity: info
receiver: blackhole
- match:
sre: true
receiver: mattermost_xxx
routes:
- match:
alertname: PeriodicalMetaMonitoringAlert
prometheus_type: harvester
group_wait: 1s
group_interval: 1s
repeat_interval: 5m
receiver: healtcheck_io_harvester
continue: false
- match:
alertname: PeriodicalMetaMonitoringAlert
prometheus_type: thanos-rule
group_wait: 1s
group_interval: 1s
repeat_interval: 5m
receiver: healtcheck_io_thanos_rule
continue: false
# Filter out alerts which were fired due to metrics based on kube-state-metrics.
# Such metrics have namespace label containing namespace of kube-state-metrics
# and exported_namespace of actual resource.
- match_re:
exported_namespace: xxx
receiver: mattermost_xxx
continue: false
# Filter out xxx_namespace
- match_re:
namespace: <name>
receiver: mattermost_xxx
continue: false
# Send critical alerts to production channel, but continue for possible pagerduty match.
- match_re:
severity: ^(critical|warning)$
receiver: mattermost_xxx
continue: true
# Send info alerts to info channel
- match:
severity: info
receiver: mattermost_xxx
continue: true
# Creates gitlab issue from warning alert but continue to pagerduty
- match:
severity: warning
group_by: [alertname, app, app_label, namespace, severity, slo_domain, slo_type]
receiver: xxx
continue: true
repeat_interval: 7d
# Pagerduty match
- match:
severity: critical
receiver: pagerduty_critical
continue: false
repeat_interval: 10m
- match:
severity: warning
receiver: pagerduty_warning
continue: false
repeat_interval: 10m
# Info alerts are sent only when has channel label
- match:
severity: info
channel: ''
receiver: blackhole
continue: false
inhibit_rules:
- source_match:
sre: true
severity: critical
# Matchers that have to be fulfilled in the alerts to be muted.
target_match:
severity: warning
# Apply inhibition if the alertname is the same.
equal: [alertname, locality, job, namespace, app, app_label]
- source_match:
alertname: xxxReadinessFailing
app: xxx
sre: true
target_match:
alertname: ExternalHttpProbeFailing
sre: true
- source_match:
alertname: NoExternalBBEIsUp
sre: true
target_match:
alertname: ExternalBBEIsDown
sre: true
- source_match:
alertname: KubeDeploymentReadyOnlyOne
sre: true
target_match:
alertname: KubeDeploymentReadyNotAll
sre: true
equal: [locality, namespace, app]
- source_match_re:
alertname: High5xxRate(ByEndpoint)?
sre: true
app: xxx
target_match_re:
sre: true
alertname: High5xxRate(ByEndpoint)?
app: (xxx(-xxx)?|xxx|xxx)
equal: [locality, namespace, job, cluster]
- source_match_re:
alertname: High5xxRate(ByEndpoint)?
app: xxx
sre: true
target_match_re:
alertname: High5xxRate(ByEndpoint)?
app: (xxx-(xxx|xxx)|xxx)
sre: true
equal: [locality, namespace, job, cluster]
- source_match:
alertname: Inhibit0000to0600
target_match:
inhibit0000to0600: true
- source_match:
alertname: Inhibit1600to0800
target_match:
inhibit1600to0800: true
- source_match:
alertname: LocalityDisabled
target_match_re:
alertname: '[A-Za-z]+[Pp]roxyMetricsDoesNotChange'
equal: [locality]
receivers:
- name: pagerduty_critical
pagerduty_configs:
- service_key: xxxx
description: '{{ template "slack.title" . }}'
client: Sklik Devops SRE Alertmanager
x client_url: '{{ template "slack.alertmanager.link" . }}'
details: {note: '{{ template "slack.text" . }}'}
component: '{{ .CommonLabels.app }}'
group: '{{ if .CommonLabels.namespace }}.{{ .CommonLabels.namespace }}{{ end }}{{
if .CommonLabels.cluster }}.{{ .CommonLabels.cluster }}{{ end }}{{ if .CommonLabels.locality
}}.{{ .CommonLabels.locality }}{{ end }}'
class: '{{.CommonLabels.alertname }}'
send_resolved: false
http_config:
proxy_url: http://proxy:xxx
- name: pagerduty_warning
pagerduty_configs:
- service_key: xxx
description: '{{ template "slack.title" . }}'
client: Sklik Devops SRE Alertmanager
client_url: '{{ template "slack.alertmanager.link" . }}'
#details: { note : '{{ template "slack.text" . }}'}
send_resolved: true
http_config:
proxy_url: http://proxy:xxx
- name: blackhole
# Deliberately left empty to not deliver anywhere.
- name: mattermost_production_alerts
slack_configs:
- api_url: https://xxx/hooks/xxx
channel: '{{ if eq .CommonLabels.channel "" }}prod-alert{{ else }}{{ .CommonLabels.channel
}}{{ end }}'
send_resolved: true
color: '{{ template "slack.color" . }}'
title: '{{ template "slack.title" . }}'
title_link: '{{ template "slack.link" . }}'
text: '{{ template "slack.text" . }}'
- name: mattermost_pre_production_alerts
slack_configs:
- api_url: https://xxx/hooks/xxx
channel: '{{ if eq .CommonLabels.channel "" }}preprod-alerts{{ else }}{{
.CommonLabels.channel }}{{ end }}'
send_resolved: true
color: '{{ template "slack.color" . }}'
title: '{{ template "slack.title" . }}'
title_link: '{{ template "slack.link" . }}'
text: '{{ template "slack.text" . }}'
- name: mattermost_info_alerts
slack_configs:
- api_url: https://xxx/hooks/xxx
channel: '{{ if eq .CommonLabels.channel "" }}info-alerts{{ else }}{{ .CommonLabels.channel
}}{{ end }}'
send_resolved: true
color: '{{ template "slack.color" . }}'
title: '{{ template "slack.title" . }}'
title_link: '{{ template "slack.link" . }}'
text: '{{ template "slack.text" . }}'
- name: mattermost_devops_production_alerts
slack_configs:
- api_url: https://xxx/hooks/xxx
channel: '{{ if eq .CommonLabels.channel "" }}devops-alerts{{ else }}{{ .CommonLabels.channel
}}{{ end }}'
send_resolved: true
color: '{{ template "slack.color" . }}'
title: '{{ template "slack.title" . }}'
title_link: '{{ template "slack.link" . }}'
text: '{{ template "slack.text" . }}'
- name: healtcheck_io_thanos_rule
webhook_configs:
- url: https://hchk.io/xxx
send_resolved: false
http_config:
proxy_url: http://proxy:xxx
bearer_token: '{{ template "slack.title" . }}'
- name: healtcheck_io_harvester
webhook_configs:
- url: https://hchk.io/xxx
send_resolved: false
http_config:
proxy_url: http://proxy:xxx
bearer_token: '{{ template "slack.title" . }}'
# Incident reporter - termporay URL for development machine for now
- name: incident_reporter
webhook_configs:
- url: http://xxxx/
# Creates gitlab issue
- name: xxx
webhook_configs:
- send_resolved: false
url: http://xxx