-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
Describe the bug a clear and concise description of what the bug is.
We are observing a bug where we are seeing three default Prometheus alert-manager alert rules firing that shouldn't be. These alerts are:
- KubeProxyDown
- KubeAPIDown
- KubeletDown
Screenshot from alert-manager:
We have not been able to track down why these default alerts are firing. We haven't upgraded the kube-prometheus-stack Helm chart in quite a while, though we have made GKE Control plane and Node Pool upgrades within the last couple months.
What's your helm version?
3.17.3
What's your kubectl version?
1.32.8
Which chart?
kube-prometheus-stack
What's the chart version?
47.6.1
What happened?
Installing kube-prometheus-stack version 47.6.1 via a Helmfile, with some modified values, against a GKE 1.32.9. Kubernetes cluster, and the default KubeAPIDown, KubeProxyDown and KubeletDown alerts are firing.
What you expected to happen?
These default alerts shouldn't need to fire.
How to reproduce it?
No response
Enter the changed values of values.yaml?
global:
imageRegistry: "us-docker.pkg.dev/devtools-1/taulia-docker"
alertmanager:
## Alertmanager configuration directives
## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
## https://prometheus.io/webtools/alerting/routing-tree-editor/
victorOps:
apiKey: <omitted for security>
# ## creates a victorOps receiver for each routing key. determines who gets paged
# ##
# routingKeys: []
config:
global:
slack_api_url: "<omitted>"
victorops_api_url: "<omitted>"
## mute alerts when other alerts are already firing
##
inhibit_rules:
## Ignore bursty, non-critical workloads. ex: mysql-operator exporter
##
- target_matchers: [alertname="CPUThrottlingHigh", container="metrics-exporter"]
source_matchers: [alertname="Watchdog"] # this alert is always firing
equal: ["prometheus"] # label shared by Watchdog and target
receivers:
- name: slack
slack_configs:
- send_resolved: true
channel: ttt-alerts-{{ .Values.jxRequirements.cluster.clusterName }}
username: '{{`{{ template "taulia.slack.default.username" . }}`}}'
color: '{{`{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}`}}'
title: '{{`{{ template "taulia.slack.default.title" . }}`}}'
title_link: '{{`{{ template "taulia.slack.default.titlelink" . }}`}}'
text: '{{`{{ template "taulia.slack.default.text" . }}`}}'
icon_url: https://avatars3.githubusercontent.com/u/3380462
pretext: "{{`{{ .CommonAnnotations.summary }}`}}"
actions:
- type: "button"
text: "Snooze this alert"
url: '{{`{{ template "taulia.slack.default.buttonlink" . }}`}}'
style: "danger"
- type: "button"
text: "Grafana"
url: '{{`{{ template "taulia.slack.default.grafanaLink" . }}`}}'
style: "danger"
- name: slack-critical
slack_configs:
- send_resolved: true
channel: ttt-alerts-{{ .Values.jxRequirements.cluster.clusterName }}-critical
username: '{{`{{ template "taulia.slack.default.username" . }}`}}'
color: '{{`{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}`}}'
title: '{{`{{ template "taulia.slack.default.title" . }}`}}'
title_link: '{{`{{ template "taulia.slack.default.titlelink" . }}`}}'
text: '{{`{{ template "taulia.slack.default.text" . }}`}}'
icon_url: https://avatars3.githubusercontent.com/u/3380462
pretext: "{{`{{ .CommonAnnotations.summary }}`}}"
actions:
- type: "button"
text: "Snooze this alert"
url: '{{`{{ template "taulia.slack.default.buttonlink" . }}`}}'
style: "danger"
- type: "button"
text: "Grafana"
url: '{{`{{ template "taulia.slack.default.grafanaLink" . }}`}}'
style: "danger"
- name: victorops-operations
victorops_configs:
- api_key: <omitted for security>
routing_key: operations
- name: "null"
route:
receiver: slack
group_by: ['namespace']
# group_by: ["job"]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
routes:
- receiver: "null"
matchers:
- alertname =~ "InfoInhibitor|Watchdog|CPUThrottlingHigh|KubeJobFailed"
- receiver: "victorops-operations"
matchers:
- page= "operations"
- receiver: "slack-critical"
matchers:
- namespace = {{ .Values.jxRequirements.cluster.clusterName }}
- severity =~ "warning|critical"
templates:
- "/etc/alertmanager/config/*.tmpl"
## Settings affecting alertmanagerSpec
## ref: https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#alertmanagerspec
##
alertmanagerSpec:
## Namespaces to be selected for AlertmanagerConfig discovery. If nil, only check own namespace.
##
alertmanagerConfigNamespaceSelector: {}
## Example which selects all namespaces
## with label "alertmanagerconfig" with values any of "example-namespace" or "example-namespace-2"
# alertmanagerConfigNamespaceSelector:
# matchExpressions:
# - key: alertmanagerconfig
# operator: In
# values:
# - example-namespace
# - example-namespace-2
#
## Size is the expected size of the alertmanager cluster. The controller will eventually make the size of the
## running cluster equal to the expected size.
replicas: 2
## Pod anti-affinity can prevent the scheduler from placing alertmanager replicas on the same node.
## The default value "soft" means that the scheduler should *prefer* to not schedule two replica pods onto the same node but no guarantee is provided.
## The value "hard" means that the scheduler is *required* to not schedule two replica pods onto the same node.
## The value "" will disable pod anti-affinity so that no anti-affinity rules will be configured.
##
podAntiAffinity: soft
## If anti-affinity is enabled sets the topologyKey to use for anti-affinity.
## This can be changed to, for example, failure-domain.beta.kubernetes.io/zone
##
podAntiAffinityTopologyKey: failure-domain.beta.kubernetes.io/zone
# tplConfig: false
## Alertmanager template files to format alerts
## ref: https://prometheus.io/docs/alerting/notifications/
## https://prometheus.io/docs/alerting/notification_examples/
##
templateFiles:
slack.tmpl: |-
{{`{{ define "__slack_text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }} - {{ printf "\x60" }}{{ .Labels.severity }}{{ printf "\x60" }}
*Description:* {{ .Annotations.description }}{{ end }}
{{ end }}
{{ define "taulia.slack.default.username" }}Prometheus AlertManager{{ end }}
{{ define "taulia.slack.default.title" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]{{ .GroupLabels.alertname}} in {{ .CommonLabels.environment }}{{end}}
{{ define "taulia.slack.default.text" }} {{ template "__slack_text" . }}{{end}}
{{ define "taulia.slack.default.buttonlink" }}http://alert-manager.{{ .CommonLabels.environment }}.tauliatrade.com/#/alerts?silenced=false&inhibited=false&filter=%7Balertname%3D%22{{ .GroupLabels.alertname}}%22%7D{{ end }}
{{ define "taulia.slack.default.grafanaLink" }}http://grafana.{{ .CommonLabels.environment }}.tauliatrade.com{{ end }}
{{ define "taulia.slack.default.titlelink" }}http://alert-manager.{{ .CommonLabels.environment }}.tauliatrade.com/#/alerts{{ end }}`}}
ingress:
enabled: true
hosts:
- alert-manager.{{ .Values.jxRequirements.ingress.domain }}
secret:
annotations:
## avoid converting alertmanager config secret to ES
secret.jenkins-x.io/convert-exclude: "true"
## Component scraping coreDns. Use either this or kubeDns. Disable on GKE
##
coreDns:
enabled: false
## Component scraping kubeDns. Use either this or coreDns
##
kubeDns:
enabled: true
## Component scraping kube scheduler. Disable on GKE
##
kubeScheduler:
enabled: false
## Component scraping the kube controller manager. Disable on GKE
##
kubeControllerManager:
enabled: false
grafana:
enabled: false
## Deploy node exporter as a daemonset to all nodes
##
nodeExporter:
enabled: true
prometheusOperator:
resources:
requests:
cpu: 100m
memory: 216Mi
limits:
cpu: 300m
memory: 320Mi
## Deploy a Prometheus instance
##
prometheus:
serviceAccount:
annotations:
iam.gke.io/gcp-service-account: thanos-{{ .Values.jxRequirements.cluster.clusterName }}@{{ .Values.jxRequirements.cluster.project }}.iam.gserviceaccount.com
prometheusSpec:
## avoid converting alertmanager config secret to ES
additionalPrometheusSecretsAnnotations:
secret.jenkins-x.io/convert-exclude: "true"
additionalScrapeConfigs:
{{- range $index,$item := readDirEntries "files/prometheus-scrape-configs/" }}
{{- if $item.IsDir -}}
{{- $item.Name -}}
{{- end -}}
{{- end }}
externalLabels:
environment: {{ .Values.jxRequirements.cluster.clusterName }}
resources:
requests:
cpu: 300m
memory: 256Mi
limits:
cpu: 2
storageSpec:
## Use PersistentVolumeClaim to persist data between restarts
##
volumeClaimTemplate:
spec:
storageClassName: standard
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
## Inject thanos sidecar into prometheus pod
##
thanos:
baseImage: us-docker.pkg.dev/devtools-1/taulia-docker/bitnami/thanos
version: 0.28.1-scratch-r0
objectStorageConfig:
name: thanos-objstore-secret
key: objstore.yml
# externalLabels:
# cluster: thanos-operator-test
## If true, a nil or {} value for prometheus.prometheusSpec.ruleSelector will cause the
## prometheus resource to be created with selectors based on values in the helm deployment,
## which will also match the PrometheusRule resources created
##
ruleSelectorNilUsesHelmValues: false
## If true, a nil or {} value for prometheus.prometheusSpec.serviceMonitorSelector will cause the
## prometheus resource to be created with selectors based on values in the helm deployment,
## which will also match the servicemonitors created
##
serviceMonitorSelectorNilUsesHelmValues: false
## If true, a nil or {} value for prometheus.prometheusSpec.podMonitorSelector will cause the
## prometheus resource to be created with selectors based on values in the helm deployment,
## which will also match the podmonitors created
##
podMonitorSelectorNilUsesHelmValues: false
## If true, a nil or {} value for prometheus.prometheusSpec.probeSelector will cause the
## prometheus resource to be created with selectors based on values in the helm deployment,
## which will also match the probes created
##
probeSelectorNilUsesHelmValues: false
## Number of replicas of each shard to deploy for a Prometheus deployment.
## Number of replicas multiplied by shards is the total number of Pods created.
##
replicas: 1
## Pod anti-affinity can prevent the scheduler from placing Prometheus replicas on the same node.
## The default value "soft" means that the scheduler should *prefer* to not schedule two replica pods onto the same node but no guarantee is provided.
## The value "hard" means that the scheduler is *required* to not schedule two replica pods onto the same node.
## The value "" will disable pod anti-affinity so that no anti-affinity rules will be configured.
podAntiAffinity: soft
## If anti-affinity is enabled sets the topologyKey to use for anti-affinity.
## This can be changed to, for example, failure-domain.beta.kubernetes.io/zone
##
podAntiAffinityTopologyKey: failure-domain.beta.kubernetes.io/zone
thanosService:
enabled: true
annotations: {}
labels: {}
thanosServiceMonitor:
enabled: true
interval: ""
Enter the command that you execute and failing/misfunctioning.
We install via Jenkins X boot job, which runs helmfile template + kubectl apply commands on the manifests.
Anything else we need to know?
We have disabled these three alerts by adding this to values.yaml for now:
defaultRules:
disabled:
KubeProxyDown: true
KubeAPIDown: true
KubeletDown: true