Skip to content

OBSDOCS-1327: Improve troubleshooting monitoring issues: new section troubleshooting alertmanager configurations #92246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 188 additions & 0 deletions modules/monitoring-troubleshooting-alertmanager-configurations.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
// Module included in the following assemblies:
//
// * monitoring/troubleshooting-monitoring-issues.adoc

:_mod-docs-content-type: PROCEDURE
[id="troubleshooting-alertmanager-configurations_{context}"]
= Troubleshooting Alertmanager configuration

If your Alertmanager configuration does not work properly, you can compare the `alertmanager-main` secret with the running Alertmanager configuration to identify possible errors. You can also test your alert routing configuration by creating a test alert.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to learn more about those errors? are we talking about cases where the user breaks the config and Alertmanager cannot/doesn't load it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@juzhao juzhao Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see https://github.com/openshift/openshift-docs/pull/92246/files#r2050261250, mail_configs, smarthost missed port, for this case, AlertmanagerFailedReload alert would be fired
if set smarthost to a unreachable value, example

receivers:
  - name: 'web.hook'
    email_configs:
    - to: ***
      from: ***
      smarthost: 'smtp.non-exist.com:25'

AlertmanagerFailedToSendAlerts would be fired, AlertmanagerFailedReload would not be fired

# token=`oc create token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=count (ALERTS{alertname=~"AlertmanagerFailedReload|AlertmanagerFailedToSendAlerts"}) by (alertname)' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "alertname": "AlertmanagerFailedToSendAlerts"
        },
        "value": [
          1744962937.043,
          "2"
        ]
      }
    ],
    "analysis": {}
  }
}

error in alertmanager pod logs

time=2025-04-18T07:46:52.334Z level=ERROR source=dispatch.go:360 msg="Notify for alerts failed" component=dispatcher num_alerts=1 err="web.hook/email[0]: notify retry canceled after 7 attempts: establish connection to server: dial tcp: lookup smtp.non-exist.com on 172.30.0.10:53: no such host"

I think we could mention AlertmanagerFailedToSendAlerts and AlertmanagerFailedReload or just mention to check any alerts related to Alertmanager


.Prerequisites

* You have access to the cluster as a user with the `cluster-admin` cluster role.
* You have installed the {oc-first}.

.Procedure

. Compare the `alertmanager-main` secret with the running Alertmanager configuration:

.. Extract the Alertmanager configuration from the `alertmanager-main` secret into the `alertmanager.yaml` file:
+
[source,terminal]
----
$ oc -n openshift-monitoring get secret alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' | base64 --decode > alertmanager.yaml
----

.. Pull the running Alertmanager configuration from the API:
+
[source,terminal]
----
$ oc exec -n openshift-monitoring alertmanager-main-0 -- amtool config show --alertmanager.url http://localhost:9093
----
+
.Example output
[source,terminal]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have it as a collapsible and have it collapsed by default?

----
global:
resolve_timeout: 5m
http_config:
follow_redirects: true
enable_http2: true
proxy_from_environment: true
...
route:
receiver: default
group_by:
- namespace
continue: false
routes:
...
- matchers: # <1>
- service="example-app"
continue: false
routes:
- receiver: team-frontend-page
matchers:
- severity="critical"
continue: false
...
receivers:
...
- name: team-frontend-page # <2>
pagerduty_configs:
- send_resolved: true
http_config:
authorization:
type: Bearer
credentials: <secret>
follow_redirects: true
enable_http2: true
proxy_from_environment: true
service_key: <secret>
url: https://events.pagerduty.com/v2/enqueue
...
templates: []
----
<1> The example shows the route to the `team-frontend-page` receiver. Alertmanager routes alerts with `service="example-app"` and `severity="critical"` labels to this receiver.
<2> The `team-frontend-page` receiver configuration. The example shows PagerDuty as a receiver.

.. Compare the contents of the `route` and `receiver` fields of the `alertmanager.yaml` file with the fields in the running Alertmanager configuration. Look for any discrepancies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparing could be tedious, maybe we should think about a diff command or sth.
If we only suspect the secret to not be taken into account by AM, maybe checking a log somewhere after changing the secret is sufficient?


.. If you used an `AlertmanagerConfig` object to configure alert routing for user-defined projects, you can use the `alertmanager.yaml` file to see the configuration before the `AlertmanagerConfig` object was applied. The running Alertmanager configuration shows the changes after the object was applied:
+
.Example running configuration with AlertmanagerConfig applied
[source,terminal]
----
...
route:
...
routes:
- receiver: ns1/example-routing/UWM-receiver <1>
group_by:
- job
matchers:
- namespace="ns1"
continue: true
...
receivers:
...
- name: ns1/example-routing/UWM-receiver <1>
webhook_configs:
- send_resolved: true
http_config:
follow_redirects: true
enable_http2: true
proxy_from_environment: true
url: <secret>
url_file: ""
max_alerts: 0
timeout: 0s
templates: []
----
<1> The routing configuration from the `example-routing` `AlertmanagerConfig` object in the `ns1` project for the `UWM-receiver` receiver.

. Check Alertmanager pod logs to see if there are any errors:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can be more precise about the errors? some keywords that they'd probably contain

Copy link

@juzhao juzhao Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need too explicit, since there are too many conditions. example, below mail_configs, smarthost missed port

receivers:
  - name: 'web.hook'
    email_configs:
    - to: ****
      from: ****
      smarthost: 'smtp.gmail.com'
      require_tls: false
      auth_username: ****
      auth_password: ****

will see error in alertmanager pod logs

$ oc -n openshift-monitoring logs -c alertmanager alertmanager-main-0
time=2025-04-18T07:28:00.274Z level=ERROR source=coordinator.go:117 msg="Loading configuration file failed" component=configuration file=/etc/alertmanager/config_out/alertmanager.env.yaml err="address smtp.gmail.com: missing port in address"

check for the error logs is fine

+
[source,terminal]
----
$ oc -n openshift-monitoring logs -c alertmanager <alertmanager_pod>
----
+
[NOTE]
====
For multi-node clusters, ensure that you check all Alertmanager pods and their logs.
====
+
.Example command
[source,terminal]
----
$ oc -n openshift-monitoring logs -c alertmanager alertmanager-main-0
----

. Verify that your receiver is configured correctly by creating a test alert.

.. Get a list of the configured routes:
+
[source,terminal]
----
$ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool config routes show --alertmanager.url http://localhost:9093
----
+
.Example output
[source,terminal]
----
Routing tree:
.
└── default-route receiver: default
├── {alertname="Watchdog"} receiver: Watchdog
└── {service="example-app"} receiver: default
└── {severity="critical"} receiver: team-frontend-page
----

.. Print the route to your chosen receiver. The following example shows the receiver used for alerts with `service=example-app` and severity=critical` matchers.
+
[source,terminal]
----
$ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool config routes test service=example-app severity=critical --alertmanager.url http://localhost:9093
----
+
.Example output
[source,terminal]
----
team-frontend-page
----

.. Create a test alert and add it to the Alertmanager. The following example creates an alert with `service=example-app` and `severity=critical` to test the `team-frontend-page` receiver:
+
[source,terminal]
----
$ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool alert add --alertmanager.url http://localhost:9093 alertname=myalarm --start="2025-03-31T00:00:00-00:00" service=example-app severity=critical --annotation="summary=\"This is a test alert with a custom summary\""
----

.. Verify that the alert was generated:
+
[source,terminal]
----
$ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool alert --alertmanager.url http://localhost:9093
----
+
.Example output
[source,terminal]
----
Alertname Starts At Summary State
myalarm 2025-03-31 00:00:00 UTC This is a test alert with a custom summary active
Watchdog 2025-04-07 10:07:16 UTC An alert that should always be firing to certify that Alertmanager is working properly. active
----

.. Verify that the receiver was notified with the `myalarm` alert.
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,8 @@ include::modules/monitoring-resolving-the-alertmanagerreceiversnotconfigured-ale
* xref:../../observability/monitoring/configuring-core-platform-monitoring/configuring-alerts-and-notifications.adoc#configuring-alert-notifications_configuring-alerts-and-notifications[Configuring alert notifications for default platform monitoring]
* xref:../../observability/monitoring/configuring-user-workload-monitoring/configuring-alerts-and-notifications-uwm.adoc#configuring-alert-notifications_configuring-alerts-and-notifications-uwm[Configuring alert notifications for user workload monitoring]
endif::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]

ifndef::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
//Troubleshooting Alertmanager configurations
include::modules/monitoring-troubleshooting-alertmanager-configurations.adoc[leveloffset=+1]
endif::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]