OBSDOCS-1327: Improve troubleshooting monitoring issues: new section troubleshooting alertmanager configurations #92246

eromanova97 · 2025-04-16T08:45:50Z

Version(s): 4.12 and later

Issue: https://issues.redhat.com/browse/OBSDOCS-1327

Link to docs preview: https://92246--ocpdocs-pr.netlify.app/openshift-enterprise/latest/observability/monitoring/troubleshooting-monitoring-issues.html#troubleshooting-alertmanager-configurations_troubleshooting-monitoring-issues

QE review:

QE has approved this change.

Additional information:

…troubleshooting alertmanager configurations

openshift-ci-robot · 2025-04-16T08:45:54Z

@eromanova97: This pull request references OBSDOCS-1327 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Version(s): 4.12 and later

Issue: https://issues.redhat.com/browse/OBSDOCS-1327

Link to docs preview:

QE review:

QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

ocpdocs-previewbot · 2025-04-16T09:00:20Z

🤖 Wed Apr 16 09:00:19 - Prow CI generated the docs preview:

https://92246--ocpdocs-pr.netlify.app/openshift-dedicated/latest/observability/monitoring/troubleshooting-monitoring-issues.html
https://92246--ocpdocs-pr.netlify.app/openshift-enterprise/latest/observability/monitoring/troubleshooting-monitoring-issues.html
https://92246--ocpdocs-pr.netlify.app/openshift-rosa-hcp/latest/observability/monitoring/troubleshooting-monitoring-issues.html
https://92246--ocpdocs-pr.netlify.app/openshift-rosa/latest/observability/monitoring/troubleshooting-monitoring-issues.html

openshift-ci · 2025-04-16T09:02:34Z

@eromanova97: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-04-16T09:17:55Z

@eromanova97: This pull request references OBSDOCS-1327 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Version(s): 4.12 and later

Issue: https://issues.redhat.com/browse/OBSDOCS-1327

Link to docs preview: https://92246--ocpdocs-pr.netlify.app/openshift-enterprise/latest/observability/monitoring/troubleshooting-monitoring-issues.html#troubleshooting-alertmanager-configurations_troubleshooting-monitoring-issues

QE review:

QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

juzhao · 2025-04-17T13:11:15Z

LGTM, waiting for others to review

openshift-ci-robot · 2025-04-17T13:25:08Z

@eromanova97: This pull request references OBSDOCS-1327 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Version(s): 4.12 and later

Issue: https://issues.redhat.com/browse/OBSDOCS-1327

Link to docs preview: https://92246--ocpdocs-pr.netlify.app/openshift-enterprise/latest/observability/monitoring/troubleshooting-monitoring-issues.html#troubleshooting-alertmanager-configurations_troubleshooting-monitoring-issues

QE review:

QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

modules/monitoring-troubleshooting-alertmanager-configurations.adoc

machine424 · 2025-04-17T19:27:51Z

modules/monitoring-troubleshooting-alertmanager-configurations.adoc

+[id="troubleshooting-alertmanager-configurations_{context}"]
+= Troubleshooting Alertmanager configuration
+
+If your Alertmanager configuration does not work properly, you can compare the `alertmanager-main` secret with the running Alertmanager configuration to identify possible errors. You can also test your alert routing configuration by creating a test alert.


I'd like to learn more about those errors? are we talking about cases where the user breaks the config and Alertmanager cannot/doesn't load it?

isn't AlertmanagerFailedReload triggered in this case? maybe we can have this in https://github.com/openshift/runbooks/blob/407f97961c22c72f57edf599efe95eadd6baf780/alerts/cluster-monitoring-operator/AlertmanagerFailedReload.md?plain=1#L2?

see https://github.com/openshift/openshift-docs/pull/92246/files#r2050261250, mail_configs, smarthost missed port, for this case, AlertmanagerFailedReload alert would be fired
if set smarthost to a unreachable value, example

receivers: - name: 'web.hook' email_configs: - to: *** from: *** smarthost: 'smtp.non-exist.com:25'

AlertmanagerFailedToSendAlerts would be fired, AlertmanagerFailedReload would not be fired

# token=`oc create token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=count (ALERTS{alertname=~"AlertmanagerFailedReload|AlertmanagerFailedToSendAlerts"}) by (alertname)' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "alertname": "AlertmanagerFailedToSendAlerts" }, "value": [ 1744962937.043, "2" ] } ], "analysis": {} } }

error in alertmanager pod logs

time=2025-04-18T07:46:52.334Z level=ERROR source=dispatch.go:360 msg="Notify for alerts failed" component=dispatcher num_alerts=1 err="web.hook/email[0]: notify retry canceled after 7 attempts: establish connection to server: dial tcp: lookup smtp.non-exist.com on 172.30.0.10:53: no such host"

I think we could mention AlertmanagerFailedToSendAlerts and AlertmanagerFailedReload or just mention to check any alerts related to Alertmanager

@machine424 @juzhao I could add step 1 like this: WDYT?

. Check active alerts related to Alertmanager: + [source,terminal] ---- $ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool alert --alertmanager.url http://localhost:9093 ---- + .Example output [source,terminal] ---- Alertname Starts At Summary State Watchdog 2025-04-28 08:01:41 UTC An alert that should always be firing to certify that Alertmanager is working properly. active AlertmanagerFailedToSendAlerts 2025-04-28 08:11:54 UTC An Alertmanager instance failed to send notifications. active <1> ---- <1> Look for alerts that indicate an issue with Alertmanager, such as `AlertmanagerFailedToSendAlerts` or `AlertmanagerFailedReload`. .. If you identified an alert related to Alertmanager, list the alert's runbook URL: + .Example command [source,terminal] ---- $ oc get prometheusrules -n openshift-monitoring -o yaml | grep 'AlertmanagerFailedToSendAlerts' | grep 'runbook_url' ---- + .Example output [source,terminal] ---- runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedToSendAlerts.md ---- .. Open the rubook URL and follow the instructions described in the runbook.

I am not sure if this is an overkill or not 😄 but we do not really mention the existence of runbooks anywhere in monitoring docs, so maybe having this here could be useful.

Or, I could plan to have something like the virtualization team has:
https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/virtualization/monitoring#virt-runbooks

Yes, the alerts should have links to their runbooks.
I suggest we enrich runbooks as users will mainly be notified about these issues via alerts.
Also that will avoid having to duplicate and maintain the troubleshooting in two different places.
(we could consider/discuss having links to the runbooks in the docs, yes)

Let's mention in AlertmanagerFailedReload runbook that one should look for logs with "Loading configuration file failed" (as shown here https://github.com/openshift/openshift-docs/pull/92246/files#r2050261250), sth like:

$ NAMESPACE='<value of namespace label from alert>' $ oc -n $NAMESPACE logs -l 'app.kubernetes.io/name=alertmanager' --tail=-1 | \ grep 'Loading configuration file failed.*' \ | sort | uniq -c | sort -n time=2025-04-18T07:28:00.274Z level=ERROR source=coordinator.go:117 msg="Loading configuration file failed" component=configuration file=/etc/alertmanager/config_out/alertmanager.env.yaml err="address smtp.gmail.com: missing port in address"

as we do in other runbooks, in https://github.com/openshift/runbooks/blob/f31c57f491b68b07ad6d1a39d45189bd780be8a7/alerts/cluster-monitoring-operator/PrometheusRuleFailures.md e.g.
note that the alert could be triggered for the platform or the uwm alertmanager.
we can say that the err field should help locate the issue.

we can do the same for AlertmanagerFailedToSendAlerts's runbook with the grep on Notify for alerts failed + add the guide for how to send a test alert to a receiver as that could help reproduce the issue and assist with diagnostics.

machine424 · 2025-04-17T19:33:51Z

modules/monitoring-troubleshooting-alertmanager-configurations.adoc

+<1> The example shows the route to the `team-frontend-page` receiver. Alertmanager routes alerts with `service="example-app"` and `severity="critical"` labels to this receiver.
+<2> The `team-frontend-page` receiver configuration. The example shows PagerDuty as a receiver.
+
+.. Compare the contents of the `route` and `receiver` fields of the `alertmanager.yaml` file with the fields in the running Alertmanager configuration. Look for any discrepancies.


comparing could be tedious, maybe we should think about a diff command or sth.
If we only suspect the secret to not be taken into account by AM, maybe checking a log somewhere after changing the secret is sufficient?

@machine424 Note: I asked the team during the office hours, and there does not seem to be a diff command that cames to mind right away, so if we do not come up with something, I will concider this as possible improvement that can be added later 👍

modules/monitoring-troubleshooting-alertmanager-configurations.adoc

machine424 · 2025-04-17T19:38:50Z

thanks for this Eliska,
I have some questions/suggestions, I'll need to tag Nigel as well I think.

eromanova97 · 2025-05-09T07:30:49Z

After the discussions, I will be closing the PR as the enhancements will be done in other parts of our docs experience.

OBSDOCS-1327: Improve troubleshooting monitoring issues: new section …

eebc8fa

…troubleshooting alertmanager configurations

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 16, 2025

openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 16, 2025

machine424 reviewed Apr 17, 2025

View reviewed changes

modules/monitoring-troubleshooting-alertmanager-configurations.adoc Show resolved Hide resolved

eromanova97 closed this May 9, 2025

eromanova97 deleted the OBSDOCS-1327 branch May 12, 2025 08:33

OBSDOCS-1327: Improve troubleshooting monitoring issues: new section troubleshooting alertmanager configurations #92246

OBSDOCS-1327: Improve troubleshooting monitoring issues: new section troubleshooting alertmanager configurations #92246

Uh oh!

Conversation

eromanova97 commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 16, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ocpdocs-previewbot commented Apr 16, 2025

Uh oh!

openshift-ci bot commented Apr 16, 2025

Uh oh!

openshift-ci-robot commented Apr 16, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juzhao commented Apr 17, 2025

Uh oh!

openshift-ci-robot commented Apr 17, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juzhao Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

machine424 commented Apr 17, 2025

Uh oh!

eromanova97 commented May 9, 2025

Uh oh!

Uh oh!

eromanova97 commented Apr 16, 2025 •

edited

Loading

openshift-ci-robot commented Apr 16, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 16, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 17, 2025 •

edited by openshift-ci bot

Loading

juzhao Apr 18, 2025 •

edited

Loading