Skip to content

OBSDOCS-1327: Improve troubleshooting monitoring issues: new section troubleshooting alertmanager configurations #92246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

eromanova97
Copy link
Contributor

@eromanova97 eromanova97 commented Apr 16, 2025

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 16, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 16, 2025

@eromanova97: This pull request references OBSDOCS-1327 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Version(s): 4.12 and later

Issue: https://issues.redhat.com/browse/OBSDOCS-1327

Link to docs preview:

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 16, 2025
Copy link

openshift-ci bot commented Apr 16, 2025

@eromanova97: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 16, 2025

@eromanova97: This pull request references OBSDOCS-1327 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Version(s): 4.12 and later

Issue: https://issues.redhat.com/browse/OBSDOCS-1327

Link to docs preview: https://92246--ocpdocs-pr.netlify.app/openshift-enterprise/latest/observability/monitoring/troubleshooting-monitoring-issues.html#troubleshooting-alertmanager-configurations_troubleshooting-monitoring-issues

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@juzhao
Copy link

juzhao commented Apr 17, 2025

LGTM, waiting for others to review

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 17, 2025

@eromanova97: This pull request references OBSDOCS-1327 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Version(s): 4.12 and later

Issue: https://issues.redhat.com/browse/OBSDOCS-1327

Link to docs preview: https://92246--ocpdocs-pr.netlify.app/openshift-enterprise/latest/observability/monitoring/troubleshooting-monitoring-issues.html#troubleshooting-alertmanager-configurations_troubleshooting-monitoring-issues

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

----
+
.Example output
[source,terminal]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have it as a collapsible and have it collapsed by default?

[id="troubleshooting-alertmanager-configurations_{context}"]
= Troubleshooting Alertmanager configuration

If your Alertmanager configuration does not work properly, you can compare the `alertmanager-main` secret with the running Alertmanager configuration to identify possible errors. You can also test your alert routing configuration by creating a test alert.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to learn more about those errors? are we talking about cases where the user breaks the config and Alertmanager cannot/doesn't load it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@juzhao juzhao Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see https://github.com/openshift/openshift-docs/pull/92246/files#r2050261250, mail_configs, smarthost missed port, for this case, AlertmanagerFailedReload alert would be fired
if set smarthost to a unreachable value, example

receivers:
  - name: 'web.hook'
    email_configs:
    - to: ***
      from: ***
      smarthost: 'smtp.non-exist.com:25'

AlertmanagerFailedToSendAlerts would be fired, AlertmanagerFailedReload would not be fired

# token=`oc create token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=count (ALERTS{alertname=~"AlertmanagerFailedReload|AlertmanagerFailedToSendAlerts"}) by (alertname)' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "alertname": "AlertmanagerFailedToSendAlerts"
        },
        "value": [
          1744962937.043,
          "2"
        ]
      }
    ],
    "analysis": {}
  }
}

error in alertmanager pod logs

time=2025-04-18T07:46:52.334Z level=ERROR source=dispatch.go:360 msg="Notify for alerts failed" component=dispatcher num_alerts=1 err="web.hook/email[0]: notify retry canceled after 7 attempts: establish connection to server: dial tcp: lookup smtp.non-exist.com on 172.30.0.10:53: no such host"

I think we could mention AlertmanagerFailedToSendAlerts and AlertmanagerFailedReload or just mention to check any alerts related to Alertmanager

<1> The example shows the route to the `team-frontend-page` receiver. Alertmanager routes alerts with `service="example-app"` and `severity="critical"` labels to this receiver.
<2> The `team-frontend-page` receiver configuration. The example shows PagerDuty as a receiver.

.. Compare the contents of the `route` and `receiver` fields of the `alertmanager.yaml` file with the fields in the running Alertmanager configuration. Look for any discrepancies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparing could be tedious, maybe we should think about a diff command or sth.
If we only suspect the secret to not be taken into account by AM, maybe checking a log somewhere after changing the secret is sufficient?

----
<1> The routing configuration from the `example-routing` `AlertmanagerConfig` object in the `ns1` project for the `UWM-receiver` receiver.

. Check Alertmanager pod logs to see if there are any errors:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can be more precise about the errors? some keywords that they'd probably contain

Copy link

@juzhao juzhao Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need too explicit, since there are too many conditions. example, below mail_configs, smarthost missed port

receivers:
  - name: 'web.hook'
    email_configs:
    - to: ****
      from: ****
      smarthost: 'smtp.gmail.com'
      require_tls: false
      auth_username: ****
      auth_password: ****

will see error in alertmanager pod logs

$ oc -n openshift-monitoring logs -c alertmanager alertmanager-main-0
time=2025-04-18T07:28:00.274Z level=ERROR source=coordinator.go:117 msg="Loading configuration file failed" component=configuration file=/etc/alertmanager/config_out/alertmanager.env.yaml err="address smtp.gmail.com: missing port in address"

check for the error logs is fine

@machine424
Copy link
Contributor

thanks for this Eliska,
I have some questions/suggestions, I'll need to tag Nigel as well I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants