Skip to content

Latest commit

 

History

History
188 lines (171 loc) · 5.92 KB

monitoring-troubleshooting-alertmanager-configurations.adoc

File metadata and controls

188 lines (171 loc) · 5.92 KB

Troubleshooting Alertmanager configuration

If your Alertmanager configuration does not work properly, you can compare the alertmanager-main secret with the running Alertmanager configuration to identify possible errors. You can also test your alert routing configuration by creating a test alert.

Prerequisites
  • You have access to the cluster as a user with the cluster-admin cluster role.

  • You have installed the {oc-first}.

Procedure
  1. Compare the alertmanager-main secret with the running Alertmanager configuration:

    1. Extract the Alertmanager configuration from the alertmanager-main secret into the alertmanager.yaml file:

      $ oc -n openshift-monitoring get secret alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' | base64 --decode > alertmanager.yaml
    2. Pull the running Alertmanager configuration from the API:

      $ oc exec -n openshift-monitoring alertmanager-main-0 -- amtool config show --alertmanager.url http://localhost:9093
      Example output
      global:
        resolve_timeout: 5m
        http_config:
          follow_redirects: true
          enable_http2: true
          proxy_from_environment: true
      ...
      route:
        receiver: default
        group_by:
        - namespace
        continue: false
        routes:
        ...
        - matchers: # (1)
          - service="example-app"
          continue: false
          routes:
          - receiver: team-frontend-page
            matchers:
            - severity="critical"
            continue: false
        ...
      receivers:
      ...
      - name: team-frontend-page # (2)
        pagerduty_configs:
        - send_resolved: true
          http_config:
            authorization:
              type: Bearer
              credentials: <secret>
            follow_redirects: true
            enable_http2: true
            proxy_from_environment: true
          service_key: <secret>
          url: https://events.pagerduty.com/v2/enqueue
          ...
      templates: []
      1. The example shows the route to the team-frontend-page receiver. Alertmanager routes alerts with service="example-app" and severity="critical" labels to this receiver.

      2. The team-frontend-page receiver configuration. The example shows PagerDuty as a receiver.

    3. Compare the contents of the route and receiver fields of the alertmanager.yaml file with the fields in the running Alertmanager configuration. Look for any discrepancies.

    4. If you used an AlertmanagerConfig object to configure alert routing for user-defined projects, you can use the alertmanager.yaml file to see the configuration before the AlertmanagerConfig object was applied. The running Alertmanager configuration shows the changes after the object was applied:

      Example running configuration with AlertmanagerConfig applied
      ...
      route:
        ...
        routes:
        - receiver: ns1/example-routing/UWM-receiver (1)
          group_by:
          - job
          matchers:
          - namespace="ns1"
          continue: true
        ...
      receivers:
      ...
      - name: ns1/example-routing/UWM-receiver (1)
        webhook_configs:
        - send_resolved: true
          http_config:
            follow_redirects: true
            enable_http2: true
            proxy_from_environment: true
          url: <secret>
          url_file: ""
          max_alerts: 0
          timeout: 0s
      templates: []
      1. The routing configuration from the example-routing AlertmanagerConfig object in the ns1 project for the UWM-receiver receiver.

  2. Check Alertmanager pod logs to see if there are any errors:

    $ oc -n openshift-monitoring logs -c alertmanager <alertmanager_pod>
    Note

    For multi-node clusters, ensure that you check all Alertmanager pods and their logs.

    Example command
    $ oc -n openshift-monitoring logs -c alertmanager alertmanager-main-0
  3. Verify that your receiver is configured correctly by creating a test alert.

    1. Get a list of the configured routes:

      $ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool config routes show --alertmanager.url http://localhost:9093
      Example output
      Routing tree:
      .
      └── default-route  receiver: default
          ├── {alertname="Watchdog"}  receiver: Watchdog
          └── {service="example-app"}  receiver: default
              └── {severity="critical"}  receiver: team-frontend-page
    2. Print the route to your chosen receiver. The following example shows the receiver used for alerts with service=example-app and severity=critical` matchers.

      $ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool config routes test service=example-app severity=critical --alertmanager.url http://localhost:9093
      Example output
      team-frontend-page
    3. Create a test alert and add it to the Alertmanager. The following example creates an alert with service=example-app and severity=critical to test the team-frontend-page receiver:

      $ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool alert add --alertmanager.url http://localhost:9093 alertname=myalarm  --start="2025-03-31T00:00:00-00:00" service=example-app severity=critical --annotation="summary=\"This is a test alert with a custom summary\""
    4. Verify that the alert was generated:

      $ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool alert --alertmanager.url http://localhost:9093
      Example output
      Alertname  Starts At                Summary                                                                                  State
      myalarm    2025-03-31 00:00:00 UTC  This is a test alert with a custom summary                                               active
      Watchdog   2025-04-07 10:07:16 UTC  An alert that should always be firing to certify that Alertmanager is working properly.  active
    5. Verify that the receiver was notified with the myalarm alert.