From eebc8fa54e612f24293d5ffdb21fbc772944bf48 Mon Sep 17 00:00:00 2001
From: Eliska Romanova <eromanov@redhat.com>
Date: Thu, 27 Mar 2025 12:33:14 +0100
Subject: [PATCH] OBSDOCS-1327: Improve troubleshooting monitoring issues: new
 section troubleshooting alertmanager configurations

---
 ...eshooting-alertmanager-configurations.adoc | 188 ++++++++++++++++++
 .../troubleshooting-monitoring-issues.adoc    |   5 +
 2 files changed, 193 insertions(+)
 create mode 100644 modules/monitoring-troubleshooting-alertmanager-configurations.adoc
diff --git a/modules/monitoring-troubleshooting-alertmanager-configurations.adoc b/modules/monitoring-troubleshooting-alertmanager-configurations.adoc
new file mode 100644
index 000000000000..848222e878c4
--- /dev/null
+++ b/modules/monitoring-troubleshooting-alertmanager-configurations.adoc
@@ -0,0 +1,188 @@
+// Module included in the following assemblies:
+//
+// * monitoring/troubleshooting-monitoring-issues.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="troubleshooting-alertmanager-configurations_{context}"]
+= Troubleshooting Alertmanager configuration
+
+If your Alertmanager configuration does not work properly, you can compare the `alertmanager-main` secret with the running Alertmanager configuration to identify possible errors. You can also test your alert routing configuration by creating a test alert.
+
+.Prerequisites
+
+* You have access to the cluster as a user with the `cluster-admin` cluster role.
+* You have installed the {oc-first}.
+
+.Procedure
+
+. Compare the `alertmanager-main` secret with the running Alertmanager configuration:
+
+.. Extract the Alertmanager configuration from the `alertmanager-main` secret into the `alertmanager.yaml` file:
++
+[source,terminal]
+----
+$ oc -n openshift-monitoring get secret alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' | base64 --decode > alertmanager.yaml 
+----
+
+.. Pull the running Alertmanager configuration from the API:
++
+[source,terminal]
+----
+$ oc exec -n openshift-monitoring alertmanager-main-0 -- amtool config show --alertmanager.url http://localhost:9093
+----
++
+.Example output
+[source,terminal]
+----
+global:
+  resolve_timeout: 5m
+  http_config:
+    follow_redirects: true
+    enable_http2: true
+    proxy_from_environment: true
+...
+route:
+  receiver: default
+  group_by:
+  - namespace
+  continue: false
+  routes:
+  ...
+  - matchers: # <1>
+    - service="example-app"
+    continue: false
+    routes:
+    - receiver: team-frontend-page
+      matchers:
+      - severity="critical"
+      continue: false
+  ...
+receivers:
+...
+- name: team-frontend-page # <2>
+  pagerduty_configs:
+  - send_resolved: true
+    http_config:
+      authorization:
+        type: Bearer
+        credentials: <secret>
+      follow_redirects: true
+      enable_http2: true
+      proxy_from_environment: true
+    service_key: <secret>
+    url: https://events.pagerduty.com/v2/enqueue
+    ...
+templates: []
+----
+<1> The example shows the route to the `team-frontend-page` receiver. Alertmanager routes alerts with `service="example-app"` and `severity="critical"` labels to this receiver.
+<2> The `team-frontend-page` receiver configuration. The example shows PagerDuty as a receiver.
+
+.. Compare the contents of the `route` and `receiver` fields of the `alertmanager.yaml` file with the fields in the running Alertmanager configuration. Look for any discrepancies.
+
+.. If you used an `AlertmanagerConfig` object to configure alert routing for user-defined projects, you can use the `alertmanager.yaml` file to see the configuration before the `AlertmanagerConfig` object was applied. The running Alertmanager configuration shows the changes after the object was applied:
++
+.Example running configuration with AlertmanagerConfig applied
+[source,terminal]
+----
+...
+route:
+  ...
+  routes:
+  - receiver: ns1/example-routing/UWM-receiver <1>
+    group_by:
+    - job
+    matchers:
+    - namespace="ns1"
+    continue: true
+  ...
+receivers:
+...
+- name: ns1/example-routing/UWM-receiver <1>
+  webhook_configs:
+  - send_resolved: true
+    http_config:
+      follow_redirects: true
+      enable_http2: true
+      proxy_from_environment: true
+    url: <secret>
+    url_file: ""
+    max_alerts: 0
+    timeout: 0s
+templates: []
+----
+<1> The routing configuration from the `example-routing` `AlertmanagerConfig` object in the `ns1` project for the `UWM-receiver` receiver.
+
+. Check Alertmanager pod logs to see if there are any errors:
++
+[source,terminal]
+----
+$ oc -n openshift-monitoring logs -c alertmanager <alertmanager_pod>
+----
++
+[NOTE]
+====
+For multi-node clusters, ensure that you check all Alertmanager pods and their logs.
+====
++
+.Example command
+[source,terminal]
+----
+$ oc -n openshift-monitoring logs -c alertmanager alertmanager-main-0
+----
+
+. Verify that your receiver is configured correctly by creating a test alert.
+
+.. Get a list of the configured routes:
++
+[source,terminal]
+----
+$ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool config routes show --alertmanager.url http://localhost:9093
+----
++
+.Example output
+[source,terminal]
+----
+Routing tree:
+.
+└── default-route  receiver: default
+    ├── {alertname="Watchdog"}  receiver: Watchdog
+    └── {service="example-app"}  receiver: default
+        └── {severity="critical"}  receiver: team-frontend-page
+----
+
+.. Print the route to your chosen receiver. The following example shows the receiver used for alerts with `service=example-app` and severity=critical` matchers.
++
+[source,terminal]
+----
+$ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool config routes test service=example-app severity=critical --alertmanager.url http://localhost:9093
+----
++
+.Example output
+[source,terminal]
+----
+team-frontend-page
+----
+
+.. Create a test alert and add it to the Alertmanager. The following example creates an alert with `service=example-app` and `severity=critical` to test the `team-frontend-page` receiver:
++
+[source,terminal]
+----
+$ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool alert add --alertmanager.url http://localhost:9093 alertname=myalarm  --start="2025-03-31T00:00:00-00:00" service=example-app severity=critical --annotation="summary=\"This is a test alert with a custom summary\""
+----
+
+.. Verify that the alert was generated:
++
+[source,terminal]
+----
+$ oc exec alertmanager-main-0 -n openshift-monitoring -- amtool alert --alertmanager.url http://localhost:9093
+----
++
+.Example output
+[source,terminal]
+----
+Alertname  Starts At                Summary                                                                                  State   
+myalarm    2025-03-31 00:00:00 UTC  This is a test alert with a custom summary                                               active  
+Watchdog   2025-04-07 10:07:16 UTC  An alert that should always be firing to certify that Alertmanager is working properly.  active  
+----
+
+.. Verify that the receiver was notified with the `myalarm` alert.
\ No newline at end of file
diff --git a/observability/monitoring/troubleshooting-monitoring-issues.adoc b/observability/monitoring/troubleshooting-monitoring-issues.adoc
index 1daa769b81f4..65655604fd10 100644
--- a/observability/monitoring/troubleshooting-monitoring-issues.adoc
+++ b/observability/monitoring/troubleshooting-monitoring-issues.adoc
@@ -57,3 +57,8 @@ include::modules/monitoring-resolving-the-alertmanagerreceiversnotconfigured-ale
 * xref:../../observability/monitoring/configuring-core-platform-monitoring/configuring-alerts-and-notifications.adoc#configuring-alert-notifications_configuring-alerts-and-notifications[Configuring alert notifications for default platform monitoring]
 * xref:../../observability/monitoring/configuring-user-workload-monitoring/configuring-alerts-and-notifications-uwm.adoc#configuring-alert-notifications_configuring-alerts-and-notifications-uwm[Configuring alert notifications for user workload monitoring]
 endif::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
+
+ifndef::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
+//Troubleshooting Alertmanager configurations
+include::modules/monitoring-troubleshooting-alertmanager-configurations.adoc[leveloffset=+1]
+endif::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]