OBSDOCS-1330: Improve 'troubleshooting monitoring issues': New section troubleshooting remote write

eromanova97 · eromanova97 · commit 52bfddcd9823 · 2025-04-15T13:43:56.000+02:00
diff --git a/modules/monitoring-table-of-remote-write-metrics.adoc b/modules/monitoring-table-of-remote-write-metrics.adoc
@@ -0,0 +1,23 @@
+// Module included in the following assemblies:
+//
+// * observability/monitoring/troubleshooting-monitoring-issues.adoc
+
+:_mod-docs-content-type: REFERENCE
+[id="table-of-remote-write-metrics_{context}"]
+= Table of remote write metrics
+
+The following table contains remote write and remote write-adjacent metrics with further description to help during troubleshooting of remote write storage.
+
+[options="header"]
+|===
+| Metric | Description
+| `prometheus_remote_storage_highest_timestamp_in_seconds` | Shows the newest timestamp that Prometheus stored in the write-ahead log (WAL) for any sample.
+| `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` | Shows the newest timestamp that remote write successfully sent.
+| `prometheus_remote_storage_retried_samples_total` | The number of samples that remote write failed to send and had to resend to remote storage. A steady high rate for this metric indicates problems with the network or remote storage endpoint. 
+| `prometheus_remote_storage_shards` | Shows how many shards are currently running for each remote endpoint.
+| `prometheus_remote_storage_shards_desired` | Shows the calculated needed number of shards based on the current write throughput and the rate of incoming versus sent samples.
+| `prometheus_remote_storage_shards_max` | Shows the maximum number of shards based on the current configuration.
+| `prometheus_remote_storage_shards_min` | Shows the minimum number of shards based on the current configuration.
+| `prometheus_tsdb_wal_segment_current` | The WAL segment file that Prometheus is currently writing new data to.
+| `prometheus_wal_watcher_current_segment` | The WAL segment file that each remote write instance is currently reading from.
+|===
diff --git a/modules/monitoring-troubleshooting-remote-write-configuration.adoc b/modules/monitoring-troubleshooting-remote-write-configuration.adoc
@@ -0,0 +1,122 @@
+// Module included in the following assemblies:
+//
+// * observability/monitoring/troubleshooting-monitoring-issues.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="troubleshooting-remote-write-configuration_{context}"]
+= Troubleshooting remote write configuration
+
+If your remote write configuration does not work properly, you can verify the running configuration from the Prometheus API and inspect metrics to discover other possible issues.
+
+.Prerequisites
+
+ifndef::openshift-dedicated,openshift-rosa-hcp,openshift-rosa[]
+* You have access to the cluster as a user with the `cluster-admin` cluster role.
+endif::openshift-dedicated,openshift-rosa-hcp,openshift-rosa[]
+ifdef::openshift-dedicated,openshift-rosa-hcp,openshift-rosa[]
+* You have access to the cluster as a user with the `dedicated-admin` role.
+endif::openshift-dedicated,openshift-rosa-hcp,openshift-rosa[]
+* You have configured remote write storage.
+* You have installed the {oc-first}.
+
+.Procedure
+
+. Verify the running remote write configuration from the Prometheus API:
+// tag::CPM[]
+
+** To verify remote write configuration for core platform monitoring:
++
+[source,terminal]
+----
+$ oc exec prometheus-k8s-0 -c prometheus -n openshift-monitoring -- curl -s 'http://localhost:9090/api/v1/status/config' 
+----
+** To verify remote write configuration for user workload monitoring:
+// end::CPM[]
++
+[source,terminal]
+----
+$ oc exec prometheus-user-workload-0 -c prometheus -n openshift-user-workload-monitoring -- curl -s 'http://localhost:9090/api/v1/status/config' 
+----
++
+.The formatted example output
+[source,terminal]
+----
+...
+remote_write:
+- url: https://remote-write-endpoint.example.com
+  remote_timeout: 30s
+  write_relabel_configs:
+  - separator: ;
+    target_label: __tmp_openshift_cluster_id__
+    replacement: 0b02e767-c309-41e9-8727-03bb50f0fc89
+    action: replace
+  - separator: ;
+    regex: __tmp_openshift_cluster_id__
+    replacement: $1
+    action: labeldrop
+  protobuf_message: prometheus.WriteRequest
+  authorization:
+    type: Bearer
+    credentials: <secret>
+  follow_redirects: true
+  enable_http2: true
+  queue_config:
+    capacity: 10000
+    max_shards: 50
+    min_shards: 1
+    max_samples_per_send: 2000
+    batch_send_deadline: 5s
+    min_backoff: 30ms
+    max_backoff: 5s
+  metadata_config:
+    send: true
+    send_interval: 1m
+    max_samples_per_send: 2000
+...
+----
++
+[NOTE]
+====
+The formatted example output uses a filtering tool, such as `jq`, to provide the formatted indented JSON. See the link:https://stedolan.github.io/jq/manual/[jq Manual] (jq documentation) for more information about using `jq`.
+====
+
+. From the *Administrator* perspective of the {product-title} web console, go to *Observe* -> *Metrics* and queue the relevant remote write metrics:
++
+[NOTE]
+====
+For more information about the metrics used in the following steps, see: "Table of remote write metrics".
+====
+
+.. Queue the `prometheus_remote_storage_retried_samples_total` metric. If you see a steady high rate for this metric, reduce throughput of remote write to reduce load on the endpoint.
+
+.. Check if remote write is falling behind in reading from the write-ahead log (WAL) and sending data to the remote endpoint:
+
+... Queue the `prometheus_remote_storage_highest_timestamp_in_seconds` and `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` metrics to see how far behind remote write is for an endpoint.
+
+... Queue the `prometheus_wal_watcher_current_segment` and `prometheus_tsdb_wal_segment_current` metrics to see if the values are the same. If you see a significant gap between the two values, it could mean that remote write is falling behind.
+
+.. If you determine from the previous steps that remote write is behind, perform the following actions:
+
+... Queue the `prometheus_remote_storage_shards` and `prometheus_remote_storage_shards_max` metrics to see if you are running the maximum number of shards.
+
+... Queue the `prometheus_remote_storage_shards_desired` metric to see if its value is greater than the `prometheus_remote_storage_shards_max` metric value.
+
+... If you are running the maximum number of shards and the value for wanted shards is greater than the value for maximum shards, increase the `maxShards` value for your remote write configuration.
++
+.Example remote write maxShards parameter for user workload monitoring
+[source,yaml]
+----
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: user-workload-monitoring-config
+  namespace: openshift-user-workload-monitoring
+data:
+  config.yaml: |
+    prometheus:
+      remoteWrite:
+        # ...
+        queueConfig:
+          # ...
+          maxShards: 100
+----
diff --git a/observability/monitoring/troubleshooting-monitoring-issues.adoc b/observability/monitoring/troubleshooting-monitoring-issues.adoc
@@ -57,3 +57,22 @@ include::modules/monitoring-resolving-the-alertmanagerreceiversnotconfigured-ale
 * xref:../../observability/monitoring/configuring-core-platform-monitoring/configuring-alerts-and-notifications.adoc#configuring-alert-notifications_configuring-alerts-and-notifications[Configuring alert notifications for default platform monitoring]
 * xref:../../observability/monitoring/configuring-user-workload-monitoring/configuring-alerts-and-notifications-uwm.adoc#configuring-alert-notifications_configuring-alerts-and-notifications-uwm[Configuring alert notifications for user workload monitoring]
 endif::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
+
+ifndef::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
+// Troubleshooting remote write configuration
+include::modules/monitoring-troubleshooting-remote-write-configuration.adoc[leveloffset=+1,tags=**;CPM]
+[role="_additional-resources"]
+.Additional resources
+* xref:../../observability/monitoring/configuring-core-platform-monitoring/configuring-metrics.adoc#example-remote-write-queue-configuration_configuring-metrics[Example remote write queue configuration]
+endif::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
+
+ifdef::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
+// Troubleshooting remote write configuration
+include::modules/monitoring-troubleshooting-remote-write-configuration.adoc[leveloffset=+1,tags=**;!CPM]
+[role="_additional-resources"]
+.Additional resources
+* xref:../../observability/monitoring/configuring-the-monitoring-stack.adoc#example-remote-write-queue-configuration_configuring-the-monitoring-stack[Example remote write queue configuration]
+endif::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
+
+//Table of remote write metrics
+include::modules/monitoring-table-of-remote-write-metrics.adoc[leveloffset=+2]