Skip to content

Commit 52bfddc

Browse files
committed
OBSDOCS-1330: Improve 'troubleshooting monitoring issues': New section troubleshooting remote write
1 parent 1de0f04 commit 52bfddc

3 files changed

+164
-0
lines changed
+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * observability/monitoring/troubleshooting-monitoring-issues.adoc
4+
5+
:_mod-docs-content-type: REFERENCE
6+
[id="table-of-remote-write-metrics_{context}"]
7+
= Table of remote write metrics
8+
9+
The following table contains remote write and remote write-adjacent metrics with further description to help during troubleshooting of remote write storage.
10+
11+
[options="header"]
12+
|===
13+
| Metric | Description
14+
| `prometheus_remote_storage_highest_timestamp_in_seconds` | Shows the newest timestamp that Prometheus stored in the write-ahead log (WAL) for any sample.
15+
| `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` | Shows the newest timestamp that remote write successfully sent.
16+
| `prometheus_remote_storage_retried_samples_total` | The number of samples that remote write failed to send and had to resend to remote storage. A steady high rate for this metric indicates problems with the network or remote storage endpoint.
17+
| `prometheus_remote_storage_shards` | Shows how many shards are currently running for each remote endpoint.
18+
| `prometheus_remote_storage_shards_desired` | Shows the calculated needed number of shards based on the current write throughput and the rate of incoming versus sent samples.
19+
| `prometheus_remote_storage_shards_max` | Shows the maximum number of shards based on the current configuration.
20+
| `prometheus_remote_storage_shards_min` | Shows the minimum number of shards based on the current configuration.
21+
| `prometheus_tsdb_wal_segment_current` | The WAL segment file that Prometheus is currently writing new data to.
22+
| `prometheus_wal_watcher_current_segment` | The WAL segment file that each remote write instance is currently reading from.
23+
|===
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * observability/monitoring/troubleshooting-monitoring-issues.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="troubleshooting-remote-write-configuration_{context}"]
7+
= Troubleshooting remote write configuration
8+
9+
If your remote write configuration does not work properly, you can verify the running configuration from the Prometheus API and inspect metrics to discover other possible issues.
10+
11+
.Prerequisites
12+
13+
ifndef::openshift-dedicated,openshift-rosa-hcp,openshift-rosa[]
14+
* You have access to the cluster as a user with the `cluster-admin` cluster role.
15+
endif::openshift-dedicated,openshift-rosa-hcp,openshift-rosa[]
16+
ifdef::openshift-dedicated,openshift-rosa-hcp,openshift-rosa[]
17+
* You have access to the cluster as a user with the `dedicated-admin` role.
18+
endif::openshift-dedicated,openshift-rosa-hcp,openshift-rosa[]
19+
* You have configured remote write storage.
20+
* You have installed the {oc-first}.
21+
22+
.Procedure
23+
24+
. Verify the running remote write configuration from the Prometheus API:
25+
// tag::CPM[]
26+
27+
** To verify remote write configuration for core platform monitoring:
28+
+
29+
[source,terminal]
30+
----
31+
$ oc exec prometheus-k8s-0 -c prometheus -n openshift-monitoring -- curl -s 'http://localhost:9090/api/v1/status/config'
32+
----
33+
** To verify remote write configuration for user workload monitoring:
34+
// end::CPM[]
35+
+
36+
[source,terminal]
37+
----
38+
$ oc exec prometheus-user-workload-0 -c prometheus -n openshift-user-workload-monitoring -- curl -s 'http://localhost:9090/api/v1/status/config'
39+
----
40+
+
41+
.The formatted example output
42+
[source,terminal]
43+
----
44+
...
45+
remote_write:
46+
- url: https://remote-write-endpoint.example.com
47+
remote_timeout: 30s
48+
write_relabel_configs:
49+
- separator: ;
50+
target_label: __tmp_openshift_cluster_id__
51+
replacement: 0b02e767-c309-41e9-8727-03bb50f0fc89
52+
action: replace
53+
- separator: ;
54+
regex: __tmp_openshift_cluster_id__
55+
replacement: $1
56+
action: labeldrop
57+
protobuf_message: prometheus.WriteRequest
58+
authorization:
59+
type: Bearer
60+
credentials: <secret>
61+
follow_redirects: true
62+
enable_http2: true
63+
queue_config:
64+
capacity: 10000
65+
max_shards: 50
66+
min_shards: 1
67+
max_samples_per_send: 2000
68+
batch_send_deadline: 5s
69+
min_backoff: 30ms
70+
max_backoff: 5s
71+
metadata_config:
72+
send: true
73+
send_interval: 1m
74+
max_samples_per_send: 2000
75+
...
76+
----
77+
+
78+
[NOTE]
79+
====
80+
The formatted example output uses a filtering tool, such as `jq`, to provide the formatted indented JSON. See the link:https://stedolan.github.io/jq/manual/[jq Manual] (jq documentation) for more information about using `jq`.
81+
====
82+
83+
. From the *Administrator* perspective of the {product-title} web console, go to *Observe* -> *Metrics* and queue the relevant remote write metrics:
84+
+
85+
[NOTE]
86+
====
87+
For more information about the metrics used in the following steps, see: "Table of remote write metrics".
88+
====
89+
90+
.. Queue the `prometheus_remote_storage_retried_samples_total` metric. If you see a steady high rate for this metric, reduce throughput of remote write to reduce load on the endpoint.
91+
92+
.. Check if remote write is falling behind in reading from the write-ahead log (WAL) and sending data to the remote endpoint:
93+
94+
... Queue the `prometheus_remote_storage_highest_timestamp_in_seconds` and `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` metrics to see how far behind remote write is for an endpoint.
95+
96+
... Queue the `prometheus_wal_watcher_current_segment` and `prometheus_tsdb_wal_segment_current` metrics to see if the values are the same. If you see a significant gap between the two values, it could mean that remote write is falling behind.
97+
98+
.. If you determine from the previous steps that remote write is behind, perform the following actions:
99+
100+
... Queue the `prometheus_remote_storage_shards` and `prometheus_remote_storage_shards_max` metrics to see if you are running the maximum number of shards.
101+
102+
... Queue the `prometheus_remote_storage_shards_desired` metric to see if its value is greater than the `prometheus_remote_storage_shards_max` metric value.
103+
104+
... If you are running the maximum number of shards and the value for wanted shards is greater than the value for maximum shards, increase the `maxShards` value for your remote write configuration.
105+
+
106+
.Example remote write maxShards parameter for user workload monitoring
107+
[source,yaml]
108+
----
109+
apiVersion: v1
110+
kind: ConfigMap
111+
metadata:
112+
name: user-workload-monitoring-config
113+
namespace: openshift-user-workload-monitoring
114+
data:
115+
config.yaml: |
116+
prometheus:
117+
remoteWrite:
118+
# ...
119+
queueConfig:
120+
# ...
121+
maxShards: 100
122+
----

Diff for: observability/monitoring/troubleshooting-monitoring-issues.adoc

+19
Original file line numberDiff line numberDiff line change
@@ -57,3 +57,22 @@ include::modules/monitoring-resolving-the-alertmanagerreceiversnotconfigured-ale
5757
* xref:../../observability/monitoring/configuring-core-platform-monitoring/configuring-alerts-and-notifications.adoc#configuring-alert-notifications_configuring-alerts-and-notifications[Configuring alert notifications for default platform monitoring]
5858
* xref:../../observability/monitoring/configuring-user-workload-monitoring/configuring-alerts-and-notifications-uwm.adoc#configuring-alert-notifications_configuring-alerts-and-notifications-uwm[Configuring alert notifications for user workload monitoring]
5959
endif::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
60+
61+
ifndef::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
62+
// Troubleshooting remote write configuration
63+
include::modules/monitoring-troubleshooting-remote-write-configuration.adoc[leveloffset=+1,tags=**;CPM]
64+
[role="_additional-resources"]
65+
.Additional resources
66+
* xref:../../observability/monitoring/configuring-core-platform-monitoring/configuring-metrics.adoc#example-remote-write-queue-configuration_configuring-metrics[Example remote write queue configuration]
67+
endif::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
68+
69+
ifdef::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
70+
// Troubleshooting remote write configuration
71+
include::modules/monitoring-troubleshooting-remote-write-configuration.adoc[leveloffset=+1,tags=**;!CPM]
72+
[role="_additional-resources"]
73+
.Additional resources
74+
* xref:../../observability/monitoring/configuring-the-monitoring-stack.adoc#example-remote-write-queue-configuration_configuring-the-monitoring-stack[Example remote write queue configuration]
75+
endif::openshift-dedicated,openshift-rosa,openshift-rosa-hcp[]
76+
77+
//Table of remote write metrics
78+
include::modules/monitoring-table-of-remote-write-metrics.adoc[leveloffset=+2]

0 commit comments

Comments
 (0)