|
| 1 | +// Module included in the following assemblies: |
| 2 | +// |
| 3 | +// * observability/monitoring/troubleshooting-monitoring-issues.adoc |
| 4 | + |
| 5 | +:_mod-docs-content-type: PROCEDURE |
| 6 | +[id="troubleshooting-remote-write-configuration_{context}"] |
| 7 | += Troubleshooting remote write configuration |
| 8 | + |
| 9 | +If your remote write configuration does not work properly, you can verify the running configuration from the Prometheus API and inspect metrics to discover other possible issues. |
| 10 | + |
| 11 | +.Prerequisites |
| 12 | + |
| 13 | +ifndef::openshift-dedicated,openshift-rosa-hcp,openshift-rosa[] |
| 14 | +* You have access to the cluster as a user with the `cluster-admin` cluster role. |
| 15 | +endif::openshift-dedicated,openshift-rosa-hcp,openshift-rosa[] |
| 16 | +ifdef::openshift-dedicated,openshift-rosa-hcp,openshift-rosa[] |
| 17 | +* You have access to the cluster as a user with the `dedicated-admin` role. |
| 18 | +endif::openshift-dedicated,openshift-rosa-hcp,openshift-rosa[] |
| 19 | +* You have configured remote write storage. |
| 20 | +* You have installed the {oc-first}. |
| 21 | + |
| 22 | +.Procedure |
| 23 | + |
| 24 | +. Verify the running remote write configuration from the Prometheus API: |
| 25 | +// tag::CPM[] |
| 26 | + |
| 27 | +** To verify remote write configuration for core platform monitoring: |
| 28 | ++ |
| 29 | +[source,terminal] |
| 30 | +---- |
| 31 | +$ oc exec prometheus-k8s-0 -c prometheus -n openshift-monitoring -- curl -s 'http://localhost:9090/api/v1/status/config' |
| 32 | +---- |
| 33 | +** To verify remote write configuration for user workload monitoring: |
| 34 | +// end::CPM[] |
| 35 | ++ |
| 36 | +[source,terminal] |
| 37 | +---- |
| 38 | +$ oc exec prometheus-user-workload-0 -c prometheus -n openshift-user-workload-monitoring -- curl -s 'http://localhost:9090/api/v1/status/config' |
| 39 | +---- |
| 40 | ++ |
| 41 | +.The formatted example output |
| 42 | +[source,terminal] |
| 43 | +---- |
| 44 | +... |
| 45 | +remote_write: |
| 46 | +- url: https://remote-write-endpoint.example.com |
| 47 | + remote_timeout: 30s |
| 48 | + write_relabel_configs: |
| 49 | + - separator: ; |
| 50 | + target_label: __tmp_openshift_cluster_id__ |
| 51 | + replacement: 0b02e767-c309-41e9-8727-03bb50f0fc89 |
| 52 | + action: replace |
| 53 | + - separator: ; |
| 54 | + regex: __tmp_openshift_cluster_id__ |
| 55 | + replacement: $1 |
| 56 | + action: labeldrop |
| 57 | + protobuf_message: prometheus.WriteRequest |
| 58 | + authorization: |
| 59 | + type: Bearer |
| 60 | + credentials: <secret> |
| 61 | + follow_redirects: true |
| 62 | + enable_http2: true |
| 63 | + queue_config: |
| 64 | + capacity: 10000 |
| 65 | + max_shards: 50 |
| 66 | + min_shards: 1 |
| 67 | + max_samples_per_send: 2000 |
| 68 | + batch_send_deadline: 5s |
| 69 | + min_backoff: 30ms |
| 70 | + max_backoff: 5s |
| 71 | + metadata_config: |
| 72 | + send: true |
| 73 | + send_interval: 1m |
| 74 | + max_samples_per_send: 2000 |
| 75 | +... |
| 76 | +---- |
| 77 | ++ |
| 78 | +[NOTE] |
| 79 | +==== |
| 80 | +The formatted example output uses a filtering tool, such as `jq`, to provide the formatted indented JSON. See the link:https://stedolan.github.io/jq/manual/[jq Manual] (jq documentation) for more information about using `jq`. |
| 81 | +==== |
| 82 | + |
| 83 | +. From the *Administrator* perspective of the {product-title} web console, go to *Observe* -> *Metrics* and queue the relevant remote write metrics: |
| 84 | ++ |
| 85 | +[NOTE] |
| 86 | +==== |
| 87 | +For more information about the metrics used in the following steps, see: "Table of remote write metrics". |
| 88 | +==== |
| 89 | + |
| 90 | +.. Queue the `prometheus_remote_storage_retried_samples_total` metric. If you see a steady high rate for this metric, reduce throughput of remote write to reduce load on the endpoint. |
| 91 | + |
| 92 | +.. Check if remote write is falling behind in reading from the write-ahead log (WAL) and sending data to the remote endpoint: |
| 93 | + |
| 94 | +... Queue the `prometheus_remote_storage_highest_timestamp_in_seconds` and `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` metrics to see how far behind remote write is for an endpoint. |
| 95 | + |
| 96 | +... Queue the `prometheus_wal_watcher_current_segment` and `prometheus_tsdb_wal_segment_current` metrics to see if the values are the same. If you see a significant gap between the two values, it could mean that remote write is falling behind. |
| 97 | + |
| 98 | +.. If you determine from the previous steps that remote write is behind, perform the following actions: |
| 99 | + |
| 100 | +... Queue the `prometheus_remote_storage_shards` and `prometheus_remote_storage_shards_max` metrics to see if you are running the maximum number of shards. |
| 101 | + |
| 102 | +... Queue the `prometheus_remote_storage_shards_desired` metric to see if its value is greater than the `prometheus_remote_storage_shards_max` metric value. |
| 103 | + |
| 104 | +... If you are running the maximum number of shards and the value for wanted shards is greater than the value for maximum shards, increase the `maxShards` value for your remote write configuration. |
| 105 | ++ |
| 106 | +.Example remote write maxShards parameter for user workload monitoring |
| 107 | +[source,yaml] |
| 108 | +---- |
| 109 | +apiVersion: v1 |
| 110 | +kind: ConfigMap |
| 111 | +metadata: |
| 112 | + name: user-workload-monitoring-config |
| 113 | + namespace: openshift-user-workload-monitoring |
| 114 | +data: |
| 115 | + config.yaml: | |
| 116 | + prometheus: |
| 117 | + remoteWrite: |
| 118 | + # ... |
| 119 | + queueConfig: |
| 120 | + # ... |
| 121 | + maxShards: 100 |
| 122 | +---- |
0 commit comments