-
Notifications
You must be signed in to change notification settings - Fork 1.8k
OBSDOCS-1330: Improve 'troubleshooting monitoring issues': New section troubleshooting remote write #92129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@eromanova97: This pull request references OBSDOCS-1330 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
c74390c
to
3c1ba68
Compare
beb1fd2
to
52bfddc
Compare
For more information about the metrics used in the following steps, see: "Table of remote write metrics". | ||
==== | ||
|
||
.. Queue the `prometheus_remote_storage_retried_samples_total` metric. If you see a steady high rate for this metric, reduce throughput of remote write to reduce load on the endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I include this step, I need an information about how to reduce throughput -> what the users can do to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throughput can really only be reduced by sending fewer metrics.
Maybe its better to state this more generic. retried samples indicates that the network or receiving side can not keep up. So either send fewer metrics or improve network or receiver.
|
||
... Queue the `prometheus_remote_storage_shards_desired` metric to see if its value is greater than the `prometheus_remote_storage_shards_max` metric value. | ||
|
||
... If you are running the maximum number of shards and the value for wanted shards is greater than the value for maximum shards, increase the `maxShards` value for your remote write configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other possible mitigations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be careful here. Increasing shards means increased memory usage. Its also unclear if that solve this issue as remote write could fall behind due to the receiving being slow or down.
If that is the case, increasing shards will simply increase memory usage but not solve the perceived falling behind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so, just increasing the value did not seem like an ideal solution even to me 😄 I would rather have some more suggestions here on what to do , and one of those could be increasing the maxShards
value while including the warning about the memory usage (this was the only solution in the article I based this procedure on)
@eromanova97: This pull request references OBSDOCS-1330 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
modules/monitoring-troubleshooting-remote-write-configuration.adoc
Outdated
Show resolved
Hide resolved
…n troubleshooting remote write
52bfddc
to
44d7e64
Compare
/retest |
@eromanova97: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
+ | ||
[NOTE] | ||
==== | ||
The formatted example output uses a filtering tool, such as `jq`, to provide the formatted indented JSON. See the link:https://stedolan.github.io/jq/manual/[jq Manual] (jq documentation) for more information about using `jq`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the paragraph is fine, just FYI: use a jq tool only will not get a formatted indented JSON, example https://privatebin.corp.redhat.com/?35bf7440a03eabfa#5uhjxvXRfRSD9gubPgWWAjL5Tr1Yk6UsyNmTKeThhsE2, need to replace \n
to new line for the data.yaml part, then can get a formatted indented JSON
For more information about the metrics used in the following steps, see: "Table of remote write metrics". | ||
==== | ||
|
||
.. Query the `prometheus_remote_storage_retried_samples_total` metric. If you see a steady high rate for this metric, reduce throughput of remote write to reduce load on the endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be prometheus_remote_storage_samples_retried_total
see
$ token=`oc create token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep remote_storage
"prometheus_remote_storage_bytes_total",
"prometheus_remote_storage_enqueue_retries_total",
"prometheus_remote_storage_exemplars_failed_total",
"prometheus_remote_storage_exemplars_in_total",
"prometheus_remote_storage_exemplars_pending",
"prometheus_remote_storage_exemplars_retried_total",
"prometheus_remote_storage_exemplars_total",
"prometheus_remote_storage_highest_timestamp_in_seconds",
"prometheus_remote_storage_histograms_failed_total",
"prometheus_remote_storage_histograms_in_total",
"prometheus_remote_storage_histograms_pending",
"prometheus_remote_storage_histograms_retried_total",
"prometheus_remote_storage_histograms_total",
"prometheus_remote_storage_max_samples_per_send",
"prometheus_remote_storage_metadata_bytes_total",
"prometheus_remote_storage_metadata_failed_total",
"prometheus_remote_storage_metadata_retried_total",
"prometheus_remote_storage_metadata_total",
"prometheus_remote_storage_queue_highest_sent_timestamp_seconds",
"prometheus_remote_storage_samples_failed_total",
"prometheus_remote_storage_samples_in_total",
"prometheus_remote_storage_samples_pending",
"prometheus_remote_storage_samples_retried_total",
"prometheus_remote_storage_samples_total",
"prometheus_remote_storage_sent_batch_duration_seconds_bucket",
"prometheus_remote_storage_sent_batch_duration_seconds_count",
"prometheus_remote_storage_sent_batch_duration_seconds_sum",
"prometheus_remote_storage_shard_capacity",
"prometheus_remote_storage_shards",
"prometheus_remote_storage_shards_desired",
"prometheus_remote_storage_shards_max",
"prometheus_remote_storage_shards_min",
"prometheus_remote_storage_string_interner_zero_reference_releases_total",
| Metric | Description | ||
| `prometheus_remote_storage_highest_timestamp_in_seconds` | Shows the newest timestamp that Prometheus stored in the write-ahead log (WAL) for any sample. | ||
| `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` | Shows the newest timestamp that remote write successfully sent. | ||
| `prometheus_remote_storage_retried_samples_total` | The number of samples that remote write failed to send and had to resend to remote storage. A steady high rate for this metric indicates problems with the network or remote storage endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatted example output uses a filtering tool, such as `jq`, to provide the formatted indented JSON. See the link:https://stedolan.github.io/jq/manual/[jq Manual] (jq documentation) for more information about using `jq`. | ||
==== | ||
|
||
. From the *Administrator* perspective of the {product-title} web console, go to *Observe* -> *Metrics* and query the relevant remote write metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be step 3, add "Check prometheus pod logs to see if there are errors" as step 2, example
$ oc -n openshift-monitoring logs -c prometheus prometheus-k8s-0
ts=2025-04-17T14:28:35.584Z caller=dedupe.go:112 component=remote level=warn remote_name=03b4b3 url=https://remote-write.endpoint msg="Failed to send batch, retrying" err="Post \"https://remote-write.endpoint\": dial tcp: lookup remote-write.endpoint on 172.30.0.10:53: no such host"
ts=2025-04-17T14:29:35.626Z caller=dedupe.go:112 component=remote level=warn remote_name=03b4b3 url=https://remote-write.endpoint msg="Failed to send batch, retrying" err="Post \"https://remote-write.endpoint\": dial tcp: lookup remote-write.endpoint on 172.30.0.10:53: no such host"
|=== | ||
| Metric | Description | ||
| `prometheus_remote_storage_highest_timestamp_in_seconds` | Shows the newest timestamp that Prometheus stored in the write-ahead log (WAL) for any sample. | ||
| `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` | Shows the newest timestamp that remote write successfully sent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` | Shows the newest timestamp that remote write successfully sent. | |
| `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` | Shows the newest timestamp that remote write successfully sent by this queue. |
based on
# HELP prometheus_remote_storage_queue_highest_sent_timestamp_seconds Timestamp from a WAL sample, the highest timestamp successfully sent by this queue, in seconds since epoch. Initialized to 0 when no data has been sent yet.
Version(s): 4.12 and later
Issue: https://issues.redhat.com/browse/OBSDOCS-1330
Link to docs preview:
Enterprise: https://92129--ocpdocs-pr.netlify.app/openshift-enterprise/latest/observability/monitoring/troubleshooting-monitoring-issues.html#troubleshooting-remote-write-configuration_troubleshooting-monitoring-issues
ROSA/OSD: https://92129--ocpdocs-pr.netlify.app/openshift-rosa/latest/observability/monitoring/troubleshooting-monitoring-issues.html#troubleshooting-remote-write-configuration_troubleshooting-monitoring-issues
QE review:
Additional information: