Skip to content

OBSDOCS-1330: Improve 'troubleshooting monitoring issues': New section troubleshooting remote write #92129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 14, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 14, 2025

@eromanova97: This pull request references OBSDOCS-1330 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Version(s): 4.12 and later

Issue: https://issues.redhat.com/browse/OBSDOCS-1330

Link to docs preview:

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 14, 2025
@eromanova97 eromanova97 force-pushed the OBSDOCS-1330 branch 2 times, most recently from c74390c to 3c1ba68 Compare April 14, 2025 12:47
@eromanova97 eromanova97 force-pushed the OBSDOCS-1330 branch 2 times, most recently from beb1fd2 to 52bfddc Compare April 15, 2025 11:44
For more information about the metrics used in the following steps, see: "Table of remote write metrics".
====

.. Queue the `prometheus_remote_storage_retried_samples_total` metric. If you see a steady high rate for this metric, reduce throughput of remote write to reduce load on the endpoint.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I include this step, I need an information about how to reduce throughput -> what the users can do to fix this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughput can really only be reduced by sending fewer metrics.
Maybe its better to state this more generic. retried samples indicates that the network or receiving side can not keep up. So either send fewer metrics or improve network or receiver.


... Queue the `prometheus_remote_storage_shards_desired` metric to see if its value is greater than the `prometheus_remote_storage_shards_max` metric value.

... If you are running the maximum number of shards and the value for wanted shards is greater than the value for maximum shards, increase the `maxShards` value for your remote write configuration.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other possible mitigations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be careful here. Increasing shards means increased memory usage. Its also unclear if that solve this issue as remote write could fall behind due to the receiving being slow or down.
If that is the case, increasing shards will simply increase memory usage but not solve the perceived falling behind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so, just increasing the value did not seem like an ideal solution even to me 😄 I would rather have some more suggestions here on what to do , and one of those could be increasing the maxShards value while including the warning about the memory usage (this was the only solution in the article I based this procedure on)

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 15, 2025

@eromanova97: This pull request references OBSDOCS-1330 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Version(s): 4.12 and later

Issue: https://issues.redhat.com/browse/OBSDOCS-1330

Link to docs preview:
Enterprise: https://92129--ocpdocs-pr.netlify.app/openshift-enterprise/latest/observability/monitoring/troubleshooting-monitoring-issues.html#troubleshooting-remote-write-configuration_troubleshooting-monitoring-issues
ROSA/OSD: https://92129--ocpdocs-pr.netlify.app/openshift-rosa/latest/observability/monitoring/troubleshooting-monitoring-issues.html#troubleshooting-remote-write-configuration_troubleshooting-monitoring-issues

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@eromanova97
Copy link
Contributor Author

/retest

Copy link

openshift-ci bot commented Apr 17, 2025

@eromanova97: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

+
[NOTE]
====
The formatted example output uses a filtering tool, such as `jq`, to provide the formatted indented JSON. See the link:https://stedolan.github.io/jq/manual/[jq Manual] (jq documentation) for more information about using `jq`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the paragraph is fine, just FYI: use a jq tool only will not get a formatted indented JSON, example https://privatebin.corp.redhat.com/?35bf7440a03eabfa#5uhjxvXRfRSD9gubPgWWAjL5Tr1Yk6UsyNmTKeThhsE2, need to replace \n to new line for the data.yaml part, then can get a formatted indented JSON

For more information about the metrics used in the following steps, see: "Table of remote write metrics".
====

.. Query the `prometheus_remote_storage_retried_samples_total` metric. If you see a steady high rate for this metric, reduce throughput of remote write to reduce load on the endpoint.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be prometheus_remote_storage_samples_retried_total
see

$ token=`oc create token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq  | grep remote_storage
    "prometheus_remote_storage_bytes_total",
    "prometheus_remote_storage_enqueue_retries_total",
    "prometheus_remote_storage_exemplars_failed_total",
    "prometheus_remote_storage_exemplars_in_total",
    "prometheus_remote_storage_exemplars_pending",
    "prometheus_remote_storage_exemplars_retried_total",
    "prometheus_remote_storage_exemplars_total",
    "prometheus_remote_storage_highest_timestamp_in_seconds",
    "prometheus_remote_storage_histograms_failed_total",
    "prometheus_remote_storage_histograms_in_total",
    "prometheus_remote_storage_histograms_pending",
    "prometheus_remote_storage_histograms_retried_total",
    "prometheus_remote_storage_histograms_total",
    "prometheus_remote_storage_max_samples_per_send",
    "prometheus_remote_storage_metadata_bytes_total",
    "prometheus_remote_storage_metadata_failed_total",
    "prometheus_remote_storage_metadata_retried_total",
    "prometheus_remote_storage_metadata_total",
    "prometheus_remote_storage_queue_highest_sent_timestamp_seconds",
    "prometheus_remote_storage_samples_failed_total",
    "prometheus_remote_storage_samples_in_total",
    "prometheus_remote_storage_samples_pending",
    "prometheus_remote_storage_samples_retried_total",
    "prometheus_remote_storage_samples_total",
    "prometheus_remote_storage_sent_batch_duration_seconds_bucket",
    "prometheus_remote_storage_sent_batch_duration_seconds_count",
    "prometheus_remote_storage_sent_batch_duration_seconds_sum",
    "prometheus_remote_storage_shard_capacity",
    "prometheus_remote_storage_shards",
    "prometheus_remote_storage_shards_desired",
    "prometheus_remote_storage_shards_max",
    "prometheus_remote_storage_shards_min",
    "prometheus_remote_storage_string_interner_zero_reference_releases_total",

| Metric | Description
| `prometheus_remote_storage_highest_timestamp_in_seconds` | Shows the newest timestamp that Prometheus stored in the write-ahead log (WAL) for any sample.
| `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` | Shows the newest timestamp that remote write successfully sent.
| `prometheus_remote_storage_retried_samples_total` | The number of samples that remote write failed to send and had to resend to remote storage. A steady high rate for this metric indicates problems with the network or remote storage endpoint.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatted example output uses a filtering tool, such as `jq`, to provide the formatted indented JSON. See the link:https://stedolan.github.io/jq/manual/[jq Manual] (jq documentation) for more information about using `jq`.
====

. From the *Administrator* perspective of the {product-title} web console, go to *Observe* -> *Metrics* and query the relevant remote write metrics:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be step 3, add "Check prometheus pod logs to see if there are errors" as step 2, example

$ oc -n openshift-monitoring logs -c prometheus prometheus-k8s-0
ts=2025-04-17T14:28:35.584Z caller=dedupe.go:112 component=remote level=warn remote_name=03b4b3 url=https://remote-write.endpoint msg="Failed to send batch, retrying" err="Post \"https://remote-write.endpoint\": dial tcp: lookup remote-write.endpoint on 172.30.0.10:53: no such host"
ts=2025-04-17T14:29:35.626Z caller=dedupe.go:112 component=remote level=warn remote_name=03b4b3 url=https://remote-write.endpoint msg="Failed to send batch, retrying" err="Post \"https://remote-write.endpoint\": dial tcp: lookup remote-write.endpoint on 172.30.0.10:53: no such host"

|===
| Metric | Description
| `prometheus_remote_storage_highest_timestamp_in_seconds` | Shows the newest timestamp that Prometheus stored in the write-ahead log (WAL) for any sample.
| `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` | Shows the newest timestamp that remote write successfully sent.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` | Shows the newest timestamp that remote write successfully sent.
| `prometheus_remote_storage_queue_highest_sent_timestamp_seconds` | Shows the newest timestamp that remote write successfully sent by this queue.

based on
# HELP prometheus_remote_storage_queue_highest_sent_timestamp_seconds Timestamp from a WAL sample, the highest timestamp successfully sent by this queue, in seconds since epoch. Initialized to 0 when no data has been sent yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants