Skip to content

Conversation

@iblancasa
Copy link
Contributor

Description

  • Add a dedicated experr.NewRetriesExhaustedErr wrapper so exporters can detect when all retry attempts failed
  • Record new otelcol_exporter_retry_dropped_{spans,metric_points,log_records} counters when retries are exhausted, alongside existing send-failed metrics

Link to tracking issue

Fixes #13956

@iblancasa iblancasa requested review from a team, bogdandrutu and dmitryax as code owners October 9, 2025 14:46
@codecov
Copy link

codecov bot commented Oct 9, 2025

Codecov Report

❌ Patch coverage is 98.90110% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 92.17%. Comparing base (7012862) to head (3aabe2c).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
exporter/exporterhelper/internal/retry_sender.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #13957   +/-   ##
=======================================
  Coverage   92.16%   92.17%           
=======================================
  Files         668      668           
  Lines       41463    41557   +94     
=======================================
+ Hits        38216    38305   +89     
- Misses       2214     2217    +3     
- Partials     1033     1035    +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@iblancasa iblancasa force-pushed the 13956 branch 3 times, most recently from 869d3f8 to ddfd2b6 Compare October 21, 2025 10:47
@iblancasa
Copy link
Contributor Author

@open-telemetry/collector-approvers can you take a look?

Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking feedback (cc @jade-guiton-dd @axw).

Question 1: The universal telemetry RFC describes the use of an attribute otelcol.component.outcome=failure to indicate when an export fails. Why would we need a separate counter to indicate when retry fails?

Question 2: If the exporterhelper is configured with wait_for_result=true then it's difficult to call these failures "drops". Wouldn't the same sort of "drop" happen if the queue is configured (without wait_for_result=true) but also without the retry processor?

I guess these questions lead me to suspect that it's the queue (not the retry sender) that should count drops which are requests that fail and have no upstream response returned because wait_for_result=false. Otherwise, failures are failures, I see no reason to count them in a new way.

@iblancasa
Copy link
Contributor Author

Thanks for your always valuable feedback @jmacd :D

Question 1: The universal telemetry RFC describes the use of an attribute otelcol.component.outcome=failure to indicate when an export fails. Why would we need a separate counter to indicate when retry fails?

The RFC attribute only tells you whether a single export span ended in success or failure. It doesn’t say why it failed or how many items were lost. Before this change, the obsreport sender only knew that err != nil. It could increment otelcol_exporter_send_failed_*, but it couldn’t tell whether the failure was because retries were exhausted, a permanent error was returned on the first attempt, the context was cancelled, the collector shut down, etc.

By having the retry sender wrap the terminal error with experr.NewRetriesExhaustedErr, the obsreport sender can now distinguish “we ran out of retries” from other failure cases. We found this metric valuable in the past because that distinction matters operationally: running out of retries usually points to a long-lived availability problem on the destination side, while other failures (permanent errors, shutdown, context cancellation) have different remediation.

Question 2: If the exporterhelper is configured with wait_for_result=true then it's difficult to call these failures "drops". Wouldn't the same sort of "drop" happen if the queue is configured (without wait_for_result=true) but also without the retry processor?

wait_for_result only controls whether the queue’s Offer call waits for the downstream sender to finish. When it’s true, upstream components see the error immediately; when false, they don’t. In both cases, once the retry sender gives up the data is gone—the collector has accepted it but cannot deliver it. So it still qualifies as a drop.

The queue already accounts for the situations it is responsible for (otelcol_exporter_enqueue_failed_* covers queue-capacity drops). What it cannot know is why the downstream sender failed. It simply forwards the error it gets back.

In the configuration you mentioned (queue enabled, wait_for_result=false, retry disabled), the queue returns success to the producer, the exporter fails, and obsReportSender.endOp increments otelcol_exporter_send_failed_*. No retry ever ran, so the new retry-drop counter remains zero. That’s intentional: the terminal failure was due to a permanent error, not because a retry budget was exhausted. Conversely, when retries are enabled and eventually fail, the retry sender wraps the error, the obsreport sender increments both send_failed and the new retry_dropped counter. Upstream may or may not have seen the error depending on wait_for_result, but the counter captures the fact that “we tried retrying and still had to drop these items.”

So the queue doesn’t have enough context to produce a “retry exhausted” metric, while the retry sender does. That’s why the new counters live alongside the retry logic instead of inside the queue.

@jade-guiton-dd
Copy link
Contributor

(For the record, the type of failure that occurred is already visible in logs. Of course, that doesn't mean we can't also surface it as metrics.)

…r exporter helper retries

Signed-off-by: Israel Blancas <[email protected]>
jaysoncena added a commit to jaysoncena/opentelemetry-collector that referenced this pull request Nov 19, 2025
@iblancasa iblancasa requested a review from jmacd November 19, 2025 19:25
@codspeed-hq
Copy link

codspeed-hq bot commented Nov 25, 2025

CodSpeed Performance Report

Merging #13957 will improve performances by ×4.3

Comparing iblancasa:13956 (3aabe2c) with main (fd17e51)

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

⚡ 1 improvement
✅ 72 untouched

Benchmarks breakdown

Benchmark BASE HEAD Change
zstdWithConcurrency 28.1 µs 6.5 µs ×4.3

@iblancasa
Copy link
Contributor Author

@open-telemetry/collector-approvers can you take a look at this PR?

Copy link
Contributor

@jade-guiton-dd jade-guiton-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know some have reservations about adding new metrics that are enabled by default, so I would be interested in other maintainers' opinions on whether we should add this or not, and whether it should be under level: detailed / behind a feature gate.

I'm especially ambivalent about this given that the error message (visible in both logs and spans) already allows telling this scenario apart, and gives more information than the metric does.

@iblancasa
Copy link
Contributor Author

I'm especially ambivalent about this given that the error message (visible in both logs and spans) already allows telling this scenario apart, and gives more information than the metric does.

I understand but teams can set threshold-based alerts or SLO burn-rate monitors on “retry drops” without building custom log pipelines if we implement the metric approach.

@jade-guiton-dd
Copy link
Contributor

That's true. I'm just not convinced of the utility of monitoring only "retry exhausted" failures, as opposed to all export failures, including permanent ones. You mentioned previously that the remediation is not the same, but in all cases, remediation will require you to check the logs to see the specifics, no?

@iblancasa
Copy link
Contributor Author

You're right that logs are needed for finding the root cause of the issue, but the key problem we're solving is alert fatigue. Right now send_failed metric increments on every failed attempt. This means teams either get noisy alerts for transient errors that self-heal, or they increase the alert threshold to match retry duration—which delays detection of real problems.

We got some feedback from users telling us they actually want to alert on "did we lose data?" not "did an attempt fail?". If the first export fails but succeeds on retry, firing an alert can be useless. But if retries are exhausted, data is lost and that needs immediate attention.

@jade-guiton-dd
Copy link
Contributor

jade-guiton-dd commented Dec 1, 2025

That doesn't seem right... My understanding of exporterhelper (and my experience using it so far) is that the send_failed metric only increments when the retry sender returns an error, ie. when the exporter either returned a permanent error, or when it ran out of retries.

So assuming that your exporter doesn't return errors marked as permanent for "transient" issues (which I think would be a bug with that particular exporter), the send_failed metric should already do what you want.

To put it another way, if you see an increment of send_failed due to retryable errors, that means the exporterhelper ran out of retries, which means this new retry_dropped metric will be incremented as well, so it won't help with alert fatigue.

If there are further retries after that, it's because a component upstream of the exporterhelper is performing retries. (Either a client, or maybe something like the loadbalancing exporter.) In those cases, I think the best way to reduce alert fatigue would be to change the config to allow the exporterhelper to retry for longer. Simply because the exporterhelper can't possibly tell whether upstream will retry a request, and filter out the error metric in that case.

@iblancasa
Copy link
Contributor Author

Actually you are right. I reviewed it again and send_failed only increments on terminal failures, not intermediate retries - I misspoke it. Since the PR has been open for too long, I lost partially the context of the problem to solve.

The real value here is distinguishing why it failed. So, for instance we can detect if the failure while sending comes because of availability in the backend vs other kind of issue in the collector (bad data or configuration maybe). This lets teams alert on retry_dropped specifically for availability issues without conflating them with data quality problems. Both are terminal failures, but they need different remediation.

Regarding your point about upstream retries: you're absolutely right that if retries are configured at multiple layers (like loadbalancing exporter + individual exporters), then tuning the retry settings at the exporterhelper level is the right approach. The metric is most useful when the exporterhelper is the only retry layer.

Copy link
Contributor

@jade-guiton-dd jade-guiton-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few remaining issues, but looks mostly good. I'll check with other approvers if there are objections to a new level: detailed metric.

@jmacd
Copy link
Contributor

jmacd commented Dec 1, 2025

At a philosophical level I would prefer to see a new detailed-level attribute to explain failure causes or categories on the existing metrics, i.e., new attributes not new instruments.

I would call this a reason attribute. I also agree there's some confusion: "retries exhausted" is not the reason for failure, it just means the last retry and all the prior attempts returned a transient failure. So I imagine a different solution is to add a new boolean attribute, emitted only at metric-level detailed, which is permanent=true/false. a slightly better if more complicated version of this is to attach HTTP status or gRPC Code to say why the failure happened in more detail than a simple success/failure outcome.

@iblancasa
Copy link
Contributor Author

I applied the requested feedback. At the same time, I found @jmacd suggestion very interesting so I created a PR with that solution here #14247

Please, tell me what solution do you prefer and I will close the PR with the non-desired approach :)

@jade-guiton-dd
Copy link
Contributor

Personally, I would prefer the attributes solution: it avoids redundancy, makes it easier to isolate failures that weren't due to exhausted retries (subtracting metrics is not really an exact science), and provides the opportunity to add more information / use cases in the future.

@iblancasa
Copy link
Contributor Author

Closing in favor of #14247

@iblancasa iblancasa closed this Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[exporter/exporterhelper] Add exporter retry-drop metrics

3 participants