Skip to content

Conversation

@iblancasa
Copy link
Contributor

Description

  • Add a dedicated experr.NewRetriesExhaustedErr wrapper so exporters can detect when all retry attempts failed
  • Record new otelcol_exporter_retry_dropped_{spans,metric_points,log_records} counters when retries are exhausted, alongside existing send-failed metrics

Link to tracking issue

Fixes #13956

@iblancasa iblancasa requested review from a team, bogdandrutu and dmitryax as code owners October 9, 2025 14:46
@codecov
Copy link

codecov bot commented Oct 9, 2025

Codecov Report

❌ Patch coverage is 98.76543% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 92.26%. Comparing base (6f29b34) to head (0630965).

Files with missing lines Patch % Lines
exporter/exporterhelper/internal/retry_sender.go 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13957      +/-   ##
==========================================
- Coverage   92.27%   92.26%   -0.02%     
==========================================
  Files         657      657              
  Lines       41111    41188      +77     
==========================================
+ Hits        37936    38001      +65     
- Misses       2173     2181       +8     
- Partials     1002     1006       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@iblancasa iblancasa force-pushed the 13956 branch 3 times, most recently from 869d3f8 to ddfd2b6 Compare October 21, 2025 10:47
@iblancasa
Copy link
Contributor Author

@open-telemetry/collector-approvers can you take a look?

…r exporter helper retries

Signed-off-by: Israel Blancas <[email protected]>
Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking feedback (cc @jade-guiton-dd @axw).

Question 1: The universal telemetry RFC describes the use of an attribute otelcol.component.outcome=failure to indicate when an export fails. Why would we need a separate counter to indicate when retry fails?

Question 2: If the exporterhelper is configured with wait_for_result=true then it's difficult to call these failures "drops". Wouldn't the same sort of "drop" happen if the queue is configured (without wait_for_result=true) but also without the retry processor?

I guess these questions lead me to suspect that it's the queue (not the retry sender) that should count drops which are requests that fail and have no upstream response returned because wait_for_result=false. Otherwise, failures are failures, I see no reason to count them in a new way.

@iblancasa
Copy link
Contributor Author

Thanks for your always valuable feedback @jmacd :D

Question 1: The universal telemetry RFC describes the use of an attribute otelcol.component.outcome=failure to indicate when an export fails. Why would we need a separate counter to indicate when retry fails?

The RFC attribute only tells you whether a single export span ended in success or failure. It doesn’t say why it failed or how many items were lost. Before this change, the obsreport sender only knew that err != nil. It could increment otelcol_exporter_send_failed_*, but it couldn’t tell whether the failure was because retries were exhausted, a permanent error was returned on the first attempt, the context was cancelled, the collector shut down, etc.

By having the retry sender wrap the terminal error with experr.NewRetriesExhaustedErr, the obsreport sender can now distinguish “we ran out of retries” from other failure cases. We found this metric valuable in the past because that distinction matters operationally: running out of retries usually points to a long-lived availability problem on the destination side, while other failures (permanent errors, shutdown, context cancellation) have different remediation.

Question 2: If the exporterhelper is configured with wait_for_result=true then it's difficult to call these failures "drops". Wouldn't the same sort of "drop" happen if the queue is configured (without wait_for_result=true) but also without the retry processor?

wait_for_result only controls whether the queue’s Offer call waits for the downstream sender to finish. When it’s true, upstream components see the error immediately; when false, they don’t. In both cases, once the retry sender gives up the data is gone—the collector has accepted it but cannot deliver it. So it still qualifies as a drop.

The queue already accounts for the situations it is responsible for (otelcol_exporter_enqueue_failed_* covers queue-capacity drops). What it cannot know is why the downstream sender failed. It simply forwards the error it gets back.

In the configuration you mentioned (queue enabled, wait_for_result=false, retry disabled), the queue returns success to the producer, the exporter fails, and obsReportSender.endOp increments otelcol_exporter_send_failed_*. No retry ever ran, so the new retry-drop counter remains zero. That’s intentional: the terminal failure was due to a permanent error, not because a retry budget was exhausted. Conversely, when retries are enabled and eventually fail, the retry sender wraps the error, the obsreport sender increments both send_failed and the new retry_dropped counter. Upstream may or may not have seen the error depending on wait_for_result, but the counter captures the fact that “we tried retrying and still had to drop these items.”

So the queue doesn’t have enough context to produce a “retry exhausted” metric, while the retry sender does. That’s why the new counters live alongside the retry logic instead of inside the queue.

@jade-guiton-dd
Copy link
Contributor

(For the record, the type of failure that occurred is already visible in logs. Of course, that doesn't mean we can't also surface it as metrics.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[exporter/exporterhelper] Add exporter retry-drop metrics

3 participants