Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957

iblancasa · 2025-10-09T14:46:47Z

Description

Add a dedicated experr.NewRetriesExhaustedErr wrapper so exporters can detect when all retry attempts failed
Record new otelcol_exporter_retry_dropped_{spans,metric_points,log_records} counters when retries are exhausted, alongside existing send-failed metrics

Link to tracking issue

codecov · 2025-10-09T15:42:15Z

Codecov Report

❌ Patch coverage is 98.76543% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 92.26%. Comparing base (6f29b34) to head (0630965).

Files with missing lines	Patch %	Lines
exporter/exporterhelper/internal/retry_sender.go	90.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #13957      +/-   ##
==========================================
- Coverage   92.27%   92.26%   -0.02%     
==========================================
  Files         657      657              
  Lines       41111    41188      +77     
==========================================
+ Hits        37936    38001      +65     
- Misses       2173     2181       +8     
- Partials     1002     1006       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

iblancasa · 2025-10-21T10:47:59Z

@open-telemetry/collector-approvers can you take a look?

…r exporter helper retries Signed-off-by: Israel Blancas <[email protected]>

jmacd

Non-blocking feedback (cc @jade-guiton-dd @axw).

Question 1: The universal telemetry RFC describes the use of an attribute otelcol.component.outcome=failure to indicate when an export fails. Why would we need a separate counter to indicate when retry fails?

Question 2: If the exporterhelper is configured with wait_for_result=true then it's difficult to call these failures "drops". Wouldn't the same sort of "drop" happen if the queue is configured (without wait_for_result=true) but also without the retry processor?

I guess these questions lead me to suspect that it's the queue (not the retry sender) that should count drops which are requests that fail and have no upstream response returned because wait_for_result=false. Otherwise, failures are failures, I see no reason to count them in a new way.

iblancasa · 2025-10-28T11:18:15Z

Thanks for your always valuable feedback @jmacd :D

Question 1: The universal telemetry RFC describes the use of an attribute otelcol.component.outcome=failure to indicate when an export fails. Why would we need a separate counter to indicate when retry fails?

The RFC attribute only tells you whether a single export span ended in success or failure. It doesn’t say why it failed or how many items were lost. Before this change, the obsreport sender only knew that err != nil. It could increment otelcol_exporter_send_failed_*, but it couldn’t tell whether the failure was because retries were exhausted, a permanent error was returned on the first attempt, the context was cancelled, the collector shut down, etc.

By having the retry sender wrap the terminal error with experr.NewRetriesExhaustedErr, the obsreport sender can now distinguish “we ran out of retries” from other failure cases. We found this metric valuable in the past because that distinction matters operationally: running out of retries usually points to a long-lived availability problem on the destination side, while other failures (permanent errors, shutdown, context cancellation) have different remediation.

Question 2: If the exporterhelper is configured with wait_for_result=true then it's difficult to call these failures "drops". Wouldn't the same sort of "drop" happen if the queue is configured (without wait_for_result=true) but also without the retry processor?

wait_for_result only controls whether the queue’s Offer call waits for the downstream sender to finish. When it’s true, upstream components see the error immediately; when false, they don’t. In both cases, once the retry sender gives up the data is gone—the collector has accepted it but cannot deliver it. So it still qualifies as a drop.

The queue already accounts for the situations it is responsible for (otelcol_exporter_enqueue_failed_* covers queue-capacity drops). What it cannot know is why the downstream sender failed. It simply forwards the error it gets back.

In the configuration you mentioned (queue enabled, wait_for_result=false, retry disabled), the queue returns success to the producer, the exporter fails, and obsReportSender.endOp increments otelcol_exporter_send_failed_*. No retry ever ran, so the new retry-drop counter remains zero. That’s intentional: the terminal failure was due to a permanent error, not because a retry budget was exhausted. Conversely, when retries are enabled and eventually fail, the retry sender wraps the error, the obsreport sender increments both send_failed and the new retry_dropped counter. Upstream may or may not have seen the error depending on wait_for_result, but the counter captures the fact that “we tried retrying and still had to drop these items.”

So the queue doesn’t have enough context to produce a “retry exhausted” metric, while the retry sender does. That’s why the new counters live alongside the retry logic instead of inside the queue.

jade-guiton-dd · 2025-10-28T11:54:49Z

(For the record, the type of failure that occurred is already visible in logs. Of course, that doesn't mean we can't also surface it as metrics.)

iblancasa requested review from a team, bogdandrutu and dmitryax as code owners October 9, 2025 14:46

iblancasa force-pushed the 13956 branch 3 times, most recently from 869d3f8 to ddfd2b6 Compare October 21, 2025 10:47

Add retry dropped item metrics and an exhausted retry error marker fo…

923abe0

…r exporter helper retries Signed-off-by: Israel Blancas <[email protected]>

iblancasa force-pushed the 13956 branch from ddfd2b6 to 923abe0 Compare October 21, 2025 10:58

Merge branch 'main' into 13956

258ab72

jmacd reviewed Oct 27, 2025

View reviewed changes

Merge branch 'main' into 13956

f4a8bae

Merge branch 'main' into 13956

0630965

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957

Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957

Uh oh!

iblancasa commented Oct 9, 2025

Uh oh!

codecov bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

iblancasa commented Oct 21, 2025

Uh oh!

jmacd left a comment

Uh oh!

iblancasa commented Oct 28, 2025

Uh oh!

jade-guiton-dd commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957

Are you sure you want to change the base?

Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957

Uh oh!

Conversation

iblancasa commented Oct 9, 2025

Description

Link to tracking issue

Uh oh!

codecov bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

iblancasa commented Oct 21, 2025

Uh oh!

jmacd left a comment

Choose a reason for hiding this comment

Uh oh!

iblancasa commented Oct 28, 2025

Uh oh!

jade-guiton-dd commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Oct 9, 2025 •

edited

Loading