-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #13957 +/- ##
=======================================
Coverage 92.16% 92.17%
=======================================
Files 668 668
Lines 41463 41557 +94
=======================================
+ Hits 38216 38305 +89
- Misses 2214 2217 +3
- Partials 1033 1035 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
869d3f8 to
ddfd2b6
Compare
|
@open-telemetry/collector-approvers can you take a look? |
jmacd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking feedback (cc @jade-guiton-dd @axw).
Question 1: The universal telemetry RFC describes the use of an attribute otelcol.component.outcome=failure to indicate when an export fails. Why would we need a separate counter to indicate when retry fails?
Question 2: If the exporterhelper is configured with wait_for_result=true then it's difficult to call these failures "drops". Wouldn't the same sort of "drop" happen if the queue is configured (without wait_for_result=true) but also without the retry processor?
I guess these questions lead me to suspect that it's the queue (not the retry sender) that should count drops which are requests that fail and have no upstream response returned because wait_for_result=false. Otherwise, failures are failures, I see no reason to count them in a new way.
|
Thanks for your always valuable feedback @jmacd :D
The RFC attribute only tells you whether a single export span ended in success or failure. It doesn’t say why it failed or how many items were lost. Before this change, the obsreport sender only knew that By having the retry sender wrap the terminal error with
The queue already accounts for the situations it is responsible for ( In the configuration you mentioned (queue enabled, So the queue doesn’t have enough context to produce a “retry exhausted” metric, while the retry sender does. That’s why the new counters live alongside the retry logic instead of inside the queue. |
|
(For the record, the type of failure that occurred is already visible in logs. Of course, that doesn't mean we can't also surface it as metrics.) |
3ca8666 to
fbd3281
Compare
…r exporter helper retries Signed-off-by: Israel Blancas <[email protected]>
…etry#13957 Signed-off-by: Jayson Cena <[email protected]>
CodSpeed Performance ReportMerging #13957 will improve performances by ×4.3Comparing
|
| Benchmark | BASE |
HEAD |
Change | |
|---|---|---|---|---|
| ⚡ | zstdWithConcurrency |
28.1 µs | 6.5 µs | ×4.3 |
|
@open-telemetry/collector-approvers can you take a look at this PR? |
jade-guiton-dd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know some have reservations about adding new metrics that are enabled by default, so I would be interested in other maintainers' opinions on whether we should add this or not, and whether it should be under level: detailed / behind a feature gate.
I'm especially ambivalent about this given that the error message (visible in both logs and spans) already allows telling this scenario apart, and gives more information than the metric does.
Signed-off-by: Israel Blancas <[email protected]>
I understand but teams can set threshold-based alerts or SLO burn-rate monitors on “retry drops” without building custom log pipelines if we implement the metric approach. |
|
That's true. I'm just not convinced of the utility of monitoring only "retry exhausted" failures, as opposed to all export failures, including permanent ones. You mentioned previously that the remediation is not the same, but in all cases, remediation will require you to check the logs to see the specifics, no? |
|
You're right that logs are needed for finding the root cause of the issue, but the key problem we're solving is alert fatigue. Right now We got some feedback from users telling us they actually want to alert on "did we lose data?" not "did an attempt fail?". If the first export fails but succeeds on retry, firing an alert can be useless. But if retries are exhausted, data is lost and that needs immediate attention. |
|
That doesn't seem right... My understanding of So assuming that your exporter doesn't return errors marked as permanent for "transient" issues (which I think would be a bug with that particular exporter), the To put it another way, if you see an increment of If there are further retries after that, it's because a component upstream of the exporterhelper is performing retries. (Either a client, or maybe something like the |
|
Actually you are right. I reviewed it again and The real value here is distinguishing why it failed. So, for instance we can detect if the failure while sending comes because of availability in the backend vs other kind of issue in the collector (bad data or configuration maybe). This lets teams alert on Regarding your point about upstream retries: you're absolutely right that if retries are configured at multiple layers (like |
jade-guiton-dd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few remaining issues, but looks mostly good. I'll check with other approvers if there are objections to a new level: detailed metric.
|
At a philosophical level I would prefer to see a new detailed-level attribute to explain failure causes or categories on the existing metrics, i.e., new attributes not new instruments. I would call this a |
Signed-off-by: Israel Blancas <[email protected]>
|
Personally, I would prefer the attributes solution: it avoids redundancy, makes it easier to isolate failures that weren't due to exhausted retries (subtracting metrics is not really an exact science), and provides the opportunity to add more information / use cases in the future. |
|
Closing in favor of #14247 |
Description
Link to tracking issue
Fixes #13956