fix(telemetry): retain spans in pending list when upload fails#417
Closed
mariobrajkovski wants to merge 1 commit into
Closed
fix(telemetry): retain spans in pending list when upload fails#417mariobrajkovski wants to merge 1 commit into
mariobrajkovski wants to merge 1 commit into
Conversation
_do_upload catches all exceptions internally and returns False on failure — it never raises. This means f.exception() is always None regardless of whether the upload actually succeeded, so the previous guard 'if not f.exception()' evaluated to True even on failure, causing _cleanup_done to silently remove the span from _pending_spans as if it had been delivered successfully. When a run hit a transient network hiccup mid-flight, the affected spans were evicted from _pending_spans by the callback, so the flush() call at context exit found nothing to retry. The run finished cleanly (no exception surfaced to user code), but those steps were permanently lost from the trace viewer. Fix: check f.result() is True instead of f.exception() is None. Only remove a span from the pending list when the upload function explicitly signals success. Any other outcome (False return, raised exception, cancelled future) leaves the span in place so flush() can retry it before the eval context is torn down.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a run hits a brief upload hiccup mid-flight, some of the steps it recorded never make it into the trace viewer. The run itself finishes cleanly — no error surfaced, nothing in the logs — so at a glance everything looks fine.
Root Cause
_do_uploadcatches all exceptions internally and returnsFalseon failure — it never raises. Because of this,f.exception()is alwaysNoneregardless of whether the upload actually succeeded or silently failed.The
_cleanup_donecallback usedif not f.exception()as its success guard:Because the condition was always
True, the span was removed from_pending_spanseven when_do_uploadreturnedFalse(i.e. the upload failed). Whenflush()ran at context exit,_pending_spanswas empty for that span, so there was nothing to retry. The span was silently dropped.Fix
Check
f.result() is Trueinstead off.exception() is None. The span is only removed from the pending list when the upload function explicitly signals success. Any other outcome —Falsereturn, raised exception, cancelled future — leaves the span in place soflush()can retry it before the eval context tears down.Behaviour
flush()retry ✅flush()retry on exitNote
Medium Risk
Changes telemetry delivery semantics on upload failure; low blast radius but affects trace completeness for eval runs.
Overview
Fixes silent loss of telemetry spans when async uploads fail: the
_cleanup_donefuture callback no longer treats “no exception” as success.Because
_do_uploadswallows errors and returnsFalse,f.exception()was alwaysNone, so spans were removed from_pending_spanseven on failed uploads andflush()had nothing to retry at eval exit. The callback now removes a span only whenf.result() is True, keeping failed uploads in the pending list for exit-time retry.Reviewed by Cursor Bugbot for commit 7251875. Bugbot is set up for automated code reviews on this repo. Configure here.