Skip to content

fix(telemetry): retain spans in pending list when upload fails#417

Closed
mariobrajkovski wants to merge 1 commit into
mainfrom
fix/telemetry-span-loss-on-transient-upload-failure-v2
Closed

fix(telemetry): retain spans in pending list when upload fails#417
mariobrajkovski wants to merge 1 commit into
mainfrom
fix/telemetry-span-loss-on-transient-upload-failure-v2

Conversation

@mariobrajkovski

@mariobrajkovski mariobrajkovski commented Jun 8, 2026

Copy link
Copy Markdown

Problem

When a run hits a brief upload hiccup mid-flight, some of the steps it recorded never make it into the trace viewer. The run itself finishes cleanly — no error surfaced, nothing in the logs — so at a glance everything looks fine.

Root Cause

_do_upload catches all exceptions internally and returns False on failure — it never raises. Because of this, f.exception() is always None regardless of whether the upload actually succeeded or silently failed.

The _cleanup_done callback used if not f.exception() as its success guard:

# BEFORE (broken)
def _cleanup_done(f: cf.Future[bool]) -> None:
    ...
    if not f.exception():          # always True — even on failure!
        _pending_spans[task_run_id].remove(span)

Because the condition was always True, the span was removed from _pending_spans even when _do_upload returned False (i.e. the upload failed). When flush() ran at context exit, _pending_spans was empty for that span, so there was nothing to retry. The span was silently dropped.

Fix

Check f.result() is True instead of f.exception() is None. The span is only removed from the pending list when the upload function explicitly signals success. Any other outcome — False return, raised exception, cancelled future — leaves the span in place so flush() can retry it before the eval context tears down.

# AFTER (fixed)
def _cleanup_done(f: cf.Future[bool]) -> None:
    ...
    upload_succeeded = False
    with contextlib.suppress(Exception):
        upload_succeeded = f.result() is True
    if upload_succeeded:
        _pending_spans[task_run_id].remove(span)

Behaviour

Scenario Before After
Upload succeeds Span removed from pending ✅ Span removed from pending ✅
Upload fails (network hiccup) Span incorrectly removed Span retained for flush() retry ✅
flush() retry on exit Nothing to retry (span already lost) ❌ Retries and delivers the span ✅

Note

Medium Risk
Changes telemetry delivery semantics on upload failure; low blast radius but affects trace completeness for eval runs.

Overview
Fixes silent loss of telemetry spans when async uploads fail: the _cleanup_done future callback no longer treats “no exception” as success.

Because _do_upload swallows errors and returns False, f.exception() was always None, so spans were removed from _pending_spans even on failed uploads and flush() had nothing to retry at eval exit. The callback now removes a span only when f.result() is True, keeping failed uploads in the pending list for exit-time retry.

Reviewed by Cursor Bugbot for commit 7251875. Bugbot is set up for automated code reviews on this repo. Configure here.

_do_upload catches all exceptions internally and returns False on
failure — it never raises. This means f.exception() is always None
regardless of whether the upload actually succeeded, so the previous
guard 'if not f.exception()' evaluated to True even on failure,
causing _cleanup_done to silently remove the span from _pending_spans
as if it had been delivered successfully.

When a run hit a transient network hiccup mid-flight, the affected
spans were evicted from _pending_spans by the callback, so the
flush() call at context exit found nothing to retry. The run finished
cleanly (no exception surfaced to user code), but those steps were
permanently lost from the trace viewer.

Fix: check f.result() is True instead of f.exception() is None.
Only remove a span from the pending list when the upload function
explicitly signals success. Any other outcome (False return, raised
exception, cancelled future) leaves the span in place so flush() can
retry it before the eval context is torn down.
@mariobrajkovski mariobrajkovski deleted the fix/telemetry-span-loss-on-transient-upload-failure-v2 branch June 8, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant