Skip to content

fix(anthropic/openai/google): wrap io.ErrUnexpectedEOF as ProviderError#198

Merged
andreynering merged 4 commits into
charmbracelet:mainfrom
ljuti:fix/anthropic-transient-stream-retry
Apr 22, 2026
Merged

fix(anthropic/openai/google): wrap io.ErrUnexpectedEOF as ProviderError#198
andreynering merged 4 commits into
charmbracelet:mainfrom
ljuti:fix/anthropic-transient-stream-retry

Conversation

@ljuti

@ljuti ljuti commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

Summary

When a provider's SSE stream drops mid-response with a raw I/O error like io.ErrUnexpectedEOF (idle timeouts, proxy resets, mid-stream TCP disconnects), toProviderErr in each of the three direct providers returns the error unchanged because errors.As(err, &apiErr) fails for non-API-error values. The retry loop gates on errors.As(err, &ProviderError) + IsRetryable(), so the retry branch is never entered and the bare error propagates to the caller on the first attempt.

This is especially ironic because ProviderError.IsRetryable at errors.go already special-cases io.ErrUnexpectedEOF as retryable — that check is simply never reached for raw stream transport errors in any direct provider except kronk.

This PR wraps io.ErrUnexpectedEOF (including wrapped forms via errors.Is) into a ProviderError with Cause set in all three direct providers, so the existing IsRetryable path engages and transient mid-stream disconnects are retried as intended.

Provider coverage

  • anthropic: fixed
  • openai: fixed
  • google: fixed
  • kronk: no change
  • bedrock: inherits anthropic fix
  • azure: inherits openai fix
  • openaicompat: inherits openai fix
  • openrouter: inherits openai fix
  • vercel: inherits openai fix

Failure trace (Anthropic example)

  1. providers/anthropic/anthropic.gostream.Err() returns io.ErrUnexpectedEOF from the SSE decoder.
  2. toProviderErr(err) is called.
  3. providers/anthropic/error.go (before this PR) — errors.As(err, &apiErr) fails; raw error returned unchanged.
  4. retry.go:112-113errors.As(err, &providerErr) fails; retry branch skipped.
  5. retry.go:132-133 — first-attempt non-retryable errors return without wrapping in RetryError. Caller gets bare unexpected EOF.

Net effect: IsRetryable never runs, and a failure mode the code already knows how to handle is treated as fatal.

Repro

Reliably reproducible on long-running tool-heavy sessions (~8 minutes of continuous streaming) where idle timeouts / proxy resets sever the SSE connection mid-response. The top-level error bubbles up as a bare unexpected EOF after thousands of successful stream parts.

Change

Minimal targeted wrap in providers/anthropic/error.go:

if errors.Is(err, io.ErrUnexpectedEOF) {
    return &fantasy.ProviderError{
        Title:   "stream transport error",
        Message: err.Error(),
        Cause:   err,
    }
}

Only the io.ErrUnexpectedEOF case is wrapped, to stay conservative. Other transient network errors (*net.OpError, syscall.ECONNRESET, http2.StreamError) are good candidates for a follow-up but aren't included here.

Tests

New providers/anthropic/error_test.go covers:

  • TestToProviderErr_WrapsUnexpectedEOF — direct, wrapped once, wrapped twice; verifies the result is *fantasy.ProviderError, Cause chains back to io.ErrUnexpectedEOF, and IsRetryable() returns true.

  • TestToProviderErr_PassesThroughUnrelatedErrors — arbitrary non-EOF errors are returned unchanged.

  • TestToProviderErr_PassesThroughPlainEOF — clean io.EOF is not wrapped (the stream handler treats io.EOF as a clean terminator; wrapping would invite false retries).

  • I have read CONTRIBUTING.md.

…engage

When the Anthropic SSE stream drops mid-response with a raw I/O error
like io.ErrUnexpectedEOF, toProviderErr previously returned the error
unchanged because errors.As(err, &apiErr) fails for non-*anthropic.Error
values. The retry loop in retry.go gates on errors.As(err, &ProviderError),
so the retry branch is never entered and the bare error is returned to
the caller on the first attempt.

This is especially ironic because ProviderError.IsRetryable already
special-cases io.ErrUnexpectedEOF as retryable — that check is simply
never reached for raw stream transport errors.

Wrap io.ErrUnexpectedEOF (including wrapped forms via errors.Is) into a
ProviderError with Cause set, so the existing IsRetryable path engages
and transient mid-stream disconnects are retried as intended.

Observed on long-running tool-heavy sessions (~8 minutes of continuous
streaming) where idle timeouts / proxy resets would otherwise abort the
whole turn with a bare "unexpected EOF".
…ies engage

Same bug as anthropic/error.go: when the SSE stream drops mid-response
with a raw io.ErrUnexpectedEOF, toProviderErr returned the error
unchanged because errors.As(err, &apiErr) fails for non-API-error
values. The retry loop in retry.go gates on *ProviderError, so the
retry branch is never entered and the bare error is returned to the
caller on the first attempt.

ProviderError.IsRetryable already special-cases io.ErrUnexpectedEOF
as retryable — wrapping it here lets that existing check engage.

Transitively fixes azure, openaicompat, openrouter, and vercel (which
wrap the openai provider). bedrock was already covered by the
anthropic fix. kronk is unaffected — it already wraps all errors as
ProviderError.
@ljuti ljuti changed the title fix(anthropic): wrap io.ErrUnexpectedEOF as ProviderError so retries engage fix(providers): wrap io.ErrUnexpectedEOF as ProviderError in anthropic/openai/google Apr 11, 2026
@ljuti

ljuti commented Apr 11, 2026

Copy link
Copy Markdown
Contributor Author

Found the issue to be in openai and google providers as well. Updated the title and the description, and added another commit to fix these two as well.

@andreynering andreynering left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good patch. Thank you @ljuti!

@andreynering andreynering changed the title fix(providers): wrap io.ErrUnexpectedEOF as ProviderError in anthropic/openai/google fix(anthropic/openai/google): wrap io.ErrUnexpectedEOF as ProviderError Apr 22, 2026
@andreynering andreynering enabled auto-merge (squash) April 22, 2026 15:01
@andreynering andreynering merged commit 471520c into charmbracelet:main Apr 22, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants