Skip to content

fix: do not retry or eject endpoint on client-side timeout#13

Merged
killme2008 merged 5 commits into
mainfrom
fix/no-retry-client-timeout
Jun 16, 2026
Merged

fix: do not retry or eject endpoint on client-side timeout#13
killme2008 merged 5 commits into
mainfrom
fix/no-retry-client-timeout

Conversation

@killme2008

Copy link
Copy Markdown
Member

Problem

A gRPC DEADLINE_EXCEEDED surfaces as TimeoutError (see promise-adapter), and it was classified as retriable in both modes and as an endpoint-health failure. Both are wrong for a client-side timeout:

  • Retry multiplies the latency budget. timeoutMs is the caller's hard latency budget for the write, but each attempt resets the deadline (unaryCall does new Date(Date.now() + deadlineMs) per attempt). Retrying a timeout therefore silently stretched the budget to maxAttempts × timeoutMs + backoff — a caller asking for 60s could wait minutes — and usually just timed out again.
  • It ejected healthy endpoints. isEndpointFailure(TimeoutError) returned true, so a tight caller deadline reported the endpoint as failed and could eject it. A timeout reflects the caller's clock, not the endpoint's health.

This also aligns the TS client with the Go ingester, which already excludes DEADLINE_EXCEEDED from both retry and endpoint-failure classification.

Fix

  • Make TimeoutError non-retriable in every mode via an early return (same treatment as AbortedError).
  • Exclude TimeoutError from isEndpointFailure.
  • Drop the dead DEADLINE_EXCEEDED (4) entries from CONSERVATIVE_RETRIABLE_GRPC_CODES and ENDPOINT_FAILURE_GRPC_CODES: promise-adapter always maps code 4 to TimeoutError, so a TransportError with code 4 never reaches those sets.

The server-side business DeadlineExceeded (status 1008) was already non-retriable and is unchanged.

Tests

  • Updated test/unit/errors.test.ts: TimeoutError is non-retriable in both modes; raw transport code 4 is non-retriable; TimeoutError is not an endpoint failure.
  • Full unit suite (207 tests), typecheck, and lint all pass.

A gRPC DEADLINE_EXCEEDED surfaces as TimeoutError and was treated as
retriable in both modes and as an endpoint-health failure. Both are wrong:

- timeoutMs is the caller's hard latency budget, and each attempt resets the
  deadline (unaryCall), so retrying silently multiplied the budget by
  maxAttempts and usually timed out again.
- a tight caller deadline ejected an otherwise-healthy endpoint, since the
  timeout reflects the caller's clock, not endpoint health.

Make TimeoutError non-retriable (early return, like AbortedError) and exclude
it from isEndpointFailure. Drop the dead DEADLINE_EXCEEDED (4) entries from
both gRPC-code sets: promise-adapter always maps code 4 to TimeoutError, so a
TransportError with code 4 never reaches them.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR corrects retry and endpoint-health classification for client-side gRPC timeouts by treating DEADLINE_EXCEEDED surfaced as TimeoutError as neither retriable nor an endpoint failure, preventing inflated latency budgets and accidental ejection of healthy endpoints.

Changes:

  • Make TimeoutError non-retriable in both retry modes and remove DEADLINE_EXCEEDED (4) from conservative retriable code sets.
  • Exclude TimeoutError (and transport code 4) from endpoint-failure detection so client deadlines don’t affect endpoint health.
  • Update unit tests to cover the new retry/health behavior and bump SDK version to 0.2.1.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/errors.ts Updates retry and endpoint-failure classification to exclude TimeoutError / gRPC code 4 and adjusts related docs/comments.
test/unit/errors.test.ts Adds/updates tests ensuring TimeoutError and transport code 4 are non-retriable and not endpoint failures.
test/unit/client-failover.test.ts Adds a failover test asserting no retries and no endpoint-health reporting on TimeoutError.
src/version.ts Bumps internal SDK version to 0.2.1.
package.json Bumps package version to 0.2.1.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/errors.ts Outdated
@killme2008 killme2008 merged commit 205bdeb into main Jun 16, 2026
6 checks passed
@killme2008 killme2008 deleted the fix/no-retry-client-timeout branch June 16, 2026 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants