fix: do not retry or eject endpoint on client-side timeout#13
Merged
Conversation
A gRPC DEADLINE_EXCEEDED surfaces as TimeoutError and was treated as retriable in both modes and as an endpoint-health failure. Both are wrong: - timeoutMs is the caller's hard latency budget, and each attempt resets the deadline (unaryCall), so retrying silently multiplied the budget by maxAttempts and usually timed out again. - a tight caller deadline ejected an otherwise-healthy endpoint, since the timeout reflects the caller's clock, not endpoint health. Make TimeoutError non-retriable (early return, like AbortedError) and exclude it from isEndpointFailure. Drop the dead DEADLINE_EXCEEDED (4) entries from both gRPC-code sets: promise-adapter always maps code 4 to TimeoutError, so a TransportError with code 4 never reaches them.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR corrects retry and endpoint-health classification for client-side gRPC timeouts by treating DEADLINE_EXCEEDED surfaced as TimeoutError as neither retriable nor an endpoint failure, preventing inflated latency budgets and accidental ejection of healthy endpoints.
Changes:
- Make
TimeoutErrornon-retriable in both retry modes and removeDEADLINE_EXCEEDED (4)from conservative retriable code sets. - Exclude
TimeoutError(and transport code4) from endpoint-failure detection so client deadlines don’t affect endpoint health. - Update unit tests to cover the new retry/health behavior and bump SDK version to
0.2.1.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/errors.ts |
Updates retry and endpoint-failure classification to exclude TimeoutError / gRPC code 4 and adjusts related docs/comments. |
test/unit/errors.test.ts |
Adds/updates tests ensuring TimeoutError and transport code 4 are non-retriable and not endpoint failures. |
test/unit/client-failover.test.ts |
Adds a failover test asserting no retries and no endpoint-health reporting on TimeoutError. |
src/version.ts |
Bumps internal SDK version to 0.2.1. |
package.json |
Bumps package version to 0.2.1. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A gRPC
DEADLINE_EXCEEDEDsurfaces asTimeoutError(seepromise-adapter), and it was classified as retriable in both modes and as an endpoint-health failure. Both are wrong for a client-side timeout:timeoutMsis the caller's hard latency budget for the write, but each attempt resets the deadline (unaryCalldoesnew Date(Date.now() + deadlineMs)per attempt). Retrying a timeout therefore silently stretched the budget tomaxAttempts × timeoutMs + backoff— a caller asking for 60s could wait minutes — and usually just timed out again.isEndpointFailure(TimeoutError)returnedtrue, so a tight caller deadline reported the endpoint as failed and could eject it. A timeout reflects the caller's clock, not the endpoint's health.This also aligns the TS client with the Go ingester, which already excludes
DEADLINE_EXCEEDEDfrom both retry and endpoint-failure classification.Fix
TimeoutErrornon-retriable in every mode via an early return (same treatment asAbortedError).TimeoutErrorfromisEndpointFailure.DEADLINE_EXCEEDED(4) entries fromCONSERVATIVE_RETRIABLE_GRPC_CODESandENDPOINT_FAILURE_GRPC_CODES:promise-adapteralways maps code 4 toTimeoutError, so aTransportErrorwith code 4 never reaches those sets.The server-side business
DeadlineExceeded(status 1008) was already non-retriable and is unchanged.Tests
test/unit/errors.test.ts:TimeoutErroris non-retriable in both modes; raw transport code 4 is non-retriable;TimeoutErroris not an endpoint failure.typecheck, andlintall pass.