Add retry logic to token acquisition for auth credentials#1602
Open
renaudhartert-db wants to merge 7 commits intomainfrom
Open
Add retry logic to token acquisition for auth credentials#1602renaudhartert-db wants to merge 7 commits intomainfrom
renaudhartert-db wants to merge 7 commits intomainfrom
Conversation
Token acquisition for OIDC, M2M OAuth, and Azure client secret credentials now retries on transient failures (429, 5xx, network errors). Retries happen at the token source level rather than the HTTP transport level. The existing RoundTrip implementation on ApiClient violates the standard http.RoundTripper contract by retrying requests and interpreting HTTP status codes -- an anti-pattern that arguably led to #1398 in the first place, since the transport-level retry cannot rewind request bodies created by the oauth2 library. By retrying at the application level, each attempt creates a fresh HTTP request, sidestepping the body-reset problem entirely. Fixes #1398 Fixes #1072 Co-authored-by: Isaac Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Only retry token acquisition on HTTP 429 (rate limited) and 503 (service unavailable). Other 5xx errors are not retried to avoid masking persistent server-side failures. Co-authored-by: Isaac Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Expand retriable codes to include 502 (Bad Gateway) and 504 (Gateway Timeout) alongside 429 and 503. These are infrastructure-layer transients from load balancers and API gateways, not application errors. 500 and 501 remain non-retriable as they indicate server bugs or missing endpoints. This matches the retry behavior of Azure SDK, AWS SDK, and Google Cloud client libraries for token endpoints. Co-authored-by: Isaac Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Simplify test assertions: compare token and error values directly instead of using boolean flags. Rename callCount to gotNumCalls for clarity. Co-authored-by: Isaac Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Shorter file name. No code changes. Co-authored-by: Isaac Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Use errors.Is(syscall.ECONNRESET), errors.Is(syscall.ECONNREFUSED), and net.Error.Timeout() instead of substring matching on error messages. String matching is brittle: error text is not covered by Go's compatibility promise and differs across platforms. Remove duplicate "tls handshake timeout" test case (identical to "i/o timeout") and update all network error tests to use real error chains (net.OpError -> os.SyscallError -> syscall.Errno). Co-authored-by: Isaac Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Replace the simple isRetriableTokenError predicate with httpRetrier, which adds Retry-After header parsing and uses the server's value as a delay hint. The delay is max(backoff, Retry-After) so we never retry sooner than the server asked, and no upper cap is applied since the context timeout already bounds total retry duration. Unify httpStatusCoder and httpHeaderCoder into a single httpResponseError interface, with a single httpResponse extractor that returns both status code and headers. Add HttpError.Header() to httpclient to satisfy the interface. Move defaultOptions into NewRetryingTokenSource to prevent slice aliasing. Co-authored-by: Isaac Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
|
If integration tests don't run automatically, an authorized user can run them manually by following the instructions below: Trigger: Inputs:
Checks will be approved automatically on success. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Token acquisition for OIDC, M2M OAuth, and Azure client secret credentials
now retries on transient failures with Retry-After header support. Retries
happen at the token source level (application layer) rather than the HTTP
transport level, sidestepping the body-rewind problem that caused #1398.
Why
The SDK's
ApiClient.RoundTripretries failed requests at the transportlevel, violating the
http.RoundTrippercontract. This is what led to #1398:the
oauth2library wraps request bodies in a non-seekable reader, and whenthe transport retry tries to rewind the body, it fails with "cannot reset
reader".
A related problem is #1072: OIDC token endpoint errors like
TEMPORARILY_UNAVAILABLEwere never retried because token acquisitionbypasses the SDK's retry middleware entirely.
By retrying at the application level, each attempt calls
Token()on theunderlying source, creating a fresh HTTP request from scratch. No body to
rewind.
Retriable status codes
Transient network errors are also retried: ECONNRESET, ECONNREFUSED, and
timeouts.
Retry-After
For retriable HTTP errors, the
Retry-Afterheader is used as a hint: thedelay is
max(backoff, Retry-After), so we never retry sooner than theserver asked. No upper cap is applied -- the context timeout already bounds
total retry duration. Retry-After is intentionally NOT treated as a
retriability signal: a 400 with Retry-After is still not retried.
What changed
Interface changes
auth.NewRetryingTokenSource(inner, opts...)-- wraps anyauth.TokenSourcewith retry logic. Exponential backoff, 1-minute timeout by default.api.ExecuteWithResult[T](ctx, fn, opts...)-- generic retry executor inexperimental/api/.(*httpclient.HttpError).HTTPStatusCode()-- exposes the status code for error classification without import cycles.(*httpclient.HttpError).Header()-- exposes headers for Retry-After parsing without import cycles.Behavioral changes
Retry-Afterheader value is not respected #767).Internal changes
httpRetrierinconfig/experimental/auth/retry.goclassifies errors and determines retry delay, including Retry-After parsing.httpResponseErrorinterface unifies status code and header extraction from errors, avoiding import cycles withhttpclient.httpResponse()extracts metadata from bothoauth2.RetrieveErrorand SDKHttpErrortypes.NewRetryingTokenSource.How is this tested?
TestRetryingTokenSource-- integration test: retry on 429/503/network errors, no retry on 400/401.TestHTTPRetrier_IsRetriable-- exhaustive retriability classification for all status codes and error types.TestHTTPRetrier_RetryAfterDelay-- Retry-After raises delay above backoff, not capped by backoff maximum.TestParseRetryAfter-- seconds, zero, negative, non-numeric, HTTP-date in the past.TestExecuteWithResult-- timeout, retry, and success paths.Issues
Fixes #1398 -- OIDC auth retries fail with "cannot reset reader"
Fixes #1072 -- OIDC token endpoint TEMPORARILY_UNAVAILABLE not retried
Partially addresses #767 -- Retry-After header respected for token acquisition
Related: #1102 (504 now retriable)
NO_CHANGELOG=true