Skip to content

Add retry logic to token acquisition for auth credentials#1602

Open
renaudhartert-db wants to merge 7 commits intomainfrom
fix/1398-oidc-retry-reset-reader
Open

Add retry logic to token acquisition for auth credentials#1602
renaudhartert-db wants to merge 7 commits intomainfrom
fix/1398-oidc-retry-reset-reader

Conversation

@renaudhartert-db
Copy link
Copy Markdown
Contributor

@renaudhartert-db renaudhartert-db commented Mar 28, 2026

Summary

Token acquisition for OIDC, M2M OAuth, and Azure client secret credentials
now retries on transient failures with Retry-After header support. Retries
happen at the token source level (application layer) rather than the HTTP
transport level, sidestepping the body-rewind problem that caused #1398.

Why

The SDK's ApiClient.RoundTrip retries failed requests at the transport
level, violating the http.RoundTripper contract. This is what led to #1398:
the oauth2 library wraps request bodies in a non-seekable reader, and when
the transport retry tries to rewind the body, it fails with "cannot reset
reader".

A related problem is #1072: OIDC token endpoint errors like
TEMPORARILY_UNAVAILABLE were never retried because token acquisition
bypasses the SDK's retry middleware entirely.

By retrying at the application level, each attempt calls Token() on the
underlying source, creating a fresh HTTP request from scratch. No body to
rewind.

Retriable status codes

Code Name Rationale
429 Too Many Requests Rate limiting from the IdP.
502 Bad Gateway Gateway could not reach the upstream token service. Common during rolling deployments.
503 Service Unavailable Token service temporarily overloaded or down.
504 Gateway Timeout Gateway timed out waiting for the upstream. Addresses #1102.

Transient network errors are also retried: ECONNRESET, ECONNREFUSED, and
timeouts.

Retry-After

For retriable HTTP errors, the Retry-After header is used as a hint: the
delay is max(backoff, Retry-After), so we never retry sooner than the
server asked. No upper cap is applied -- the context timeout already bounds
total retry duration. Retry-After is intentionally NOT treated as a
retriability signal: a 400 with Retry-After is still not retried.

What changed

Interface changes

  • auth.NewRetryingTokenSource(inner, opts...) -- wraps any auth.TokenSource with retry logic. Exponential backoff, 1-minute timeout by default.
  • api.ExecuteWithResult[T](ctx, fn, opts...) -- generic retry executor in experimental/api/.
  • (*httpclient.HttpError).HTTPStatusCode() -- exposes the status code for error classification without import cycles.
  • (*httpclient.HttpError).Header() -- exposes headers for Retry-After parsing without import cycles.

Behavioral changes

Internal changes

  • httpRetrier in config/experimental/auth/retry.go classifies errors and determines retry delay, including Retry-After parsing.
  • httpResponseError interface unifies status code and header extraction from errors, avoiding import cycles with httpclient.
  • httpResponse() extracts metadata from both oauth2.RetrieveError and SDK HttpError types.
  • All three credential providers wrap their token sources with NewRetryingTokenSource.

How is this tested?

  • TestRetryingTokenSource -- integration test: retry on 429/503/network errors, no retry on 400/401.
  • TestHTTPRetrier_IsRetriable -- exhaustive retriability classification for all status codes and error types.
  • TestHTTPRetrier_RetryAfterDelay -- Retry-After raises delay above backoff, not capped by backoff maximum.
  • TestParseRetryAfter -- seconds, zero, negative, non-numeric, HTTP-date in the past.
  • TestExecuteWithResult -- timeout, retry, and success paths.

Issues

Fixes #1398 -- OIDC auth retries fail with "cannot reset reader"
Fixes #1072 -- OIDC token endpoint TEMPORARILY_UNAVAILABLE not retried
Partially addresses #767 -- Retry-After header respected for token acquisition

Related: #1102 (504 now retriable)

NO_CHANGELOG=true

Token acquisition for OIDC, M2M OAuth, and Azure client secret
credentials now retries on transient failures (429, 5xx, network errors).

Retries happen at the token source level rather than the HTTP transport
level. The existing RoundTrip implementation on ApiClient violates the
standard http.RoundTripper contract by retrying requests and interpreting
HTTP status codes -- an anti-pattern that arguably led to #1398 in the
first place, since the transport-level retry cannot rewind request bodies
created by the oauth2 library. By retrying at the application level,
each attempt creates a fresh HTTP request, sidestepping the body-reset
problem entirely.

Fixes #1398
Fixes #1072

Co-authored-by: Isaac
Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Only retry token acquisition on HTTP 429 (rate limited) and 503
(service unavailable). Other 5xx errors are not retried to avoid
masking persistent server-side failures.

Co-authored-by: Isaac
Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Expand retriable codes to include 502 (Bad Gateway) and 504 (Gateway
Timeout) alongside 429 and 503. These are infrastructure-layer transients
from load balancers and API gateways, not application errors. 500 and 501
remain non-retriable as they indicate server bugs or missing endpoints.

This matches the retry behavior of Azure SDK, AWS SDK, and Google Cloud
client libraries for token endpoints.

Co-authored-by: Isaac
Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Simplify test assertions: compare token and error values directly
instead of using boolean flags. Rename callCount to gotNumCalls for
clarity.

Co-authored-by: Isaac
Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Shorter file name. No code changes.

Co-authored-by: Isaac
Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Use errors.Is(syscall.ECONNRESET), errors.Is(syscall.ECONNREFUSED),
and net.Error.Timeout() instead of substring matching on error messages.
String matching is brittle: error text is not covered by Go's
compatibility promise and differs across platforms.

Remove duplicate "tls handshake timeout" test case (identical to
"i/o timeout") and update all network error tests to use real error
chains (net.OpError -> os.SyscallError -> syscall.Errno).

Co-authored-by: Isaac
Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Replace the simple isRetriableTokenError predicate with httpRetrier, which
adds Retry-After header parsing and uses the server's value as a delay hint.
The delay is max(backoff, Retry-After) so we never retry sooner than the
server asked, and no upper cap is applied since the context timeout already
bounds total retry duration.

Unify httpStatusCoder and httpHeaderCoder into a single httpResponseError
interface, with a single httpResponse extractor that returns both status
code and headers. Add HttpError.Header() to httpclient to satisfy the
interface.

Move defaultOptions into NewRetryingTokenSource to prevent slice aliasing.

Co-authored-by: Isaac
Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
@github-actions
Copy link
Copy Markdown

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/sdk-go

Inputs:

  • PR number: 1602
  • Commit SHA: 6a37d89c877e4489949607095f4a6e0206668483

Checks will be approved automatically on success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant