fix(auth/m2m): remove double-caching to enable proactive token refresh#1550
Conversation
|
Thanks for the thorough analysis and the fix! The double-caching problem is well explained and the root cause is clearly correct. I think the tests need to change though. The two tests in We can mock the HTTP layer through func TestM2mCredentials_DirectTokenSource(t *testing.T) {
var tokenCalls int32
transport := &tokenCountingTransport{
calls: &tokenCalls,
inner: fixtures.MappingTransport{
"GET /oidc/.well-known/oauth-authorization-server": {
Response: u2m.OAuthAuthorizationServer{
TokenEndpoint: "https://localhost/token",
},
},
"POST /token": {
Response: oauth2.Token{
TokenType: "Bearer",
AccessToken: "test-token",
},
},
},
}
cfg := &Config{
Host: "a",
ClientID: "b",
ClientSecret: "c",
AuthType: "oauth-m2m",
ConfigFile: "/dev/null",
DisableAsyncTokenRefresh: true,
HTTPTransport: transport,
}
err := cfg.EnsureResolved()
if err != nil {
t.Fatalf("EnsureResolved(): %v", err)
}
ctx := cfg.refreshClient.InContextForOAuth2(cfg.refreshCtx)
provider, err := M2mCredentials{}.Configure(ctx, cfg)
if err != nil {
t.Fatalf("Configure(): %v", err)
}
oauthProvider := provider.(credentials.OAuthCredentialsProvider)
// Call Token() 3 times. With the fix, each call hits the endpoint.
// Without the fix (ReuseTokenSource), only the first call does.
const nCalls = 3
for i := 0; i < nCalls; i++ {
tok, err := oauthProvider.Token(context.Background())
if err != nil {
t.Fatalf("Token() call %d: %v", i+1, err)
}
if tok.AccessToken == "" {
t.Fatalf("Token() call %d: empty access token", i+1)
}
}
got := int(atomic.LoadInt32(&tokenCalls))
if got != nCalls {
t.Errorf("token endpoint calls = %d, want %d", got, nCalls)
}
}
type tokenCountingTransport struct {
calls *int32
inner http.RoundTripper
}
func (t *tokenCountingTransport) RoundTrip(req *http.Request) (*http.Response, error) {
if req.Method == "POST" {
atomic.AddInt32(t.calls, 1)
}
return t.inner.RoundTrip(req)
}
func (t *tokenCountingTransport) SkipRetryOnIO() bool { return true }Would you prefer to adjust the tests, or would it be easier for us to take over the PR from here? Happy either way. |
## Changes Fixes #1549. Related: #1550, #1573. Google's Cloud Go libraries (`impersonate.IDTokenSource`, `impersonate.CredentialsTokenSource`, `idtoken.NewTokenSource`, `google.CredentialsFromJSON`) all internally wrap their token sources in `oauth2.ReuseTokenSource`. This means the SDK's `cachedTokenSource` async refresh is swallowed by the inner `ReuseTokenSource` cache, making it entirely wasted work. This is the same double-caching bug as M2M OAuth (#1549). Though, unlike M2M (#1550) and Azure Client Secret (#1573), Google's libraries don't expose an uncached `Token(ctx)` method -- the inner sources are unexported types. Recreating the token source on each call is expensive and will require further considerations. The pragmatic fix is to disable async refresh for both GCP providers: - `GoogleDefaultCredentials` (`config/auth_gcp_google_id.go`) - `GoogleCredentials` (`config/auth_gcp_google_credentials.go`) Google's own `ReuseTokenSource` (with a 10-second early refresh window) still provides token renewal before expiry. Thorough comments explain the rationale and warn against re-enabling async refresh. ## Tests Existing `config` tests pass (`go test ./config/ -v -count=1`). No new tests needed -- this is a configuration change (appending `auth.WithAsyncRefresh(false)` to cache options) and the async refresh mechanism itself is already well-tested. --------- Signed-off-by: Ubuntu <renaud.hartert@databricks.com> Co-authored-by: simon <simon.faltum@databricks.com>
e4438cf to
38971e9
Compare
Thanks for the review and for catching that! I gave the fix a whirl. I think the tests are correct now? |
renaudhartert-db
left a comment
There was a problem hiding this comment.
LGTM! Could you please handle the fmt issue and the merge conflict? I'll trigger the integration test and complete the merge afterward. Thanks!
38971e9 to
1dce339
Compare
clientcredentials.Config.TokenSource returns an oauth2.ReuseTokenSource, which caches the token internally with a 10s expiryDelta. Wrapping this in cachedTokenSource creates a double-caching stack where async refresh calls return the inner-cached token instead of making a real HTTP request. As a result, the proactive 20-min async refresh window is wasted: the underlying token endpoint is not reached until ~10s before expiry. Any request that holds the about-to-expire token and whose HTTP round-trip to Databricks completes after the expiry time receives HTTP 401. Replace clientcredentials.Config.TokenSource (ReuseTokenSource) with a direct TokenSourceFn that always calls ccfg.Token(ctx). cachedTokenSource becomes the sole cache layer and async refresh proactively fetches a fresh token at T-20min as intended. Fixes databricks#1549. Tests: - TestCachedTokenSource_AsyncRefreshBlockedByInnerCache: documents that inner ReuseTokenSource delays the real fetch to near T-10s - TestCachedTokenSource_AsyncRefreshWithDirectSource: verifies that a direct source causes the fetch at T-20min as intended - Existing TestM2mHappyFlow / TestM2mHappyFlowForAccount: still pass Signed-off-by: Alex Ausch <alex@ausch.name>
Replace the two cache_test.go tests that demonstrated the bug at the cachedTokenSource level with TestM2mCredentials_DirectTokenSource in auth_m2m_test.go. The new test goes through M2mCredentials.Configure and verifies that calling Token() N times results in N HTTP fetches, which directly guards the fix in auth_m2m.go against regression. Signed-off-by: Alex Ausch <alex@ausch.name>
…omment Signed-off-by: Alex Ausch <alex@ausch.name>
1dce339 to
d653977
Compare
…hubOIDCCredentials EnsureResolved now calls resolveHostMetadata, which makes a real network request when no HTTPTransport is set. Four test cases with a Host but no transport were hanging indefinitely. Add an empty MappingTransport to each so the host-metadata request gets a fast "missing stub" error and falls back gracefully. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Alex Ausch <alex@ausch.name>
d653977 to
f3dc334
Compare
|
Added a fourth commit (f3dc334) that fixes a pre-existing test hang I found while running the full suite locally. Root cause: Fix: Add This is unrelated to the M2M double-caching fix but reproducible on the current (also rebased on main and force-pushed, to keep the commit logs clean) |
…#1573) ## Summary Fixes the double-caching of OAuth tokens in Azure client secret credentials (#1549) and unifies the internal token source infrastructure on `auth.TokenSource`. ## Why `AzureClientSecretCredentials.tokenSourceFor` called `clientcredentials.Config{}.TokenSource(ctx)`, which returns an `oauth2.ReuseTokenSource` with a hardcoded 10-second `expiryDelta`. This token source was then wrapped in `azureReuseTokenSource` (a `cachedTokenSource`) and again in `serviceToServiceVisitor` (another `cachedTokenSource`). The inner `ReuseTokenSource` swallowed proactive refresh attempts from the outer layers -- it kept returning its cached token until only ~10 seconds remained before expiry, defeating the 20-minute async refresh window and causing bursts of 401s at token expiry. This is the same class of bug as #1550 (M2M OAuth), but in the Azure client secret path. ## What changed ### Interface changes None. All changes are to unexported types and functions. ### Behavioral changes - `AzureClientSecretCredentials` no longer double-caches tokens. Each call to the underlying token source results in an HTTP request to the token endpoint; caching is purely controlled by the outer `azureReuseTokenSource` and `serviceToServiceVisitor` layers. ### Internal changes - **`azureHostResolver` interface** now returns `auth.TokenSource` instead of `oauth2.TokenSource`, and no longer takes a `context.Context` parameter. Context is passed per-call via `Token(ctx)`. - **`serviceToServiceVisitor` and `refreshableVisitor`** now accept `auth.TokenSource` directly, removing the `authconv.AuthTokenSource` wrapping at call sites. - **`azureReuseTokenSource` and `wrap`** operate on `auth.TokenSource` instead of `oauth2.TokenSource`, eliminating the `authconv` round-trip. - **`AzureClientSecretCredentials`**: removed the duplicate `tokenSourceFor` (which used `clientcredentials.Config.TokenSource` with inner caching). The former `authTokenSourceFor` -- which calls `clientcredentials.Config.Token(ctx)` without inner caching -- is now the sole `tokenSourceFor`. - **`AzureCliCredentials.tokenSourceFor`**: returns an `auth.TokenSource` that creates a fresh `azureCliTokenSource` per call, properly threading the caller's context. - **`AzureMsiCredentials.tokenSourceFor`**: wraps `NewAzureMsiTokenSource` with `authconv.AuthTokenSource`. - **GCP and M2M credential providers**: callers wrap their `oauth2.TokenSource` with `authconv.AuthTokenSource` before passing to the visitor functions. ## How is this tested? - `TestAzureClientSecretCredentials_tokenSourceFor_noCaching`: calls `Token()` N times and asserts N HTTP requests hit the token endpoint. Fails when the fix is reverted (inner caching collapses N calls into 1). - `TestAzureClientSecretCredentials_Configure`: happy-path test for the full Azure client secret credential flow. - All existing `config/` tests pass. Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
The m2m test file declared tokenCountingTransport which conflicts with the same type in auth_azure_client_secret_test.go. Rename to postCountingTransport since this variant only counts POST requests. Co-authored-by: Isaac Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
|
If integration tests don't run automatically, an authorized user can run them manually by following the instructions below: Trigger: Inputs:
Checks will be approved automatically on success. |
Fixes a double-caching bug in M2M OAuth that caused the proactive async token refresh to have no effect, resulting in bursts of HTTP 401 or 403 errors at each token rotation boundary (~every hour).
Closes #1549.
Why
M2mCredentials.Configurepreviously calledclientcredentials.Config.TokenSource(ctx), which returns anoauth2.ReuseTokenSource. That source was then passed torefreshableVisitor, which wraps it in acachedTokenSource. The resulting stack is:When
cachedTokenSourcetriggers its async refresh at T−20 min, it calls through toReuseTokenSource.Token(). Because the token still has 20 minutes of life,ReuseTokenSourceconsiders it valid and returns the cached token without an HTTP call. The async refresh fires repeatedly (with an accelerating schedule) but each call hits the same inner cache — until only ~10 s remain, at which pointReuseTokenSource'sexpiryDeltawindow is crossed and a real network call is finally made.Any request that receives the about-to-expire token and whose round-trip to Databricks completes after the token's expiry time gets an auth error. In production this manifests as a burst of errors at precisely one-hour intervals, correlated with pod startup times.
Note on status codes: the exact HTTP status code returned for an expired M2M OAuth token varies by workspace and request type. A completely invalid token returns 401 on all endpoints. A naturally expired token has been observed to return 403
"Invalid Token"on both management endpoints (GET /api/2.0/serving-endpoints/{name}) and inference endpoints (POST /serving-endpoints/{name}/invocations). Local workarounds that intercept auth errors at the transport layer need to handle both 401 and 403.What changed
Interface changes
None.
Behavioral changes
Internal changes
auth_m2m.go: replacedclientcredentials.Config.TokenSource(ctx)(which returns aReuseTokenSource) with aTokenSourceFnclosure that callsccfg.Token(ctx)directly.cachedTokenSourceis now the sole caching layer.cache_test.go: addedreuseTokenSourcetest helper and two new tests:TestCachedTokenSource_AsyncRefreshBlockedByInnerCache— documents the double-caching behaviour using a mock clock.TestCachedTokenSource_AsyncRefreshWithDirectSource— verifies that a direct (non-caching) inner source triggers an HTTP fetch at the proactive window.NEXT_CHANGELOG.md: updated.How is this tested?
TestCachedTokenSource_AsyncRefreshBlockedByInnerCache: uses a fake clock and a controlledreuseTokenSourcehelper to confirm that inner caching delays the HTTP fetch to T−10 s rather than T−20 min.TestCachedTokenSource_AsyncRefreshWithDirectSource: uses the same fake clock with a direct token source to confirm the fetch occurs at T−20 min after the fix.TestM2mHappyFlowandTestM2mHappyFlowForAccountcontinue to pass.