Skip to content

fix(auth/m2m): remove double-caching to enable proactive token refresh#1550

Merged
renaudhartert-db merged 8 commits intodatabricks:mainfrom
aausch:aausch/fix/cached-token-source-stale-refresh
Mar 27, 2026
Merged

fix(auth/m2m): remove double-caching to enable proactive token refresh#1550
renaudhartert-db merged 8 commits intodatabricks:mainfrom
aausch:aausch/fix/cached-token-source-stale-refresh

Conversation

@aausch
Copy link
Copy Markdown
Contributor

@aausch aausch commented Mar 16, 2026

Fixes a double-caching bug in M2M OAuth that caused the proactive async token refresh to have no effect, resulting in bursts of HTTP 401 or 403 errors at each token rotation boundary (~every hour).

Closes #1549.

Why

M2mCredentials.Configure previously called clientcredentials.Config.TokenSource(ctx), which returns an oauth2.ReuseTokenSource. That source was then passed to refreshableVisitor, which wraps it in a cachedTokenSource. The resulting stack is:

cachedTokenSource          ← Databricks async cache (proactive, T-20min)
    └── authTokenSource    ← discards context
            └── oauth2.ReuseTokenSource  ← Go stdlib cache, expiryDelta=10s
                    └── clientcredentials.tokenSource  ← HTTP /token endpoint

When cachedTokenSource triggers its async refresh at T−20 min, it calls through to ReuseTokenSource.Token(). Because the token still has 20 minutes of life, ReuseTokenSource considers it valid and returns the cached token without an HTTP call. The async refresh fires repeatedly (with an accelerating schedule) but each call hits the same inner cache — until only ~10 s remain, at which point ReuseTokenSource's expiryDelta window is crossed and a real network call is finally made.

Any request that receives the about-to-expire token and whose round-trip to Databricks completes after the token's expiry time gets an auth error. In production this manifests as a burst of errors at precisely one-hour intervals, correlated with pod startup times.

Note on status codes: the exact HTTP status code returned for an expired M2M OAuth token varies by workspace and request type. A completely invalid token returns 401 on all endpoints. A naturally expired token has been observed to return 403 "Invalid Token" on both management endpoints (GET /api/2.0/serving-endpoints/{name}) and inference endpoints (POST /serving-endpoints/{name}/invocations). Local workarounds that intercept auth errors at the transport layer need to handle both 401 and 403.

What changed

Interface changes

None.

Behavioral changes

  • M2M OAuth token refresh now makes a real HTTP call to the token endpoint at T−20 min (the intended proactive window) rather than at T−10 s.
  • Callers will see one additional token endpoint call per hour per process (the proactive refresh), offset by eliminating the burst of auth errors at expiry.

Internal changes

  • auth_m2m.go: replaced clientcredentials.Config.TokenSource(ctx) (which returns a ReuseTokenSource) with a TokenSourceFn closure that calls ccfg.Token(ctx) directly. cachedTokenSource is now the sole caching layer.
  • cache_test.go: added reuseTokenSource test helper and two new tests:
    • TestCachedTokenSource_AsyncRefreshBlockedByInnerCache — documents the double-caching behaviour using a mock clock.
    • TestCachedTokenSource_AsyncRefreshWithDirectSource — verifies that a direct (non-caching) inner source triggers an HTTP fetch at the proactive window.
  • NEXT_CHANGELOG.md: updated.

How is this tested?

  • TestCachedTokenSource_AsyncRefreshBlockedByInnerCache: uses a fake clock and a controlled reuseTokenSource helper to confirm that inner caching delays the HTTP fetch to T−10 s rather than T−20 min.
  • TestCachedTokenSource_AsyncRefreshWithDirectSource: uses the same fake clock with a direct token source to confirm the fetch occurs at T−20 min after the fix.
  • Both tests operate at the token-source level using a mock clock — they count real HTTP fetches to the token endpoint rather than checking the status code returned by Databricks API calls. This means they validate the fix regardless of whether the downstream auth failure manifests as 401 or 403.
  • Existing TestM2mHappyFlow and TestM2mHappyFlowForAccount continue to pass.

@renaudhartert-db
Copy link
Copy Markdown
Contributor

renaudhartert-db commented Mar 21, 2026

Thanks for the thorough analysis and the fix! The double-caching problem is well explained and the root cause is clearly correct.

I think the tests need to change though. The two tests in cache_test.go demonstrate the bug at the cachedTokenSource level, but they don't exercise the fix in auth_m2m.go. Could we rather have a test that goes through M2mCredentials.Configure and verifies the returned token source is direct (no inner caching)?

We can mock the HTTP layer through Config.HTTPTransport. Here's what I have in mind — I've tested locally, it passes with the fix and fails without:

func TestM2mCredentials_DirectTokenSource(t *testing.T) {
	var tokenCalls int32
	transport := &tokenCountingTransport{
		calls: &tokenCalls,
		inner: fixtures.MappingTransport{
			"GET /oidc/.well-known/oauth-authorization-server": {
				Response: u2m.OAuthAuthorizationServer{
					TokenEndpoint: "https://localhost/token",
				},
			},
			"POST /token": {
				Response: oauth2.Token{
					TokenType:   "Bearer",
					AccessToken: "test-token",
				},
			},
		},
	}

	cfg := &Config{
		Host:                     "a",
		ClientID:                 "b",
		ClientSecret:             "c",
		AuthType:                 "oauth-m2m",
		ConfigFile:               "/dev/null",
		DisableAsyncTokenRefresh: true,
		HTTPTransport:            transport,
	}

	err := cfg.EnsureResolved()
	if err != nil {
		t.Fatalf("EnsureResolved(): %v", err)
	}

	ctx := cfg.refreshClient.InContextForOAuth2(cfg.refreshCtx)
	provider, err := M2mCredentials{}.Configure(ctx, cfg)
	if err != nil {
		t.Fatalf("Configure(): %v", err)
	}

	oauthProvider := provider.(credentials.OAuthCredentialsProvider)

	// Call Token() 3 times. With the fix, each call hits the endpoint.
	// Without the fix (ReuseTokenSource), only the first call does.
	const nCalls = 3
	for i := 0; i < nCalls; i++ {
		tok, err := oauthProvider.Token(context.Background())
		if err != nil {
			t.Fatalf("Token() call %d: %v", i+1, err)
		}
		if tok.AccessToken == "" {
			t.Fatalf("Token() call %d: empty access token", i+1)
		}
	}

	got := int(atomic.LoadInt32(&tokenCalls))
	if got != nCalls {
		t.Errorf("token endpoint calls = %d, want %d", got, nCalls)
	}
}

type tokenCountingTransport struct {
	calls *int32
	inner http.RoundTripper
}

func (t *tokenCountingTransport) RoundTrip(req *http.Request) (*http.Response, error) {
	if req.Method == "POST" {
		atomic.AddInt32(t.calls, 1)
	}
	return t.inner.RoundTrip(req)
}

func (t *tokenCountingTransport) SkipRetryOnIO() bool { return true }

Would you prefer to adjust the tests, or would it be easier for us to take over the PR from here? Happy either way.

@renaudhartert-db renaudhartert-db self-requested a review March 21, 2026 20:44
github-merge-queue bot pushed a commit that referenced this pull request Mar 22, 2026
## Changes

Fixes #1549. Related: #1550, #1573.

Google's Cloud Go libraries (`impersonate.IDTokenSource`,
`impersonate.CredentialsTokenSource`, `idtoken.NewTokenSource`,
`google.CredentialsFromJSON`) all internally wrap their token sources in
`oauth2.ReuseTokenSource`. This means the SDK's `cachedTokenSource`
async refresh is swallowed by the inner `ReuseTokenSource` cache, making
it entirely wasted work.


This is the same double-caching bug as M2M OAuth (#1549). Though, unlike
M2M (#1550) and Azure Client Secret (#1573), Google's libraries don't
expose an uncached `Token(ctx)` method -- the inner sources are
unexported types. Recreating the token source on each call is expensive
and will require further considerations.

The pragmatic fix is to disable async refresh for both GCP providers:
- `GoogleDefaultCredentials` (`config/auth_gcp_google_id.go`)
- `GoogleCredentials` (`config/auth_gcp_google_credentials.go`)

Google's own `ReuseTokenSource` (with a 10-second early refresh window)
still provides token renewal before expiry. Thorough comments explain
the rationale and warn against re-enabling async refresh.

## Tests

Existing `config` tests pass (`go test ./config/ -v -count=1`). No new
tests needed -- this is a configuration change (appending
`auth.WithAsyncRefresh(false)` to cache options) and the async refresh
mechanism itself is already well-tested.

---------

Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
Co-authored-by: simon <simon.faltum@databricks.com>
@aausch aausch force-pushed the aausch/fix/cached-token-source-stale-refresh branch from e4438cf to 38971e9 Compare March 23, 2026 14:41
@aausch
Copy link
Copy Markdown
Contributor Author

aausch commented Mar 23, 2026

Would you prefer to adjust the tests, or would it be easier for us to take over the PR from here? Happy either way.

Thanks for the review and for catching that! I gave the fix a whirl. I think the tests are correct now?

@aausch aausch temporarily deployed to test-trigger-is March 23, 2026 16:47 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@renaudhartert-db renaudhartert-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Could you please handle the fmt issue and the merge conflict? I'll trigger the integration test and complete the merge afterward. Thanks!

@aausch aausch force-pushed the aausch/fix/cached-token-source-stale-refresh branch from 38971e9 to 1dce339 Compare March 23, 2026 16:58
aausch added 3 commits March 23, 2026 14:57
clientcredentials.Config.TokenSource returns an oauth2.ReuseTokenSource,
which caches the token internally with a 10s expiryDelta. Wrapping this
in cachedTokenSource creates a double-caching stack where async refresh
calls return the inner-cached token instead of making a real HTTP request.

As a result, the proactive 20-min async refresh window is wasted: the
underlying token endpoint is not reached until ~10s before expiry. Any
request that holds the about-to-expire token and whose HTTP round-trip
to Databricks completes after the expiry time receives HTTP 401.

Replace clientcredentials.Config.TokenSource (ReuseTokenSource) with a
direct TokenSourceFn that always calls ccfg.Token(ctx). cachedTokenSource
becomes the sole cache layer and async refresh proactively fetches a fresh
token at T-20min as intended.

Fixes databricks#1549.

Tests:
- TestCachedTokenSource_AsyncRefreshBlockedByInnerCache: documents that
  inner ReuseTokenSource delays the real fetch to near T-10s
- TestCachedTokenSource_AsyncRefreshWithDirectSource: verifies that a
  direct source causes the fetch at T-20min as intended
- Existing TestM2mHappyFlow / TestM2mHappyFlowForAccount: still pass

Signed-off-by: Alex Ausch <alex@ausch.name>
Replace the two cache_test.go tests that demonstrated the bug at the
cachedTokenSource level with TestM2mCredentials_DirectTokenSource in
auth_m2m_test.go. The new test goes through M2mCredentials.Configure and
verifies that calling Token() N times results in N HTTP fetches, which
directly guards the fix in auth_m2m.go against regression.

Signed-off-by: Alex Ausch <alex@ausch.name>
…omment

Signed-off-by: Alex Ausch <alex@ausch.name>
@aausch aausch force-pushed the aausch/fix/cached-token-source-stale-refresh branch from 1dce339 to d653977 Compare March 23, 2026 19:39
…hubOIDCCredentials

EnsureResolved now calls resolveHostMetadata, which makes a real network
request when no HTTPTransport is set. Four test cases with a Host but no
transport were hanging indefinitely. Add an empty MappingTransport to each
so the host-metadata request gets a fast "missing stub" error and falls
back gracefully.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Alex Ausch <alex@ausch.name>
@aausch aausch force-pushed the aausch/fix/cached-token-source-stale-refresh branch from d653977 to f3dc334 Compare March 23, 2026 19:40
@aausch
Copy link
Copy Markdown
Contributor Author

aausch commented Mar 23, 2026

Added a fourth commit (f3dc334) that fixes a pre-existing test hang I found while running the full suite locally.

Root cause: EnsureResolved now calls resolveHostMetadata (introduced in #1572), which makes a real network request when no HTTPTransport is set. Four test cases in TestAzureGithubOIDCCredentials had a Host but no HTTPTransport, so they would hang indefinitely waiting for a connection to http://host.com/.well-known/databricks-config.

Fix: Add HTTPTransport: fixtures.MappingTransport{} to those four cases. An empty MappingTransport returns a fast "missing stub" error, which resolveHostMetadata already handles gracefully (logs a warning and falls back to user config). The tests now complete in milliseconds.

This is unrelated to the M2M double-caching fix but reproducible on the current main.

(also rebased on main and force-pushed, to keep the commit logs clean)

@aausch aausch temporarily deployed to test-trigger-is March 23, 2026 19:57 — with GitHub Actions Inactive
github-merge-queue bot pushed a commit that referenced this pull request Mar 24, 2026
…#1573)

## Summary

Fixes the double-caching of OAuth tokens in Azure client secret
credentials (#1549) and unifies the internal token source infrastructure
on `auth.TokenSource`.

## Why

`AzureClientSecretCredentials.tokenSourceFor` called
`clientcredentials.Config{}.TokenSource(ctx)`, which returns an
`oauth2.ReuseTokenSource` with a hardcoded 10-second `expiryDelta`. This
token source was then wrapped in `azureReuseTokenSource` (a
`cachedTokenSource`) and again in `serviceToServiceVisitor` (another
`cachedTokenSource`). The inner `ReuseTokenSource` swallowed proactive
refresh attempts from the outer layers -- it kept returning its cached
token until only ~10 seconds remained before expiry, defeating the
20-minute async refresh window and causing bursts of 401s at token
expiry.

This is the same class of bug as #1550 (M2M OAuth), but in the Azure
client secret path.

## What changed

### Interface changes

None. All changes are to unexported types and functions.

### Behavioral changes

- `AzureClientSecretCredentials` no longer double-caches tokens. Each
call to the underlying token source results in an HTTP request to the
token endpoint; caching is purely controlled by the outer
`azureReuseTokenSource` and `serviceToServiceVisitor` layers.

### Internal changes

- **`azureHostResolver` interface** now returns `auth.TokenSource`
instead of `oauth2.TokenSource`, and no longer takes a `context.Context`
parameter. Context is passed per-call via `Token(ctx)`.
- **`serviceToServiceVisitor` and `refreshableVisitor`** now accept
`auth.TokenSource` directly, removing the `authconv.AuthTokenSource`
wrapping at call sites.
- **`azureReuseTokenSource` and `wrap`** operate on `auth.TokenSource`
instead of `oauth2.TokenSource`, eliminating the `authconv` round-trip.
- **`AzureClientSecretCredentials`**: removed the duplicate
`tokenSourceFor` (which used `clientcredentials.Config.TokenSource` with
inner caching). The former `authTokenSourceFor` -- which calls
`clientcredentials.Config.Token(ctx)` without inner caching -- is now
the sole `tokenSourceFor`.
- **`AzureCliCredentials.tokenSourceFor`**: returns an
`auth.TokenSource` that creates a fresh `azureCliTokenSource` per call,
properly threading the caller's context.
- **`AzureMsiCredentials.tokenSourceFor`**: wraps
`NewAzureMsiTokenSource` with `authconv.AuthTokenSource`.
- **GCP and M2M credential providers**: callers wrap their
`oauth2.TokenSource` with `authconv.AuthTokenSource` before passing to
the visitor functions.

## How is this tested?

- `TestAzureClientSecretCredentials_tokenSourceFor_noCaching`: calls
`Token()` N times and asserts N HTTP requests hit the token endpoint.
Fails when the fix is reverted (inner caching collapses N calls into 1).
- `TestAzureClientSecretCredentials_Configure`: happy-path test for the
full Azure client secret credential flow.
- All existing `config/` tests pass.

Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
@hectorcast-db hectorcast-db disabled auto-merge March 25, 2026 10:44
The m2m test file declared tokenCountingTransport which conflicts with
the same type in auth_azure_client_secret_test.go. Rename to
postCountingTransport since this variant only counts POST requests.

Co-authored-by: Isaac
Signed-off-by: Ubuntu <renaud.hartert@databricks.com>
@github-actions
Copy link
Copy Markdown

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/sdk-go

Inputs:

  • PR number: 1550
  • Commit SHA: ce9e0d6ff5c4a84163141fa5431bf9e07c8a7e3c

Checks will be approved automatically on success.

@renaudhartert-db renaudhartert-db added this pull request to the merge queue Mar 27, 2026
Merged via the queue into databricks:main with commit d269fdb Mar 27, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ISSUE] M2M OAuth: double-caching causes async token refresh to be ineffective until ~10s before expiry, causing 401 bursts

2 participants