Skip to content

Invalidate host creds in case of failure and retry#1925

Closed
prafgup wants to merge 3 commits into
awslabs:mainfrom
prafgup:prafulg/try-refresh-creds-on-fail
Closed

Invalidate host creds in case of failure and retry#1925
prafgup wants to merge 3 commits into
awslabs:mainfrom
prafgup:prafulg/try-refresh-creds-on-fail

Conversation

@prafgup
Copy link
Copy Markdown
Contributor

@prafgup prafgup commented Apr 2, 2026

Issue #, if available:

Related to #1924
Disclaimer - Coded by AI

Description of changes:

Testing performed:

Proof:

  • InvalidateRegistryHosts is real → integration test PASSES (container runs after credential rotation)
  • InvalidateRegistryHosts is no-op → integration test FAILS with layer "5" unavailable: unavailable (exact same error as your production issue)

The mechanism: Short-lived Bearer tokens expire → docker.Authorizer's cached handler tries fetchToken with old password baked in TokenOptions → token server rejects → error inside doBearerAuth (before AddResponses) → handler never deleted → Authorizer stuck → our fix invalidates the registryHostMap → fresh Authorizer with empty handlers → re-authenticates with new password from docker config → works.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

prafgup added 3 commits April 3, 2026 00:51
Unit tests:
- TestRegistryHostsCachesAndReturnsHosts: verifies cache returns same hosts
- TestInvalidateRegistryHostsForcesFreshHosts: verifies invalidation creates
  fresh AuthClient (fails when InvalidateRegistryHosts is a no-op)
- TestInvalidateIsNoOpForUnknownRef: no panic on unknown ref

Integration test:
- TestCredentialRefreshOnCheckFailure: sets up Bearer token auth registry with
  a custom token server, pulls image via SOCI with check_always=true, rotates
  the token server password, and verifies the container can still run after
  credential rotation.
Use 3-second token TTL so cached Bearer tokens expire during the test.
After expiry, the docker.Authorizer's cached handler tries to re-fetch
a token with old credentials baked into its TokenOptions. The token
server rejects the old password, causing doBearerAuth to fail without
calling AddResponses — so the handler is never deleted and the
Authorizer is stuck.

Without InvalidateRegistryHosts: layer "N" unavailable (permanent failure).
With InvalidateRegistryHosts: cache cleared, fresh Authorizer created,
new handler fetches token with new password, container runs.
@prafgup prafgup force-pushed the prafulg/try-refresh-creds-on-fail branch from 407a251 to aeb4d65 Compare April 2, 2026 23:54
@prafgup
Copy link
Copy Markdown
Contributor Author

prafgup commented Apr 3, 2026

@sondavidb I (+AI) was able to replicate the layer "5" unavailable: unavailable issue with the integration test here (very hacky, basically expire the custom token with TTL and fetch again to trigger check()). This fails without the change in this PR but actually re-auths (refreshes) when the auth is expired.

  • The main logic change is in fs.go where we invalidate the cache in case of error when refreshing and refresh again.

This in no way is ready for review. I just want your views on the FS cache invalidation logic :) (I will refine the PR once I have your views aligned, tho integration test is not really viable in this case without a lot of custom changes like in this PR)

@sondavidb
Copy link
Copy Markdown
Contributor

Idea LGTM, I'm happy to move forward with this. I do have some concerns in terms of credentials on a per-layer basis, as I believe we cache the credentials per layer as well. But I don't think this has to be comprehensive by any means — if we can refresh credentials on a snapshotter level that is already helping a ton of pain points for Kubernetes use cases.

@prafgup
Copy link
Copy Markdown
Contributor Author

prafgup commented Apr 3, 2026

Thanks, I will go ahead and refine the PR (by Monday, afk for some days) 😄

Note: Integration test is gonna be very complex with the current testing setup thus would not have any for now. I'll research a bit in future if theres any easier way of implementing a integration test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants