[EXTENSION] Fall back to current pull's image ref for auth when shared base layer credentials are revoked#3
Open
prafgup wants to merge 1 commit into
Conversation
… credentials are revoked
6978f26 to
ff3e95b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related to - awslabs#1927
Description Generated by AI -
Body:
Summary
Fix "layer unavailable" errors when pulling images that share base layers with a previously pulled image whose registry credentials have been revoked.
Problem
When using SOCI with short-lived registry credentials (e.g., CRI keychain with per-pull tokens), shared base layers become inaccessible after the original pull's credentials are revoked.
Root Cause
Consider two images sharing base layers in the same repository:
image:tag-A (pulled first, with SOCI index) image:tag-B (pulled later)
┌─────────────────────┐ ┌─────────────────────┐
│ Layer 5 (top/unique) │ --------------------------------------------- │ Layer 5 (top/unique) │
├─────────────────────┤ ├─────────────────────┤
│ Layer 4 (base) │────── SHARED ───────────│ Layer 4 (base) │
│ Layer 3 (base) │────── SHARED ───────────│ Layer 3 (base) │
│ Layer 2 (base) │────── SHARED ───────────│ Layer 2 (base) │
│ Layer 1 (base) │────── SHARED ───────────│ Layer 1 (base) │
└─────────────────────┘ └─────────────────────┘
creds: token-A creds: token-B
image:tag-Ais pulled with SOCI → base layers are created as remote snapshots with labelcri.image-ref=image:tag-A. CRI keychain storestoken-Aforimage:tag-A.The pod using
tag-Ais terminated →token-Ais revoked server-side, but the CRI keychain still holds it (only cleared onRemoveImage, not pod termination). The remote snapshots persist.image:tag-Bis pulled →PullImagestores freshtoken-Bunderimage:tag-B. Containerd sees the shared base layers already exist as snapshots and reuses them.During extraction of
tag-B's top layer,checkAvailabilitywalks parent snapshots. Each parent has labelcri.image-ref=image:tag-A→ the check callsgetSourceswhich reads this label →hosts("image:tag-A")→ CRIkeychain returns revoked
token-A→ 401 → "layer unavailable".The fresh
token-B(stored underimage:tag-B) is never tried because the snapshot labels point toimage:tag-A.Failure Flow (before fix)
Prepare(extraction for image:tag-B)
└─► mounts() → checkAvailability()
└─► Walk parent snapshots (labeled image:tag-A)
└─► fs.Check() → check()
1. l.Check() fails (stale fetcher with revoked token-A)
2. refreshLayer(labels with image:tag-A) → revoked token-A → 401
3. invalidateHosts → refreshLayer(image:tag-A) → still revoked → 401
4. ✗ No fallback — gives up
└─► "layer unavailable"
Fix
Pass the current pull's image ref through the existing context chain as a fallback. When the snapshot's original ref fails auth, retry with the current pull's ref which has fresh credentials in the CRI keychain.
Approach: Context-based fallback (zero interface changes)
fs/source/source.go: AddWithFallbackImageRef/FallbackImageRefcontext helperssnapshot/snapshot.go: InPrepare, store the current pull'scri.image-refin contextfs/fs.go: Incheck(), after all retries with the original ref fail, extract the fallback ref from context and retryrefreshLayerwith itFixed Flow
Prepare(extraction for image:tag-B)
ctx = WithFallbackImageRef(ctx, "image:tag-B") ← NEW
└─► mounts(ctx) → checkAvailability(ctx)
└─► Walk parent snapshots (labeled image:tag-A)
└─► fs.Check(ctx) → check(ctx)
1. l.Check() fails (stale fetcher)
2. refreshLayer(image:tag-A) → revoked → 401
3. invalidateHosts → refreshLayer(image:tag-A) → still 401
4. FallbackImageRef(ctx) = "image:tag-B" ← NEW
5. refreshLayer(image:tag-B) → fresh token-B → 200 ✓
└─► Layer available