Skip to content

[EXTENSION] Fall back to current pull's image ref for auth when shared base layer credentials are revoked#3

Open
prafgup wants to merge 1 commit into
prafug/refresh-creds-on-check-errorfrom
prafulg/fallback-to-current-pull-image-ref
Open

[EXTENSION] Fall back to current pull's image ref for auth when shared base layer credentials are revoked#3
prafgup wants to merge 1 commit into
prafug/refresh-creds-on-check-errorfrom
prafulg/fallback-to-current-pull-image-ref

Conversation

@prafgup
Copy link
Copy Markdown
Owner

@prafgup prafgup commented Apr 7, 2026

Related to - awslabs#1927

Description Generated by AI -

Body:

Summary

Fix "layer unavailable" errors when pulling images that share base layers with a previously pulled image whose registry credentials have been revoked.

Problem

When using SOCI with short-lived registry credentials (e.g., CRI keychain with per-pull tokens), shared base layers become inaccessible after the original pull's credentials are revoked.

Root Cause

Consider two images sharing base layers in the same repository:

image:tag-A (pulled first, with SOCI index) image:tag-B (pulled later)
┌─────────────────────┐ ┌─────────────────────┐
│ Layer 5 (top/unique) │ --------------------------------------------- │ Layer 5 (top/unique) │
├─────────────────────┤ ├─────────────────────┤
│ Layer 4 (base) │────── SHARED ───────────│ Layer 4 (base) │
│ Layer 3 (base) │────── SHARED ───────────│ Layer 3 (base) │
│ Layer 2 (base) │────── SHARED ───────────│ Layer 2 (base) │
│ Layer 1 (base) │────── SHARED ───────────│ Layer 1 (base) │
└─────────────────────┘ └─────────────────────┘
creds: token-A creds: token-B

  1. image:tag-A is pulled with SOCI → base layers are created as remote snapshots with label cri.image-ref=image:tag-A. CRI keychain stores token-A for image:tag-A.

  2. The pod using tag-A is terminated → token-A is revoked server-side, but the CRI keychain still holds it (only cleared on RemoveImage, not pod termination). The remote snapshots persist.

  3. image:tag-B is pulled → PullImage stores fresh token-B under image:tag-B. Containerd sees the shared base layers already exist as snapshots and reuses them.

  4. During extraction of tag-B's top layer, checkAvailability walks parent snapshots. Each parent has label cri.image-ref=image:tag-A → the check calls getSources which reads this label → hosts("image:tag-A") → CRI
    keychain returns revoked token-A401 → "layer unavailable".

The fresh token-B (stored under image:tag-B) is never tried because the snapshot labels point to image:tag-A.

Failure Flow (before fix)

Prepare(extraction for image:tag-B)
└─► mounts() → checkAvailability()
└─► Walk parent snapshots (labeled image:tag-A)
└─► fs.Check() → check()
1. l.Check() fails (stale fetcher with revoked token-A)
2. refreshLayer(labels with image:tag-A) → revoked token-A → 401
3. invalidateHosts → refreshLayer(image:tag-A) → still revoked → 401
4. ✗ No fallback — gives up
└─► "layer unavailable"

Fix

Pass the current pull's image ref through the existing context chain as a fallback. When the snapshot's original ref fails auth, retry with the current pull's ref which has fresh credentials in the CRI keychain.

Approach: Context-based fallback (zero interface changes)

  • fs/source/source.go: Add WithFallbackImageRef/FallbackImageRef context helpers
  • snapshot/snapshot.go: In Prepare, store the current pull's cri.image-ref in context
  • fs/fs.go: In check(), after all retries with the original ref fail, extract the fallback ref from context and retry refreshLayer with it

Fixed Flow

Prepare(extraction for image:tag-B)
ctx = WithFallbackImageRef(ctx, "image:tag-B") ← NEW
└─► mounts(ctx) → checkAvailability(ctx)
└─► Walk parent snapshots (labeled image:tag-A)
└─► fs.Check(ctx) → check(ctx)
1. l.Check() fails (stale fetcher)
2. refreshLayer(image:tag-A) → revoked → 401
3. invalidateHosts → refreshLayer(image:tag-A) → still 401
4. FallbackImageRef(ctx) = "image:tag-B" ← NEW
5. refreshLayer(image:tag-B) → fresh token-B → 200 ✓
└─► Layer available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant