Skip to content

[Bug] SOCI 2nd Image pull not working with layer unavailable error #1924

@prafgup

Description

@prafgup

Description

Hey Devs,
(I have used AI for some analysis and writeup, but have confirmed that things are correct)

When using soci-snapshotter-grpc with a private registry that uses Basic auth (username and token) and short-lived credentials (via CRI keychain, we use k8s Secrets and remove them once pod has finished), layers that were successfully pulled/partially pulled become unavailable hours later. The checkAvailability health check fails because the snapshotter cannot re-authenticate with the original credentials expire.

Environment

  • soci-snapshotter version: 0.12.1
  • Registry auth: Basic (WWW-Authenticate: Basic realm="XXXX")
  • Credential source: CRI keychain (cri_keychain.enable_keychain = true)
  • Container runtime: containerd

Behavior

First pull (works): Kubelet issues PullImage() → CRI keychain receives fresh credentials → SOCI artifacts fetched and layers mounted via FUSE successfully.

Hours later (fails): A new pod pulls the same image. checkAvailability runs on the existing FUSE-mounted layers, attempts to verify blob connectivity, receives 401, and tries to re-authenticate. Re-authentication fails with "failed to find supported auth scheme: not implemented", causing layers to be marked unavailable.

When a pod is launched and kubelet tries to pull the same image hours after on the same node I see this error -

failed to create containerd container: error unpacking image: failed to prepare extraction snapshot 
"extract-86525.... sha256:c49c6f8114925036148a3596d382b44b770483093594ba21be9923ea3d51a71c": 
layer "177" unavailable: unavailable

My soci config -

[cri_keychain]
enable_keychain = true
image_service_path = "/run/containerd/containerd.sock"

Heres a rough timeline of the error -

  ┌─────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
  │    Time     │                                                              Event                                                               │
  ├─────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 15:04:51    │ Image pull starts, snapshots prepared. Credentials in CRI keychain.                                                              │
  ├─────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 15:05:19-22 │ Remote snapshots (FUSE) successfully prepared for layers with ztocs (snapshots 14, 16, 17, 20). Auth works here.                 │
  ├─────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 15:05:49    │ Background fetcher kicks in (~27 seconds after pull). Tries to fetch span 0 from layer sha256:18d8de.... Gets 401 Unauthorized.  │
  │             │ "failed to find supported auth scheme: not implemented". CRI keychain credentials are already gone.                              │
  ├─────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 15:05:51-58 │ More background fetch auth failures for layers sha256:17431c..., sha256:ea2a54..., sha256:ac273d...                              │
  ├─────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 18:06:43    │ New pod tries to use same image. checkAvailability() runs → tries to connect to blobs → 401 on all 4 remote layers → "layer      │
  │             │ unavailable" → container creation fails                                                                                          │
  └─────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Error logs

  {"level":"WARN","message":"check failed","params":{"error":"failed(layer:\"sha256:18d8de3e8af0f2fb1b2fca8bea9f20cc5d4c783192bcb6cad631b6fee0686bbd\",
  ref:\"registry.example.com/image:tag\"): unable to create remote fetcher: failed to redirect: request to registry failed: failed to handle challenge: failed to find     
  supported auth scheme: not implemented (host \"registry.example.com\", ref:\"registry.example.com/image:tag\",
  digest:\"sha256:18d8de3e8af0f2fb1b2fca8bea9f20cc5d4c783192bcb6cad631b6fee0686bbd\"): failed to refresh connection"}}                                                     
                                                                  
  {"level":"WARN","message":"layer is unavailable","params":{"error":"failed(layer:\"sha256:18d8de3e8af0f2fb1b2fca8bea9f20cc5d4c783192bcb6cad631b6fee0686bbd\",            
  ref:\"registry.example.com/image:tag\"): unable to create remote fetcher: failed to redirect: request to registry failed: failed to handle challenge: failed to find
  supported auth scheme: not implemented (host \"registry.example.com\", ref:\"registry.example.com/image:tag\",                                                           
  digest:\"sha256:18d8de3e8af0f2fb1b2fca8bea9f20cc5d4c783192bcb6cad631b6fee0686bbd\"): failed to refresh connection"}}

  {"level":"WARN","message":"failed to refresh the layer \"sha256:17431c26b529de6c07033da89d4c81ddd0b76bbec8f6edc10c11d670cb6ff1db\" from                                  
  \"registry.example.com/image:tag\"","params":{"error":"unable to create remote fetcher: failed to redirect: request to registry failed: failed to handle challenge:
  failed to find supported auth scheme: not implemented (host \"registry.example.com\", ref:\"registry.example.com/image:tag\",                                            
  digest:\"sha256:17431c26b529de6c07033da89d4c81ddd0b76bbec8f6edc10c11d670cb6ff1db\")"}}


  Background fetcher also fails with the same auth issue (I can disable it but I dont think its going to help much):                                                                                                                  
  {"level":"WARN","message":"error trying to resolve layer, removing it from the queue","params":{"error":"error trying to fetch span with spanId = 0 from layerDigest =
  sha256:18d8de3e8af0f2fb1b2fca8bea9f20cc5d4c783192bcb6cad631b6fee0686bbd: failed to handle challenge: failed to find supported auth scheme: not implemented"}} 

I have checked the config but I don't see any config that would help with it. Would appreciate your expertise over this. The credentials we provide to the cri keychain are short lived and removed/expired once the pod has finished. New credentials are injected when launching a new pod.

From AI -

  The full sequence:                                                                                                                                                       
                                                                  
  1. Pod A pulls image X → PullImage() intercepted → CRI keychain stores creds-A → RegistryManager creates AuthClient with creds-A, caches it permanently in               
  registryHostMap                                                 
  2. Hours pass → creds-A expire                                                                                                                                           
  3. Pod B pulls same image X → PullImage() intercepted → CRI keychain updated with fresh creds-B → but containerd sees image already exists, skips layer work
  4. Container starts → Mounts() → checkAvailability() → blob.Check() → 401                                                                                                
  5. Refresh path → getSources() → hosts(refspec) → cache hit → returns stale AuthClient with expired creds-A → re-auth fails → "layer is unavailable"                     
                                                                                                                                                                           
  The CRI keychain HAS fresh creds-B, but RegistryManager never asks for them again — it returns the cached RegistryHost from the first pull.

Containerd tries Prepare() → SOCI's mounts() → checkAvailability() → blob.Check() → 401 → layer unavailable

More AI analysis but sounds about right -

Image

Steps to reproduce the bug

No response

Describe the results you expected

The 2nd pod should be able to launch sucessfully, the layer should not be marked unavailable. The layer check should authenticate with the new creds if required.

Host information

  1. OS: x86_64
  2. Snapshotter Version: 0.12.1
  3. Containerd Version: v2.2.2

Any additional context or information about the bug

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions