[Bug] SOCI 2nd Image pull not working with layer unavailable error

### Description

Hey Devs,
(I have used AI for some analysis and writeup, but have confirmed that things are correct)

When using soci-snapshotter-grpc with a private registry that uses `Basic auth` (username and token) and short-lived credentials (via CRI keychain, we use k8s `Secrets` and remove them once pod has finished), layers that were successfully pulled/partially pulled become unavailable hours later. The checkAvailability health check fails because the snapshotter cannot re-authenticate with the original credentials expire.

#### Environment                                                     
  - soci-snapshotter version: 0.12.1
  - Registry auth: Basic (WWW-Authenticate: Basic realm="XXXX")
  - Credential source: CRI keychain (cri_keychain.enable_keychain = true)                                                                                                  
  - Container runtime: containerd                                                                                                                                          
                                                                                                                                                                           
#### Behavior                                                                                                                                                                 
                                                                  
First pull (works): Kubelet issues PullImage() → CRI keychain receives fresh credentials → SOCI artifacts fetched and layers mounted via FUSE successfully.              

Hours later (fails): A new pod pulls the same image. checkAvailability runs on the existing FUSE-mounted layers, attempts to verify blob connectivity, receives 401, and  tries to re-authenticate. Re-authentication fails with "failed to find supported auth scheme: not implemented", causing layers to be marked unavailable.



When a pod is launched and kubelet tries to pull the same image hours after on the same node I see this error -

```
failed to create containerd container: error unpacking image: failed to prepare extraction snapshot 
"extract-86525.... sha256:c49c6f8114925036148a3596d382b44b770483093594ba21be9923ea3d51a71c": 
layer "177" unavailable: unavailable
```

My soci config - 
```
[cri_keychain]
enable_keychain = true
image_service_path = "/run/containerd/containerd.sock"
```


Heres a rough timeline of the error - 

```
  ┌─────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
  │    Time     │                                                              Event                                                               │
  ├─────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 15:04:51    │ Image pull starts, snapshots prepared. Credentials in CRI keychain.                                                              │
  ├─────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 15:05:19-22 │ Remote snapshots (FUSE) successfully prepared for layers with ztocs (snapshots 14, 16, 17, 20). Auth works here.                 │
  ├─────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 15:05:49    │ Background fetcher kicks in (~27 seconds after pull). Tries to fetch span 0 from layer sha256:18d8de.... Gets 401 Unauthorized.  │
  │             │ "failed to find supported auth scheme: not implemented". CRI keychain credentials are already gone.                              │
  ├─────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 15:05:51-58 │ More background fetch auth failures for layers sha256:17431c..., sha256:ea2a54..., sha256:ac273d...                              │
  ├─────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 18:06:43    │ New pod tries to use same image. checkAvailability() runs → tries to connect to blobs → 401 on all 4 remote layers → "layer      │
  │             │ unavailable" → container creation fails                                                                                          │
  └─────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

```


#### Error logs

```
  {"level":"WARN","message":"check failed","params":{"error":"failed(layer:\"sha256:18d8de3e8af0f2fb1b2fca8bea9f20cc5d4c783192bcb6cad631b6fee0686bbd\",
  ref:\"registry.example.com/image:tag\"): unable to create remote fetcher: failed to redirect: request to registry failed: failed to handle challenge: failed to find     
  supported auth scheme: not implemented (host \"registry.example.com\", ref:\"registry.example.com/image:tag\",
  digest:\"sha256:18d8de3e8af0f2fb1b2fca8bea9f20cc5d4c783192bcb6cad631b6fee0686bbd\"): failed to refresh connection"}}                                                     
                                                                  
  {"level":"WARN","message":"layer is unavailable","params":{"error":"failed(layer:\"sha256:18d8de3e8af0f2fb1b2fca8bea9f20cc5d4c783192bcb6cad631b6fee0686bbd\",            
  ref:\"registry.example.com/image:tag\"): unable to create remote fetcher: failed to redirect: request to registry failed: failed to handle challenge: failed to find
  supported auth scheme: not implemented (host \"registry.example.com\", ref:\"registry.example.com/image:tag\",                                                           
  digest:\"sha256:18d8de3e8af0f2fb1b2fca8bea9f20cc5d4c783192bcb6cad631b6fee0686bbd\"): failed to refresh connection"}}

  {"level":"WARN","message":"failed to refresh the layer \"sha256:17431c26b529de6c07033da89d4c81ddd0b76bbec8f6edc10c11d670cb6ff1db\" from                                  
  \"registry.example.com/image:tag\"","params":{"error":"unable to create remote fetcher: failed to redirect: request to registry failed: failed to handle challenge:
  failed to find supported auth scheme: not implemented (host \"registry.example.com\", ref:\"registry.example.com/image:tag\",                                            
  digest:\"sha256:17431c26b529de6c07033da89d4c81ddd0b76bbec8f6edc10c11d670cb6ff1db\")"}}


  Background fetcher also fails with the same auth issue (I can disable it but I dont think its going to help much):                                                                                                                  
  {"level":"WARN","message":"error trying to resolve layer, removing it from the queue","params":{"error":"error trying to fetch span with spanId = 0 from layerDigest =
  sha256:18d8de3e8af0f2fb1b2fca8bea9f20cc5d4c783192bcb6cad631b6fee0686bbd: failed to handle challenge: failed to find supported auth scheme: not implemented"}} 
```


I have checked the config but I don't see any config that would help with it. Would appreciate your expertise over this. The credentials we provide to the cri keychain are short lived and removed/expired once the pod has finished. New credentials are injected when launching a new pod.


From AI - 

```
  The full sequence:                                                                                                                                                       
                                                                  
  1. Pod A pulls image X → PullImage() intercepted → CRI keychain stores creds-A → RegistryManager creates AuthClient with creds-A, caches it permanently in               
  registryHostMap                                                 
  2. Hours pass → creds-A expire                                                                                                                                           
  3. Pod B pulls same image X → PullImage() intercepted → CRI keychain updated with fresh creds-B → but containerd sees image already exists, skips layer work
  4. Container starts → Mounts() → checkAvailability() → blob.Check() → 401                                                                                                
  5. Refresh path → getSources() → hosts(refspec) → cache hit → returns stale AuthClient with expired creds-A → re-auth fails → "layer is unavailable"                     
                                                                                                                                                                           
  The CRI keychain HAS fresh creds-B, but RegistryManager never asks for them again — it returns the cached RegistryHost from the first pull.

Containerd tries Prepare() → SOCI's mounts() → checkAvailability() → blob.Check() → 401 → layer unavailable
```

More AI analysis but sounds about right - 

<img width="1652" height="690" alt="Image" src="https://github.com/user-attachments/assets/a4605031-e198-4bb8-9cb3-e6895dcd0aa7" />

### Steps to reproduce the bug

_No response_

### Describe the results you expected

The 2nd pod should be able to launch sucessfully, the layer should not be marked unavailable. The layer check should authenticate with the new creds if required.

### Host information

1. OS: x86_64
2. Snapshotter Version: 0.12.1
3. Containerd Version: v2.2.2


### Any additional context or information about the bug

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] SOCI 2nd Image pull not working with layer unavailable error #1924

Description

Environment

Behavior

Error logs

Steps to reproduce the bug

Describe the results you expected

Host information

Any additional context or information about the bug

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] SOCI 2nd Image pull not working with layer unavailable error #1924

Description

Description

Environment

Behavior

Error logs

Steps to reproduce the bug

Describe the results you expected

Host information

Any additional context or information about the bug

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions