Skip to content

Conversation

@johnrichardrinehart
Copy link

@johnrichardrinehart johnrichardrinehart commented Jan 8, 2026

Okay, straight up: claude worked through all of this. I confirmed these patches with a build of k3d . And, I can walk through any contextual information that's missing below that would support landing this patch, if it's deemed acceptable. I can confirm that it seems to solve the below nasty problem:

$ ./deploy/local-k8s.sh --fast 1
[INFO] Checking prerequisites...
[INFO] All prerequisites met!
[INFO] Starting background tasks...
[INFO] Podman network 'k3d' already exists
[INFO] Creating k3d cluster: foo with rancher/k3s:v1.31.6-k3s1
"kedacore" already exists with the same configuration, skipping
INFO[0000] portmapping '8443:443' targets the loadbalancer: defaulting to [servers:*:proxy agents:*:proxy] 
INFO[0000] portmapping '8080:80' targets the loadbalancer: defaulting to [servers:*:proxy agents:*:proxy] 
INFO[0000] Prep: Network                                
INFO[0000] Re-using existing network 'k3d' (731d924213a2cbd01d9dead162d08c4a2cb11bf3f271486d539bdd24e2233aab) 
INFO[0000] Created image volume k3d-foo-images     
INFO[0000] Starting new tools node...                   
ERRO[0000] Failed to run tools container for cluster 'foo' 
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "kedacore" chart repository
Update Complete. ⎈Happy Helming!⎈
INFO[0001] Creating node 'k3d-foo-server-0'        
INFO[0001] Creating LoadBalancer 'k3d-foo-serverlb' 
ERRO[0001] failed to ensure tools node: failed to run k3d-tools node for cluster 'foo': failed to create node 'k3d-foo-tools': runtime failed to create node 'k3d-foo-tools': failed to create container for node 'k3d-foo-tools': docker failed to create container 'k3d-foo-tools': Error response from daemon: make cli opts(): making volume mountpoint for volume /var/run/docker.sock: mkdir /var/run/docker.sock: permission denied 
ERRO[0001] Failed to create cluster >>> Rolling Back    
INFO[0001] Deleting cluster 'foo'                  
INFO[0001] Deleting 1 attached volumes...               
FATA[0001] Cluster creation FAILED, all changes have been rolled back! 

Below is claude's generated analysis/summary

Problem:

When creating a k3d cluster with rootless Podman, cluster creation fails with:

ERRO failed to ensure tools node: failed to retrieve cluster: No nodes found for given cluster

This is a race condition between EnsureToolsNode and ClusterCreate.

Root Cause:

In ClusterRun(), EnsureToolsNode was launched in a goroutine that ran in parallel with ClusterCreate:

g.Go(func() error {
    _, err := EnsureToolsNode(ctx, runtime, &clusterConfig.Cluster)
    return err
})

if err := ClusterCreate(...) { ... }

if err := g.Wait(); err != nil { ... }

EnsureToolsNode calls ClusterGet() to discover the cluster's network name and image volume from existing node labels. However, ClusterGet() queries the container runtime for nodes with the cluster label - but those nodes don't exist yet because ClusterCreate() hasn't created them.

With Docker, the timing usually works out because the Docker daemon is fast enough that nodes are queryable by the time EnsureToolsNode needs them. With rootless Podman (especially with crun), the container runtime is slightly slower, exposing this race condition reliably.

The error path is:

  1. ClusterRun starts EnsureToolsNode goroutine
  2. EnsureToolsNode calls ClusterGet to get cluster info
  3. ClusterGet calls runtime.GetNodesByLabel()
  4. GetNodesByLabel returns empty list (nodes don't exist yet)
  5. ClusterGet returns ClusterGetNoNodesFoundError
  6. EnsureToolsNode fails with "failed to retrieve cluster"
  7. ClusterCreate may succeed, but g.Wait() returns the error

Solution:

Move EnsureToolsNode to run sequentially after ClusterCreate completes. The tools node is only needed for later operations (image importing, etc.), so there's no benefit to running it in parallel with container creation.

This ordering guarantees that cluster nodes exist and are queryable before EnsureToolsNode attempts to discover cluster metadata from them.

The slight increase in cluster creation time (tools node creation is no longer parallelized) is negligible compared to the reliability improvement, especially for rootless container runtimes.

Testing:

Verified fix with:

  • Podman 5.7.0 (rootless, crun 1.24, cgroups v2)
  • k3d v5.8.3
  • k3s v1.31.6-k3s1

Before fix: Cluster creation fails ~100% of the time with rootless Podman
After fix: Cluster creation succeeds reliably

Fixes: #1312
Fixes: #1439
Related: #1284

Caveats

I have no idea if it fixes those referenced issues. I honestly haven't looked at them. I'm just taking 10m to push a patch and offering my willingness to ping-pong a bit on a patch set that seems to have solved a problem for me.

Thanks for your consideration!

…otless Podman

Problem:
--------
When creating a k3d cluster with rootless Podman, cluster creation fails with:

  ERRO failed to ensure tools node: failed to retrieve cluster: No nodes found for given cluster

This is a race condition between EnsureToolsNode and ClusterCreate.

Root Cause:
-----------
In ClusterRun(), EnsureToolsNode was launched in a goroutine that ran in
parallel with ClusterCreate:

    g.Go(func() error {
        _, err := EnsureToolsNode(ctx, runtime, &clusterConfig.Cluster)
        return err
    })

    if err := ClusterCreate(...) { ... }

    if err := g.Wait(); err != nil { ... }

EnsureToolsNode calls ClusterGet() to discover the cluster's network name
and image volume from existing node labels. However, ClusterGet() queries
the container runtime for nodes with the cluster label - but those nodes
don't exist yet because ClusterCreate() hasn't created them.

With Docker, the timing usually works out because the Docker daemon is fast
enough that nodes are queryable by the time EnsureToolsNode needs them.
With rootless Podman (especially with crun), the container runtime is
slightly slower, exposing this race condition reliably.

The error path is:
1. ClusterRun starts EnsureToolsNode goroutine
2. EnsureToolsNode calls ClusterGet to get cluster info
3. ClusterGet calls runtime.GetNodesByLabel()
4. GetNodesByLabel returns empty list (nodes don't exist yet)
5. ClusterGet returns ClusterGetNoNodesFoundError
6. EnsureToolsNode fails with "failed to retrieve cluster"
7. ClusterCreate may succeed, but g.Wait() returns the error

Solution:
---------
Move EnsureToolsNode to run sequentially after ClusterCreate completes.
The tools node is only needed for later operations (image importing, etc.),
so there's no benefit to running it in parallel with container creation.

This ordering guarantees that cluster nodes exist and are queryable before
EnsureToolsNode attempts to discover cluster metadata from them.

The slight increase in cluster creation time (tools node creation is no
longer parallelized) is negligible compared to the reliability improvement,
especially for rootless container runtimes.

Testing:
--------
Verified fix with:
- Podman 5.7.0 (rootless, crun 1.24, cgroups v2)
- k3d v5.8.3
- k3s v1.31.6-k3s1

Before fix: Cluster creation fails ~100% of the time with rootless Podman
After fix: Cluster creation succeeds reliably

Fixes: k3d-io#1312
Fixes: k3d-io#1439
Related: k3d-io#1284
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Cluster create fails while using rootless podman [BUG] k3d v5.5.1 still not works with Podman rootless

1 participant