fix(cluster): resolve race condition causing "No nodes found" with rootless podman
#1640
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Okay, straight up:
claudeworked through all of this. I confirmed these patches with a build ofk3d. And, I can walk through any contextual information that's missing below that would support landing this patch, if it's deemed acceptable. I can confirm that it seems to solve the below nasty problem:Below is
claude's generated analysis/summaryProblem:
When creating a k3d cluster with rootless Podman, cluster creation fails with:
ERRO failed to ensure tools node: failed to retrieve cluster: No nodes found for given cluster
This is a race condition between EnsureToolsNode and ClusterCreate.
Root Cause:
In ClusterRun(), EnsureToolsNode was launched in a goroutine that ran in parallel with ClusterCreate:
EnsureToolsNode calls ClusterGet() to discover the cluster's network name and image volume from existing node labels. However, ClusterGet() queries the container runtime for nodes with the cluster label - but those nodes don't exist yet because ClusterCreate() hasn't created them.
With Docker, the timing usually works out because the Docker daemon is fast enough that nodes are queryable by the time EnsureToolsNode needs them. With rootless Podman (especially with crun), the container runtime is slightly slower, exposing this race condition reliably.
The error path is:
Solution:
Move EnsureToolsNode to run sequentially after ClusterCreate completes. The tools node is only needed for later operations (image importing, etc.), so there's no benefit to running it in parallel with container creation.
This ordering guarantees that cluster nodes exist and are queryable before EnsureToolsNode attempts to discover cluster metadata from them.
The slight increase in cluster creation time (tools node creation is no longer parallelized) is negligible compared to the reliability improvement, especially for rootless container runtimes.
Testing:
Verified fix with:
Before fix: Cluster creation fails ~100% of the time with rootless Podman
After fix: Cluster creation succeeds reliably
Fixes: #1312
Fixes: #1439
Related: #1284
Caveats
I have no idea if it fixes those referenced issues. I honestly haven't looked at them. I'm just taking 10m to push a patch and offering my willingness to ping-pong a bit on a patch set that seems to have solved a problem for me.
Thanks for your consideration!