Skip to content

Commit 6aa737b

Browse files
devantlerclaude
andauthored
feat(k3d): reuse lowest-available agent index when scaling (#5312) (#5317)
* feat(talos): reuse lowest-available node index when provisioning (#5312) When provisioning a replacement node during `ksail cluster update`, KSail named the new node max(existing index)+1 instead of filling the lowest freed index. So after a lost node was removed (e.g. etcd remove-member + hcloud server delete), recreating it allocated a brand-new higher name (prod-control-plane-4) rather than reclaiming the freed one (prod-control-plane-2), drifting names ever higher over repeated failures. Replace the shared max+1 allocator (nextNodeIndexFromNames) with availableNodeIndices, which returns the next N suffixes preferring the lowest free indices: gaps within 1..max are reclaimed (lowest first) before the series extends past its highest used index. This applies to both control-plane and worker series, on Docker and Hetzner, across the scale-up (addDockerNodes/addHetznerNodes) and rolling-recreate (createReplacementServer) paths. For Docker, the reclaimed index also reclaims the freed node's static IP. Returning a per-node index list (rather than a single base index plus a running offset) keeps multi-node scale-ups that span a gap collision-free (e.g. existing {1,3}, add 2 -> {2,4}, not {2,3}). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * refactor(talos): collapse availableNodeIndices to a single gap-filling loop Code-review/simplify pass: the two-loop + maxIndex version reclaimed gaps within 1..max and then extended past max in a separate loop. Since `used` only ever holds existing indices, any index past the max is free by definition — so a single linear scan from 1 that takes each free index yields gaps-first-then-extend with no maxIndex bookkeeping and one fewer loop. Drops the scanUsedNodeIndices helper's second return value. Behaviour is unchanged; the existing availableNodeIndices table tests (lowest-gap reclaim, gap-then-extend, leading-gap, zero-count, etc.) all still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(k3d): reuse lowest-available agent index when scaling (#5312) Follow-up to #5313: k3d named new K3s agent nodes max(existing index)+1 (plus a per-iteration batch offset), so a removed agent left a gap that a later `ksail cluster update` scale-up never reclaimed — the same drift fixed for Talos in #5313. Generalize the Talos lowest-index allocator into clusterupdate.AvailableNodeIndices(names, prefix, base, count): it returns the next N suffixes preferring the lowest free indices (gaps within base..max reclaimed first, then the series extends), parameterised by base so it serves both the Talos 1-based control-plane/worker series and the k3d 0-based agent series. Talos's availableNodeIndices becomes a thin adapter over it (behaviour unchanged), keeping a single implementation so jscpd sees no duplication. k3d's addAgentNodes now lists agents once and computes all node names up front via agentNodeNames -> AvailableNodeIndices(..., base 0, ...). Besides gap-filling, this fixes a latent index-skip: the old code re-listed nodes each iteration AND added batchOffset = i, so once a freshly created node (k3d node create --wait) became visible to the next list call, both the list and the offset counted it (e.g. 0 agents -> create agent-0 -> next iter maxIndex 0 -> 0+1+1 = 2, skipping agent-1). Computing indices from one snapshot is correct regardless of list-visibility timing. Adds clusterupdate.AvailableNodeIndices tests (base 0 and base 1) and a k3d agentNodeNames table test (lowest-gap reclaim, gap-then-extend, leading-gap, prefix filtering, zero count). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent af349b2 commit 6aa737b

6 files changed

Lines changed: 337 additions & 95 deletions

File tree

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
package clusterupdate
2+
3+
import (
4+
"strconv"
5+
"strings"
6+
)
7+
8+
// AvailableNodeIndices returns the next `count` node-name indices to allocate for
9+
// a "<prefix><n>" naming series, preferring the lowest free indices. Gaps left by
10+
// removed nodes are reclaimed (lowest first) before the series extends past its
11+
// highest used index, so a recreated node reclaims a removed node's name instead
12+
// of drifting to an ever-higher index (#5312).
13+
//
14+
// base is the first index in the series: 1 for Talos control-plane/worker names
15+
// (e.g. "<cluster>-control-plane-1"), 0 for k3d agent names (e.g.
16+
// "k3d-<cluster>-agent-0"). Names without the prefix, or whose suffix is not an
17+
// integer >= base, are ignored.
18+
//
19+
// With base 1, existing indices {1, 3} and count 2 it returns [2, 4]; with no
20+
// matching names it returns [base, base+1, ..., base+count-1].
21+
func AvailableNodeIndices(names []string, prefix string, base, count int) []int {
22+
if count <= 0 {
23+
return []int{}
24+
}
25+
26+
used := usedNodeIndices(names, prefix, base)
27+
indices := make([]int, 0, count)
28+
29+
// Walk the series from base and take each free index. Gaps left by removed
30+
// nodes are reclaimed first; once they run out, every higher index is free
31+
// (nothing past the max is in `used`), so the series simply extends.
32+
for index := base; len(indices) < count; index++ {
33+
if !used[index] {
34+
indices = append(indices, index)
35+
}
36+
}
37+
38+
return indices
39+
}
40+
41+
// usedNodeIndices returns the set of numeric suffixes (>= base) in use among names
42+
// sharing the given prefix. Names without the prefix or without a parsable suffix
43+
// >= base are ignored.
44+
func usedNodeIndices(names []string, prefix string, base int) map[int]bool {
45+
used := make(map[int]bool, len(names))
46+
47+
for _, name := range names {
48+
suffix, found := strings.CutPrefix(name, prefix)
49+
if !found {
50+
continue
51+
}
52+
53+
index, err := strconv.Atoi(suffix)
54+
if err != nil || index < base {
55+
continue
56+
}
57+
58+
used[index] = true
59+
}
60+
61+
return used
62+
}
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
package clusterupdate_test
2+
3+
import (
4+
"testing"
5+
6+
"github.com/devantler-tech/ksail/v7/pkg/svc/provisioner/cluster/clusterupdate"
7+
"github.com/stretchr/testify/assert"
8+
)
9+
10+
//nolint:funlen // Table-driven allocation cases are clearest enumerated inline.
11+
func TestAvailableNodeIndices(t *testing.T) {
12+
t.Parallel()
13+
14+
tests := []struct {
15+
name string
16+
names []string
17+
prefix string
18+
base int
19+
count int
20+
want []int
21+
}{
22+
// --- base 1 (Talos control-plane/worker series) ---
23+
{
24+
name: "base1 empty list starts at 1",
25+
prefix: "c-control-plane-",
26+
base: 1,
27+
count: 1,
28+
want: []int{1},
29+
},
30+
{
31+
name: "base1 empty list multiple nodes are contiguous",
32+
prefix: "c-control-plane-",
33+
base: 1,
34+
count: 3,
35+
want: []int{1, 2, 3},
36+
},
37+
{
38+
name: "base1 no matching prefix",
39+
names: []string{"other-control-plane-1", "c-worker-1"},
40+
prefix: "c-control-plane-",
41+
base: 1,
42+
count: 1,
43+
want: []int{1},
44+
},
45+
{
46+
name: "base1 contiguous extends past max",
47+
names: []string{"c-control-plane-1", "c-control-plane-2", "c-control-plane-3"},
48+
prefix: "c-control-plane-",
49+
base: 1,
50+
count: 1,
51+
want: []int{4},
52+
},
53+
{
54+
// Core #5312 behaviour at base 1.
55+
name: "base1 reclaims a freed middle index",
56+
names: []string{"c-control-plane-1", "c-control-plane-3"},
57+
prefix: "c-control-plane-",
58+
base: 1,
59+
count: 1,
60+
want: []int{2},
61+
},
62+
{
63+
name: "base1 fills gap then extends",
64+
names: []string{"c-worker-1", "c-worker-3"},
65+
prefix: "c-worker-",
66+
base: 1,
67+
count: 2,
68+
want: []int{2, 4},
69+
},
70+
{
71+
name: "base1 picks lowest of multiple gaps",
72+
names: []string{"c-worker-1", "c-worker-5"},
73+
prefix: "c-worker-",
74+
base: 1,
75+
count: 1,
76+
want: []int{2},
77+
},
78+
{
79+
name: "base1 ignores zero suffix below base",
80+
names: []string{"c-control-plane-0", "c-control-plane-1"},
81+
prefix: "c-control-plane-",
82+
base: 1,
83+
count: 1,
84+
want: []int{2},
85+
},
86+
{
87+
name: "base1 ignores non-numeric suffix",
88+
names: []string{"c-worker-abc", "c-worker-1"},
89+
prefix: "c-worker-",
90+
base: 1,
91+
count: 1,
92+
want: []int{2},
93+
},
94+
// --- base 0 (k3d agent series) ---
95+
{
96+
name: "base0 empty list starts at 0",
97+
prefix: "k3d-c-agent-",
98+
base: 0,
99+
count: 1,
100+
want: []int{0},
101+
},
102+
{
103+
name: "base0 empty list multiple nodes are contiguous from 0",
104+
prefix: "k3d-c-agent-",
105+
base: 0,
106+
count: 3,
107+
want: []int{0, 1, 2},
108+
},
109+
{
110+
name: "base0 reclaims a freed middle index",
111+
names: []string{"k3d-c-agent-0", "k3d-c-agent-2"},
112+
prefix: "k3d-c-agent-",
113+
base: 0,
114+
count: 1,
115+
want: []int{1},
116+
},
117+
{
118+
name: "base0 fills gap then extends",
119+
names: []string{"k3d-c-agent-0", "k3d-c-agent-2"},
120+
prefix: "k3d-c-agent-",
121+
base: 0,
122+
count: 2,
123+
want: []int{1, 3},
124+
},
125+
{
126+
name: "base0 reclaims leading gap at 0",
127+
names: []string{"k3d-c-agent-1", "k3d-c-agent-2"},
128+
prefix: "k3d-c-agent-",
129+
base: 0,
130+
count: 1,
131+
want: []int{0},
132+
},
133+
// --- guards ---
134+
{
135+
name: "zero count yields empty",
136+
names: []string{"c-worker-1"},
137+
prefix: "c-worker-",
138+
base: 1,
139+
count: 0,
140+
want: []int{},
141+
},
142+
{
143+
name: "negative count yields empty",
144+
prefix: "c-worker-",
145+
base: 1,
146+
count: -2,
147+
want: []int{},
148+
},
149+
}
150+
151+
for _, testCase := range tests {
152+
t.Run(testCase.name, func(t *testing.T) {
153+
t.Parallel()
154+
155+
got := clusterupdate.AvailableNodeIndices(
156+
testCase.names, testCase.prefix, testCase.base, testCase.count,
157+
)
158+
assert.Equal(t, testCase.want, got)
159+
})
160+
}
161+
}

pkg/svc/provisioner/cluster/k3d/export_test.go

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,11 @@ func ParseClusterNamesForTest(output string) ([]string, error) {
6969
return parseClusterNames(output)
7070
}
7171

72+
// AgentNodeNamesForTest exposes agentNodeNames for unit testing.
73+
func AgentNodeNamesForTest(existing []string, clusterName string, count int) []string {
74+
return agentNodeNames(existing, clusterName, count)
75+
}
76+
7277
// ResolveNameForTest exposes resolveName for unit testing.
7378
func (k *Provisioner) ResolveNameForTest(name string) string {
7479
return k.resolveName(name)

pkg/svc/provisioner/cluster/k3d/update.go

Lines changed: 25 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -141,20 +141,21 @@ func (k *Provisioner) applyWorkerScaling(
141141
return err
142142
}
143143

144-
// addAgentNodes adds new agent nodes to the cluster.
144+
// addAgentNodes adds new agent nodes to the cluster, reclaiming the lowest
145+
// available index for each new node so a removed agent's name is reused rather
146+
// than always appending past the highest index (#5312).
145147
func (k *Provisioner) addAgentNodes(
146148
ctx context.Context,
147149
clusterName string,
148150
count int,
149151
result *clusterupdate.UpdateResult,
150152
) error {
151-
for i := range count {
152-
nodeName := fmt.Sprintf(
153-
"k3d-%s-agent-%d",
154-
clusterName,
155-
k.nextAgentIndex(ctx, clusterName, i),
156-
)
153+
existing, err := k.listAgentNodes(ctx, clusterName)
154+
if err != nil {
155+
return fmt.Errorf("failed to list agent nodes: %w", err)
156+
}
157157

158+
for _, nodeName := range agentNodeNames(existing, clusterName, count) {
158159
args := []string{
159160
nodeName,
160161
"--cluster", clusterName,
@@ -169,14 +170,14 @@ func (k *Provisioner) addAgentNodes(
169170
nodeRunner := runner.NewCobraCommandRunner(io.Discard, io.Discard)
170171
cmd := nodecommand.NewCmdNodeCreate()
171172

172-
_, err := nodeRunner.Run(ctx, cmd, args)
173-
if err != nil {
173+
_, runErr := nodeRunner.Run(ctx, cmd, args)
174+
if runErr != nil {
174175
result.FailedChanges = append(result.FailedChanges, clusterupdate.Change{
175176
Field: "k3d.agents",
176-
Reason: fmt.Sprintf("failed to create agent node %s: %v", nodeName, err),
177+
Reason: fmt.Sprintf("failed to create agent node %s: %v", nodeName, runErr),
177178
})
178179

179-
return fmt.Errorf("failed to create agent node %s: %w", nodeName, err)
180+
return fmt.Errorf("failed to create agent node %s: %w", nodeName, runErr)
180181
}
181182

182183
result.AppliedChanges = append(result.AppliedChanges, clusterupdate.Change{
@@ -382,39 +383,23 @@ func (k *Provisioner) listAgentNodes(
382383
return agents, nil
383384
}
384385

385-
// nextAgentIndex calculates the next agent index for naming.
386-
// It finds the maximum existing agent index to avoid naming collisions when there
387-
// are gaps in the index sequence, then adds 1 plus the batch offset.
388-
func (k *Provisioner) nextAgentIndex(
389-
ctx context.Context,
390-
clusterName string,
391-
batchOffset int,
392-
) int {
393-
agents, err := k.listAgentNodes(ctx, clusterName)
394-
if err != nil {
395-
return batchOffset
396-
}
397-
398-
if len(agents) == 0 {
399-
return batchOffset
400-
}
401-
402-
maxIndex := -1
386+
// agentNodeNames returns the names for `count` new agent nodes, reclaiming the
387+
// lowest-available index in the 0-based "k3d-<cluster>-agent-<n>" series before
388+
// extending past the highest existing index (#5312). All names are computed up
389+
// front from a single inventory snapshot: this keeps a multi-node scale-up's
390+
// indices distinct and gap-filling, and avoids the previous per-node re-list +
391+
// batch-offset scheme, which double-counted (and so skipped indices) once a
392+
// freshly created node became visible to the next list call.
393+
func agentNodeNames(existing []string, clusterName string, count int) []string {
403394
prefix := fmt.Sprintf("k3d-%s-agent-", clusterName)
395+
indices := clusterupdate.AvailableNodeIndices(existing, prefix, 0, count)
404396

405-
for _, name := range agents {
406-
idx, found := strings.CutPrefix(name, prefix)
407-
if !found {
408-
continue
409-
}
410-
411-
n, parseErr := strconv.Atoi(idx)
412-
if parseErr == nil && n > maxIndex {
413-
maxIndex = n
414-
}
397+
names := make([]string, len(indices))
398+
for i, index := range indices {
399+
names[i] = fmt.Sprintf("k3d-%s-agent-%d", clusterName, index)
415400
}
416401

417-
return maxIndex + 1 + batchOffset
402+
return names
418403
}
419404

420405
// GetCurrentConfig retrieves the current cluster configuration by probing the

0 commit comments

Comments
 (0)