fix: deletion of nodes#1919
Conversation
WalkthroughReformatted ETCD-related kubectl constants; node deletion now obtains etcd members via Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Operator
participant Deleter as deleteFromEtcd
participant Kubectl as Kubectl.Exec
participant Etcd as etcdctl
Operator->>Deleter: trigger node deletion
Deleter->>Kubectl: exec exportEtcdEnvsCmd && etcdctl member list -w json --hex=true
Kubectl->>Etcd: run "member list" (JSON, hex)
Etcd-->>Kubectl: JSON members
Kubectl-->>Deleter: stdout (members JSON)
Deleter->>Deleter: unmarshal JSON -> etcdMemberList
alt members found
loop for each Member in Members
Deleter->>Kubectl: exec exportEtcdEnvsCmd && etcdctl member remove <Member.Id>
Kubectl->>Etcd: run "member remove <Id>"
Etcd-->>Kubectl: removal result
Kubectl-->>Deleter: result
end
else no matching member
Deleter->>Deleter: log warning for missing node
end
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🧰 Additional context used🧠 Learnings (2)📓 Common learnings📚 Learning: 2025-06-10T08:04:36.596ZApplied to files:
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
services/kuber/server/domain/utils/nodes/delete.go (2)
245-261: JSON parsing ofetcdctl member listis a clean improvement; watch for mixed stderr/stdoutThe new
getEtcdMembershelper is a clean abstraction and correctly surfaces both kubectl/exec errors and JSON parsing issues. One subtle point:KubectlExecEtcdcallsrunWithOutput, which explicitly usesCombinedOutput()(line 298 of internal/kubectl/kubectl.go), combining both stdout and stderr into a single byte stream. Since etcdctl 3.5.x writes deprecation warnings and other logging output to stderr while JSON goes to stdout, any such warnings will causejson.Unmarshalto fail. That's safer than mis‑parsing warnings as members, but if you start seeing frequent "failed to unmarshal etcd member list output" errors, consider an alternative that reads only etcdctl's stdout for this specific JSON call.
23-34: Add explicit JSON tags toetcdMemberListfor clarity and robustnessThe struct currently works due to Go's case-insensitive JSON field matching (falling back when no exact case match exists), but adding explicit tags improves clarity and future-proofs against changes. Replace the inline struct with a named type:
type etcdMember struct { Id string `json:"ID"` Name string `json:"name"` } type etcdMemberList struct { Members []etcdMember }This makes the JSON mapping explicit and self-documenting.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
internal/kubectl/kubectl.go(1 hunks)services/kuber/server/domain/utils/nodes/delete.go(4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-29T13:01:46.081Z
Learnt from: Despire
Repo: berops/claudie PR: 1803
File: internal/kubectl/kubectl.go:279-279
Timestamp: 2025-07-29T13:01:46.081Z
Learning: In kubectl wrapper functions in internal/kubectl/kubectl.go, the commands executed through bash -c are internally constructed using predefined templates, not user-supplied commands. Parameters like resource names and manifests come from internal codebase functions rather than external user input.
Applied to files:
internal/kubectl/kubectl.go
📚 Learning: 2025-08-13T13:49:44.424Z
Learnt from: Despire
Repo: berops/claudie PR: 1819
File: services/kuber/server/domain/utils/nodes/patch.go:22-24
Timestamp: 2025-08-13T13:49:44.424Z
Learning: In the Claudie codebase, kubectl commands have a linear backoff algorithm with up to 3 retries built-in (as indicated by MaxKubectlRetries: 3 in kubectl.Kubectl configuration).
Applied to files:
internal/kubectl/kubectl.go
📚 Learning: 2025-08-25T06:54:50.759Z
Learnt from: Despire
Repo: berops/claudie PR: 1838
File: services/kube-eleven/server/main.go:37-37
Timestamp: 2025-08-25T06:54:50.759Z
Learning: In services/kube-eleven/server/main.go, Despire prefers to use environment variable values (like KUBE_ELEVEN_CONCURRENT_CLUSTERS) directly without adding validation guards or clamping logic, keeping the behavior simple where env values are used "as is".
Applied to files:
internal/kubectl/kubectl.go
🧬 Code graph analysis (1)
services/kuber/server/domain/utils/nodes/delete.go (1)
internal/kubectl/kubectl.go (1)
Kubectl(20-26)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: check-changes
🔇 Additional comments (1)
internal/kubectl/kubectl.go (1)
30-37: etcd command templates align with supported versions; behavior is correctDefining
getEtcdPodsCmdandexportEtcdEnvsCmdas constants keepsKubectlGetEtcdPodsandKubectlExecEtcdreadable. ThegetEtcdPodsCmdwith thegrep etcdpattern correctly filters for etcd pods (matching the standardetcd-<nodeName>naming convention in Kubernetes), and removingETCDCTL_API=3fromexportEtcdEnvsCmdis appropriate—etcd v3.5 and v3.4 use the v3 API by default, so the explicit export is unnecessary for supported clusters. Theetcdctl member removecommand correctly accepts the hex IDs produced byetcdctl member listas documented in etcd's membership management examples.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
services/kuber/server/domain/utils/nodes/delete.go (1)
192-201: Optional: log unmatched masters and break after successful etcd member removalWithin
deleteFromEtcd, if a master’sk8sNamenever matches anymember.Name, we silently skip etcd removal for that node. Given this leaves a stale etcd member, a debug/info log would be useful for operators to spot name mismatches or drift (this was also noted in an earlier review).You could also break out of the inner loop once a match is found to avoid unnecessary scans:
- for _, node := range d.masterNodes { - for _, member := range etcd.Members { - if node.k8sName == member.Name { - d.logger.Debug().Msgf("Deleting etcd member %s, with hash %s", member.Name, member.Id) - etcdctlCmd := fmt.Sprintf("etcdctl member remove %s", member.Id) - if _, err := kc.KubectlExecEtcd(etcdPods[0], etcdctlCmd); err != nil { - return fmt.Errorf("error while executing \"etcdctl member remove\" on node %s, cluster %s: %w", member.Name, d.clusterPrefix, err) - } - } - } - } + for _, node := range d.masterNodes { + found := false + for _, member := range etcd.Members { + if node.k8sName != member.Name { + continue + } + found = true + d.logger.Debug().Msgf("Deleting etcd member %s, with ID %s", member.Name, member.Id) + etcdctlCmd := fmt.Sprintf("etcdctl member remove %s", member.Id) + if _, err := kc.KubectlExecEtcd(etcdPods[0], etcdctlCmd); err != nil { + return fmt.Errorf("error while executing \"etcdctl member remove\" on node %s, cluster %s: %w", member.Name, d.clusterPrefix, err) + } + break + } + if !found { + d.logger.Debug().Msgf("No etcd member found for master %s; skipping etcd removal", node.k8sName) + } + }Purely optional, but would make mismatches much easier to diagnose.
🧹 Nitpick comments (1)
services/kuber/server/domain/utils/nodes/delete.go (1)
23-34: Structured etcd member parsing is a good move; please make JSON mapping explicit and confirm ID encodingSwitching from ad‑hoc text parsing to
etcdctl ... -w jsonwith a typedetcdMemberListis a solid improvement for robustness. A couple of follow‑ups:
- The etcd docs show
member list -w jsonreturning amembersarray with fieldsID(numeric) andname(string). (pkg.go.dev) Your struct expects a stringIdand assumes this is a hex‑encoded identifier. Because behavior of--hex=truewith JSON isn’t clearly documented, it’s worth explicitly verifying on the etcd versions you target that:
members[].IDis indeed emitted as a string (or thatmember removehappily accepts the decimal form you’re passing), and- the JSON shape (
members,ID,name) matches whatencoding/jsonis unmarshalling into here.- For maintainability, I’d also consider adding explicit JSON tags (and optionally a named member type) so the contract with etcdctl is clear and resilient to refactors, e.g.:
-type etcdMemberList struct { - Members []struct { - // Hex encoded id of the member within etcd. - Id string - - // Name of the node within the cluster. - Name string - } -} +type etcdMemberList struct { + Members []struct { + // Hex-encoded ID of the member within etcd, as printed by etcdctl. + ID string `json:"ID"` + + // Name of the node within the cluster. + Name string `json:"name"` + } `json:"members"` +} @@ - d.logger.Debug().Msgf("Deleting etcd member %s, with hash %s", member.Name, member.Id) - etcdctlCmd := fmt.Sprintf("etcdctl member remove %s", member.Id) + d.logger.Debug().Msgf("Deleting etcd member %s, with ID %s", member.Name, member.ID) + etcdctlCmd := fmt.Sprintf("etcdctl member remove %s", member.ID)(Adjust naming/ID format as needed once you’ve confirmed what
-w json --hex=trueactually returns on your clusters.)Also applies to: 257-272
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
services/kuber/server/domain/utils/nodes/delete.go(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-13T14:22:14.668Z
Learnt from: Despire
Repo: berops/claudie PR: 1819
File: services/kuber/server/domain/utils/nodes/patch.go:230-237
Timestamp: 2025-08-13T14:22:14.668Z
Learning: In services/kuber/server/domain/utils/nodes/patch.go, the user Despire has observed that using "replace" operation for JSON-Patch on /spec/taints works fine on newly added taints in their testing environment, even though the JSON-Patch specification suggests "replace" should fail on missing fields.
Applied to files:
services/kuber/server/domain/utils/nodes/delete.go
📚 Learning: 2025-06-18T12:23:27.872Z
Learnt from: Despire
Repo: berops/claudie PR: 1770
File: services/builder/domain/usecases/kube_eleven_caller.go:15-18
Timestamp: 2025-06-18T12:23:27.872Z
Learning: In the claudie builder workflow, the terraform stage comes before the kube-eleven stage. If DNS building fails during terraform, the process wouldn't proceed to the kube-eleven stage. Therefore, nil-checks for ep.Dns are unnecessary in reconcileK8sCluster since the DNS is guaranteed to be properly built by that point in the workflow.
Applied to files:
services/kuber/server/domain/utils/nodes/delete.go
🧬 Code graph analysis (1)
services/kuber/server/domain/utils/nodes/delete.go (1)
internal/kubectl/kubectl.go (1)
Kubectl(20-26)
🔇 Additional comments (1)
services/kuber/server/domain/utils/nodes/delete.go (1)
182-189: Safer etcd pod parsing confirmed; Go 1.25 meetsstrings.SplitSeqrequirementThe rework of
getEtcdPodNamesto iteratestrings.SplitSeq, trim whitespace, drop empty lines, and explicitly error onlen(lines) == 0makes the subsequentetcdPods[0]access indeleteFromEtcdsafe and avoids empty pod names.The
go.moddirective is set togo 1.25, which meets and exceeds the Go 1.24+ requirement forstrings.SplitSeq. No toolchain adjustments needed.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
services/kuber/server/domain/utils/nodes/delete.go (1)
26-34: Add explicit json tags for clarity and correctness.The etcd JSON output uses specific casing (
members,ID,name). While Go'sjson.Unmarshalperforms case-insensitive fallback matching, adding explicit tags prevents subtle bugs and improves readability.type etcdMemberList struct { - Members []struct { + Members []struct { // Hex encoded id of the member within etcd. - Id string + Id string `json:"ID"` // Name of the node within the cluster. - Name string - } + Name string `json:"name"` + } `json:"members"` }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
services/kuber/server/domain/utils/nodes/delete.go(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-13T14:22:14.668Z
Learnt from: Despire
Repo: berops/claudie PR: 1819
File: services/kuber/server/domain/utils/nodes/patch.go:230-237
Timestamp: 2025-08-13T14:22:14.668Z
Learning: In services/kuber/server/domain/utils/nodes/patch.go, the user Despire has observed that using "replace" operation for JSON-Patch on /spec/taints works fine on newly added taints in their testing environment, even though the JSON-Patch specification suggests "replace" should fail on missing fields.
Applied to files:
services/kuber/server/domain/utils/nodes/delete.go
📚 Learning: 2025-06-18T12:23:27.872Z
Learnt from: Despire
Repo: berops/claudie PR: 1770
File: services/builder/domain/usecases/kube_eleven_caller.go:15-18
Timestamp: 2025-06-18T12:23:27.872Z
Learning: In the claudie builder workflow, the terraform stage comes before the kube-eleven stage. If DNS building fails during terraform, the process wouldn't proceed to the kube-eleven stage. Therefore, nil-checks for ep.Dns are unnecessary in reconcileK8sCluster since the DNS is guaranteed to be properly built by that point in the workflow.
Applied to files:
services/kuber/server/domain/utils/nodes/delete.go
🧬 Code graph analysis (1)
services/kuber/server/domain/utils/nodes/delete.go (1)
internal/kubectl/kubectl.go (1)
Kubectl(20-26)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: check-changes
🔇 Additional comments (2)
services/kuber/server/domain/utils/nodes/delete.go (2)
180-214: LGTM! Past review concerns addressed.The unsafe
etcdPods[0]access concern is now mitigated bygetEtcdPodNamesreturning an error when no pods are found (lines 260-262). The silent no-op concern is addressed with the warning log at lines 208-210 when a node is not found among etcd members.
267-283: LGTM! JSON-based parsing is a robust improvement.The shift from text parsing to JSON unmarshalling eliminates the risk of parsing etcd warning text that could be printed to stdout. The
--hex=trueflag ensures the member ID is returned as a hex string, matching the struct definition. Error handling is properly contextualized.
the eviction used to stuck when the kuber removes the master node from etcd members first
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
services/kuber/server/domain/utils/nodes/delete.go (1)
99-123:deleteFromEtcdremoves all masters on every iteration; sequencing contradicts "cordon/drain first" commentInside
DeleteNodes, you calld.deleteFromEtcd(kubectl, etcdEpNode)for eachmasterin a loop (line 116):for _, master := range d.masterNodes { // cordon + drain master if err := d.deleteFromEtcd(kubectl, etcdEpNode); err != nil { ... } // deleteNodesByName(master, ...) }But
deleteFromEtcditself loops over alld.masterNodes(line 210) and removes every matching etcd member in a single call. The net effect:
- On the first iteration, etcd membership for all masters is removed, even for those not yet cordoned/drained.
- Subsequent iterations re-scan members that no longer exist and log
"is not a member of etcd, ignoring"warnings.This contradicts the "IMPORTANT: first you have to cordon and drain, and only after that you can remove from etcd" comment and performs redundant etcdctl calls.
Pass the current master node to
deleteFromEtcdinstead of processing all masters:-func (d *Deleter) deleteFromEtcd(kc kubectl.Kubectl, etcdEpNode *spec.Node) error { +func (d *Deleter) deleteFromEtcd(kc kubectl.Kubectl, etcdEpNode *spec.Node, node nodeInfo) error {And call it with the current master at line 116:
- if err := d.deleteFromEtcd(kubectl, etcdEpNode); err != nil { + if err := d.deleteFromEtcd(kubectl, etcdEpNode, master); err != nil {Then update the inner loop in
deleteFromEtcd(line 210) to operate on the single passed node instead of iterating overd.masterNodes. This is safe—there is only one call site in the codebase.
🧹 Nitpick comments (2)
services/kuber/server/domain/utils/nodes/delete.go (2)
23-33: Make etcd JSON shape explicit with a named member type and tagsThe
etcdMemberListstruct and its anonymous inner struct rely on default, case-insensitive JSON field-name matching formembers/ID/name. This works but is a bit implicit and harder to reuse or test.Consider introducing a named type with explicit tags for clarity and future-proofing:
-type etcdMemberList struct { - Members []struct { - // Hex encoded id of the member within etcd. - Id string - - // Name of the node within the cluster. - Name string - } -} +type etcdMember struct { + // Hex-encoded ID of the member within etcd. + ID string `json:"ID"` + Name string `json:"name"` +} + +type etcdMemberList struct { + Members []etcdMember `json:"members"` +}And update the usage in
deleteFromEtcdaccordingly (e.g.member.IDinstead ofmember.Id).Also applies to: 280-295
132-141: Cordon/drain error handling is consistent for workers; optional: aggregate drain failures tooWorker handling mirrors the master path: cordon errors are accumulated into
errDelwhileKubectlDrainfailures are treated as fatal and abortDeleteNodes. That’s reasonable given the risk around draining.If you ever want “best-effort” deletion of multiple workers in one call, you could mirror the cordon pattern and aggregate drain failures into
errDelinstead of returning immediately, but that’s more of a product decision than a correctness issue.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
services/kuber/server/domain/utils/nodes/delete.go(6 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: Despire
Repo: berops/claudie PR: 1749
File: manifests/testing-framework/test-sets/test-set5/1.yaml:65-65
Timestamp: 2025-06-10T14:27:44.312Z
Learning: When Despire declines to address consistency issues like version string prefixes in a focused PR (e.g., kubeone updates), they prefer to handle such standardization in separate issues rather than expanding the scope of the current PR.
📚 Learning: 2025-08-13T14:22:14.668Z
Learnt from: Despire
Repo: berops/claudie PR: 1819
File: services/kuber/server/domain/utils/nodes/patch.go:230-237
Timestamp: 2025-08-13T14:22:14.668Z
Learning: In services/kuber/server/domain/utils/nodes/patch.go, the user Despire has observed that using "replace" operation for JSON-Patch on /spec/taints works fine on newly added taints in their testing environment, even though the JSON-Patch specification suggests "replace" should fail on missing fields.
Applied to files:
services/kuber/server/domain/utils/nodes/delete.go
📚 Learning: 2025-06-18T12:23:27.872Z
Learnt from: Despire
Repo: berops/claudie PR: 1770
File: services/builder/domain/usecases/kube_eleven_caller.go:15-18
Timestamp: 2025-06-18T12:23:27.872Z
Learning: In the claudie builder workflow, the terraform stage comes before the kube-eleven stage. If DNS building fails during terraform, the process wouldn't proceed to the kube-eleven stage. Therefore, nil-checks for ep.Dns are unnecessary in reconcileK8sCluster since the DNS is guaranteed to be properly built by that point in the workflow.
Applied to files:
services/kuber/server/domain/utils/nodes/delete.go
📚 Learning: 2025-06-10T14:27:44.312Z
Learnt from: Despire
Repo: berops/claudie PR: 1749
File: manifests/testing-framework/test-sets/test-set5/1.yaml:65-65
Timestamp: 2025-06-10T14:27:44.312Z
Learning: When Despire declines to address consistency issues like version string prefixes in a focused PR (e.g., kubeone updates), they prefer to handle such standardization in separate issues rather than expanding the scope of the current PR.
Applied to files:
services/kuber/server/domain/utils/nodes/delete.go
🧬 Code graph analysis (1)
services/kuber/server/domain/utils/nodes/delete.go (1)
internal/kubectl/kubectl.go (1)
Kubectl(20-26)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: check-changes
🔇 Additional comments (1)
services/kuber/server/domain/utils/nodes/delete.go (1)
259-278: Improved etcd pod name parsing and bounds safety looks goodThe new
getEtcdPodNamesimplementation that iterates line-by-line, trims whitespace, filters out empty lines, and errors when no pods are found makes the subsequentetcdPods[0]access safe and avoids spurious empty entries from trailing newlines:var lines []string for line := range strings.SplitSeq(string(etcdPodsBytes), "\n") { if line := strings.TrimSpace(line); line != "" { lines = append(lines, line) } } if len(lines) == 0 { return nil, errors.New("no etcd pods found in cluster") } return lines, nilThis is a solid robustness improvement over the previous direct
strings.Splitusage.
supporting kubeAPI v1.32.x, v1.33.x, and v1.34.x
the newer version of actions-runner (2.329.x and later) don't have the BuildKit
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
services/kuber/Dockerfile (1)
6-7: Update to kubectl v1.33.0 from official distribution site aligns with PR objectives.The changes correctly update kubectl to v1.33.0 and switch to the official distribution URL (
dl.k8s.iorather thangoogleapis.com). This addresses the PR's goal to fix node deletion for newer kubectl versions.One minor note: the wget command lacks an explicit
-O kubectlflag to specify the output filename. While this works implicitly (wget will use the basenamekubectlfrom the URL and save to/go/kubectl, matching the subsequent COPY expectation), being explicit would improve clarity:-RUN KC_VERSION=v1.33.0 && \ - wget -q https://dl.k8s.io/release/${KC_VERSION}/bin/linux/$TARGETARCH/kubectl +RUN KC_VERSION=v1.33.0 && \ + wget -q -O kubectl https://dl.k8s.io/release/${KC_VERSION}/bin/linux/$TARGETARCH/kubectl
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
services/kuber/Dockerfile(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: Despire
Repo: berops/claudie PR: 1749
File: manifests/testing-framework/test-sets/test-set5/1.yaml:65-65
Timestamp: 2025-06-10T14:27:44.312Z
Learning: When Despire declines to address consistency issues like version string prefixes in a focused PR (e.g., kubeone updates), they prefer to handle such standardization in separate issues rather than expanding the scope of the current PR.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: gotest
🔇 Additional comments (1)
services/kuber/Dockerfile (1)
39-42: Post-install steps appropriately prepare the runtime image.The multi-line RUN command correctly chains executable permissions, package updates, bash installation, and cache cleanup in a single layer. This follows Docker best practices for reducing image size while meeting the service's shell execution requirements (noted in the comment at line 48).
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
manifests/testing-framework/test-sets/proxy-without-hetzner/8.yaml (1)
31-44: Server type bump looks fine; just be mindful of cost/quota impact for this test set.
No functional concerns in the manifest itself; this will increase resources forgcp-cmpt-nodes, which may affect CI budget/quotas if these test-sets run frequently.
🧹 Nitpick comments (1)
manifests/testing-framework/test-sets/succeeds-on-last-2/2.yaml (1)
35-45: The intermediate test file uses an obviously-invalid placeholder OCID that differs from realistic patterns in other test-sets (e.g.,ami-00000000000000001). While this file is meant to fail and won't block the overall test progression to file 3.yaml, consider using a more realistic-looking invalid OCID or clarifying in the comment what specific failure mode is expected (image validation, availability, etc.) to improve debugging clarity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (50)
manifests/testing-framework/test-sets/autoscaling-1/1.yaml(1 hunks)manifests/testing-framework/test-sets/autoscaling-1/2.yaml(1 hunks)manifests/testing-framework/test-sets/autoscaling-1/3.yaml(1 hunks)manifests/testing-framework/test-sets/autoscaling-2/1.yaml(2 hunks)manifests/testing-framework/test-sets/autoscaling-2/2.yaml(2 hunks)manifests/testing-framework/test-sets/proxy-with-hetzner/1.yaml(2 hunks)manifests/testing-framework/test-sets/proxy-with-hetzner/2.yaml(2 hunks)manifests/testing-framework/test-sets/proxy-with-hetzner/3.yaml(2 hunks)manifests/testing-framework/test-sets/proxy-with-hetzner/4.yaml(2 hunks)manifests/testing-framework/test-sets/proxy-with-hetzner/5.yaml(2 hunks)manifests/testing-framework/test-sets/proxy-with-hetzner/6.yaml(2 hunks)manifests/testing-framework/test-sets/proxy-with-hetzner/7.yaml(2 hunks)manifests/testing-framework/test-sets/proxy-with-hetzner/8.yaml(3 hunks)manifests/testing-framework/test-sets/proxy-without-hetzner/1.yaml(1 hunks)manifests/testing-framework/test-sets/proxy-without-hetzner/10.yaml(1 hunks)manifests/testing-framework/test-sets/proxy-without-hetzner/2.yaml(1 hunks)manifests/testing-framework/test-sets/proxy-without-hetzner/3.yaml(1 hunks)manifests/testing-framework/test-sets/proxy-without-hetzner/4.yaml(1 hunks)manifests/testing-framework/test-sets/proxy-without-hetzner/5.yaml(1 hunks)manifests/testing-framework/test-sets/proxy-without-hetzner/6.yaml(1 hunks)manifests/testing-framework/test-sets/proxy-without-hetzner/7.yaml(1 hunks)manifests/testing-framework/test-sets/proxy-without-hetzner/8.yaml(1 hunks)manifests/testing-framework/test-sets/proxy-without-hetzner/9.yaml(1 hunks)manifests/testing-framework/test-sets/rolling-update-2/1.yaml(3 hunks)manifests/testing-framework/test-sets/rolling-update-2/2.yaml(2 hunks)manifests/testing-framework/test-sets/rolling-update/1.yaml(1 hunks)manifests/testing-framework/test-sets/rolling-update/2.yaml(1 hunks)manifests/testing-framework/test-sets/succeeds-on-last-1/1.yaml(1 hunks)manifests/testing-framework/test-sets/succeeds-on-last-1/2.yaml(1 hunks)manifests/testing-framework/test-sets/succeeds-on-last-2/1.yaml(1 hunks)manifests/testing-framework/test-sets/succeeds-on-last-2/2.yaml(1 hunks)manifests/testing-framework/test-sets/succeeds-on-last-2/3.yaml(1 hunks)manifests/testing-framework/test-sets/succeeds-on-last-4/1.yaml(1 hunks)manifests/testing-framework/test-sets/succeeds-on-last-4/2.yaml(1 hunks)manifests/testing-framework/test-sets/succeeds-on-last-4/3.yaml(1 hunks)manifests/testing-framework/test-sets/test-set1/1.yaml(8 hunks)manifests/testing-framework/test-sets/test-set1/2.yaml(7 hunks)manifests/testing-framework/test-sets/test-set2/1.yaml(1 hunks)manifests/testing-framework/test-sets/test-set2/2.yaml(1 hunks)manifests/testing-framework/test-sets/test-set2/3.yaml(1 hunks)manifests/testing-framework/test-sets/test-set3/1.yaml(3 hunks)manifests/testing-framework/test-sets/test-set3/2.yaml(2 hunks)manifests/testing-framework/test-sets/test-set3/3.yaml(1 hunks)manifests/testing-framework/test-sets/test-set3/4.yaml(1 hunks)manifests/testing-framework/test-sets/test-set4/1.yaml(1 hunks)manifests/testing-framework/test-sets/test-set4/2.yaml(1 hunks)manifests/testing-framework/test-sets/test-set4/3.yaml(1 hunks)manifests/testing-framework/test-sets/test-set5/1.yaml(1 hunks)manifests/testing-framework/test-sets/test-set5/2.yaml(1 hunks)manifests/testing-framework/test-sets/test-set5/3.yaml(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- manifests/testing-framework/test-sets/proxy-with-hetzner/1.yaml
- manifests/testing-framework/test-sets/test-set3/4.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-07T15:17:32.154Z
Learnt from: m-brando
Repo: berops/claudie PR: 1857
File: manifests/testing-framework/test-sets/test-set1/1.yaml:70-78
Timestamp: 2025-10-07T15:17:32.154Z
Learning: In the Claudie repository, secrets referenced in test manifests under `manifests/testing-framework/test-sets/` (such as ovh-secret, gcp-secret, hetzner-secret, etc.) are provisioned via an external secret management system in Kubernetes and are not stored as static Secret manifests in the repository. Do not flag missing Secret manifests for these secretRef entries as they are managed externally at runtime.
Applied to files:
manifests/testing-framework/test-sets/proxy-with-hetzner/8.yamlmanifests/testing-framework/test-sets/proxy-with-hetzner/7.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Run golangci-lint
- GitHub Check: build-and-push
🔇 Additional comments (61)
manifests/testing-framework/test-sets/succeeds-on-last-1/2.yaml (1)
18-27: Validate CI/e2e quota & intent fort3.mediumineu-west-3a.Bumping
serverTypetot3.mediumcan affect AWS capacity/quota and test cost; please confirm the e2e AWS account has sufficient vCPU limits ineu-west-3aand that this aligns with the broader test-set sizing changes in the PR.manifests/testing-framework/test-sets/succeeds-on-last-1/1.yaml (1)
24-24: Clarify the rationale for the instance type upgrade.The
serverTypehas been changed fromt3.smalltot3.medium. While the AI summary indicates this is part of a broader set of test-manifest updates, the connection to the PR's core objective (fixing node deletion for kubectl >= 1.33) is unclear.Is this instance size increase required for the node deletion fix to function correctly, or is it a separate infrastructure improvement? If the latter, consider tracking it as a distinct concern or PR to maintain focus on the node deletion fix.
manifests/testing-framework/test-sets/succeeds-on-last-4/2.yaml (1)
52-53: Server flavor bump looks fine; just confirm availability/quota/cost assumptions for Hetznercpx22.
Sincehtz-ctrl-nodesis used as both control+compute (and LB targets), this upsizes the whole cluster footprint for this test-set.manifests/testing-framework/test-sets/succeeds-on-last-4/3.yaml (1)
51-52: Server flavor bump looks fine; just confirm availability/quota/cost assumptions for Hetznercpx22.
This node pool backs control+compute (and LB targets), so the increase applies to the full test cluster.manifests/testing-framework/test-sets/succeeds-on-last-4/1.yaml (1)
24-25: Server flavor bump looks fine; just confirm availability/quota/cost assumptions for Hetznercpx22.manifests/testing-framework/test-sets/autoscaling-1/1.yaml (1)
44-44: Verify that thee2-mediumupgrade is necessary for Kubernetes 1.34.0 compatibility.The manifest upgrades the
gcp-cmpt-nodesserverTypefrome2-smalltoe2-medium. While the change is safe and the manifest structure is valid, this appears orthogonal to the core PR objective (fixing node deletion for kubectl >= 1.33). Clarify whether this upgrade is required for the new Kubernetes version or if it's part of a separate infrastructure improvement.manifests/testing-framework/test-sets/autoscaling-1/2.yaml (1)
42-42: Verify that thee2-mediumupgrade is necessary for Kubernetes 1.34.0 compatibility.The manifest upgrades the
gcp-cmpt-nodesserverTypefrome2-smalltoe2-medium. While the change is safe and the manifest structure is valid, this appears orthogonal to the core PR objective (fixing node deletion for kubectl >= 1.33). Clarify whether this upgrade is required for the new Kubernetes version or if it's part of a separate infrastructure improvement.manifests/testing-framework/test-sets/autoscaling-1/3.yaml (1)
44-44: Verify that thee2-mediumupgrade is necessary for Kubernetes 1.34.0 compatibility.The manifest upgrades the
gcp-cmpt-nodesserverTypefrome2-smalltoe2-medium. While the change is safe and the manifest structure is valid, this appears orthogonal to the core PR objective (fixing node deletion for kubectl >= 1.33). Clarify whether this upgrade is required for the new Kubernetes version or if it's part of a separate infrastructure improvement.manifests/testing-framework/test-sets/test-set5/3.yaml (1)
34-34: Instance type upgrade for compute nodes looks good.The serverType upgrade from t3.small to t3.medium for the aws-cmpt-nodes pool aligns with the broader test infrastructure improvements in this PR to support node deletion fixes for newer kubectl versions.
manifests/testing-framework/test-sets/test-set5/1.yaml (1)
34-34: Instance type upgrade for control plane nodes is appropriate.The serverType change from t3.small to t3.medium for aws-ctrl-nodes provides additional headroom for control plane operations and aligns with the infrastructure updates across test-set5 manifests in this PR.
manifests/testing-framework/test-sets/test-set5/2.yaml (1)
34-34: Control plane instance type upgrade is consistent.The serverType change from t3.small to t3.medium for aws-ctrl-nodes maintains consistency with the parallel test-set5/1.yaml and supports the enhanced node deletion handling required for kubectl >= 1.33.
manifests/testing-framework/test-sets/test-set4/2.yaml (1)
42-42: Configuration update approved.The serverType change from
VM.Standard1.1toVM.Standard1.2for the OCI node pool is a straightforward infrastructure upgrade. This aligns with similar updates in test-set4/1.yaml and test-set4/3.yaml.manifests/testing-framework/test-sets/test-set4/1.yaml (1)
24-24: Configuration update approved.The serverType change from
VM.Standard1.1toVM.Standard1.2is consistent with the test-set4 infrastructure update across all related manifests.manifests/testing-framework/test-sets/test-set4/3.yaml (1)
42-42: Configuration update approved.The serverType change from
VM.Standard1.1toVM.Standard1.2is consistent with the broader test-set4 infrastructure update.manifests/testing-framework/test-sets/test-set3/2.yaml (1)
33-34: Verify provider availability/cost impact of the larger instance types (cpx22, e2-medium).
This is a pure capacity bump, but it can break CI if the types aren’t available innbg1/europe-west1or if quotas/budget assumptions were tuned forcpx11/e2-small.Also applies to: 43-45
manifests/testing-framework/test-sets/test-set3/3.yaml (1)
51-54: Confirmcpx22is available for Hetzner innbg1and won’t regress test reliability/cost.manifests/testing-framework/test-sets/test-set3/1.yaml (1)
84-88: Double-check quotas/availability + LB pool sizing consistency.
- Ensure
t3.medium(AWSeu-west-2a),cpx22(Hetznernbg1), ande2-medium(GCPeurope-west1) won’t hit quota/capacity issues in CI.gcp-ldbl-nodesis stille2-small(Line 126) while other GCP pools aree2-medium—confirm that’s intentional (LB workload) vs an inconsistency with other updated test-set3 manifests.Also applies to: 95-98, 114-128
manifests/testing-framework/test-sets/autoscaling-2/1.yaml (1)
33-33: Verify the necessity of serverType upgrade for kubectl 1.33 compatibility.The
serverTypechanges fromt3.smalltot3.mediumare consistent across both control and compute node pools. However, the PR objectives focus on etcd API and node deletion logic changes. Please clarify the relationship between the kubectl 1.33 upgrade and the increased instance sizes, and consider documenting this rationale in the commit message or a comment within the manifest if the upgrade is resource-driven.Also applies to: 49-49
manifests/testing-framework/test-sets/autoscaling-2/2.yaml (1)
35-35: Consistency with autoscaling-2/1.yaml confirmed.The
serverTypechanges in this file are consistent with autoscaling-2/1.yaml, applying the same upgrade pattern across both control and compute pools. Note that this file uses template tagv0.9.14(vs.v0.9.12in the other file), which may indicate different test scenarios or progression stages.Also applies to: 51-51
manifests/testing-framework/test-sets/test-set2/1.yaml (1)
78-78: Test infrastructure configuration updated appropriately for K8s 1.34.0 testing.The serverType upgrade to cpx22 for htz-kube-nodes and the addition of installationProxy for the OCI cluster are consistent with the PR's objective to validate node deletion with newer kubectl versions (1.34.0). The manifest structure is valid and all references are properly configured.
Also applies to: 169-171
manifests/testing-framework/test-sets/test-set2/2.yaml (1)
87-87: Test infrastructure configuration updated appropriately for K8s 1.34.0 testing.The serverType upgrade to cpx22 for htz-kube-nodes and the addition of installationProxy for the OCI cluster maintain consistency across test-set2 manifests. The manifest is valid and properly configured for the newer Kubernetes version.
Also applies to: 179-181
manifests/testing-framework/test-sets/test-set2/3.yaml (1)
78-78: Test infrastructure configuration updated appropriately for K8s 1.34.0 testing.The serverType upgrade to cpx22 and installationProxy addition are consistent with the broader test-set2 updates. The manifest properly configures the test environment for validating node deletion with newer Kubernetes versions.
Also applies to: 170-172
manifests/testing-framework/test-sets/test-set1/2.yaml (2)
87-87: Verify serverType upgrades are aligned across test infrastructure.Instance type changes across all providers have been consistently upgraded (cpx11→cpx22, e2-small→e2-standard-4, etc.), which aligns with infrastructure scaling for newer Kubernetes versions. However, ensure these changes are mirrored across the parallel test-set1/1.yaml file and that the upgraded instance sizes provide adequate resources for the test scenarios.
Also applies to: 100-100, 127-127, 151-151, 162-162, 175-175, 236-236
265-267: Confirm intentionality: OCI cluster has installationProxy while others do not.The OCI cluster (ts1-oci-cluster-test-set-no1) is the only cluster in this file with an
installationProxyconfiguration (mode: "on"). Verify whether this is intentional or if installationProxy should be added to other clusters. Check test-set1/1.yaml for consistency on this configuration pattern.manifests/testing-framework/test-sets/test-set1/1.yaml (2)
87-87: Approve: serverType upgrades are consistent with test-set1/2.yaml.The instance type upgrades in this file align exactly with the changes in test-set1/2.yaml, indicating consistent test infrastructure scaling across both test-set configurations.
Also applies to: 100-100, 125-125, 152-152, 163-163, 176-176, 237-237
292-293: Discrepancy detected: installationProxy added to OVH cluster here, but not in test-set1/2.yaml.Line 292-293 adds
installationProxy: mode "on"to the OVH cluster (ts1-ovh-cluster-test-set-no1) in this file. However, the corresponding OVH cluster in test-set1/2.yaml does not have this configuration. Verify whether this is intentional (test-set1/1.yaml uses proxy, test-set1/2.yaml does not) or if the configuration should be mirrored in both files for consistency.manifests/testing-framework/test-sets/rolling-update/2.yaml (1)
92-92: Test infrastructure upgrade approved.The upgrade from
cpx11tocpx22aligns with the broader test infrastructure updates in this PR. Verify that this server type is available in the Hetzner region (nbg1) and that resource quotas permit this upgrade.manifests/testing-framework/test-sets/rolling-update/1.yaml (1)
92-92: Test infrastructure upgrade approved.Consistent with the parallel upgrade in
rolling-update/2.yaml, this Hetzner server type upgrade fromcpx11tocpx22is appropriate for the test infrastructure changes in this PR.manifests/testing-framework/test-sets/rolling-update-2/2.yaml (2)
34-34: Hetzner server type upgrade approved.The upgrade from
cpx11tocpx22is consistent with parallel changes in other rolling-update test-sets.
48-48: AWS instance type upgrade approved.The upgrade from
t3.smalltot3.mediumaligns with the broader AWS infrastructure upgrades documented across this PR's test-set changes.manifests/testing-framework/test-sets/rolling-update-2/1.yaml (3)
70-70: Hetzner server type upgrade approved.The upgrade of
htz-ctrl-nodesfromcpx11tocpx22is consistent with parallel infrastructure upgrades in this PR.
95-95: AWS instance type upgrade approved.The upgrade from
t3.smalltot3.mediumaligns with AWS infrastructure upgrades documented across this PR's test-set changes.
106-106: Hetzner server type upgrade approved.The upgrade of
htz-cmpt-nodesfromcpx11tocpx22is consistent with parallel infrastructure upgrades across rolling-update test-sets.manifests/testing-framework/test-sets/proxy-with-hetzner/2.yaml (2)
24-25: Server type bump looks consistent with the rest of the Hetzner proxy test-sets.
37-40: Server type bump looks consistent with the rest of the Hetzner proxy test-sets.manifests/testing-framework/test-sets/proxy-with-hetzner/7.yaml (2)
25-33: Pool rename + serverType update are internally consistent (definition matches later control reference).
63-65: Control pool reference correctly follows the renamed dynamic pool.manifests/testing-framework/test-sets/proxy-with-hetzner/5.yaml (2)
31-34: Hetzner serverType bump (cpx22) is straightforward.
45-48: GCP serverType bump (e2-medium) is straightforward.manifests/testing-framework/test-sets/proxy-with-hetzner/8.yaml (3)
25-33: Renamed control pool + serverType bump are consistent within the manifest.
40-49: Compute pool serverType bump is consistent with the rest of the Hetzner proxy set.
63-66: Control pool reference correctly follows the renamed pool.manifests/testing-framework/test-sets/proxy-with-hetzner/6.yaml (2)
31-34: Hetzner serverType bump (cpx22) is straightforward.
45-48: GCP serverType bump (e2-medium) is straightforward.manifests/testing-framework/test-sets/proxy-with-hetzner/3.yaml (2)
23-26: Server type bump looks consistent with the rest of the Hetzner proxy test-sets.
36-41: Server type bump looks consistent with the rest of the Hetzner proxy test-sets.manifests/testing-framework/test-sets/proxy-with-hetzner/4.yaml (2)
23-26: Server type bump looks consistent with the rest of the Hetzner proxy test-sets.
36-41: Server type bump looks consistent with the rest of the Hetzner proxy test-sets.manifests/testing-framework/test-sets/proxy-without-hetzner/10.yaml (1)
37-37: LGTM. Straightforward infrastructure update aligning with broader test-set sizing improvements.manifests/testing-framework/test-sets/proxy-without-hetzner/3.yaml (1)
37-37: LGTM. Consistent with broader infrastructure upgrades across proxy-without-hetzner test-sets.manifests/testing-framework/test-sets/proxy-without-hetzner/7.yaml (1)
37-37: LGTM. Consistent infrastructure upgrade pattern.manifests/testing-framework/test-sets/proxy-without-hetzner/6.yaml (1)
37-37: LGTM. Consistent sizing upgrade; proxy mode variation ("off") is intentional for test coverage.manifests/testing-framework/test-sets/proxy-without-hetzner/1.yaml (1)
37-37: LGTM. Consistent infrastructure upgrade across test-sets.manifests/testing-framework/test-sets/proxy-without-hetzner/5.yaml (1)
37-37: LGTM. Consistent sizing upgrade; proxy disabled for this test scenario.manifests/testing-framework/test-sets/proxy-without-hetzner/2.yaml (1)
37-37: LGTM. Consistent sizing upgrade; proxy disabled for test variation.manifests/testing-framework/test-sets/proxy-without-hetzner/4.yaml (1)
37-37: LGTM. Consistent sizing upgrade; proxy enabled for this test scenario.manifests/testing-framework/test-sets/proxy-without-hetzner/9.yaml (1)
31-44: Server type bump is OK; verify the intent is to standardize compute sizing across proxy-without-hetzner cases.
Given this is a test-set manifest, the only practical concern is higher resource usage (and potential quota pressure) in CI.manifests/testing-framework/test-sets/succeeds-on-last-2/1.yaml (1)
16-35: Verify Hetzner flavor availability/quotas aftercpx22bump.
serverType: cpx22(Line 24) is likely fine, but test stability will depend on whether the Hetzner account/region used by CI supports this flavor and has sufficient quota (and cost expectations).manifests/testing-framework/test-sets/succeeds-on-last-2/3.yaml (2)
26-34: Hetzner pool flavor bump looks consistent—just re-check CI capacity.
serverType: cpx22(Line 33) matches the broader test-set upgrades; please ensure CI credentials/quotas innbg1still cover this flavor.
35-44: Verify OCI shape availability for the test tenancy/AD.
serverType: VM.Standard1.2(Line 41) is an infra sizing change that can fail if the tenancy/region/AD lacks capacity or shape entitlement; worth validating foreu-frankfurt-1/AD-1.manifests/testing-framework/test-sets/succeeds-on-last-2/2.yaml (1)
26-34: Hetznercpx22change is fine; just validate quota/cost for the ordinary test set.
Since this is an e2e test-set manifest, ensure the Hetzner project used by CI has capacity forcpx22so runs don’t become flaky.
|
Closing because the deletion of the nodes works. |
Fixes deletion of nodes in newer version of kubectl (>= 1.33)
the
export ETCDCTL_API=3is no longer needed as the lowest supported kuberenetes version (1.32.x) ships withetcd version = 3.5.xand the env has been made as a default since version
3.4.xAlso changed the dumping of the etcd members as a JSON, instead of parsing raw
StdoutoutputThe issue with the approach in master was that etcd started to print warnings which would be parsed instead of the
Stdoutoutput, leading to panics in the kuber service.Summary by CodeRabbit
Bug Fixes
Refactor
Chores
Configuration
✏️ Tip: You can customize this high-level summary in your review settings.