Skip to content

[CASCL-1386] Add evict-legacy-nodes subcommand to drain non-Datadog node groups#3026

Draft
L3n41c wants to merge 24 commits into
mainfrom
lenaic/CASCL-1386-evict-legacy-nodes
Draft

[CASCL-1386] Add evict-legacy-nodes subcommand to drain non-Datadog node groups#3026
L3n41c wants to merge 24 commits into
mainfrom
lenaic/CASCL-1386-evict-legacy-nodes

Conversation

@L3n41c

@L3n41c L3n41c commented May 18, 2026

Copy link
Copy Markdown
Member

Summary

Adds kubectl datadog autoscaling cluster evict-legacy-nodes, which migrates workloads off non-Datadog-managed node groups (cluster-autoscaler ASGs, EKS managed node groups, user-created Karpenter NodePools, standalone EC2 instances) onto the Datadog-managed Karpenter NodePools created by install, then scales the legacy groups to zero.

Jira: CASCL-1386
Builds on: CASCL-1304

Highlights

  • Re-classifies the cluster via clusterinfo.Classify and updates the dd-cluster-info ConfigMap before any destructive work.
  • Scales the legacy cluster-autoscaler Deployment to 0 replicas as step 1 (opt-out via --skip-cluster-autoscaler).
  • Creates temporary maxUnavailable: 1 PodDisruptionBudgets for workloads without one. Cleanup is label-based (app.kubernetes.io/managed-by=kubectl-datadog + autoscaling.datadoghq.com/temporary-pdb=true), so a SIGKILL'd run is reaped by the next invocation.
  • Evicts in parallel by manager type, with errors aggregated rather than aborting other types:
    • ASG → cordon + evict + UpdateAutoScalingGroup(0,0,0) (only when all nodes drained — avoids AZ-rebalance and MinSize > DesiredCapacity hazards).
    • EKS managed node groupUpdateNodegroupConfig(0,0,0), then waits for the K8s nodes carrying eks.amazonaws.com/nodegroup=<name> to disappear so the EKS-side drain finishes before temp PDBs are removed (sentinel error `errEKSDrainIncomplete`).
    • Karpenter user NodePool → cordon + evict (NodePool spec left alone, by design).
    • Standalone EC2 → cordon + evict + TerminateInstances.
  • Pre-flight refuses when no Datadog-managed NodePool exists; warns on user NodePool weight conflicts that could cause evicted pods to land back on user-managed nodes.
  • Every K8s read-modify-write is wrapped in retry.RetryOnConflict.

Out of scope (by design)

  • Modifying user-managed Karpenter NodePool specs (only their existing nodes are drained).
  • Deleting ASGs, EKS managed node groups, or NodePools — the command scales to 0; ownership and deletion belong to whoever provisioned them (Terraform, Helm, eksctl, etc.).
  • Migrating Fargate-hosted pods (Fargate has its own lifecycle).

Test plan

  • Unit tests with -race (go test -race ./cmd/kubectl-datadog/autoscaling/cluster/evict/...).
  • make lint — 0 issues.
  • make kubectl-datadog builds.
  • Pre-commit review passed: Codex (6 rounds), code-reviewer agent (2 rounds), /simplify (2 rounds), cross-validation (1 cycle).
  • Smoke test on a real EKS cluster with a CA + ASG + EKS MNG + user Karpenter NodePool + workloads.

🤖 Generated with Claude Code

… node groups

Introduces `kubectl datadog autoscaling cluster evict-legacy-nodes`, which
drains workloads from cluster-autoscaler ASGs, EKS managed node groups,
user-created Karpenter NodePools and standalone EC2 instances onto the
Datadog-managed Karpenter NodePools created by `install`, then scales the
legacy groups to zero.

Highlights:
- Re-classifies the cluster (reuses `clusterinfo.Classify`) and updates the
  `dd-cluster-info` ConfigMap before any destructive work.
- Scales the legacy cluster-autoscaler Deployment to 0 replicas (opt-out
  via `--skip-cluster-autoscaler`).
- Creates temporary `maxUnavailable: 1` PodDisruptionBudgets for workloads
  without one; cleanup is label-based, so a SIGKILL'd run is reaped on the
  next invocation.
- Evicts in parallel by manager type:
    * ASG: cordon + evict + UpdateAutoScalingGroup(0,0,0) (only when all
      nodes drained; avoids AZ-rebalance and MinSize/DesiredCapacity
      hazards).
    * EKS managed node group: UpdateNodegroupConfig(0,0,0), then waits
      for the K8s nodes to disappear so the EKS-side drain finishes
      before temp PDBs are removed.
    * Karpenter user NodePool: cordon + evict (NodePool spec untouched).
    * Standalone EC2: cordon + evict + TerminateInstances.
- Pre-flight refuses when no Datadog-managed NodePool exists; warns on
  user NodePool weight conflicts.
- Every K8s read-modify-write wrapped in retry.RetryOnConflict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented May 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 61.95509% with 288 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.35%. Comparing base (20ecb9e) to head (e75e7bf).
⚠️ Report is 61 commits behind head on main.

Files with missing lines Patch % Lines
...d/kubectl-datadog/autoscaling/cluster/evict/run.go 15.78% 96 Missing ⚠️
...kubectl-datadog/autoscaling/cluster/evict/evict.go 0.00% 80 Missing ⚠️
...d/kubectl-datadog/autoscaling/cluster/evict/pdb.go 69.00% 45 Missing and 8 partials ⚠️
...tl-datadog/autoscaling/cluster/evict/evict_pods.go 78.94% 11 Missing and 5 partials ⚠️
...d/kubectl-datadog/autoscaling/cluster/evict/asg.go 71.79% 8 Missing and 3 partials ⚠️
...tl-datadog/autoscaling/cluster/evict/standalone.go 77.14% 5 Missing and 3 partials ⚠️
...atadog/autoscaling/cluster/evict/karpenter_user.go 54.54% 3 Missing and 2 partials ⚠️
...ctl-datadog/autoscaling/cluster/evict/preflight.go 85.29% 4 Missing and 1 partial ⚠️
.../kubectl-datadog/autoscaling/cluster/evict/plan.go 95.12% 2 Missing and 2 partials ⚠️
.../autoscaling/cluster/common/karpenter/fromnodes.go 0.00% 3 Missing ⚠️
... and 4 more

❌ Your patch status has failed because the patch coverage (61.95%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3026      +/-   ##
==========================================
+ Coverage   41.50%   44.35%   +2.84%     
==========================================
  Files         335      368      +33     
  Lines       28714    33373    +4659     
==========================================
+ Hits        11919    14802    +2883     
- Misses      16001    17625    +1624     
- Partials      794      946     +152     
Flag Coverage Δ
unittests 44.35% <61.95%> (+2.84%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ctl-datadog/autoscaling/cluster/common/aws/node.go 100.00% <100.00%> (ø)
...autoscaling/cluster/common/clusterinfo/classify.go 88.46% <100.00%> (-0.07%) ⬇️
...ubectl-datadog/autoscaling/cluster/evict/prompt.go 100.00% <100.00%> (ø)
cmd/kubectl-datadog/autoscaling/cluster/cluster.go 0.00% <0.00%> (ø)
...dog/autoscaling/cluster/evict/clusterautoscaler.go 88.23% <88.23%> (ø)
...ubectl-datadog/autoscaling/cluster/evict/cordon.go 86.66% <86.66%> (ø)
...bectl-datadog/autoscaling/cluster/evict/eks_mng.go 95.12% <95.12%> (ø)
.../autoscaling/cluster/common/karpenter/fromnodes.go 11.93% <0.00%> (ø)
.../kubectl-datadog/autoscaling/cluster/evict/plan.go 95.12% <95.12%> (ø)
...atadog/autoscaling/cluster/evict/karpenter_user.go 54.54% <54.54%> (ø)
... and 7 more

... and 78 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20ecb9e...e75e7bf. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datadog-prod-us1-3

datadog-prod-us1-3 Bot commented May 18, 2026

Copy link
Copy Markdown

Code Coverage

Fix all issues with BitsAI

🛑 Gate Violations

🎯 1 Code Coverage issue detected

A Patch coverage percentage gate may be blocking this PR.

Patch coverage: 60.76% (threshold: 80.00%)

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 60.76%
Overall Coverage: 44.48% (+2.66%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: e75e7bf | Docs | Datadog PR Page | Give us feedback!

Trailing alignment / extra blank line cleaned up by `gofmt -s` and
`golangci-lint --fix` on three of the evict-package test files. No
behavior change; restores `git diff --exit-code` after `make fmt` for
the `check_formatting` CI gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c L3n41c added the enhancement New feature or request label May 18, 2026
@L3n41c L3n41c added this to the v1.27.0 milestone May 18, 2026
Adds three small test files to lift `evict/` package coverage from 55%
to 66% so the patch-coverage gate passes:

- `preflight_test.go`: covers `warnKarpenterWeightConflicts` for no
  Karpenter targets, no conflict, equal-weight conflict, nil-weight
  defaulting to 0, and an unknown user-NodePool name.
- `prompt_test.go`: covers `printPlan` rendering of each section
  (CA, PDB, per-manager evictions, dry-run skips) and
  `promptConfirmation`'s y/N/yes parsing.
- `pdb_test.go`: covers `uniqueNodes` excluding EKS MNG entries and
  deduplicating shared node names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c

L3n41c commented May 18, 2026

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7ee9656a7e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/kubectl-datadog/autoscaling/cluster/evict/evict_pods.go Outdated
Comment thread cmd/kubectl-datadog/autoscaling/cluster/evict/pdb.go Outdated
L3n41c and others added 14 commits May 18, 2026 22:14
P1: `shouldSkipPod` collapsed two distinct concerns — "do not call Evict()
on this pod" (drainNode) and "this pod no longer occupies the node"
(waitForNodeEmpty). A pod with DeletionTimestamp set was treated as
absent, so for ASG/standalone targets the orchestrator could terminate
the EC2 instance before the container finished its grace period.
Splits the predicate:

- `shouldSkipEviction`: skip the Eviction call for DS / mirror /
  terminating / completed pods.
- `podOccupiesNode`: a terminating pod STILL occupies the node and
  must keep `waitForNodeEmpty` blocking. DS / mirror / completed pods
  don't.

P2: `uniqueNodes` no longer filters out EKS managed node group entries
when discovering controllers for temporary PDBs. The orchestrator now
blocks on `waitEKSNodegroupEmpty` before cleaning up the PDBs, so EKS
observes them throughout its drain. Excluding MNG nodes left a workload
whose every replica lived on a single MNG without any PDB protection;
EKS could then disrupt all replicas at once.

Test coverage:
- New `TestPodOccupiesNode` locks in the "terminating pods still
  occupy the node" semantics.
- `TestShouldSkipPod` renamed to `TestShouldSkipEviction`.
- `TestUniqueNodes_ExcludesEKSMNG` renamed to
  `TestUniqueNodes_IncludesAllManagerTypes` and inverted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…can dedup

`ExtractEC2InstanceID` and `LabelEKSNodegroup` were introduced in
`common/clusterinfo` so the evict package could reuse them, but two more
copies of the same `aws:///[^/]+/(i-[0-9a-f]+)$` regex remained inline:
in `common/clusterinfo/classify.go:197` and `common/karpenter/fromnodes.go:65`.

`common/clusterinfo` already imports `common/karpenter` (for the Karpenter
detection helpers), so karpenter can't reach back into clusterinfo without
producing an import cycle. Move the helper to a neutral home —
`common/aws/node.go` — that both can import. The unit test moves with the
function.

Net result: one regex definition, four call sites, three packages dedup'd.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing tests under `cmd/kubectl-datadog/autoscaling/cluster/` (e.g.
`install/install_test.go:TestValidate`, `apply/install_mode_test.go`,
`apply/inference_method_test.go`, `apply/create_karpenter_resources_test.go`)
all declare the table inline inside `for _, tc := range []struct{ … }{ … } {`
rather than via an intermediate `tests :=` variable.

Aligns four newly-added tests with that convention:
- `common/aws/node_test.go: TestExtractEC2InstanceID`
- `evict/evict_pods_test.go: TestShouldSkipEviction`, `TestPodOccupiesNode`
- `evict/plan_test.go: TestParseTargetSpec`
- `evict/pdb_test.go: TestTempPDBName`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`evictASG` extracted the EC2 instance ID and warned on an unexpected
providerID, but discarded the ID immediately afterwards. The check was
load-bearing in an earlier revision that terminated instances per-instance
via `TerminateInstanceInAutoScalingGroup`; after the round-2 Codex review
replaced that loop with a single `UpdateAutoScalingGroup(0,0,0)`, the ID
is no longer used here and the warning misleadingly suggested otherwise.

The `standalone` and `karpenter/fromnodes` call sites still use the
extracted ID, so the helper stays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tASG`

The five `TestEvictASG_*` functions shared the same setup (fake clientset
+ stubAutoscaling), the same call (`evictASG(ctx, client, stub, …)`) and
the same axes of assertion (`err`, `stub.scaledASGs`, optionally
`Spec.Unschedulable`). They only differed on the initial K8s objects,
whether an Eviction reactor was wired, and the drain timeouts.

Collapsing them into one `TestEvictASG` with an inline case slice keeps
the matrix of (objects × reactor × dry-run × wantErr × wantScaledASGs ×
wantUnschedulable) visible at a glance, aligns with the convention used
in `install/install_test.go`, `apply/install_mode_test.go`, etc., and
makes it cheap to add a new scenario.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same shape as the earlier `TestEvictASG` consolidation, applied to the
rest of the package. Tests that exercised the same target function with
different inputs are now grouped into one `Test<Function>` with an
inline case slice; one test per *function* rather than per *scenario*
collapses ~65 cousin functions down to ~25.

- `clusterautoscaler_test.go`: 5 → 1 (`TestScaleDownClusterAutoscaler`)
- `cordon_test.go`: 4 → 1 (`TestCordonNode`)
- `eks_mng_test.go`: 5 → 1 (`TestEvictEKSManagedNodeGroup`)
- `evict_pods_test.go`: 5 EvictPodWithRetry tests → 1 (`TestEvictPodWithRetry`)
  with a typed `evictionResponder` closure capturing each scenario's reactor
- `karpenter_user_test.go`: 2 → 1 (`TestEvictKarpenterUserNodePool`)
- `standalone_test.go`: 5 → 1 (`TestEvictStandalone`)
- `preflight_test.go`: 5 conflict tests → 1 (`TestWarnKarpenterWeightConflicts`)
- `prompt_test.go`: 4 printPlan + 2 promptConfirmation → 2 (`TestPrintPlan`,
  `TestPromptConfirmation`)
- `run_test.go`: 5 → 1 (`TestEvictAllTargetsParallel`)
- `plan_test.go`: 3 BuildPlan tests + the inline-subtests in
  `TestHasDatadogManagedNodePool` → `TestBuildPlan` + table-driven
  `TestHasDatadogManagedNodePool`
- `pdb_test.go`: per-helper consolidations
  (`TestUniqueNodes`, `TestIsTemporaryPDB`, `TestHasUserPDB`,
  `TestCleanupTempPDBs`, `TestCreateTempPDB`); `TestTempPDBName` was already
  table-driven; `TestDiscoverControllers_FiltersByNodeSet` left as a single
  scenario since the fixtures don't vary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the hand-rolled for/return-true loop with the stdlib
`slices.ContainsFunc` (Go 1.21+). Same semantics, half the lines,
no extra dependency — `slices` is already imported elsewhere in the
plugin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e boolean expression

Both functions were cascades of `if cond { return true/false }` short-circuiting
the same way `||` and `&&` already do. The two predicates collapse to a single
return line each, which makes the asymmetry between the two — `||` includes
`p.DeletionTimestamp \!= nil`, `&&` doesn't — visible at a glance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the hand-rolled accumulator loop in `waitForNodeEmpty` with
`lo.CountBy`. Same semantics, fewer lines, intent visible at the
function name. The stdlib `slices` package has `ContainsFunc` /
`IndexFunc` but no `CountFunc`, so `lo.CountBy` is the right call —
`lo` is already a project dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the if/else cascade in evictPodWithRetry with a tagless switch
and move the retry-sleep into the default branch so the four dispatch
outcomes (success, 404, non-429 error, deadline) sit as peers next to
the retry path. Tightens listPodsOnNode with named returns.
PollInterval is an internal field with a single construction site
(run.go) that never set it, leaving two divergent defensive defaults
(2s in drainNode, 10s in waitEKSNodegroupEmpty) to paper over the gap.
Set it explicitly to 2s in run.go and remove both fallbacks; tests
still inject their own short intervals.
Drop the per-manager-type goroutines, mutex and WaitGroup in favour of
a plain loop over targets. The original rationale for parallelism was
asynchronous EKS draining (UpdateNodegroupConfig returning while EKS
worked in the background), but waitEKSNodegroupEmpty now blocks until
the EKS-managed nodes disappear, so every manager already runs
synchronously. Sequential execution yields linear logs, bounded
apiserver pressure, and easier debugging when a target fails.
Realign inline struct literals after recent field-list changes. Adding
fields whose name is longer than any existing one widens the column
gofmt computes for the whole literal; the previous commits introduced
such fields without re-running gofmt, so check_formatting on the CI
pipeline reported diffs in five test files. Pure whitespace, no
semantic change.
@L3n41c

L3n41c commented May 20, 2026

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 96d5f50e4f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/kubectl-datadog/autoscaling/cluster/evict/eks_mng.go Outdated
Comment thread cmd/kubectl-datadog/autoscaling/cluster/evict/run.go Outdated
L3n41c and others added 2 commits May 20, 2026 11:42
Two fixes triggered by Codex review of 96d5f50:

1. EKS managed node group: the API rejects `maxSize < 1`, so the previous
   `min=max=desired=0` Update would have failed before any drain started.
   DescribeNodegroup the target first, then preserve its current MaxSize
   while still zeroing min/desired (defaulting to 1 if the described
   scaling config is nil). The EKS-side drain behaviour is unchanged —
   desired=0 still triggers it — but the API call now succeeds.

2. Dry-run: both `clusterinfo.Persist` call sites in Run mutated the
   cluster-info ConfigMap unconditionally, before any dry-run gate. A
   preview must not require write RBAC on the Karpenter namespace and
   must not leave behind state; gate both calls on `\!opts.DryRun` and
   log the would-be persistence instead.

Test updates: extend stubEKS with DescribeNodegroup, add cases for
DescribeNodegroup failure (short-circuits Update) and for a nil
ScalingConfig (falls back to max=1).
Replace context.Background() with t.Context() across the eviction
test files. t.Context() (Go 1.24+) returns a context that is canceled
when the test completes, so any goroutine still using the context is
notified instead of leaking past the test boundary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@levan-m levan-m modified the milestones: v1.27.0, v1.28.0 Jun 5, 2026
@khewonc khewonc modified the milestones: v1.28.0, v1.29.0 Jun 10, 2026
L3n41c and others added 2 commits June 11, 2026 16:35
…elpers

- Embed a controllerKey (Namespace/Kind/Name) into controllerInfo and key the
  seen map on it with the selector as value, removing the stringly-typed key
  and the key/value redundancy.
- Key the per-kind caches on client.ObjectKey instead of string concatenation.
- Collapse getReplicaSet/getDeployment/getStatefulSet into one generic
  getCached[T] taking the namespace-bound typed accessor as a func value.
- Rewrite hasUserPDB with slices.ContainsFunc and formatSelector/discoverControllers
  map-to-slice loops with lo.MapToSlice; inline uniqueNodes via lo.FlatMap/SliceToMap.
- Inline throwaway locals in tempPDBName; drop the redundant client-side label
  re-check in cleanupTempPDBs (List already filters by MatchingLabels).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The pdb.go refactor (struct-key dedup + generic getCached) rewrote the
controllerInfo construction in every arm of resolveTopLevelController, but
TestDiscoverControllers_FiltersByNodeSet only exercised the
Deployment-via-ReplicaSet path, dropping patch coverage from 100% to 78%.

Extend the discovery test to also cover the StatefulSet arm, the bare
ReplicaSet arm (no owning Deployment), the default arm (DaemonSet -> no
PDB), and the orphan-pod path (no controller owner).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@L3n41c

L3n41c commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d77c9cc3a7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +93 to +95
if len(targets) == 0 {
display.PrintBox(streams.Out, "Nothing to evict — the cluster is already on Datadog-managed Karpenter NodePools.")
return nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clean leaked temp PDBs before returning no-op

When a previous run leaves temporary PDBs in place because an EKS managed node group timed out, the documented recovery path is to rerun after EKS finishes. In that scenario classification now finds zero legacy targets, so this early return exits before cleanupTempPDBs runs, leaving the temporary maxUnavailable: 1 PDBs in the cluster indefinitely and potentially throttling later rollouts/disruptions. Please run the label-based cleanup on the no-target path when --ensure-pdbs is enabled.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 7093d74. The no-target early-return now runs the idempotent label-based cleanupTempPDBs (when --ensure-pdbs is set) before exiting, so temp PDBs leaked by a prior interrupted run are reclaimed once the cluster has fully migrated. Extracted into a reclaimLeakedTempPDBs helper with unit tests for the enabled/disabled/dry-run branches.

L3n41c and others added 3 commits June 11, 2026 17:40
Addresses Codex review on run.go: when a prior run was interrupted (e.g.
an EKS managed node group timed out, leaving its temporary PDBs in place
for a later rerun), the rerun finds the cluster fully migrated and exits
early via the no-target no-op — before the temp-PDB cleanup runs. The
leaked maxUnavailable:1 PDBs would then persist indefinitely and throttle
future rollouts/disruptions.

Run the idempotent, label-based cleanup on the no-target path when
--ensure-pdbs is enabled, via a small reclaimLeakedTempPDBs helper, and
unit-test all three branches (enabled/disabled/dry-run).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The patch_coverage gate (threshold 80%, carryforward:false) evaluates the
head commit's changed lines; reclaimLeakedTempPDBs' error-logging branch
was uncovered because cleanupTempPDBs never errored with the plain fake
client. Add a case using an interceptor client whose List fails, so the
swallow-and-log path is exercised, bringing the helper to 100%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- buildAllPlan: replace the manual key-extract + sort + filter + append loop
  with lo.FilterMap over slices.Sorted(maps.Keys(bucket)).
- printPlan: iterate managers via slices.Sorted(maps.Keys(...)) and inline the
  grouping with lo.GroupBy, removing the now-trivial single-use groupByManager
  helper (and its test, which would only exercise lo.GroupBy).
- Drop the sort import in both files; add maps/slices (and lo in prompt.go).

Behavior-preserving: same sorted-by-name ordering within a manager, same
manager string ordering, same ManagedByDatadog filtering and Nodes defensive copy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants