Version
- Karpenter controller:
v1.8.1 (public.ecr.aws/karpenter/controller:1.8.1@sha256:41c28a606cbad86869384ff8ae8345203b63f81612b6fcfd2e136197dccc03ef)
- Helm chart:
karpenter-1.8.1
- CRDs:
ec2nodeclasses.karpenter.k8s.aws/v1, nodepools.karpenter.sh/v1
- Platform: EKS,
us-east-1
Symptom
When an EC2NodeClass is deleted (via kubectl delete, or implicitly via helm uninstall on a chart that owns it), the resource enters terminating state with deletionTimestamp set, but the karpenter.k8s.aws/termination finalizer is never released. The resource remains as a tombstone indefinitely.
Observed state on a stuck NodeClass:
metadata:
deletionTimestamp: "2026-05-21T05:27:03Z"
finalizers:
- karpenter.k8s.aws/termination
generation: 3
status:
conditions:
- type: Ready
status: "True"
reason: Ready
observedGeneration: <empty> # Karpenter has stopped reconciling updates
Other observations:
- No NodeClaims reference the NodeClass at the time of (or after) deletion.
- No Nodes attributable to this NodeClass exist in the cluster.
- Subnets, security groups, and instance profile selectors are all valid and the NodeClass was
Ready=True immediately before the delete.
- The NodePool that referenced this NodeClass was deleted normally (no finalizer issues).
- Karpenter controller logs at the time of the delete contain no errors related to this NodeClass; the controller appears to never attempt finalizer release.
Once stuck, the only way out is manually patching the finalizer off:
kubectl patch ec2nodeclass <name> --type=merge -p '{"metadata":{"finalizers":null}}'
Reproduction (statistically — not 100% deterministic)
- Install a Helm chart that creates an
EC2NodeClass and a NodePool referencing it.
- Let Karpenter provision a handful of nodes for the NodePool.
- Run workloads on those nodes briefly.
helm uninstall the chart. Helm deletes pods → nodes drain → NodeClaims are removed → both the NodePool and the EC2NodeClass are sent kubectl delete.
- Observe: NodePool deletes cleanly. EC2NodeClass sits with
deletionTimestamp set, finalizer present, indefinitely.
Hit twice in our staging cluster during a deploy cutover, ~24 hours apart, on two different NodeClass instances (llamacloud-helm-unified, llamacloud-parse-helm-unified). On the first occurrence we also had an older orphaned NodeClass from April that had been stuck for 43 days under similar circumstances.
Impact
A stuck tombstoned EC2NodeClass blocks subsequent attempts to manage a NodeClass with the same name. Helm cannot meaningfully update it (Karpenter ignores spec updates on objects pending deletion — observedGeneration is empty even though generation advances on each helm-upgrade). The NodePool reports NodeClassReady=False with reason NodeClassTerminating. Karpenter logs "ignoring nodepool, not ready" and refuses to provision. Workload pods sit Pending until an operator manually clears the finalizer.
Workaround
Tagging the EC2NodeClass and NodePool with helm.sh/resource-policy: keep so they survive helm uninstall. Updates via helm upgrade continue to work; only the delete path is avoided. This sidesteps the bug but is unsatisfying — kubectl delete against a healthy NodeClass should still complete cleanly.
What I think is happening
Speculation, not verified: the finalizer release logic appears to be conditional on some signal that doesn't reliably fire when a NodeClass has produced zero (or has already cleaned up all) NodeClaims. The release path may be triggered by NodeClaim deletion events; if the NodeClaims were drained and removed before the NodeClass was marked for deletion, there's no further event to drive finalizer cleanup. Just a guess — the maintainers will know better.
Happy to provide controller logs from the time of the stuck deletion, or to reproduce with a fresh cluster if useful.
Version
v1.8.1(public.ecr.aws/karpenter/controller:1.8.1@sha256:41c28a606cbad86869384ff8ae8345203b63f81612b6fcfd2e136197dccc03ef)karpenter-1.8.1ec2nodeclasses.karpenter.k8s.aws/v1,nodepools.karpenter.sh/v1us-east-1Symptom
When an
EC2NodeClassis deleted (viakubectl delete, or implicitly viahelm uninstallon a chart that owns it), the resource enters terminating state withdeletionTimestampset, but thekarpenter.k8s.aws/terminationfinalizer is never released. The resource remains as a tombstone indefinitely.Observed state on a stuck NodeClass:
Other observations:
Ready=Trueimmediately before the delete.Once stuck, the only way out is manually patching the finalizer off:
Reproduction (statistically — not 100% deterministic)
EC2NodeClassand aNodePoolreferencing it.helm uninstallthe chart. Helm deletes pods → nodes drain → NodeClaims are removed → both the NodePool and the EC2NodeClass are sentkubectl delete.deletionTimestampset, finalizer present, indefinitely.Hit twice in our staging cluster during a deploy cutover, ~24 hours apart, on two different NodeClass instances (
llamacloud-helm-unified,llamacloud-parse-helm-unified). On the first occurrence we also had an older orphaned NodeClass from April that had been stuck for 43 days under similar circumstances.Impact
A stuck tombstoned EC2NodeClass blocks subsequent attempts to manage a NodeClass with the same name. Helm cannot meaningfully update it (Karpenter ignores spec updates on objects pending deletion —
observedGenerationis empty even thoughgenerationadvances on each helm-upgrade). The NodePool reportsNodeClassReady=Falsewith reasonNodeClassTerminating. Karpenter logs"ignoring nodepool, not ready"and refuses to provision. Workload pods sit Pending until an operator manually clears the finalizer.Workaround
Tagging the EC2NodeClass and NodePool with
helm.sh/resource-policy: keepso they survivehelm uninstall. Updates viahelm upgradecontinue to work; only the delete path is avoided. This sidesteps the bug but is unsatisfying —kubectl deleteagainst a healthy NodeClass should still complete cleanly.What I think is happening
Speculation, not verified: the finalizer release logic appears to be conditional on some signal that doesn't reliably fire when a NodeClass has produced zero (or has already cleaned up all) NodeClaims. The release path may be triggered by NodeClaim deletion events; if the NodeClaims were drained and removed before the NodeClass was marked for deletion, there's no further event to drive finalizer cleanup. Just a guess — the maintainers will know better.
Happy to provide controller logs from the time of the stuck deletion, or to reproduce with a fresh cluster if useful.