Skip to content

EC2NodeClass karpenter.k8s.aws/termination finalizer stuck indefinitely after deletion, even with zero NodeClaims #9185

@liukrimhrim

Description

@liukrimhrim

Version

  • Karpenter controller: v1.8.1 (public.ecr.aws/karpenter/controller:1.8.1@sha256:41c28a606cbad86869384ff8ae8345203b63f81612b6fcfd2e136197dccc03ef)
  • Helm chart: karpenter-1.8.1
  • CRDs: ec2nodeclasses.karpenter.k8s.aws/v1, nodepools.karpenter.sh/v1
  • Platform: EKS, us-east-1

Symptom

When an EC2NodeClass is deleted (via kubectl delete, or implicitly via helm uninstall on a chart that owns it), the resource enters terminating state with deletionTimestamp set, but the karpenter.k8s.aws/termination finalizer is never released. The resource remains as a tombstone indefinitely.

Observed state on a stuck NodeClass:

metadata:
  deletionTimestamp: "2026-05-21T05:27:03Z"
  finalizers:
  - karpenter.k8s.aws/termination
  generation: 3
status:
  conditions:
  - type: Ready
    status: "True"
    reason: Ready
  observedGeneration: <empty>   # Karpenter has stopped reconciling updates

Other observations:

  • No NodeClaims reference the NodeClass at the time of (or after) deletion.
  • No Nodes attributable to this NodeClass exist in the cluster.
  • Subnets, security groups, and instance profile selectors are all valid and the NodeClass was Ready=True immediately before the delete.
  • The NodePool that referenced this NodeClass was deleted normally (no finalizer issues).
  • Karpenter controller logs at the time of the delete contain no errors related to this NodeClass; the controller appears to never attempt finalizer release.

Once stuck, the only way out is manually patching the finalizer off:

kubectl patch ec2nodeclass <name> --type=merge -p '{"metadata":{"finalizers":null}}'

Reproduction (statistically — not 100% deterministic)

  1. Install a Helm chart that creates an EC2NodeClass and a NodePool referencing it.
  2. Let Karpenter provision a handful of nodes for the NodePool.
  3. Run workloads on those nodes briefly.
  4. helm uninstall the chart. Helm deletes pods → nodes drain → NodeClaims are removed → both the NodePool and the EC2NodeClass are sent kubectl delete.
  5. Observe: NodePool deletes cleanly. EC2NodeClass sits with deletionTimestamp set, finalizer present, indefinitely.

Hit twice in our staging cluster during a deploy cutover, ~24 hours apart, on two different NodeClass instances (llamacloud-helm-unified, llamacloud-parse-helm-unified). On the first occurrence we also had an older orphaned NodeClass from April that had been stuck for 43 days under similar circumstances.

Impact

A stuck tombstoned EC2NodeClass blocks subsequent attempts to manage a NodeClass with the same name. Helm cannot meaningfully update it (Karpenter ignores spec updates on objects pending deletion — observedGeneration is empty even though generation advances on each helm-upgrade). The NodePool reports NodeClassReady=False with reason NodeClassTerminating. Karpenter logs "ignoring nodepool, not ready" and refuses to provision. Workload pods sit Pending until an operator manually clears the finalizer.

Workaround

Tagging the EC2NodeClass and NodePool with helm.sh/resource-policy: keep so they survive helm uninstall. Updates via helm upgrade continue to work; only the delete path is avoided. This sidesteps the bug but is unsatisfying — kubectl delete against a healthy NodeClass should still complete cleanly.

What I think is happening

Speculation, not verified: the finalizer release logic appears to be conditional on some signal that doesn't reliably fire when a NodeClass has produced zero (or has already cleaned up all) NodeClaims. The release path may be triggered by NodeClaim deletion events; if the NodeClaims were drained and removed before the NodeClass was marked for deletion, there's no further event to drive finalizer cleanup. Just a guess — the maintainers will know better.

Happy to provide controller logs from the time of the stuck deletion, or to reproduce with a fresh cluster if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    triage/solvedMark the issue as solved by a Karpenter maintainer. This gives time for the issue author to confirm.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions