Skip to content

Spot interrupted instances persist as ghosts with all pods terminating #9226

@macb

Description

@macb

Description

Observed Behavior:
After upgrading from 1.8.6 to 1.12.1 (we have the ec2:DescribeInstanceStatus permission on the role karpenter uses) due to unsupported version log we've since noticed our spot terminated nodes are not being removed from the cluster well after the node has been reclaimed by AWS. This results in pods stuck in the Terminating state which then have associated volumes orphaned until everything unwinds when the termination grace periods are exceeded.

We use relatively long termination grace periods on both pods and nodes to prevent consolidation from interrupting long running work in the best-case scenario. We expect spot interruptions to supersede these termination grace periods which was the observed behavior prior to the 1.12.1 upgrade. All configuration between nodepool and pods is unchanged. We were on 1.35 for a number of weeks before the upgrade to karpenter where we immediately started hitting this case.

You can see exactly when we upgraded karpenter in this cluster and the behavior began occurring even though our spot interrupt rate did not drastically change.

Hard to tell but the value is non-zero prior to the spike:

karpenter.interruption.message.latency.time_seconds.sum / karpenter.interruption.message.latency.time_seconds.count
Image
karpenter.interruption.received_messages.count
Image

Logs for associated nodeclaim:

{
  "level": "INFO",
  "time": "2026-06-05T02:47:05.971Z",
  "logger": "controller",
  "message": "initiating delete from interruption message",
  "commit": "15781e3",
  "controller": "interruption",
  "namespace": "",
  "name": "",
  "reconcileID": "9cbd1899-ada4-4a34-be8e-ccc58089bc72",
  "queue": "queue-name",
  "messageKind": "spot_interrupted",
  "NodeClaim": {
    "name": "nodeclaim-n4lzk"
  },
  "action": "CordonAndDrain",
  "Node": {
    "name": "ip-10-0-0-0.ec2.internal"
  }
}
{
  "level": "INFO",
  "time": "2026-06-05T02:47:06.569Z",
  "logger": "controller",
  "message": "annotated nodeclaim",
  "commit": "15781e3",
  "controller": "nodeclaim.lifecycle",
  "controllerGroup": "karpenter.sh",
  "controllerKind": "NodeClaim",
  "NodeClaim": {
    "name": "nodeclaim-n4lzk"
  },
  "namespace": "",
  "name": "nodeclaim-n4lzk",
  "reconcileID": "b5322649-c05c-4b36-b98c-94a9d49efb08",
  "provider-id": "aws:///region-1a/i-001",
  "Node": {
    "name": "ip-10-0-0-0.ec2.internal"
  },
  "karpenter.sh/nodeclaim-termination-timestamp": "2026-06-06T02:47:05Z"
}
{
  "level": "INFO",
  "time": "2026-06-05T02:47:07.693Z",
  "logger": "controller",
  "message": "tainted node",
  "commit": "15781e3",
  "controller": "node.termination",
  "controllerGroup": "",
  "controllerKind": "Node",
  "Node": {
    "name": "ip-10-0-0-0.ec2.internal"
  },
  "namespace": "",
  "name": "ip-10-0-0-0.ec2.internal",
  "reconcileID": "91b58de0-b212-492f-9d84-00c0ab4653b9",
  "NodeClaim": {
    "name": "nodeclaim-n4lzk"
  },
  "taint.Key": "karpenter.sh/disrupted",
  "taint.Value": "",
  "taint.Effect": "NoSchedule"
}

Resulted in all pods in the Terminating state, a kubelet with Status=Unknown and an event history like:

Events:
  Type     Reason             Age                    From                             Message
  ----     ------             ----                   ----                             -------
  Normal   DisruptionBlocked  4m56s (x63 over 169m)  karpenter                        Node is deleting or marked for deletion
  Normal   DeletingNode       107s (x959 over 166m)  cloud-node-lifecycle-controller  Deleting node ip-10-0-0-0.ec2.internal because it does not exist in the cloud provider
  Warning  FailedDraining     67s (x85 over 169m)    karpenter                        Failed to drain node, 10 pods are waiting to be evicted

Expected Behavior:
Karpenter recognizes the node is no longer active in the cloud provider and cleans up resources accordingly. This was largely the behavior we observed until the upgrade minus some small % of completely deadlocked nodeclaims.

Reproduction Steps (Please include YAML):

  • Karpenter configured for spot interruption messages via sqs.
  • Configure a deployment with a multi-hour termination grace period and pods which when terminated do not exit before spot shutdown.
  • Configure a nodepool with a multi-hour termination grace period
  • Wait for interruption and AWS reclaim on instance.
  • node is present in kubernetes api but is not backed by an ec2 instance
    • Will remain until the either the pods all reach their termination grace period or the node reaches its termination grace period.

Versions:

  • Chart Version: 1.12.1
  • Kubernetes Version (kubectl version): 1.35
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingpriority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.triage/needs-informationMarks that the issue still needs more information to properly triage

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions