Spot interrupted instances persist as ghosts with all pods terminating

### Description

**Observed Behavior**:
After upgrading from 1.8.6 to 1.12.1 (we have the `ec2:DescribeInstanceStatus` permission on the role karpenter uses) due to unsupported version log we've since noticed our spot terminated nodes are not being removed from the cluster well after the node has been reclaimed by AWS. This results in pods stuck in the Terminating state which then have associated volumes orphaned until everything unwinds when the termination grace periods are exceeded.

We use relatively long termination grace periods on both pods and nodes to prevent consolidation from interrupting long running work in the best-case scenario. We expect spot interruptions to supersede these termination grace periods which was the observed behavior prior to the 1.12.1 upgrade. All configuration between nodepool and pods is unchanged. We were on 1.35 for a number of weeks before the upgrade to karpenter where we immediately started hitting this case.

You can see exactly when we upgraded karpenter in this cluster and the behavior began occurring even though our spot interrupt rate did not drastically change.

Hard to tell but the value is non-zero prior to the spike:
```
karpenter.interruption.message.latency.time_seconds.sum / karpenter.interruption.message.latency.time_seconds.count
```
<img width="1090" height="576" alt="Image" src="https://github.com/user-attachments/assets/c91f7116-f6e5-48d3-8561-335ebfe9b491" />

```
karpenter.interruption.received_messages.count
```
<img width="1090" height="576" alt="Image" src="https://github.com/user-attachments/assets/4d5a7112-7684-415d-9f58-2f0051661d22" />

Logs for associated nodeclaim:
```
{
  "level": "INFO",
  "time": "2026-06-05T02:47:05.971Z",
  "logger": "controller",
  "message": "initiating delete from interruption message",
  "commit": "15781e3",
  "controller": "interruption",
  "namespace": "",
  "name": "",
  "reconcileID": "9cbd1899-ada4-4a34-be8e-ccc58089bc72",
  "queue": "queue-name",
  "messageKind": "spot_interrupted",
  "NodeClaim": {
    "name": "nodeclaim-n4lzk"
  },
  "action": "CordonAndDrain",
  "Node": {
    "name": "ip-10-0-0-0.ec2.internal"
  }
}
{
  "level": "INFO",
  "time": "2026-06-05T02:47:06.569Z",
  "logger": "controller",
  "message": "annotated nodeclaim",
  "commit": "15781e3",
  "controller": "nodeclaim.lifecycle",
  "controllerGroup": "karpenter.sh",
  "controllerKind": "NodeClaim",
  "NodeClaim": {
    "name": "nodeclaim-n4lzk"
  },
  "namespace": "",
  "name": "nodeclaim-n4lzk",
  "reconcileID": "b5322649-c05c-4b36-b98c-94a9d49efb08",
  "provider-id": "aws:///region-1a/i-001",
  "Node": {
    "name": "ip-10-0-0-0.ec2.internal"
  },
  "karpenter.sh/nodeclaim-termination-timestamp": "2026-06-06T02:47:05Z"
}
{
  "level": "INFO",
  "time": "2026-06-05T02:47:07.693Z",
  "logger": "controller",
  "message": "tainted node",
  "commit": "15781e3",
  "controller": "node.termination",
  "controllerGroup": "",
  "controllerKind": "Node",
  "Node": {
    "name": "ip-10-0-0-0.ec2.internal"
  },
  "namespace": "",
  "name": "ip-10-0-0-0.ec2.internal",
  "reconcileID": "91b58de0-b212-492f-9d84-00c0ab4653b9",
  "NodeClaim": {
    "name": "nodeclaim-n4lzk"
  },
  "taint.Key": "karpenter.sh/disrupted",
  "taint.Value": "",
  "taint.Effect": "NoSchedule"
}
```

Resulted in all pods in the `Terminating` state, a kubelet with `Status=Unknown` and an event history like:
```
Events:
  Type     Reason             Age                    From                             Message
  ----     ------             ----                   ----                             -------
  Normal   DisruptionBlocked  4m56s (x63 over 169m)  karpenter                        Node is deleting or marked for deletion
  Normal   DeletingNode       107s (x959 over 166m)  cloud-node-lifecycle-controller  Deleting node ip-10-0-0-0.ec2.internal because it does not exist in the cloud provider
  Warning  FailedDraining     67s (x85 over 169m)    karpenter                        Failed to drain node, 10 pods are waiting to be evicted
```

**Expected Behavior**:
Karpenter recognizes the node is no longer active in the cloud provider and cleans up resources accordingly. This was largely the behavior we observed until the upgrade minus some small % of completely deadlocked nodeclaims.

**Reproduction Steps** (Please include YAML):
- Karpenter configured for spot interruption messages via sqs.
- Configure a deployment with a multi-hour termination grace period and pods which when terminated do not exit before spot shutdown.
- Configure a nodepool with a multi-hour termination grace period
- Wait for interruption and AWS reclaim on instance.
- node is present in kubernetes api but is not backed by an ec2 instance
  - Will remain until the either the pods all reach their termination grace period or the node reaches its termination grace period.


**Versions**:
- Chart Version: 1.12.1
- Kubernetes Version (`kubectl version`): 1.35

* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request
* Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
* If you are interested in working on this issue or have submitted a pull request, please leave a comment


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spot interrupted instances persist as ghosts with all pods terminating #9226

Description

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Spot interrupted instances persist as ghosts with all pods terminating #9226

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions