Description
Observed Behavior:
After upgrading from 1.8.6 to 1.12.1 (we have the ec2:DescribeInstanceStatus permission on the role karpenter uses) due to unsupported version log we've since noticed our spot terminated nodes are not being removed from the cluster well after the node has been reclaimed by AWS. This results in pods stuck in the Terminating state which then have associated volumes orphaned until everything unwinds when the termination grace periods are exceeded.
We use relatively long termination grace periods on both pods and nodes to prevent consolidation from interrupting long running work in the best-case scenario. We expect spot interruptions to supersede these termination grace periods which was the observed behavior prior to the 1.12.1 upgrade. All configuration between nodepool and pods is unchanged. We were on 1.35 for a number of weeks before the upgrade to karpenter where we immediately started hitting this case.
You can see exactly when we upgraded karpenter in this cluster and the behavior began occurring even though our spot interrupt rate did not drastically change.
Hard to tell but the value is non-zero prior to the spike:
karpenter.interruption.message.latency.time_seconds.sum / karpenter.interruption.message.latency.time_seconds.count
karpenter.interruption.received_messages.count
Logs for associated nodeclaim:
{
"level": "INFO",
"time": "2026-06-05T02:47:05.971Z",
"logger": "controller",
"message": "initiating delete from interruption message",
"commit": "15781e3",
"controller": "interruption",
"namespace": "",
"name": "",
"reconcileID": "9cbd1899-ada4-4a34-be8e-ccc58089bc72",
"queue": "queue-name",
"messageKind": "spot_interrupted",
"NodeClaim": {
"name": "nodeclaim-n4lzk"
},
"action": "CordonAndDrain",
"Node": {
"name": "ip-10-0-0-0.ec2.internal"
}
}
{
"level": "INFO",
"time": "2026-06-05T02:47:06.569Z",
"logger": "controller",
"message": "annotated nodeclaim",
"commit": "15781e3",
"controller": "nodeclaim.lifecycle",
"controllerGroup": "karpenter.sh",
"controllerKind": "NodeClaim",
"NodeClaim": {
"name": "nodeclaim-n4lzk"
},
"namespace": "",
"name": "nodeclaim-n4lzk",
"reconcileID": "b5322649-c05c-4b36-b98c-94a9d49efb08",
"provider-id": "aws:///region-1a/i-001",
"Node": {
"name": "ip-10-0-0-0.ec2.internal"
},
"karpenter.sh/nodeclaim-termination-timestamp": "2026-06-06T02:47:05Z"
}
{
"level": "INFO",
"time": "2026-06-05T02:47:07.693Z",
"logger": "controller",
"message": "tainted node",
"commit": "15781e3",
"controller": "node.termination",
"controllerGroup": "",
"controllerKind": "Node",
"Node": {
"name": "ip-10-0-0-0.ec2.internal"
},
"namespace": "",
"name": "ip-10-0-0-0.ec2.internal",
"reconcileID": "91b58de0-b212-492f-9d84-00c0ab4653b9",
"NodeClaim": {
"name": "nodeclaim-n4lzk"
},
"taint.Key": "karpenter.sh/disrupted",
"taint.Value": "",
"taint.Effect": "NoSchedule"
}
Resulted in all pods in the Terminating state, a kubelet with Status=Unknown and an event history like:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DisruptionBlocked 4m56s (x63 over 169m) karpenter Node is deleting or marked for deletion
Normal DeletingNode 107s (x959 over 166m) cloud-node-lifecycle-controller Deleting node ip-10-0-0-0.ec2.internal because it does not exist in the cloud provider
Warning FailedDraining 67s (x85 over 169m) karpenter Failed to drain node, 10 pods are waiting to be evicted
Expected Behavior:
Karpenter recognizes the node is no longer active in the cloud provider and cleans up resources accordingly. This was largely the behavior we observed until the upgrade minus some small % of completely deadlocked nodeclaims.
Reproduction Steps (Please include YAML):
- Karpenter configured for spot interruption messages via sqs.
- Configure a deployment with a multi-hour termination grace period and pods which when terminated do not exit before spot shutdown.
- Configure a nodepool with a multi-hour termination grace period
- Wait for interruption and AWS reclaim on instance.
- node is present in kubernetes api but is not backed by an ec2 instance
- Will remain until the either the pods all reach their termination grace period or the node reaches its termination grace period.
Versions:
- Chart Version: 1.12.1
- Kubernetes Version (
kubectl version): 1.35
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Description
Observed Behavior:
After upgrading from 1.8.6 to 1.12.1 (we have the
ec2:DescribeInstanceStatuspermission on the role karpenter uses) due to unsupported version log we've since noticed our spot terminated nodes are not being removed from the cluster well after the node has been reclaimed by AWS. This results in pods stuck in the Terminating state which then have associated volumes orphaned until everything unwinds when the termination grace periods are exceeded.We use relatively long termination grace periods on both pods and nodes to prevent consolidation from interrupting long running work in the best-case scenario. We expect spot interruptions to supersede these termination grace periods which was the observed behavior prior to the 1.12.1 upgrade. All configuration between nodepool and pods is unchanged. We were on 1.35 for a number of weeks before the upgrade to karpenter where we immediately started hitting this case.
You can see exactly when we upgraded karpenter in this cluster and the behavior began occurring even though our spot interrupt rate did not drastically change.
Hard to tell but the value is non-zero prior to the spike:
Logs for associated nodeclaim:
Resulted in all pods in the
Terminatingstate, a kubelet withStatus=Unknownand an event history like:Expected Behavior:
Karpenter recognizes the node is no longer active in the cloud provider and cleans up resources accordingly. This was largely the behavior we observed until the upgrade minus some small % of completely deadlocked nodeclaims.
Reproduction Steps (Please include YAML):
Versions:
kubectl version): 1.35