Skip to content

Karpenter Disrupted Nodes and EBS CSI Volume Attachment #2318

Open
@jammerful

Description

@jammerful

I'm running into an issue where the Karpenter wants to disrupt a node with a stateful set running on it. Then karpenter terminates all the non-daemonset pods on that node. However, when the pod is scheduled to the new node it is unable to start as the volume is still attached to the old node and karpenter is not able to terminate that node:

$ kubectl describe pod
Status:                    Terminating (lasts 3h5m)
...
Events:
  Type    Reason     Age                   From       Message
  ----    ------     ----                  ----       -------
  Normal  Nominated  6m1s (x79 over 164m)  karpenter  Pod should schedule on: nodeclaim/default-on-demand-p27q8, node/ip-10-221-64-33.ec2.internal

When trying to find the volumeattachment and which node its attached to

kubectl describe volumeattachment csi-c72d43bef46cd68c80357ffa7c5e647f8351bd0b01b2b747cb11f5d702745f7f
Name:         csi-c72d43bef46cd68c80357ffa7c5e647f8351bd0b01b2b747cb11f5d702745f7f
Namespace:
Labels:       <none>
Annotations:  csi.alpha.kubernetes.io/node-id: i-048a0a6c6c9e79dd2
API Version:  storage.k8s.io/v1
Kind:         VolumeAttachment
Metadata:
  Creation Timestamp:  2025-01-24T03:25:51Z
  Finalizers:
    external-attacher/ebs-csi-aws-com
  Resource Version:  913618302
  UID:               26ce1744-6c4d-440b-a54b-aa4e9e02eb5c
Spec:
  Attacher:   ebs.csi.aws.com
  Node Name:  ip-10-221-66-172.ec2.internal

You can see the that is another node than what the pod is scheduled to. Looking at the EBS CSI Driver attacher, I don't see any mentions of that attachment:

$ kubectl logs -n system-storage ebs-csi-driver-controller-659467997f-5rw4s -c csi-attacher | grep csi-c72d43bef46cd68c80357ffa7c5e647f8351bd0b01b2b747cb11f5d702745f7f
<empty> (I confirmed this was the leader....)

Once I run kubectl delete volumeattachment csi-c72d43bef46cd68c80357ffa7c5e647f8351bd0b01b2b747cb11f5d702745f7f the pod that was stuck terminating comes up on the new node.

What could this issue be caused by? I would expect the EBS CSI attacher to dettach the volume at some point.

Metadata

Metadata

Labels

triage/needs-informationIndicates an issue needs more information in order to work on it.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions