Volume hang on Karpenter Node Consolidation/Termination

/kind bug

As discussed in [https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1302](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1302), some users experience a considerable delay when using PersistentVolumes in combination with Karpenter.

This ticket is to capture that specific issue, highlight where the error is coming from, and discuss possible ways forward.

**What happened?**

During a Karpenter scaling event, it’s possible that the Kubelet cannot update its Node.Status.VolumesAttached and Node.Status.VolumesInUse values, either because EC2 TerminateInstances has been executed or the Node object has been removed. The Kubelet VolumeManager processes jobs asynchronously, so a terminated pod does not mean all the volumes have been fully unmounted and detached from the node.

Due to how the controllermanager attachdetach-controller uses this node information, the VolumeAttachment objects are not cleaned up immediately, often waiting for around 6+ minutes. Once they are cleaned up by the control plane, the EBS-CSI Controller does its job as expected, typically throwing the error “Already detached” in the logs (as the EC2 instance had detached the volumes on termination).

```bash
DetachDisk: called on non-attached volume
```

**What you expected to happen?**

If a node is deleted and the underlying EBS volumes become detached in the real world, the ebs-csi driver should be able to reflect the actual state in Kubernetes before the timeout condition triggers in the controllermanager attachdetach-controller.

Ideally, we can then catch this dangling volume condition, and reconcile the state within seconds instead of 6+ minutes. Allowing the volumes to be remounted quickly on another node.

**How to reproduce it (as minimally and precisely as possible)?**

- Create a cluster with Karpenter and a provisioning group
- Create a Stateful set with 1 replica and ~20 volumes
    - 20 volumes are used here to reliably reproduce the bug, but it can occur with fewer volumes. It’s a factor of how quickly the Kubelet VolumeManager can unmount volumes after pod termination and update the Node status object.
- Use the Kubernetes API to delete the Karpenter node the pod is running on
    - This will be captured by the Karpenter termination controller, causing a cordon and drain
    - Once this is complete, the node is Terminated using the EC2 API
        - If Volumes are listed in the Node status when the finalizer is removed - this bug occurs

**Reproduction Walkthrough**

- Using the statefulset available here [https://github.com/martysweet/eks-ebs-node-detachment-reproduction/blob/main/manifests/stateful_set.yml](https://github.com/martysweet/eks-ebs-node-detachment-reproduction/blob/main/manifests/stateful_set.yml) apply it to a cluster and ensure it’s being scheduled on a Karpenter provisioned node.
- Run `make py-watch-nodes` and `make py-watch-pods`, to listen to updates to the applied Kubernetes objects
- Delete the Karpenter node the pod is running on ex. `ip-10-227-165-182.eu-central-1.compute.internal`
- In the Karpenter logs, we can observe the Karpenter node is cordoned and drained, which is [https://github.com/aws/karpenter-core/blob/1a05bdff59ced95c8868c75efdddfa77a3540092/pkg/controllers/termination/controller.go#L84-L101](https://github.com/aws/karpenter-core/blob/1a05bdff59ced95c8868c75efdddfa77a3540092/pkg/controllers/termination/controller.go#L84-L101)

```bash
2023-06-30T09:04:30.218Z    INFO    controller.termination    cordoned node    {"commit": "26e2d35-dirty", "node": "ip-10-227-165-182.eu-central-1.compute.internal"}                                                        │
2023-06-30T09:04:31.769Z    INFO    controller.termination    deleted node    {"commit": "26e2d35-dirty", "node": "ip-10-227-165-182.eu-central-1.compute.internal"}                                                         │
2023-06-30T09:04:32.049Z    INFO    controller.machine.termination    deleted machine    {"commit": "26e2d35-dirty", "machine": "nodes-ubuntu-j67qz", "node": "ip-10-227-165-182.eu-central-1.compute.internal",
```

- The pod is successfully terminated

```bash
[2023-06-30T09:04:31.364131Z] Name:  ebstest-0
[2023-06-30T09:04:31.364192Z] Type:  MODIFIED
[2023-06-30T09:04:31.364213Z] Phase:  Running
[2023-06-30T09:04:31.364228Z] Deletion Timestamp:  2023-06-30 09:04:30+00:00
[2023-06-30T09:04:31.364252Z] Node Name:  ip-10-227-165-182.eu-central-1.compute.internal
[2023-06-30T09:04:31.364267Z]  ====================================================================================== =
[2023-06-30T09:04:31.368870Z] Name:  ebstest-0
[2023-06-30T09:04:31.368932Z] Type:  DELETED
[2023-06-30T09:04:31.368955Z] Phase:  Running
[2023-06-30T09:04:31.368972Z] Deletion Timestamp:  2023-06-30 09:04:30+00:00
[2023-06-30T09:04:31.368997Z] Node Name:  ip-10-227-165-182.eu-central-1.compute.internal
[2023-06-30T09:04:31.369037Z]  ====================================================================================== =
```

- The node is officially deleted from the API, however, it still has VolumesAttached and VolumesInUse. It’s then also purged from existence by the EC2 API. If 1 or two volumes are on the node, it can sometimes process all of these in time, however, any more volumes and this bug becomes consistently reproducible.

```bash
[2023-06-30T09:04:30.451152Z] Name:  ip-10-227-165-182.eu-central-1.compute.internal
[2023-06-30T09:04:30.451202Z] Type:  MODIFIED
[2023-06-30T09:04:30.451216Z] Deletion Timestamp:  2023-06-30 09:04:30+00:00
[2023-06-30T09:04:30.451237Z] Finalizers:  ['karpenter.sh/termination']
[2023-06-30T09:04:30.451582Z] Status volumes attached:  [{'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-039c251ec3bb18e1a'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-044c85228f50c3cc3'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0a04dc850d50fde39'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0ae8ca9ce6efd78c3'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-01e3c874d68c54844'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-000bf0a07713e801c'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-08cdaad8d4fb97ded'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-062a56ba3b8325000'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-03d4144b21e2e4240'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-00ab6f6923e2cedd2'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0e082a6e4eee7a6d3'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0bd58963b12e47220'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0ac5723c4698267bc'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0d8944ef66977a5e9'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0d97e48cc694a6e90'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0bd917ecb162abf39'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-009a8731a6c1a5d75'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-010c6f981ce7fb716'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0ca6be22757fe2860'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0dd2106a0b2e12980'}]
[2023-06-30T09:04:30.451628Z] Status volumes in use:  ['kubernetes.io/csi/ebs.csi.aws.com^vol-000bf0a07713e801c', 'kubernetes.io/csi/ebs.csi.aws.com^vol-009a8731a6c1a5d75', 'kubernetes.io/csi/ebs.csi.aws.com^vol-00ab6f6923e2cedd2', 'kubernetes.io/csi/ebs.csi.aws.com^vol-010c6f981ce7fb716', 'kubernetes.io/csi/ebs.csi.aws.com^vol-01e3c874d68c54844', 'kubernetes.io/csi/ebs.csi.aws.com^vol-039c251ec3bb18e1a', 'kubernetes.io/csi/ebs.csi.aws.com^vol-03d4144b21e2e4240', 'kubernetes.io/csi/ebs.csi.aws.com^vol-044c85228f50c3cc3', 'kubernetes.io/csi/ebs.csi.aws.com^vol-062a56ba3b8325000', 'kubernetes.io/csi/ebs.csi.aws.com^vol-08cdaad8d4fb97ded', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0a04dc850d50fde39', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0ac5723c4698267bc', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0ae8ca9ce6efd78c3', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0bd58963b12e47220', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0bd917ecb162abf39', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0ca6be22757fe2860', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0d8944ef66977a5e9', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0d97e48cc694a6e90', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0dd2106a0b2e12980', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0e082a6e4eee7a6d3']
[2023-06-30T09:04:30.451656Z]  ====================================================================================== =
[2023-06-30T09:04:31.966559Z] Name:  ip-10-227-165-182.eu-central-1.compute.internal
[2023-06-30T09:04:31.966637Z] Type:  DELETED
[2023-06-30T09:04:31.966657Z] Deletion Timestamp:  2023-06-30 09:04:30+00:00
[2023-06-30T09:04:31.966682Z] Finalizers:  ['karpenter.sh/termination']
[2023-06-30T09:04:31.967037Z] Status volumes attached:  [{'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-039c251ec3bb18e1a'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-044c85228f50c3cc3'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0a04dc850d50fde39'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0ae8ca9ce6efd78c3'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-01e3c874d68c54844'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-000bf0a07713e801c'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-08cdaad8d4fb97ded'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-062a56ba3b8325000'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-03d4144b21e2e4240'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-00ab6f6923e2cedd2'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0e082a6e4eee7a6d3'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0bd58963b12e47220'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0ac5723c4698267bc'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0d8944ef66977a5e9'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0d97e48cc694a6e90'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0bd917ecb162abf39'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-009a8731a6c1a5d75'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-010c6f981ce7fb716'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0ca6be22757fe2860'}, {'device_path': '', 'name': 'kubernetes.io/csi/ebs.csi.aws.com^vol-0dd2106a0b2e12980'}]
[2023-06-30T09:04:31.967086Z] Status volumes in use:  ['kubernetes.io/csi/ebs.csi.aws.com^vol-000bf0a07713e801c', 'kubernetes.io/csi/ebs.csi.aws.com^vol-009a8731a6c1a5d75', 'kubernetes.io/csi/ebs.csi.aws.com^vol-00ab6f6923e2cedd2', 'kubernetes.io/csi/ebs.csi.aws.com^vol-010c6f981ce7fb716', 'kubernetes.io/csi/ebs.csi.aws.com^vol-01e3c874d68c54844', 'kubernetes.io/csi/ebs.csi.aws.com^vol-039c251ec3bb18e1a', 'kubernetes.io/csi/ebs.csi.aws.com^vol-03d4144b21e2e4240', 'kubernetes.io/csi/ebs.csi.aws.com^vol-044c85228f50c3cc3', 'kubernetes.io/csi/ebs.csi.aws.com^vol-062a56ba3b8325000', 'kubernetes.io/csi/ebs.csi.aws.com^vol-08cdaad8d4fb97ded', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0a04dc850d50fde39', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0ac5723c4698267bc', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0ae8ca9ce6efd78c3', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0bd58963b12e47220', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0bd917ecb162abf39', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0ca6be22757fe2860', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0d8944ef66977a5e9', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0d97e48cc694a6e90', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0dd2106a0b2e12980', 'kubernetes.io/csi/ebs.csi.aws.com^vol-0e082a6e4eee7a6d3']
```

- While this may seem like a Karpenter or Kubelet issue, we can’t find any supporting documentation that suggests Autoscalers should be aware of such conditions - so we have raised the issue here.
- In the controllermanager attachdetach-controller, [https://github.com/kubernetes/kubernetes/blob/d07c2688fe07d7a65bca66d26853c637743544d4/pkg/controller/volume/attachdetach/attach_detach_controller.go#L571-L584](https://github.com/kubernetes/kubernetes/blob/d07c2688fe07d7a65bca66d26853c637743544d4/pkg/controller/volume/attachdetach/attach_detach_controller.go#L571-L584) we still care about volumes for deleted nodes - there are presumably edgecases in other volume plugins where this matters.
- As a result, users can expect to see dangling VolumeAttachment objects, until the attachdetach-controller safety timer times out and sets a DeletionTimestamp on the VA object.

Here is a snippet of VolumeAttachment objects after the node has been deleted - referencing the stale node. After 6 minutes, these objects will be cleaned up and recreated for the newly scheduled node.

```bash
csi-2b69ecb15e39f04be54aff1ca9c435cf3b699c71abc97a0817b1f48e56035cd9   ebs.csi.aws.com   pvc-a12f2685-a61b-4f35-b5db-d0d2d120dae3   ip-10-227-165-182.eu-central-1.compute.internal   true       20h
csi-4983722e4e0284f939fe90e75ff3b7bd7631ef8034b3fe9301680d05c0d29c57   ebs.csi.aws.com   pvc-323c2507-e3b7-405b-8bdd-bdb451190c6e   ip-10-227-165-182.eu-central-1.compute.internal   true       20h
csi-52a39a2029b26218f3c50f9b7199e7627f748d84348dfb7e3baf18914dd04e50   ebs.csi.aws.com   pvc-63131420-107f-4560-a4be-f5478b6a7b10   ip-10-227-165-182.eu-central-1.compute.internal   true       20h
csi-7d391010e1749ceeaf03ac930de8f38c287bbfaedba589f32c0e5907d82d754d   ebs.csi.aws.com   pvc-b097ef05-7813-4813-b245-23af831d9609   ip-10-227-165-182.eu-central-1.compute.internal   true       20h**Environment**
```

## Versions

- Kubernetes version (use `kubectl version`): v1.25.10-eks-c12679a running Ubuntu Kubelet v1.25.10
- Driver version: 1.19.0

## Possible Solutions

There are lots of components at play in this specific issue. 

- Karpenter is deleting the node before the Status.Volumes* fields are empty - but is it really the responsibility of Karpenter to care about this?
- The attachdetach-controller uses the Status.Volumes* data to populate its ActualStateOfWorld, which, in our specific case is not the truth
- Maybe the Kubelet is not detaching volumes quickly enough? This, however, is an asynchronous reconciliation process
- The ebs-csi-controller is not notified about any of these actions

To help explain all the components involved, we put together a simplified diagram.

![image](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/assets/20517404/58144a87-1b93-4edd-9327-1af891af6c09)

As the ebs-csi controller, we have access to the real-world EBS Volume attachments information in the EC2 API. As such, the driver should be able to update the VolumeAttachment object to notify the attachdetach-controller of the correct state.

One possible solution to this problem could be:

- When a CSI Driver is deployed to a node, add a `csi-ebs` finalizer to the Node
- During runtime: On node deletion, capture the event
    - Log and Wait until all `csi-ebs` volumes are detached before removing the finalizer
- On controller startup, check if there are any pending node deletions and run above logic

The advantage of the above flow, is we are just forcing a wait and allowing the standard CSI flows to happen. However, looking at the Karpenter logic, it seems that it wouldn’t care about our finalizer, terminating the instance regardless [https://github.com/aws/karpenter-core/blob/1a05bdff59ced95c8868c75efdddfa77a3540092/pkg/controllers/termination/controller.go#L84-L102](https://github.com/aws/karpenter-core/blob/1a05bdff59ced95c8868c75efdddfa77a3540092/pkg/controllers/termination/controller.go#L84-L102). Perhaps that is a bug on their side, where they should be waiting for a final delete event before proceeding with the TerminateInstance call?

Another solution could be:

- Listen to Node deletion events
- When a node has been removed from the API
- Add each dangling volumesAttached to a queue
- Periodically check to see if the volume is still in use in the EC2 API to the dangling nodeName
- Update any dangling VolumeAttachment objects to reflect the actual EC2 state
- On csi-controller startup, check VolumeAttachments for bindings to any nodes which don’t exist

In either case, missing events (say during a Controller restart) will fall back to the 6-minute stale timeout controlled by the attachdetach-controller. The aim is to make this reconciliation quicker and in a safe manner. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Volume hang on Karpenter Node Consolidation/Termination #1665

Versions

Possible Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Volume hang on Karpenter Node Consolidation/Termination #1665

Description

Versions

Possible Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions