Reported allocatable volume attachment is erroneous

/kind bug

**What happened?**

Reported maximum allocatable volumes depends on the moment where the ebs-csi-node daemonset starts.

On some of our nodes, the reported allocatable volume is extremely low (12, 13, even 2 sometimes), whereas we use instances that should be able to report 25 volumes (we have 1 root disk and 1 eni).

You can see the issue when listing the `csinodes` objects (here is an example on a 4 nodes cluster):

<img width="2550" height="588" alt="Image" src="https://github.com/user-attachments/assets/74c0bb9a-bf9f-4077-9d2e-1268d09fb0d4" />


**What you expected to happen?**

I expected the driver to expose 25 volumes, just like the other nodes.


**How to reproduce it (as minimally and precisely as possible)?**

Reboot the ebs-csi-node-xxxx driver on any node where there is already some EBS volumes mounted (from PODs). 
The reported allocatable will be lower than what is expected.

This has **radical** effects:
- impossible to schedule new pods which mount EBS volumes on these nodes
- compounded by the fact that EBS volumes are AZ-dependent, so the scheduler has less and less leeway to schedule these pods
- allocatable limit containing the actual pod volumes makes it coupound even more, since the scheduler counts pod volumes 2 times: in the limit and for its own scheduling constraints

The current workaround that we found is to manually override the `node.reservedVolumeAttachments` to `1` so that we skip the heuristics about volumes.

I'm actually surprised that nobody has suffered from the problem. We have an extensive use of EBS though, so maybe this is normal..

**Anything else we need to know?**:

We use the ebs-csi-driver add-on on our EKS cluster with default configuration, provisioned through terraform module.

From a glance in the driver code, it looks like there are two issues:
- when the driver boots, it removes and recreates the `ebs` driver section in the `csinodes` with the (erroneous) computation of the allocatable volume attachments
- the heuristics take the number of block devices attached to the node **when the driver starts**, which **includes** the EBS volumes attached by the pods. So the volumes are taken "twice": as a reserved volume, but also by the scheduler when deciding if it should schedule new pods (it sees that these pods use a "allocatable" slot).

The faulty code (IMHO) is here: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/79e8b1dd07a8d2d0e800a772338dd8111e5b5672/pkg/driver/node.go#L815

```go
if reservedVolumeAttachments == -1 {
	// Auto-detect number of reserved volume attachments - plus 1 to account for the root volume
	reservedVolumeAttachments = d.metadata.GetNumBlockDeviceMappings() + 1
}
```

This counts ALL block devices instead of those that are mounted by default.
The driver can boot at ANY point in the node lifecycle, so it *cannot* be expected to receive a correct number of devices if started after pods have been scheduled on it.

**Environment**
- Kubernetes version (use `kubectl version`): 1.32
- Driver version: 1.53 (but reproduced on 1.52 as well)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reported allocatable volume attachment is erroneous #2796

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reported allocatable volume attachment is erroneous #2796

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions