-
Notifications
You must be signed in to change notification settings - Fork 850
Description
/kind bug
What happened?
Reported maximum allocatable volumes depends on the moment where the ebs-csi-node daemonset starts.
On some of our nodes, the reported allocatable volume is extremely low (12, 13, even 2 sometimes), whereas we use instances that should be able to report 25 volumes (we have 1 root disk and 1 eni).
You can see the issue when listing the csinodes objects (here is an example on a 4 nodes cluster):
What you expected to happen?
I expected the driver to expose 25 volumes, just like the other nodes.
How to reproduce it (as minimally and precisely as possible)?
Reboot the ebs-csi-node-xxxx driver on any node where there is already some EBS volumes mounted (from PODs).
The reported allocatable will be lower than what is expected.
This has radical effects:
- impossible to schedule new pods which mount EBS volumes on these nodes
- compounded by the fact that EBS volumes are AZ-dependent, so the scheduler has less and less leeway to schedule these pods
- allocatable limit containing the actual pod volumes makes it coupound even more, since the scheduler counts pod volumes 2 times: in the limit and for its own scheduling constraints
The current workaround that we found is to manually override the node.reservedVolumeAttachments to 1 so that we skip the heuristics about volumes.
I'm actually surprised that nobody has suffered from the problem. We have an extensive use of EBS though, so maybe this is normal..
Anything else we need to know?:
We use the ebs-csi-driver add-on on our EKS cluster with default configuration, provisioned through terraform module.
From a glance in the driver code, it looks like there are two issues:
- when the driver boots, it removes and recreates the
ebsdriver section in thecsinodeswith the (erroneous) computation of the allocatable volume attachments - the heuristics take the number of block devices attached to the node when the driver starts, which includes the EBS volumes attached by the pods. So the volumes are taken "twice": as a reserved volume, but also by the scheduler when deciding if it should schedule new pods (it sees that these pods use a "allocatable" slot).
The faulty code (IMHO) is here:
aws-ebs-csi-driver/pkg/driver/node.go
Line 815 in 79e8b1d
| reservedVolumeAttachments = d.metadata.GetNumBlockDeviceMappings() + 1 |
if reservedVolumeAttachments == -1 {
// Auto-detect number of reserved volume attachments - plus 1 to account for the root volume
reservedVolumeAttachments = d.metadata.GetNumBlockDeviceMappings() + 1
}This counts ALL block devices instead of those that are mounted by default.
The driver can boot at ANY point in the node lifecycle, so it cannot be expected to receive a correct number of devices if started after pods have been scheduled on it.
Environment
- Kubernetes version (use
kubectl version): 1.32 - Driver version: 1.53 (but reproduced on 1.52 as well)