Skip to content

Reported allocatable volume attachment is erroneous #2796

@Gui13

Description

@Gui13

/kind bug

What happened?

Reported maximum allocatable volumes depends on the moment where the ebs-csi-node daemonset starts.

On some of our nodes, the reported allocatable volume is extremely low (12, 13, even 2 sometimes), whereas we use instances that should be able to report 25 volumes (we have 1 root disk and 1 eni).

You can see the issue when listing the csinodes objects (here is an example on a 4 nodes cluster):

Image

What you expected to happen?

I expected the driver to expose 25 volumes, just like the other nodes.

How to reproduce it (as minimally and precisely as possible)?

Reboot the ebs-csi-node-xxxx driver on any node where there is already some EBS volumes mounted (from PODs).
The reported allocatable will be lower than what is expected.

This has radical effects:

  • impossible to schedule new pods which mount EBS volumes on these nodes
  • compounded by the fact that EBS volumes are AZ-dependent, so the scheduler has less and less leeway to schedule these pods
  • allocatable limit containing the actual pod volumes makes it coupound even more, since the scheduler counts pod volumes 2 times: in the limit and for its own scheduling constraints

The current workaround that we found is to manually override the node.reservedVolumeAttachments to 1 so that we skip the heuristics about volumes.

I'm actually surprised that nobody has suffered from the problem. We have an extensive use of EBS though, so maybe this is normal..

Anything else we need to know?:

We use the ebs-csi-driver add-on on our EKS cluster with default configuration, provisioned through terraform module.

From a glance in the driver code, it looks like there are two issues:

  • when the driver boots, it removes and recreates the ebs driver section in the csinodes with the (erroneous) computation of the allocatable volume attachments
  • the heuristics take the number of block devices attached to the node when the driver starts, which includes the EBS volumes attached by the pods. So the volumes are taken "twice": as a reserved volume, but also by the scheduler when deciding if it should schedule new pods (it sees that these pods use a "allocatable" slot).

The faulty code (IMHO) is here:

reservedVolumeAttachments = d.metadata.GetNumBlockDeviceMappings() + 1

if reservedVolumeAttachments == -1 {
	// Auto-detect number of reserved volume attachments - plus 1 to account for the root volume
	reservedVolumeAttachments = d.metadata.GetNumBlockDeviceMappings() + 1
}

This counts ALL block devices instead of those that are mounted by default.
The driver can boot at ANY point in the node lifecycle, so it cannot be expected to receive a correct number of devices if started after pods have been scheduled on it.

Environment

  • Kubernetes version (use kubectl version): 1.32
  • Driver version: 1.53 (but reproduced on 1.52 as well)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions