-
Notifications
You must be signed in to change notification settings - Fork 71
Description
Summary
After upgrading from the v1 Mountpoint for Amazon S3 CSI driver to v2.x on EKS, Cluster Autoscaler (CAS) stopped scaling nodes down reliably.
CAS marks nodes as unneeded but refuses to remove them because of mount-s3/mp-* pods created by the S3 CSI driver, which it treats as “pods not backed by a controller”. This effectively pins many nodes and prevents scale-down across the cluster.
Downgrading the S3 CSI driver back to v1.13.0 removes the mount-s3/mp-* pods and immediately restores normal scale-down behavior, with no other changes to Cluster Autoscaler or workloads.
Environment details
EKS clusters: multiple (perf/assembly/etc.)
EKS control plane: v1.33 (also observed on v1.32)
Node AMI: amazon-eks-node-al2023-x86_64-standard-1.33-v20251209 (example from AWS support’s lab, our cluster is similar)
Cluster Autoscaler image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.33.3 (also reproduced on v1.32.x)
Mountpoint for Amazon S3 CSI driver:
Working version (no issue): v1.13.0-eksbuild.1
Problematic version: v2.0.0-eksbuild.1
S3 CSI installed as EKS add-on.
Node groups: ASG tagged for CA with minSize < maxSize (e.g., min:1 max:10)
Workloads: multiple Deployments/StatefulSets using S3 volumes via Mountpoint S3 CSI driver.
Minimal reproduction steps
Create an EKS 1.33 cluster with:
Cluster Autoscaler v1.33.3 configured per EKS best practices (ASG tags k8s.io/cluster-autoscaler/enabled and k8s.io/cluster-autoscaler/).
Node group(s) with minSize < maxSize so scale-up/down is allowed.
Install Mountpoint for Amazon S3 CSI driver v2.0.0-eksbuild.1 as an EKS add-on.
Deploy a workload that uses S3 via the CSI driver, for example:
A Deployment with Pods mounting an S3 bucket using the recommended StorageClass / PVC configuration from the docs.
Generate load:
Scale that Deployment up to dozens/hundreds of replicas so the cluster scales out.
Reduce load:
Scale the Deployment back down (e.g., from 100 replicas to 5) and wait for Cluster Autoscaler’s scale-down window.
Observe:
kubectl get pods -n mount-s3 -o wide shows multiple mp-* pods (one per node or per S3 consumer).
Cluster Autoscaler logs show nodes cannot be removed due to these pods:
Node cannot be removed: mount-s3/mp-xxxxx is not replicated
failed to find place for mount-s3/mp-xxxxx: couldn't find a matching Node with passing predicates
As a result, nodes with these mp-* pods never get scaled down, even when they are underutilized and CA considers them unneeded.
Downgrade S3 CSI:
Downgrade the EKS S3 CSI add-on back to v1.13.0-eksbuild.1.
Delete any existing mp-* pods in mount-s3 (they are not recreated on v1).
Repeat steps 4–5.
After downgrade:
kubectl get pods -n mount-s3 shows no mp-* pods.
Cluster Autoscaler logs no longer mention mount-s3/mp-* is not replicated.
CA successfully drains and removes unneeded nodes from the ASGs (only PDBs/minSize may block some nodes, which is expected).
Expected vs actual behavior
Expected behavior
Mountpoint S3 CSI driver v2.x should not introduce pods that effectively prevent Cluster Autoscaler from scaling down nodes in normal EKS configurations.
Nodes that are underutilized and only host replicated workloads should be removable by CA, regardless of S3 mounts.
Actual behavior with v2.0.0-eksbuild.1
mount-s3/mp-* pods are created on many nodes.
Those pods are not owned by a controller (Deployment/DaemonSet), so CA classifies them as non‑replicated singleton pods.
CA repeatedly logs that nodes cannot be removed because mount-s3/mp-* is not replicated and fails to scale down those nodes, even when they are unneeded from a utilization perspective.
Actual behavior after downgrading to v1.13.0-eksbuild.1
mount-s3/mp-* pods disappear.
CA stops logging the “not replicated” errors for Mountpoint pods.
CA is able to mark nodes as unneeded, cordon/taint them, drain workloads, and successfully terminate nodes from the ASGs; the only remaining blockers are PDBs and usual minSize constraints.
Relevant logs / snippets
Cluster Autoscaler log snippets
I1216 20:28:55.343278 1 cluster.go:169] Node ip-10-112-14-230.ec2.internal cannot be removed: mount-s3/mp-4c5r7 is not replicated
I1216 20:31:27.637682 1 klogx.go:87] failed to find place for mount-s3/mp-phkzn: can't schedule pod mount-s3/mp-phkzn: couldn't find a matching Node with passing predicates
After downgrade
I1217 01:57:41.443] Scale-down removing node ip-10-112-15-103.ec2.internal, utilization 0.009, pods to reschedule com-wwex-container-finsearchsettlement-...
# no more "mount-s3/mp-*" blocking messages
Pods
# With v2.x
kubectl get pods -n mount-s3 -o wide
mp-4c5r7 1/1 Running ... NODE ip-...
...
# After downgrade to v1.13.x
kubectl get pods -n mount-s3
No resources found in mount-s3 namespace
Additional context
AWS Support is already aware of this case and mentioned internally that v1 vs v2 differ in scheduling behavior for Mountpoint pods and that they saw our issue as related to that change. They were unable to reproduce it in a simplified lab environment and advised us to open this GitHub issue with our reproduction steps so the Mountpoint S3 CSI engineering team can investigate against a more realistic workload mix.
We need guidance on:
Whether this CA interaction is expected with the current v2 design.
Any recommended configuration/workarounds (e.g., different pod ownership model, annotations, or scheduling tweaks) to make v2 compatible with CA scale‑down.
Whether future v2.x releases will change how mp-* pods are represented so they no longer block CA.