Skip to content

[cluster-autoscaler] Autoscaler does not scale down after failed scale-up and workload disappears #9120

@thatmidwesterncoder

Description

@thatmidwesterncoder

Which component are you using?:

/area cluster-autoscaler

What version of the component are you using?:

  • Chart version: 9.54.1
  • App version: 1.35.0
  • Image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.34.2

(Built from head as well; issue persists)

What k8s version are you using (kubectl version)?:

kubectl version Output
Client Version: v1.35.0
Kustomize Version: v5.7.1
Server Version: v1.34.1+k3s1

What environment is this in?:
Kubernetes cluster with nodes managed by CAPI (Cluster API) using an infrastructure provider. The issue is specifically observed when the infra provider is at its upper capacity due to infrastructure limitations and the infra provider fails to provision new nodes as requested by the autoscaler.

What did you expect to happen?:
Expect the autoscaler to scale the machinePool (node group) back down once excess load disappears after the failed scale-up (e.g., due to provider limits or no available capacity upstream).

What happened instead?:
After a failed scale-up attempt, the CAPI infra provider continues retrying spinning up a new node indefinitely, even after load returns to normal. Scale-down never triggers until a node physically appears and then gets noticed it is unused by the autoscaler.

How to reproduce it (as minimally and precisely as possible):

  1. Set your machine pool max capacity higher than your true infrastructure allows (simulate insufficient capacity via your CAPI infrastructure provider).
  2. Cause a scale-up event (workload spike) to trigger.
  3. Infrastructure provider refuses/fails to fulfill the new node request (CAPI Infra provider fails node provision).
  4. Once load disappears, observe that cluster-autoscaler does not scale the requested replicas back down, and the infrastructure provider keeps retrying (manual intervention needed).
  5. Also notice that during the scaleUpRequest timeout we delete the reference of the scaleUpRequest and put the machineDeployment in backoff without scaling it back down.

Anything else we need to know?:

  • This creates operational headaches for teams who want to run close to their infra capacity since they have to manually scale the clusters back down once an exhaustion is observed
  • See below for design/debug discussion for the root cause and a possible fix (via patch):

Patch for workaround

I was able to solve this by modifying clusterstate.go to decrease the target size (e.g. "undo" the scale up) after a scale-up timeout and removal of the scaleUpRequest from tracking in the ClusterStateRegistry:

diff --git a/cluster-autoscaler/clusterstate/clusterstate.go b/cluster-autoscaler/clusterstate/clusterstate.go
index 564641df6..3e2e9fa47 100644
--- a/cluster-autoscaler/clusterstate/clusterstate.go
+++ b/cluster-autoscaler/clusterstate/clusterstate.go
@@ -307,6 +307,21 @@ func (csr *ClusterStateRegistry) updateScaleRequests(currentTime time.Time) {
 				ErrorCode:    "timeout",
 				ErrorMessage: fmt.Sprintf("Scale-up timed out for node group %v after %v", nodeGroupName, currentTime.Sub(scaleUpRequest.Time)),
 			}, gpuResource, gpuType, currentTime)
+
+			// Attempt to revert the failed scale-up by decreasing target size.
+			// This prevents cloud providers from indefinitely retrying failed provisioning attempts.
+			if scaleUpRequest.Increase > 0 {
+				klog.V(2).Infof("Reverting timed-out scale-up for node group %v by decreasing target size by %d",
+					nodeGroupName, scaleUpRequest.Increase)
+				err := scaleUpRequest.NodeGroup.DecreaseTargetSize(-scaleUpRequest.Increase)
+				if err != nil {
+					klog.Warningf("Failed to revert timed-out scale-up for node group %v: %v", nodeGroupName, err)
+					csr.logRecorder.Eventf(apiv1.EventTypeWarning, "FailedToRevertScaleUp",
+						"Failed to decrease target size for group %s after scale-up timeout: %v",
+						scaleUpRequest.NodeGroup.Id(), err)
+				}
+			}
+
 			delete(csr.scaleUpRequests, nodeGroupName)
 		}
 	}

Would love input from maintainers as to whether this is a generally sound approach or if there is a broader fix needed for autoscaler deployments. I will be happy to open a PR if necessary!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions