[cluster-autoscaler] Autoscaler does not scale down after failed scale-up and workload disappears

**Which component are you using?**:

/area cluster-autoscaler

**What version of the component are you using?**:
- Chart version: 9.54.1
- App version: 1.35.0
- Image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.34.2

(Built from head as well; issue persists)

**What k8s version are you using (`kubectl version`)?**:

<details><summary><code>kubectl version</code> Output</summary><br><pre>
Client Version: v1.35.0
Kustomize Version: v5.7.1
Server Version: v1.34.1+k3s1
</pre></details>

**What environment is this in?**:
Kubernetes cluster with nodes managed by CAPI (Cluster API) using an infrastructure provider. The issue is specifically observed when the infra provider is at its upper capacity due to infrastructure limitations and the infra provider fails to provision new nodes as requested by the autoscaler.

**What did you expect to happen?**:
Expect the autoscaler to scale the machinePool (node group) back down once excess load disappears after the failed scale-up (e.g., due to provider limits or no available capacity upstream).

**What happened instead?**:
After a failed scale-up attempt, the CAPI infra provider continues retrying spinning up a new node indefinitely, even after load returns to normal. Scale-down never triggers until a node physically appears and then gets noticed it is unused by the autoscaler.

**How to reproduce it (as minimally and precisely as possible)**:
1. Set your machine pool max capacity *higher* than your true infrastructure allows (simulate insufficient capacity via your CAPI infrastructure provider).
2. Cause a scale-up event (workload spike) to trigger.
3. Infrastructure provider refuses/fails to fulfill the new node request (CAPI Infra provider fails node provision).
4. Once load disappears, observe that cluster-autoscaler does not scale the requested replicas back down, and the infrastructure provider keeps retrying (manual intervention needed).
5. Also notice that during the `scaleUpRequest` timeout we delete the reference of the `scaleUpRequest` and put the machineDeployment in backoff without scaling it back down.


**Anything else we need to know?**:
- This creates operational headaches for teams who want to run close to their infra capacity since they have to manually scale the clusters back down once an exhaustion is observed 
- See below for design/debug discussion for the root cause and a possible fix (via patch):

---

## Patch for workaround

I was able to solve this by modifying `clusterstate.go` to decrease the target size (e.g. "undo" the scale up) after a scale-up timeout and removal of the scaleUpRequest from tracking in the ClusterStateRegistry:

```diff
diff --git a/cluster-autoscaler/clusterstate/clusterstate.go b/cluster-autoscaler/clusterstate/clusterstate.go
index 564641df6..3e2e9fa47 100644
--- a/cluster-autoscaler/clusterstate/clusterstate.go
+++ b/cluster-autoscaler/clusterstate/clusterstate.go
@@ -307,6 +307,21 @@ func (csr *ClusterStateRegistry) updateScaleRequests(currentTime time.Time) {
 				ErrorCode:    "timeout",
 				ErrorMessage: fmt.Sprintf("Scale-up timed out for node group %v after %v", nodeGroupName, currentTime.Sub(scaleUpRequest.Time)),
 			}, gpuResource, gpuType, currentTime)
+
+			// Attempt to revert the failed scale-up by decreasing target size.
+			// This prevents cloud providers from indefinitely retrying failed provisioning attempts.
+			if scaleUpRequest.Increase > 0 {
+				klog.V(2).Infof("Reverting timed-out scale-up for node group %v by decreasing target size by %d",
+					nodeGroupName, scaleUpRequest.Increase)
+				err := scaleUpRequest.NodeGroup.DecreaseTargetSize(-scaleUpRequest.Increase)
+				if err != nil {
+					klog.Warningf("Failed to revert timed-out scale-up for node group %v: %v", nodeGroupName, err)
+					csr.logRecorder.Eventf(apiv1.EventTypeWarning, "FailedToRevertScaleUp",
+						"Failed to decrease target size for group %s after scale-up timeout: %v",
+						scaleUpRequest.NodeGroup.Id(), err)
+				}
+			}
+
 			delete(csr.scaleUpRequests, nodeGroupName)
 		}
 	}
```

Would love input from maintainers as to whether this is a generally sound approach or if there is a broader fix needed for autoscaler deployments. I will be happy to open a PR if necessary!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cluster-autoscaler] Autoscaler does not scale down after failed scale-up and workload disappears #9120

Patch for workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[cluster-autoscaler] Autoscaler does not scale down after failed scale-up and workload disappears #9120

Description

Patch for workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions