Which component are you using?
cluster-autoscaler
What version of the component are you using?
cluster-autoscaler 1.32.0
What k8s version are you using?
Kubernetes 1.32.11 (LTS)
What environment is this in?
Azure, self-deployed cluster-autoscaler running on AKS with multiple VMSS node pools (GPU workloads).
What did you expect to happen?
The cluster-autoscaler should maintain the correct node count (90 nodes in the GPU VMSS) and only scale up/down based on actual workload demand. When getCurSize() detects a drastic capacity change (e.g., from 90 to 0), it should perform some sanity check rather than unconditionally trusting the value and acting on it.
What happened instead?
getCurSize() returned new size: 0 for a VMSS that had 90 running instances. The cluster-autoscaler unconditionally accepted this value, updated its in-memory size from 90 to 0, and then proceeded to "scale up" to 1 node to satisfy pending pods — effectively issuing a CreateOrUpdate call that set the VMSS capacity to 1, destroying 89 running GPU nodes.
The exact cause of getCurSize() returning 0 is unclear. Possible causes include:
- The Azure VMSS API returned
SKU.Capacity = 0 (transient API anomaly)
- The cached VMSS object had a nil or zero
SKU.Capacity value
- Some other code path corrupted the cached capacity
Regardless of the root cause, the autoscaler should not blindly act on such a drastic capacity change without any validation.
Timeline of events
21:38:11 — getCurSize() refreshes VMSS cache. All other VMSS pools return correct sizes, but the GPU pool returns capacity=0:
I0320 21:38:11.829158 azure_scale_set.go:247] VMSS: vmss-pool-c, in-memory size: 19, new size: 19
I0320 21:38:11.829172 azure_scale_set_instance_cache.go:78] invalidating instanceCache for vmss-gpu-pool
I0320 21:38:11.829178 azure_scale_set.go:247] VMSS: vmss-gpu-pool, in-memory size: 90, new size: 0
Note: three other VMSS pools refreshed at the same timestamp all returned correct sizes. Only the GPU pool was affected.
21:38:13 — CA now believes the pool has 0 nodes. Scale-down pre-filtering skips all 90 nodes because current: 0 < min: 3:
I0320 21:38:13.812664 pre_filtering_processor.go:67] Skipping vmss-gpu-pool000y9a - node group min size reached (current: 0, min: 3)
... (repeated for ~80+ nodes)
21:39:25 — CA detects pending pods, sees current size as 0, and decides to scale up to 1:
I0320 21:39:25.215090 orchestrator.go:189] Estimated 1 nodes needed in vmss-gpu-pool
I0320 21:39:25.215105 orchestrator.go:254] Final scale-up plan: [{vmss-gpu-pool 0->1 (max: 600)}]
I0320 21:39:25.215131 executor.go:166] Scale-up: setting group vmss-gpu-pool size to 1
I0320 21:39:25.215162 azure_scale_set.go:309] Remaining unsatisfied count is 1. Attempting to increase scale set capacity
I0320 21:39:25.215171 azure_scale_set.go:465] Waiting for virtualMachineScaleSetsClient.CreateOrUpdateAsync(vmss-gpu-pool)
21:40:26 — Azure executes the CreateOrUpdate successfully. The VMSS is now actually at 1 node — 89 GPU nodes destroyed.
I0320 21:40:26.660769 azure_scale_set.go:289] waitForCreateOrUpdateInstances(vmss-gpu-pool) success
I0320 21:40:28.753165 azure_scale_set.go:247] VMSS: vmss-gpu-pool, in-memory size: 1, new size: 1
21:41:34 ~ 21:44:57 — CA slowly scales back up: 1→2, then 2→43, but the damage is already done.
I0320 21:41:34.653097 orchestrator.go:254] Final scale-up plan: [{vmss-gpu-pool 1->2 (max: 600)}]
I0320 21:44:57.221711 orchestrator.go:254] Final scale-up plan: [{vmss-gpu-pool 2->43 (max: 600)}]
Root Cause Analysis
In getCurSize() (azure_scale_set.go), the new size is read from *set.SKU.Capacity and applied unconditionally:
curSize := *set.SKU.Capacity
// ...
scaleSet.curSize = curSize
There is no sanity check for drastic capacity drops. When getCurSize() returns 0 for a VMSS that previously had 90 nodes, the autoscaler blindly trusts it. This is then compounded when the scale-up logic issues a CreateOrUpdate call with a small target size, which Azure interprets as a request to scale down the VMSS — deleting the excess nodes.
Distinction from #7432
This issue is different from #7432. In #7432, the in-memory size was gradually decremented due to repeated failed deletion retries. In this case, the in-memory size jumped from 90 to 0 in a single step. The StrictCacheUpdates flag introduced by #7481 does not protect against this scenario.
How to reproduce it
This is difficult to reproduce on demand as the root cause of getCurSize() returning 0 is not fully understood. It appears to be a rare, transient issue, but when it occurs, the consequences are catastrophic. The issue only affected one VMSS out of several in the same refresh cycle — the other three VMSS pools returned correct capacity values at the same timestamp.
Anything else we need to know?
Impact
- 89 GPU nodes destroyed in a production cluster within ~1 minute
- Workloads disrupted across the entire GPU node pool
- Slow recovery: CA had to gradually scale back up (0→1→2→43→...) over many minutes, plus node provisioning and GPU driver initialization time
/kind bug
Which component are you using?
cluster-autoscaler
What version of the component are you using?
cluster-autoscaler 1.32.0
What k8s version are you using?
Kubernetes 1.32.11 (LTS)
What environment is this in?
Azure, self-deployed cluster-autoscaler running on AKS with multiple VMSS node pools (GPU workloads).
What did you expect to happen?
The cluster-autoscaler should maintain the correct node count (90 nodes in the GPU VMSS) and only scale up/down based on actual workload demand. When
getCurSize()detects a drastic capacity change (e.g., from 90 to 0), it should perform some sanity check rather than unconditionally trusting the value and acting on it.What happened instead?
getCurSize()returnednew size: 0for a VMSS that had 90 running instances. The cluster-autoscaler unconditionally accepted this value, updated its in-memory size from 90 to 0, and then proceeded to "scale up" to 1 node to satisfy pending pods — effectively issuing aCreateOrUpdatecall that set the VMSS capacity to 1, destroying 89 running GPU nodes.The exact cause of
getCurSize()returning 0 is unclear. Possible causes include:SKU.Capacity = 0(transient API anomaly)SKU.CapacityvalueRegardless of the root cause, the autoscaler should not blindly act on such a drastic capacity change without any validation.
Timeline of events
21:38:11 —
getCurSize()refreshes VMSS cache. All other VMSS pools return correct sizes, but the GPU pool returns capacity=0:Note: three other VMSS pools refreshed at the same timestamp all returned correct sizes. Only the GPU pool was affected.
21:38:13 — CA now believes the pool has 0 nodes. Scale-down pre-filtering skips all 90 nodes because
current: 0 < min: 3:21:39:25 — CA detects pending pods, sees current size as 0, and decides to scale up to 1:
21:40:26 — Azure executes the
CreateOrUpdatesuccessfully. The VMSS is now actually at 1 node — 89 GPU nodes destroyed.21:41:34 ~ 21:44:57 — CA slowly scales back up: 1→2, then 2→43, but the damage is already done.
Root Cause Analysis
In
getCurSize()(azure_scale_set.go), the new size is read from*set.SKU.Capacityand applied unconditionally:There is no sanity check for drastic capacity drops. When
getCurSize()returns 0 for a VMSS that previously had 90 nodes, the autoscaler blindly trusts it. This is then compounded when the scale-up logic issues aCreateOrUpdatecall with a small target size, which Azure interprets as a request to scale down the VMSS — deleting the excess nodes.Distinction from #7432
This issue is different from #7432. In #7432, the in-memory size was gradually decremented due to repeated failed deletion retries. In this case, the in-memory size jumped from 90 to 0 in a single step. The
StrictCacheUpdatesflag introduced by #7481 does not protect against this scenario.How to reproduce it
This is difficult to reproduce on demand as the root cause of
getCurSize()returning 0 is not fully understood. It appears to be a rare, transient issue, but when it occurs, the consequences are catastrophic. The issue only affected one VMSS out of several in the same refresh cycle — the other three VMSS pools returned correct capacity values at the same timestamp.Anything else we need to know?
Impact
/kind bug