Skip to content

Azure: getCurSize() returned 0 for a 90-node VMSS, causing catastrophic scale-down to 1 node #9452

@birdemily

Description

@birdemily

Which component are you using?

cluster-autoscaler

What version of the component are you using?

cluster-autoscaler 1.32.0

What k8s version are you using?

Kubernetes 1.32.11 (LTS)

What environment is this in?

Azure, self-deployed cluster-autoscaler running on AKS with multiple VMSS node pools (GPU workloads).

What did you expect to happen?

The cluster-autoscaler should maintain the correct node count (90 nodes in the GPU VMSS) and only scale up/down based on actual workload demand. When getCurSize() detects a drastic capacity change (e.g., from 90 to 0), it should perform some sanity check rather than unconditionally trusting the value and acting on it.

What happened instead?

getCurSize() returned new size: 0 for a VMSS that had 90 running instances. The cluster-autoscaler unconditionally accepted this value, updated its in-memory size from 90 to 0, and then proceeded to "scale up" to 1 node to satisfy pending pods — effectively issuing a CreateOrUpdate call that set the VMSS capacity to 1, destroying 89 running GPU nodes.

The exact cause of getCurSize() returning 0 is unclear. Possible causes include:

  • The Azure VMSS API returned SKU.Capacity = 0 (transient API anomaly)
  • The cached VMSS object had a nil or zero SKU.Capacity value
  • Some other code path corrupted the cached capacity

Regardless of the root cause, the autoscaler should not blindly act on such a drastic capacity change without any validation.

Timeline of events

21:38:11getCurSize() refreshes VMSS cache. All other VMSS pools return correct sizes, but the GPU pool returns capacity=0:

I0320 21:38:11.829158  azure_scale_set.go:247] VMSS: vmss-pool-c, in-memory size: 19, new size: 19
I0320 21:38:11.829172  azure_scale_set_instance_cache.go:78] invalidating instanceCache for vmss-gpu-pool
I0320 21:38:11.829178  azure_scale_set.go:247] VMSS: vmss-gpu-pool, in-memory size: 90, new size: 0

Note: three other VMSS pools refreshed at the same timestamp all returned correct sizes. Only the GPU pool was affected.

21:38:13 — CA now believes the pool has 0 nodes. Scale-down pre-filtering skips all 90 nodes because current: 0 < min: 3:

I0320 21:38:13.812664  pre_filtering_processor.go:67] Skipping vmss-gpu-pool000y9a - node group min size reached (current: 0, min: 3)
... (repeated for ~80+ nodes)

21:39:25 — CA detects pending pods, sees current size as 0, and decides to scale up to 1:

I0320 21:39:25.215090  orchestrator.go:189] Estimated 1 nodes needed in vmss-gpu-pool
I0320 21:39:25.215105  orchestrator.go:254] Final scale-up plan: [{vmss-gpu-pool 0->1 (max: 600)}]
I0320 21:39:25.215131  executor.go:166] Scale-up: setting group vmss-gpu-pool size to 1
I0320 21:39:25.215162  azure_scale_set.go:309] Remaining unsatisfied count is 1. Attempting to increase scale set capacity
I0320 21:39:25.215171  azure_scale_set.go:465] Waiting for virtualMachineScaleSetsClient.CreateOrUpdateAsync(vmss-gpu-pool)

21:40:26 — Azure executes the CreateOrUpdate successfully. The VMSS is now actually at 1 node — 89 GPU nodes destroyed.

I0320 21:40:26.660769  azure_scale_set.go:289] waitForCreateOrUpdateInstances(vmss-gpu-pool) success
I0320 21:40:28.753165  azure_scale_set.go:247] VMSS: vmss-gpu-pool, in-memory size: 1, new size: 1

21:41:34 ~ 21:44:57 — CA slowly scales back up: 1→2, then 2→43, but the damage is already done.

I0320 21:41:34.653097  orchestrator.go:254] Final scale-up plan: [{vmss-gpu-pool 1->2 (max: 600)}]
I0320 21:44:57.221711  orchestrator.go:254] Final scale-up plan: [{vmss-gpu-pool 2->43 (max: 600)}]

Root Cause Analysis

In getCurSize() (azure_scale_set.go), the new size is read from *set.SKU.Capacity and applied unconditionally:

curSize := *set.SKU.Capacity
// ...
scaleSet.curSize = curSize

There is no sanity check for drastic capacity drops. When getCurSize() returns 0 for a VMSS that previously had 90 nodes, the autoscaler blindly trusts it. This is then compounded when the scale-up logic issues a CreateOrUpdate call with a small target size, which Azure interprets as a request to scale down the VMSS — deleting the excess nodes.

Distinction from #7432

This issue is different from #7432. In #7432, the in-memory size was gradually decremented due to repeated failed deletion retries. In this case, the in-memory size jumped from 90 to 0 in a single step. The StrictCacheUpdates flag introduced by #7481 does not protect against this scenario.

How to reproduce it

This is difficult to reproduce on demand as the root cause of getCurSize() returning 0 is not fully understood. It appears to be a rare, transient issue, but when it occurs, the consequences are catastrophic. The issue only affected one VMSS out of several in the same refresh cycle — the other three VMSS pools returned correct capacity values at the same timestamp.

Anything else we need to know?

Impact

  • 89 GPU nodes destroyed in a production cluster within ~1 minute
  • Workloads disrupted across the entire GPU node pool
  • Slow recovery: CA had to gradually scale back up (0→1→2→43→...) over many minutes, plus node provisioning and GPU driver initialization time

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions