Azure: getCurSize() returned 0 for a 90-node VMSS, causing catastrophic scale-down to 1 node

**Which component are you using?**

cluster-autoscaler

**What version of the component are you using?**

cluster-autoscaler 1.32.0

**What k8s version are you using?**

Kubernetes 1.32.11 (LTS)

**What environment is this in?**

Azure, self-deployed cluster-autoscaler running on AKS with multiple VMSS node pools (GPU workloads).

**What did you expect to happen?**

The cluster-autoscaler should maintain the correct node count (90 nodes in the GPU VMSS) and only scale up/down based on actual workload demand. When `getCurSize()` detects a drastic capacity change (e.g., from 90 to 0), it should perform some sanity check rather than unconditionally trusting the value and acting on it.

**What happened instead?**

`getCurSize()` returned `new size: 0` for a VMSS that had 90 running instances. The cluster-autoscaler unconditionally accepted this value, updated its in-memory size from 90 to 0, and then proceeded to "scale up" to 1 node to satisfy pending pods — effectively issuing a `CreateOrUpdate` call that set the VMSS capacity to 1, destroying 89 running GPU nodes.

The exact cause of `getCurSize()` returning 0 is unclear. Possible causes include:
- The Azure VMSS API returned `SKU.Capacity = 0` (transient API anomaly)
- The cached VMSS object had a nil or zero `SKU.Capacity` value
- Some other code path corrupted the cached capacity

Regardless of the root cause, the autoscaler should not blindly act on such a drastic capacity change without any validation.

### Timeline of events

**21:38:11** — `getCurSize()` refreshes VMSS cache. All other VMSS pools return correct sizes, but the GPU pool returns capacity=0:

```
I0320 21:38:11.829158  azure_scale_set.go:247] VMSS: vmss-pool-c, in-memory size: 19, new size: 19
I0320 21:38:11.829172  azure_scale_set_instance_cache.go:78] invalidating instanceCache for vmss-gpu-pool
I0320 21:38:11.829178  azure_scale_set.go:247] VMSS: vmss-gpu-pool, in-memory size: 90, new size: 0
```

Note: three other VMSS pools refreshed at the same timestamp all returned correct sizes. Only the GPU pool was affected.

**21:38:13** — CA now believes the pool has 0 nodes. Scale-down pre-filtering skips all 90 nodes because `current: 0 < min: 3`:

```
I0320 21:38:13.812664  pre_filtering_processor.go:67] Skipping vmss-gpu-pool000y9a - node group min size reached (current: 0, min: 3)
... (repeated for ~80+ nodes)
```

**21:39:25** — CA detects pending pods, sees current size as 0, and decides to scale up to 1:

```
I0320 21:39:25.215090  orchestrator.go:189] Estimated 1 nodes needed in vmss-gpu-pool
I0320 21:39:25.215105  orchestrator.go:254] Final scale-up plan: [{vmss-gpu-pool 0->1 (max: 600)}]
I0320 21:39:25.215131  executor.go:166] Scale-up: setting group vmss-gpu-pool size to 1
I0320 21:39:25.215162  azure_scale_set.go:309] Remaining unsatisfied count is 1. Attempting to increase scale set capacity
I0320 21:39:25.215171  azure_scale_set.go:465] Waiting for virtualMachineScaleSetsClient.CreateOrUpdateAsync(vmss-gpu-pool)
```

**21:40:26** — Azure executes the `CreateOrUpdate` successfully. **The VMSS is now actually at 1 node — 89 GPU nodes destroyed.**

```
I0320 21:40:26.660769  azure_scale_set.go:289] waitForCreateOrUpdateInstances(vmss-gpu-pool) success
I0320 21:40:28.753165  azure_scale_set.go:247] VMSS: vmss-gpu-pool, in-memory size: 1, new size: 1
```

**21:41:34 ~ 21:44:57** — CA slowly scales back up: 1→2, then 2→43, but the damage is already done.

```
I0320 21:41:34.653097  orchestrator.go:254] Final scale-up plan: [{vmss-gpu-pool 1->2 (max: 600)}]
I0320 21:44:57.221711  orchestrator.go:254] Final scale-up plan: [{vmss-gpu-pool 2->43 (max: 600)}]
```

### Root Cause Analysis

In `getCurSize()` (`azure_scale_set.go`), the new size is read from `*set.SKU.Capacity` and applied unconditionally:

```go
curSize := *set.SKU.Capacity
// ...
scaleSet.curSize = curSize
```

There is **no sanity check** for drastic capacity drops. When `getCurSize()` returns 0 for a VMSS that previously had 90 nodes, the autoscaler blindly trusts it. This is then compounded when the scale-up logic issues a `CreateOrUpdate` call with a small target size, which Azure interprets as a request to scale down the VMSS — deleting the excess nodes.

### Distinction from #7432

This issue is different from #7432. In #7432, the in-memory size was gradually decremented due to repeated failed deletion retries. In this case, the in-memory size **jumped from 90 to 0 in a single step**. The `StrictCacheUpdates` flag introduced by #7481 does not protect against this scenario.

**How to reproduce it**

This is difficult to reproduce on demand as the root cause of `getCurSize()` returning 0 is not fully understood. It appears to be a rare, transient issue, but when it occurs, the consequences are catastrophic. The issue only affected one VMSS out of several in the same refresh cycle — the other three VMSS pools returned correct capacity values at the same timestamp.

**Anything else we need to know?**

### Impact

- **89 GPU nodes destroyed** in a production cluster within ~1 minute
- Workloads disrupted across the entire GPU node pool
- Slow recovery: CA had to gradually scale back up (0→1→2→43→...) over many minutes, plus node provisioning and GPU driver initialization time

/kind bug 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure: getCurSize() returned 0 for a 90-node VMSS, causing catastrophic scale-down to 1 node #9452

Timeline of events

Root Cause Analysis

Distinction from #7432

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Azure: getCurSize() returned 0 for a 90-node VMSS, causing catastrophic scale-down to 1 node #9452

Description

Timeline of events

Root Cause Analysis

Distinction from #7432

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions