Cluster Autoscaler uses SetDesiredCapacity causing AWS to terminate nodes with active workloads

### Summary
The Kubernetes Cluster Autoscaler scaled down an ASG using SetDesiredCapacity instead of selectively terminating empty nodes via TerminateInstanceInAutoScalingGroup. This caused AWS Auto Scaling to
randomly select and terminate 32 instances, including nodes running active GitLab CI jobs.

### Environment
- **Cluster Autoscaler Version**: v1.34.2
- **Kubernetes Version**: 1.34
- **Cloud Provider**: AWS

### Timeline (2026-02-04 UTC)
- **04:47:13** - Autoscaler sets desired capacity to 76
- **04:47:23** - Autoscaler sets desired capacity to 94 (scale up)
- **04:47:34** - Autoscaler sets desired capacity to 62 (scale down by 32)
- **04:48:49-04:49:05** - AWS Auto Scaling terminates 32 instances
- **04:48:00** - Active GitLab runner pod evicted (job 668628 failed)

### Evidence

CloudTrail Event (04:47:34 UTC):
json
{
  "eventName": "SetDesiredCapacity",
  "userAgent": "aws-sdk-go/1.48.7 (go1.24.10; linux; amd64) cluster-autoscaler/v1.34.2",
  "requestParameters": {
    "autoScalingGroupName": "REDACTED",
    "desiredCapacity": 62,
    "honorCooldown": false
  }
}


Kubernetes Events:
- Node ip-10-0-128-21.ec2.internal was marked "unremovable" by autoscaler at 04:40:00
- No "ToBeDeletedByClusterAutoscaler" annotation found on the node
- Node terminated by AWS at 04:48:49 without Kubernetes coordination

### Expected Behavior
The cluster autoscaler should:
1. Identify specific empty/underutilized nodes as scale-down candidates
2. Mark them with ToBeDeletedByClusterAutoscaler annotation
3. Use TerminateInstanceInAutoScalingGroup API to remove specific instances
4. Never terminate nodes with active workloads

### Actual Behavior
The cluster autoscaler:
1. Made rapid capacity adjustments (76→94→62 in 21 seconds)
2. Used SetDesiredCapacity for bulk scale-down
3. Let AWS Auto Scaling randomly select which instances to terminate
4. Resulted in termination of nodes with running pods

### Impact
- GitLab CI job failures
- Workload disruption
- Loss of trust in autoscaler safety

### Questions
1. Why did the autoscaler use SetDesiredCapacity instead of TerminateInstanceInAutoScalingGroup?
2. Is there a configuration issue causing aggressive scale-down behavior?
3. Should the autoscaler ever use bulk capacity changes that delegate instance selection to AWS?

### Proposed Fix
The autoscaler should always use TerminateInstanceInAutoScalingGroup for scale-down operations to ensure only safe-to-remove nodes are terminated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Autoscaler uses SetDesiredCapacity causing AWS to terminate nodes with active workloads #9182

Summary

Environment

Timeline (2026-02-04 UTC)

Evidence

Expected Behavior

Actual Behavior

Impact

Questions

Proposed Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cluster Autoscaler uses SetDesiredCapacity causing AWS to terminate nodes with active workloads #9182

Description

Summary

Environment

Timeline (2026-02-04 UTC)

Evidence

Expected Behavior

Actual Behavior

Impact

Questions

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions