-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
Summary
The Kubernetes Cluster Autoscaler scaled down an ASG using SetDesiredCapacity instead of selectively terminating empty nodes via TerminateInstanceInAutoScalingGroup. This caused AWS Auto Scaling to
randomly select and terminate 32 instances, including nodes running active GitLab CI jobs.
Environment
- Cluster Autoscaler Version: v1.34.2
- Kubernetes Version: 1.34
- Cloud Provider: AWS
Timeline (2026-02-04 UTC)
- 04:47:13 - Autoscaler sets desired capacity to 76
- 04:47:23 - Autoscaler sets desired capacity to 94 (scale up)
- 04:47:34 - Autoscaler sets desired capacity to 62 (scale down by 32)
- 04:48:49-04:49:05 - AWS Auto Scaling terminates 32 instances
- 04:48:00 - Active GitLab runner pod evicted (job 668628 failed)
Evidence
CloudTrail Event (04:47:34 UTC):
json
{
"eventName": "SetDesiredCapacity",
"userAgent": "aws-sdk-go/1.48.7 (go1.24.10; linux; amd64) cluster-autoscaler/v1.34.2",
"requestParameters": {
"autoScalingGroupName": "REDACTED",
"desiredCapacity": 62,
"honorCooldown": false
}
}
Kubernetes Events:
- Node ip-10-0-128-21.ec2.internal was marked "unremovable" by autoscaler at 04:40:00
- No "ToBeDeletedByClusterAutoscaler" annotation found on the node
- Node terminated by AWS at 04:48:49 without Kubernetes coordination
Expected Behavior
The cluster autoscaler should:
- Identify specific empty/underutilized nodes as scale-down candidates
- Mark them with ToBeDeletedByClusterAutoscaler annotation
- Use TerminateInstanceInAutoScalingGroup API to remove specific instances
- Never terminate nodes with active workloads
Actual Behavior
The cluster autoscaler:
- Made rapid capacity adjustments (76→94→62 in 21 seconds)
- Used SetDesiredCapacity for bulk scale-down
- Let AWS Auto Scaling randomly select which instances to terminate
- Resulted in termination of nodes with running pods
Impact
- GitLab CI job failures
- Workload disruption
- Loss of trust in autoscaler safety
Questions
- Why did the autoscaler use SetDesiredCapacity instead of TerminateInstanceInAutoScalingGroup?
- Is there a configuration issue causing aggressive scale-down behavior?
- Should the autoscaler ever use bulk capacity changes that delegate instance selection to AWS?
Proposed Fix
The autoscaler should always use TerminateInstanceInAutoScalingGroup for scale-down operations to ensure only safe-to-remove nodes are terminated.