Skip to content

Cluster Autoscaler uses SetDesiredCapacity causing AWS to terminate nodes with active workloads #9182

@gfvirga

Description

@gfvirga

Summary

The Kubernetes Cluster Autoscaler scaled down an ASG using SetDesiredCapacity instead of selectively terminating empty nodes via TerminateInstanceInAutoScalingGroup. This caused AWS Auto Scaling to
randomly select and terminate 32 instances, including nodes running active GitLab CI jobs.

Environment

  • Cluster Autoscaler Version: v1.34.2
  • Kubernetes Version: 1.34
  • Cloud Provider: AWS

Timeline (2026-02-04 UTC)

  • 04:47:13 - Autoscaler sets desired capacity to 76
  • 04:47:23 - Autoscaler sets desired capacity to 94 (scale up)
  • 04:47:34 - Autoscaler sets desired capacity to 62 (scale down by 32)
  • 04:48:49-04:49:05 - AWS Auto Scaling terminates 32 instances
  • 04:48:00 - Active GitLab runner pod evicted (job 668628 failed)

Evidence

CloudTrail Event (04:47:34 UTC):
json
{
"eventName": "SetDesiredCapacity",
"userAgent": "aws-sdk-go/1.48.7 (go1.24.10; linux; amd64) cluster-autoscaler/v1.34.2",
"requestParameters": {
"autoScalingGroupName": "REDACTED",
"desiredCapacity": 62,
"honorCooldown": false
}
}

Kubernetes Events:

  • Node ip-10-0-128-21.ec2.internal was marked "unremovable" by autoscaler at 04:40:00
  • No "ToBeDeletedByClusterAutoscaler" annotation found on the node
  • Node terminated by AWS at 04:48:49 without Kubernetes coordination

Expected Behavior

The cluster autoscaler should:

  1. Identify specific empty/underutilized nodes as scale-down candidates
  2. Mark them with ToBeDeletedByClusterAutoscaler annotation
  3. Use TerminateInstanceInAutoScalingGroup API to remove specific instances
  4. Never terminate nodes with active workloads

Actual Behavior

The cluster autoscaler:

  1. Made rapid capacity adjustments (76→94→62 in 21 seconds)
  2. Used SetDesiredCapacity for bulk scale-down
  3. Let AWS Auto Scaling randomly select which instances to terminate
  4. Resulted in termination of nodes with running pods

Impact

  • GitLab CI job failures
  • Workload disruption
  • Loss of trust in autoscaler safety

Questions

  1. Why did the autoscaler use SetDesiredCapacity instead of TerminateInstanceInAutoScalingGroup?
  2. Is there a configuration issue causing aggressive scale-down behavior?
  3. Should the autoscaler ever use bulk capacity changes that delegate instance selection to AWS?

Proposed Fix

The autoscaler should always use TerminateInstanceInAutoScalingGroup for scale-down operations to ensure only safe-to-remove nodes are terminated.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions