You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md
+21
Original file line number
Diff line number
Diff line change
@@ -92,6 +92,7 @@ tags, and then generate with `hack/update-toc.sh`.
92
92
-[Design Details](#design-details)
93
93
-[new API changes](#new-api-changes)
94
94
-[ScaleUpFailed](#scaleupfailed)
95
+
-[How we implement <code>TriggeredScaleUp</code> in the cluster autoscaler](#how-we-implement--in-the-cluster-autoscaler)
95
96
-[PreemptionFalied](#preemptionfalied)
96
97
-[What if are both specified in <code>FallbackCriterion</code>?](#what-if-are-both-specified-in-)
97
98
-[Test Plan](#test-plan)
@@ -378,6 +379,26 @@ which creates new Node for Pod typically by the cluster autoscaler.
378
379
3. The cluster autoscaler adds `TriggeredScaleUp: false`.
379
380
4. The scheduler notices `TriggeredScaleUp: false` on Pod and schedules that Pod while falling back to `ScheduleAnyway` on Pod Topology Spread.
380
381
382
+
#### How we implement `TriggeredScaleUp` in the cluster autoscaler
383
+
384
+
Basically, we just put `TriggeredScaleUp: false` for Pods in [status.ScaleUpStatus.PodsRemainUnschedulable](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/processors/status/scale_up_status_processor.go#L37) every [reconciliation (RunOnce)](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/core/static_autoscaler.go#L296).
385
+
386
+
This `status.ScaleUpStatus.PodsRemainUnschedulable` contains Pods that the cluster autoscaler [simulates](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go#L536) the scheduling process for and determines that Pods wouldn't be schedulable in any node group.
387
+
388
+
So, for a simple example,
389
+
if a Pod has 64 cpu request, but no node group can satisfy 64 cpu requirement,
390
+
the Pod would be in `status.ScaleUpStatus.PodsRemainUnschedulable`; get `TriggeredScaleUp: false`.
391
+
392
+
A complicated scenario could also be covered by this way;
393
+
supposing a Pod has 64 cpu request and only a node group can satisfy 64 cpu requirement,
394
+
but the node group is running out of instances at the moment.
395
+
In this case, the first reconciliation selects the node group to make the Pod schedulable,
396
+
but the node group size increase request would be rejected by the cloud provider because of the stockout.
397
+
The node group is then considered to be non-safe for a while,
398
+
and the next reconciliation happens without taking the failed node group into account.
399
+
As said, there's no other node group that can satisfy 64 cpu requirement,
400
+
and then the Pod would be finally in `status.ScaleUpStatus.PodsRemainUnschedulable`; get `TriggeredScaleUp: false`.
401
+
381
402
### PreemptionFalied
382
403
383
404
`PreemptionFailed` is used to fallback when preemption is failed.
0 commit comments