You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(scheduler): extend semi-preemptible design to subgroup-level elasticity
Replace the pod-only semi-elasticity model with a tree-level one: a
SubGroupSet's minSubGroup defines how many child subgroups are core
(non-preemptible) while surplus subgroups are elastic, alongside the
existing leaf minMember pod split. The non-preemptible resource count
is redefined as the tree's minimal satisfying set, matching what the
allocator/eviction already compute.
Document the interaction with segmented subgroups (subgroup-level
elasticity is what gives segmented workloads an elastic tier, and
whole-segment eviction preserves the forced topology shape) and record
the leaf-only quota-accounting follow-up gap for the impl branch.
Signed-off-by: SiorMeir <msior@nvidia.com>
Copy file name to clipboardExpand all lines: docs/developer/designs/semi-preemptible/README.md
+39-17Lines changed: 39 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
In v0.10 we separated Priority and Preemption to allow users to control the two parameters independently, where Preemption has 2 modes (values) - **preemptible** and **non-preemptible**.
6
6
7
-
We want to add a new 3rd mode, named **semi-preemptible**, where the podgroup will be non-preemptible up to the `minMember`count of each leaf PodSet, and any extra pods are "elastic" pods and preemptible.
7
+
We want to add a new 3rd mode, named **semi-preemptible**, where the podgroup is non-preemptible up to its **minimum required shape** — `minMember`pods at each leaf PodSet and `minSubGroup` child subgroups at each intermediate node — and anything beyond that minimum is "elastic" and preemptible. Elasticity therefore applies at **every level of the subgroup tree**, not just to pods.
8
8
9
9
## Use Cases
10
10
@@ -30,43 +30,65 @@ Future options considered and deferred:
30
30
31
31
## Subgroups and Multi-Level Trees
32
32
33
-
### Scope: semi-elasticity is a pod-level concept
33
+
### Scope: semi-elasticity applies at every level of the tree
34
34
35
-
Semi-elasticity (the core/elastic split) applies**exclusively to pods**. Subgroups and groups are atomic scheduling units — they are either fully scheduled or not. There is no "semi-elastic subgroup" concept. A user who defines fine-grained subgroups intends them to be scheduled as a whole.
35
+
The core/elastic split is**not** limited to pods. It is defined by the minimum requirement of each node in the subgroup tree:
36
36
37
-
### Inheritance
37
+
- A **leaf PodSet** with `minMember = m` keeps `m` pods core (non-preemptible); any pods beyond `m` are elastic.
38
+
- An **intermediate SubGroupSet** with `minSubGroup = k` keeps its `k` most-prioritized child subgroups core; any additional scheduled child subgroups are elastic and reclaimed as a whole.
39
+
40
+
Subgroups remain **atomic** as a scheduling unit: an elastic subgroup is evicted in its entirety (gang), never partially. "Semi-elasticity at the subgroup level" means *which* subgroups are protected, not that a subgroup is itself split.
38
41
39
-
Subgroups inherit the preemption mode from the root PodGroup. If the PodGroup is **semi-preemptible**, all subgroups are **semi-preemptible**.
42
+
This matches what the scheduler already does. The allocator's gang phase (`collectTasksFromSubGroupSet` in `allocation_info.go`) schedules the `minSubGroup` highest-priority children first and then bursts extra children opportunistically; eviction (`eviction_info.go`) protects exactly the `minSubGroup` core children (`GetMinMembersToSatisfy()`) and reclaims surplus subgroups whole. As today, an elastic (extra) subgroup is only deployed if its pods can be gang-scheduled — otherwise it simply stays unsatisfied.
40
43
41
-
### `minMember` vs. `minSubGroups`
44
+
### Inheritance
42
45
43
-
The core/elastic split is determined **only at nodes that have `minMember` set** (leaf PodSets). Intermediate nodes that use `minSubGroups` define a scheduling gate (how many child subgroups must be satisfied) but do not themselves define a non-preemptible pod threshold.
46
+
Subgroups inherit the preemption mode from the root PodGroup. If the PodGroup is **semi-preemptible**, the whole tree is **semi-preemptible**, and each node's minimum (`minMember` or `minSubGroup`) defines its own core/elastic boundary.
44
47
45
-
Since pods are always attached to leaf PodSets and never to intermediate SubGroupSets, this is a natural boundary: `minMember` is always a leaf-level concept.
48
+
### Non-preemptible resource count = the minimal satisfying set
46
49
47
-
**Non-preemptible resource count** = sum of (`minMember × pod resource request`) across all scheduled leaf PodSets.
50
+
The non-preemptible (core) resources of a semi-preemptible job are the resources of the tree's **minimal satisfying set**, computed recursively from the root:
48
51
49
-
### Behavior when `minSubGroups < scheduled children`
52
+
- at each SubGroupSet, descend into the `minSubGroup` most-prioritized children (all of them if `minSubGroup` is unset);
53
+
- at each leaf PodSet, take `minMember` pods × the per-pod request.
50
54
51
-
When a parent node requires fewer children than are actually scheduled (i.e., some children are "extra" from the scheduling perspective), each child's core/elastic split is still determined independently by that child's own `minMember`. The "extra-ness" is a scheduling-gate concept handled by the existing elastic subgroup scheduling; it does not override the per-subgroup semi-preemptible semantics.
55
+
This is **the same set the allocator computes in its gang phase**, so quota and scheduling agree on what "core" means. It replaces the earlier leaf-only definition (`Σ minMember × request` over all leaf PodSets), which over-counted core resources whenever an intermediate `minSubGroup` made some leaf subgroups elastic.
56
+
57
+
Degenerate cases follow naturally:
58
+
- a leaf where total pods `== minMember`, or a node where scheduled children `== minSubGroup`, is fully non-preemptible at that level;
59
+
- the lower a node's `minSubGroup` (or a leaf's `minMember`) relative to what is scheduled, the wider its elastic tier;
60
+
-`minMember == 0` / no minimum ⇒ fully preemptible at that node.
52
61
53
62
## Immutability Constraint
54
63
55
64
A validation webhook must **prevent increases** to `minMember` and `minSubGroups` on a semi-preemptible PodGroup after creation. This applies to the root PodGroup spec and to all SubGroup entries within it.
56
65
57
-
**Rationale:** once a semi-preemptible job is running, some pods may be over-quota (the elastic ones). Increasing `minMember` or `minSubGroups` would silently reclassify those over-quota pods as "core" non-preemptible pods, violating quota invariants without a rescheduling cycle.
66
+
**Rationale:** once a semi-preemptible job is running, some of its shape may be over-quota (the elastic pods *and* the elastic subgroups). Increasing `minMember` would reclassify over-quota elastic pods as core; increasing `minSubGroup` would reclassify whole over-quota elastic subgroups as core. Either silently grows the minimal satisfying set and violates quota invariants without a rescheduling cycle — which is why both fields are immutable upward.
67
+
68
+
Decreasing these fields is allowed — it can only widen the elastic tier (at the pod or subgroup level, respectively).
69
+
70
+
## Interaction with Segmented Subgroups
71
+
72
+
The [segmented subgroups](../segmented-subgroups/README.md) feature auto-builds a specific tree shape: a workload's replicas are split into `N` fixed-size **segments**, each emitted as a leaf PodSet with `minAvailable = segmentSize`, nested under a parent SubGroupSet that may carry a `minSubGroup`. Segmentation forces this shape so each segment lands in its own topology domain (e.g. a rack).
73
+
74
+
Subgroup-level semi-elasticity is **what makes semi-preemptible meaningful for segmented workloads**:
75
+
76
+
- Each segment is fully gang (`minAvailable == segmentSize`), so there is **no pod-level elastic tier inside a segment**. Under a pod-only model every pod of every segment would be core, and a segmented semi-preemptible job would collapse to plain non-preemptible.
77
+
- With the tree-level model, the parent's `minSubGroup` defines how many **segments** are core; extra segments are elastic. A job with `minSubGroup: 2` over 4 segments keeps 2 segments core (in-quota, non-preemptible) and lets the other 2 burst as elastic, over-quota, reclaimed-first.
58
78
59
-
Decreasing these fields is allowed — it can only widen the elastic tier.
79
+
Because elastic eviction at the subgroup level drops a **whole** segment (a segment is a gang PodSet with no internal surplus), reclaim never tears apart a segment — the forced topology shape is preserved. Segmentation and semi-preemptible therefore compose cleanly: segmentation decides the shape, semi-preemptible decides which parts of that shape are guaranteed.
60
80
61
81
## Simulation Considerations
62
82
63
-
In simulations, only the "extra" (`n - minMember`) pods per leaf PodSet are considered as possible victims. This is applied independently per PodSet; no cross-subgroup victim selection is needed. This approach may miss some solutions when checking all $\binom{n}{m}$ orderings, but the added complexity is not justified for the MVP.
83
+
Victim selection considers only the **surplus** of each node: the "extra" (`n - minMember`) pods at a leaf PodSet, and the extra (`scheduled - minSubGroup`) child subgroups at an intermediate node (evicted whole). This is applied independently per node; no cross-subgroup victim selection is needed. This approach may miss some solutions when checking all orderings, but the added complexity is not justified for the MVP.
64
84
65
85
## Implementation Notes
66
86
67
-
-**Over-quota checks**: core pods (up to `minMember` per leaf) must be in-quota; extra pods may be over-quota
68
-
-**Podgroup and queue status**: count only `minMember` resources per leaf PodSet toward the non-preemptible totals. The pod ordering plugin determines which specific pods are "core" vs. "extra"
69
-
-**Solver simulation**: represent the "extra" pods of a semi-preemptible job as a fully-preemptible representative job
87
+
-**Allocation & eviction — already tree-aware.**`allocation_info.go` (`collectTasksFromSubGroupSet`) and `eviction_info.go` (`hasElasticSurplusInSubGroupSet`, `GetMinMembersToSatisfy()`) already gate core vs. elastic at every level of the tree. Admitting semi-preemptible jobs into the victim pools is enough for reclaim/preempt to evict only the surplus (elastic pods and surplus subgroups); no change to allocation or eviction logic is needed.
88
+
-**Over-quota checks**: the minimal satisfying set (see above) must be in-quota; anything beyond it may be over-quota.
89
+
-**Quota accounting — follow-up gap (impl branch, PR #1713).** The current quota accounting is **leaf-only**: `pkg/scheduler/plugins/proportion/proportion.go` (`isCoreTaskForSemiPreemptible`, `updateQueuesCurrentResourceUsage`) and `pkg/scheduler/plugins/proportion/capacity_policy/capacity_policy.go` (`filterCoreTasksToAllocate`, `isTaskCoreForSemiPreemptibleJob`) classify a pod as core purely by `count ≤ minMember` within its PodSet and ignore `minSubGroup`. This over-counts `AllocatedNotPreemptible` whenever an intermediate `minSubGroup` makes some leaf subgroups elastic (e.g. a `minSubGroup: 2` job over 4 segments counts all 4 segments as core while eviction protects only 2). These must be reworked to use the tree's minimal satisfying set, so quota matches the allocation/eviction semantics.
90
+
-**Correctness item to verify during impl**: ensure preempt/reclaim only ever take the elastic-phase victims for semi-preemptible jobs and never fall back to the phase-3 full eviction in `collectTasksToEvictFromSubGroupSet`, so core subgroups/pods are never offered as victims.
91
+
-**Solver simulation**: represent the elastic surplus of a semi-preemptible job (extra pods and extra subgroups) as a fully-preemptible representative job.
0 commit comments