Skip to content

Commit ba0b5bb

Browse files
committed
KEP-2724: Add multi-level topology aware scheduling design
Add the design for multi-level TAS, which extends two-level scheduling to support N slice layers across deeper topology hierarchies (e.g., datacenter → block → rack → host).
1 parent 9c069ea commit ba0b5bb

File tree

2 files changed

+148
-3
lines changed

2 files changed

+148
-3
lines changed

keps/2724-topology-aware-scheduling/README.md

Lines changed: 146 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,11 @@
1414
- [Story 5](#story-5)
1515
- [Story 6](#story-6)
1616
- [Story 7](#story-7)
17+
- [Story 8](#story-8)
1718
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
1819
- [Integration support](#integration-support)
1920
- [Job](#job)
21+
- [Job with multi-level scheduling](#job-with-multi-level-scheduling)
2022
- [JobSet](#jobset)
2123
- [LeaderWorkerSet](#leaderworkerset)
2224
- [MPIJob with runLauncherAsWorker](#mpijob-with-runlauncherasworker)
@@ -51,12 +53,14 @@
5153
- [Since v0.15](#since-v015)
5254
- [Two-level Topology Aware scheduling](#two-level-topology-aware-scheduling)
5355
- [Example](#example-1)
56+
- [Multi-level Topology Aware scheduling](#multi-level-topology-aware-scheduling)
57+
- [Example](#example-2)
5458
- [Cross-PodSet Topology Aware scheduling](#cross-podset-topology-aware-scheduling)
5559
- [Ensure leader and workers end up on the same flavor](#ensure-leader-and-workers-end-up-on-the-same-flavor)
5660
- [Enforcing the assignment](#enforcing-the-assignment)
5761
- [Support for Elastic Workloads](#support-for-elastic-workloads)
5862
- [Balanced placement](#balanced-placement)
59-
- [Example](#example-2)
63+
- [Example](#example-3)
6064
- [Support for ProvisioningRequests](#support-for-provisioningrequests)
6165
- [Determining the need for second pass](#determining-the-need-for-second-pass)
6266
- [Targeting the newly provisioned nodes](#targeting-the-newly-provisioned-nodes)
@@ -80,7 +84,7 @@
8084
- [Rename the topologyAssignment.domains.values field as levelValues](#rename-the-topologyassignmentdomainsvalues-field-as-levelvalues)
8185
- [Drop dedicated TAS label](#drop-dedicated-tas-label)
8286
- [MostFreeCapacity algorithm](#mostfreecapacity-algorithm)
83-
- [Example](#example-3)
87+
- [Example](#example-4)
8488
- [TopologyAssignmentSlices as separate CRD instances](#topologyassignmentslices-as-separate-crd-instances)
8589
<!-- /toc -->
8690

@@ -203,6 +207,13 @@ Similar to [Story 1](#story-1), but I want Leader and its Workers across multipl
203207
for MPIJob with runLauncherAsWorker (`.spec.runLauncherAsWorker`) which should be scheduled considering
204208
Pod index order.
205209

210+
#### Story 8
211+
212+
Similar to [Story 1](#story-1), but I want a Job's Pods to be placed across a
213+
multi-layer topology based on user-specified constraints. For example, I want
214+
to ensure that a Job is scheduled onto the same data center, in multiples
215+
of 64 on the same "block", and in multiples of 16 on the same "rack".
216+
206217
### Notes/Constraints/Caveats (Optional)
207218

208219
#### Integration support
@@ -242,6 +253,42 @@ spec:
242253
In this example we indicate that all Pods created by the Job should be contained
243254
within the same "rack".
244255
256+
##### Job with multi-level scheduling
257+
258+
According to [Story 8](#story-8), some users would like to place Pods of a Job
259+
across multi-layer topologies.
260+
261+
```yaml
262+
apiVersion: batch/v1
263+
kind: Job
264+
metadata:
265+
namespace: tas-example-job
266+
labels:
267+
kueue.x-k8s.io/queue-name: user-queue
268+
spec:
269+
parallelism: 128
270+
completions: 128
271+
template:
272+
metadata:
273+
annotations:
274+
kueue.x-k8s.io/podset-required-topology: cloud.provider.com/datacenter
275+
kueue.x-k8s.io/podset-slice-size: "64"
276+
kueue.x-k8s.io/podset-slice-required-topology: cloud.provider.com/aizone
277+
kueue.x-k8s.io/podset-slice-size-1: "32"
278+
kueue.x-k8s.io/podset-slice-required-topology-1: cloud.provider.com/block
279+
kueue.x-k8s.io/podset-slice-size-2: "16"
280+
kueue.x-k8s.io/podset-slice-required-topology-2: cloud.provider.com/rack
281+
spec:
282+
containers:
283+
- name: worker
284+
image: registry.k8s.io/e2e-test-images/agnhost:2.53
285+
```
286+
287+
This example ensures the Job is placed within a data center symmetrically,
288+
while leaving room for the user to tune the "multiple-of" knob on each layer.
289+
This setup achieves "gang-of-gang-of-gang" semantics in complex NVIDIA
290+
GB200/GB300 topology architectures.
291+
245292
##### JobSet
246293
247294
One complication we noticed for JobSet is that the proposed design assumes
@@ -780,6 +827,26 @@ const (
780827
// This annotation is required if `kueue.x-k8s.io/podset-slice-required-topology`
781828
// is defined
782829
PodSetSliceSizeAnnotation = "kueue.x-k8s.io/podset-slice-size"
830+
831+
// PodSetSliceRequiredTopologyAnnotation1 indicates the topology level required
832+
// by the second slice layer (first additional layer). Each group from the
833+
// parent slice layer is further subdivided into groups of size
834+
// PodSetSliceSizeAnnotation1, each constrained to the indicated topology domain.
835+
PodSetSliceRequiredTopologyAnnotation1 = "kueue.x-k8s.io/podset-slice-required-topology-1"
836+
837+
// PodSetSliceSizeAnnotation1 describes the requested size of the second
838+
// slice layer (first additional layer).
839+
PodSetSliceSizeAnnotation1 = "kueue.x-k8s.io/podset-slice-size-1"
840+
841+
// PodSetSliceRequiredTopologyAnnotation2 indicates the topology level required
842+
// by the third slice layer (second additional layer). Each group from the
843+
// parent slice layer is further subdivided into groups of size
844+
// PodSetSliceSizeAnnotation2, each constrained to the indicated topology domain.
845+
PodSetSliceRequiredTopologyAnnotation2 = "kueue.x-k8s.io/podset-slice-required-topology-2"
846+
847+
// PodSetSliceSizeAnnotation2 describes the requested size of the third
848+
// slice layer (second additional layer).
849+
PodSetSliceSizeAnnotation2 = "kueue.x-k8s.io/podset-slice-size-2"
783850
)
784851
```
785852

@@ -811,6 +878,12 @@ the rules is deactivated):
811878
specified its own default. See [Slice size validation](#slice-size-validation))
812879
- The value of `kueue.x-k8s.io/podset-slice-size` has to be a numeric value greater or equal
813880
than 1. It has to evenly divide the size of a PodSet.
881+
- The above 2 `podset-slice-*` rules apply to additional slice layers (`kueue.x-k8s.io/podset-slice-required-topology-[X]`)
882+
as well. `[X]` can be up to `2`.
883+
- The value of `kueue.x-k8s.io/podset-slice-size-[X]` must evenly divide the
884+
slice size of the parent layer (i.e., `podset-slice-size` for the first
885+
additional layer, or the preceding `podset-slice-size-[X-1]` for subsequent layers).
886+
- Additional slice layers must be specified in order of increasing topology depth.
814887
- If `kueue.x-k8s.io/podset-group-name` is specified, the `kueue.x-k8s.io/podset-required-topology`
815888
or `kueue.x-k8s.io/podset-preferred-topology` has to also be specified in all other
816889
PodTemplates included in the PodSet Group and it has to have the same value.
@@ -826,6 +899,10 @@ However, in case of the JobSet we expect that the most frequent use-case will be
826899
define PodSet Slice as a single Job, thus if `kueue.x-k8s.io/podset-slice-size`
827900
is not defined for JobSet it defaults to `parallelism`.
828901

902+
For additional slice layers `kueue.x-k8s.io/podset-slice-required-topology-[X]`, its
903+
corresponding `kueue.x-k8s.io/podset-slice-size-[X]` is required. No defaulting logic
904+
is applied here even for JobSet.
905+
829906
### Internal APIs
830907

831908
We extend the `Workload` structure to reflect the topology request at the
@@ -896,6 +973,33 @@ type PodSetTopologyRequest struct {
896973
//
897974
// +optional
898975
PodSetSliceSize *int32 `json:"podSetSliceSize,omitempty"`
976+
977+
// additionalSliceLayers defines additional layers of recursive slice
978+
// subdivision beyond the first slice layer (podSetSliceRequiredTopology /
979+
// podSetSliceSize). Each layer further subdivides the parent layer's
980+
// groups into smaller groups constrained to a finer topology domain.
981+
// At most 2 additional layers are supported (for a total of 3 slice layers).
982+
//
983+
// +optional
984+
// +listType=atomic
985+
// +kubebuilder:validation:MaxItems=2
986+
AdditionalSliceLayers []SliceLayer `json:"additionalSliceLayers,omitempty"`
987+
}
988+
989+
// SliceLayer defines a single additional slice subdivision layer.
990+
type SliceLayer struct {
991+
// topology indicates the topology level required for this slice layer.
992+
//
993+
// +required
994+
// +kubebuilder:validation:MinLength=1
995+
// +kubebuilder:validation:MaxLength=63
996+
Topology string `json:"topology"`
997+
998+
// size indicates the number of pods in each group at this slice layer.
999+
//
1000+
// +required
1001+
// +kubebuilder:validation:Minimum=1
1002+
Size int32 `json:"size"`
8991003
}
9001004
```
9011005

@@ -1428,6 +1532,46 @@ Explanation:
14281532
14291533
It is worth noting that the tight fit mentioned above does not guarantee that no free capacity will be left within the assigned domains.
14301534
1535+
### Multi-level Topology Aware scheduling
1536+
1537+
> [!NOTE]
1538+
> For the alpha implementation, the multi-level code path and API surface are
1539+
> kept separate from two-level scheduling. In a future iteration, we plan to
1540+
> unify them behind a single internal data structure.
1541+
1542+
In consideration of [Story 8](#story-8), multi-level scheduling extends two-level
1543+
scheduling to support more than two levels of topology constraints (e.g.,
1544+
datacenter → block → rack → host). Up to 2 additional slice layers
1545+
(`AdditionalSliceLayers`) can be specified beyond the first slice layer, for a
1546+
total of 3 slice layers. Each additional layer must reference a topology level
1547+
strictly lower (deeper) than the previous layer, and its size must evenly divide
1548+
the parent layer's size. This feature is gated by `TASMultiLayerTopology`
1549+
(alpha, disabled by default since v0.17).
1550+
1551+
In two-level scheduling, below the outermost slice level the algorithm
1552+
distributes individual pods (unit size = 1). With multi-level, a
1553+
`sliceSizeAtLevel` map records the required group size at each intermediate
1554+
level. During the downward traversal, the algorithm looks up this map to
1555+
distribute pods in correctly-sized groups rather than individually. Before
1556+
assigning groups at a given inner level, the algorithm recomputes `sliceState`
1557+
on the child domains as `state / innerSliceSize`, since the `sliceState`
1558+
populated during phase 1 reflects only the outermost slice size. The same
1559+
sorting and selection logic (BestFit / LeastFreeCapacity) is applied at each
1560+
level.
1561+
1562+
#### Example
1563+
1564+
Consider a topology with 3 levels: block → rack → host. A block contains 2
1565+
racks, and each rack contains 4 hosts. Each host can accommodate 8 pods. The
1566+
PodSet has 64 pods with `sliceSize=32` at the block level and one additional
1567+
slice layer with `sliceSize=16` at the rack level.
1568+
1569+
| Phase | Level | Action |
1570+
|-------|-------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
1571+
| 1 | block | Each block has capacity 64 pods. `sliceState` = 64/32 = 2 slices per block. Select the block with the best fit - one block hosts all 64 pods (2 slices). |
1572+
| 2 | rack | `sliceSizeAtLevel` gives 16 at the rack level. Recompute child `sliceState` = 32/16 = 2. Distribute 64 pods across 2 racks in groups of 16: each rack gets 32 pods (2 × 16). |
1573+
| 3 | host | No further slice layer, so `sliceSizeAtLevel` = 1. Distribute 32 pods per rack across 4 hosts individually: each host gets 8 pods. |
1574+
14311575
### Cross-PodSet Topology Aware scheduling
14321576
14331577
In consideration of a [Story 6](#story-6) a cross-podset topology aware scheduling

keps/2724-topology-aware-scheduling/kep.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,9 @@ feature-gates:
4646
- name: TASReplaceNodeOnPodTermination
4747
- name: TASBalancedPlacement
4848
- name: TASReplaceNodeOnNodeTaints
49+
- name: TASMultiLayerTopology
4950
disable-supported: true
5051

5152
# The following PRR answers are required at beta release
5253
#metrics:
53-
# - my_feature_metric
54+
# - my_feature_metric

0 commit comments

Comments
 (0)