1414 - [ Story 5] ( #story-5 )
1515 - [ Story 6] ( #story-6 )
1616 - [ Story 7] ( #story-7 )
17+ - [ Story 8] ( #story-8 )
1718 - [ Notes/Constraints/Caveats (Optional)] ( #notesconstraintscaveats-optional )
1819 - [ Integration support] ( #integration-support )
1920 - [ Job] ( #job )
21+ - [ Job with multi-level scheduling] ( #job-with-multi-level-scheduling )
2022 - [ JobSet] ( #jobset )
2123 - [ LeaderWorkerSet] ( #leaderworkerset )
2224 - [ MPIJob with runLauncherAsWorker] ( #mpijob-with-runlauncherasworker )
5153 - [ Since v0.15] ( #since-v015 )
5254 - [ Two-level Topology Aware scheduling] ( #two-level-topology-aware-scheduling )
5355 - [ Example] ( #example-1 )
56+ - [ Multi-level Topology Aware scheduling] ( #multi-level-topology-aware-scheduling )
57+ - [ Example] ( #example-2 )
5458 - [ Cross-PodSet Topology Aware scheduling] ( #cross-podset-topology-aware-scheduling )
5559 - [ Ensure leader and workers end up on the same flavor] ( #ensure-leader-and-workers-end-up-on-the-same-flavor )
5660 - [ Enforcing the assignment] ( #enforcing-the-assignment )
5761 - [ Support for Elastic Workloads] ( #support-for-elastic-workloads )
5862 - [ Balanced placement] ( #balanced-placement )
59- - [ Example] ( #example-2 )
63+ - [ Example] ( #example-3 )
6064 - [ Support for ProvisioningRequests] ( #support-for-provisioningrequests )
6165 - [ Determining the need for second pass] ( #determining-the-need-for-second-pass )
6266 - [ Targeting the newly provisioned nodes] ( #targeting-the-newly-provisioned-nodes )
8084 - [ Rename the topologyAssignment.domains.values field as levelValues] ( #rename-the-topologyassignmentdomainsvalues-field-as-levelvalues )
8185 - [ Drop dedicated TAS label] ( #drop-dedicated-tas-label )
8286 - [ MostFreeCapacity algorithm] ( #mostfreecapacity-algorithm )
83- - [ Example] ( #example-3 )
87+ - [ Example] ( #example-4 )
8488 - [ TopologyAssignmentSlices as separate CRD instances] ( #topologyassignmentslices-as-separate-crd-instances )
8589<!-- /toc -->
8690
@@ -203,6 +207,13 @@ Similar to [Story 1](#story-1), but I want Leader and its Workers across multipl
203207for MPIJob with runLauncherAsWorker (` .spec.runLauncherAsWorker ` ) which should be scheduled considering
204208Pod index order.
205209
210+ #### Story 8
211+
212+ Similar to [ Story 1] ( #story-1 ) , but I want a Job's Pods to be placed across a
213+ multi-layer topology based on user-specified constraints. For example, I want
214+ to ensure that a Job is scheduled onto the same data center, in multiples
215+ of 64 on the same "block", and in multiples of 16 on the same "rack".
216+
206217### Notes/Constraints/Caveats (Optional)
207218
208219#### Integration support
@@ -242,6 +253,42 @@ spec:
242253In this example we indicate that all Pods created by the Job should be contained
243254within the same "rack".
244255
256+ ##### Job with multi-level scheduling
257+
258+ According to [Story 8](#story-8), some users would like to place Pods of a Job
259+ across multi-layer topologies.
260+
261+ ` ` ` yaml
262+ apiVersion : batch/v1
263+ kind : Job
264+ metadata :
265+ namespace : tas-example-job
266+ labels :
267+ kueue.x-k8s.io/queue-name : user-queue
268+ spec :
269+ parallelism : 128
270+ completions : 128
271+ template :
272+ metadata :
273+ annotations :
274+ kueue.x-k8s.io/podset-required-topology : cloud.provider.com/datacenter
275+ kueue.x-k8s.io/podset-slice-size : " 64"
276+ kueue.x-k8s.io/podset-slice-required-topology : cloud.provider.com/aizone
277+ kueue.x-k8s.io/podset-slice-size-1 : " 32"
278+ kueue.x-k8s.io/podset-slice-required-topology-1 : cloud.provider.com/block
279+ kueue.x-k8s.io/podset-slice-size-2 : " 16"
280+ kueue.x-k8s.io/podset-slice-required-topology-2 : cloud.provider.com/rack
281+ spec :
282+ containers :
283+ - name : worker
284+ image : registry.k8s.io/e2e-test-images/agnhost:2.53
285+ ` ` `
286+
287+ This example ensures the Job is placed within a data center symmetrically,
288+ while leaving room for the user to tune the "multiple-of" knob on each layer.
289+ This setup achieves "gang-of-gang-of-gang" semantics in complex NVIDIA
290+ GB200/GB300 topology architectures.
291+
245292##### JobSet
246293
247294One complication we noticed for JobSet is that the proposed design assumes
@@ -780,6 +827,26 @@ const (
780827 // This annotation is required if `kueue.x-k8s.io/podset-slice-required-topology`
781828 // is defined
782829 PodSetSliceSizeAnnotation = "kueue.x-k8s.io/podset-slice-size"
830+
831+ // PodSetSliceRequiredTopologyAnnotation1 indicates the topology level required
832+ // by the second slice layer (first additional layer). Each group from the
833+ // parent slice layer is further subdivided into groups of size
834+ // PodSetSliceSizeAnnotation1, each constrained to the indicated topology domain.
835+ PodSetSliceRequiredTopologyAnnotation1 = "kueue.x-k8s.io/podset-slice-required-topology-1"
836+
837+ // PodSetSliceSizeAnnotation1 describes the requested size of the second
838+ // slice layer (first additional layer).
839+ PodSetSliceSizeAnnotation1 = "kueue.x-k8s.io/podset-slice-size-1"
840+
841+ // PodSetSliceRequiredTopologyAnnotation2 indicates the topology level required
842+ // by the third slice layer (second additional layer). Each group from the
843+ // parent slice layer is further subdivided into groups of size
844+ // PodSetSliceSizeAnnotation2, each constrained to the indicated topology domain.
845+ PodSetSliceRequiredTopologyAnnotation2 = "kueue.x-k8s.io/podset-slice-required-topology-2"
846+
847+ // PodSetSliceSizeAnnotation2 describes the requested size of the third
848+ // slice layer (second additional layer).
849+ PodSetSliceSizeAnnotation2 = "kueue.x-k8s.io/podset-slice-size-2"
783850)
784851```
785852
@@ -811,6 +878,12 @@ the rules is deactivated):
811878 specified its own default. See [ Slice size validation] ( #slice-size-validation ) )
812879- The value of ` kueue.x-k8s.io/podset-slice-size ` has to be a numeric value greater or equal
813880 than 1. It has to evenly divide the size of a PodSet.
881+ - The above 2 ` podset-slice-* ` rules apply to additional slice layers (` kueue.x-k8s.io/podset-slice-required-topology-[X] ` )
882+ as well. ` [X] ` can be up to ` 2 ` .
883+ - The value of ` kueue.x-k8s.io/podset-slice-size-[X] ` must evenly divide the
884+ slice size of the parent layer (i.e., ` podset-slice-size ` for the first
885+ additional layer, or the preceding ` podset-slice-size-[X-1] ` for subsequent layers).
886+ - Additional slice layers must be specified in order of increasing topology depth.
814887- If ` kueue.x-k8s.io/podset-group-name ` is specified, the ` kueue.x-k8s.io/podset-required-topology `
815888 or ` kueue.x-k8s.io/podset-preferred-topology ` has to also be specified in all other
816889 PodTemplates included in the PodSet Group and it has to have the same value.
@@ -826,6 +899,10 @@ However, in case of the JobSet we expect that the most frequent use-case will be
826899define PodSet Slice as a single Job, thus if ` kueue.x-k8s.io/podset-slice-size `
827900is not defined for JobSet it defaults to ` parallelism ` .
828901
902+ For additional slice layers ` kueue.x-k8s.io/podset-slice-required-topology-[X] ` , its
903+ corresponding ` kueue.x-k8s.io/podset-slice-size-[X] ` is required. No defaulting logic
904+ is applied here even for JobSet.
905+
829906### Internal APIs
830907
831908We extend the ` Workload ` structure to reflect the topology request at the
@@ -896,6 +973,33 @@ type PodSetTopologyRequest struct {
896973 //
897974 // +optional
898975 PodSetSliceSize *int32 ` json:"podSetSliceSize,omitempty"`
976+
977+ // additionalSliceLayers defines additional layers of recursive slice
978+ // subdivision beyond the first slice layer (podSetSliceRequiredTopology /
979+ // podSetSliceSize). Each layer further subdivides the parent layer's
980+ // groups into smaller groups constrained to a finer topology domain.
981+ // At most 2 additional layers are supported (for a total of 3 slice layers).
982+ //
983+ // +optional
984+ // +listType=atomic
985+ // +kubebuilder:validation:MaxItems=2
986+ AdditionalSliceLayers []SliceLayer ` json:"additionalSliceLayers,omitempty"`
987+ }
988+
989+ // SliceLayer defines a single additional slice subdivision layer.
990+ type SliceLayer struct {
991+ // topology indicates the topology level required for this slice layer.
992+ //
993+ // +required
994+ // +kubebuilder:validation:MinLength=1
995+ // +kubebuilder:validation:MaxLength=63
996+ Topology string ` json:"topology"`
997+
998+ // size indicates the number of pods in each group at this slice layer.
999+ //
1000+ // +required
1001+ // +kubebuilder:validation:Minimum=1
1002+ Size int32 ` json:"size"`
8991003}
9001004```
9011005
@@ -1428,6 +1532,46 @@ Explanation:
14281532
14291533It is worth noting that the tight fit mentioned above does not guarantee that no free capacity will be left within the assigned domains.
14301534
1535+ ### Multi-level Topology Aware scheduling
1536+
1537+ > [!NOTE]
1538+ > For the alpha implementation, the multi-level code path and API surface are
1539+ > kept separate from two-level scheduling. In a future iteration, we plan to
1540+ > unify them behind a single internal data structure.
1541+
1542+ In consideration of [Story 8](#story-8), multi-level scheduling extends two-level
1543+ scheduling to support more than two levels of topology constraints (e.g.,
1544+ datacenter → block → rack → host). Up to 2 additional slice layers
1545+ (` AdditionalSliceLayers` ) can be specified beyond the first slice layer, for a
1546+ total of 3 slice layers. Each additional layer must reference a topology level
1547+ strictly lower (deeper) than the previous layer, and its size must evenly divide
1548+ the parent layer's size. This feature is gated by ` TASMultiLayerTopology`
1549+ (alpha, disabled by default since v0.17).
1550+
1551+ In two-level scheduling, below the outermost slice level the algorithm
1552+ distributes individual pods (unit size = 1). With multi-level, a
1553+ ` sliceSizeAtLevel` map records the required group size at each intermediate
1554+ level. During the downward traversal, the algorithm looks up this map to
1555+ distribute pods in correctly-sized groups rather than individually. Before
1556+ assigning groups at a given inner level, the algorithm recomputes ` sliceState`
1557+ on the child domains as ` state / innerSliceSize` , since the ` sliceState`
1558+ populated during phase 1 reflects only the outermost slice size. The same
1559+ sorting and selection logic (BestFit / LeastFreeCapacity) is applied at each
1560+ level.
1561+
1562+ #### Example
1563+
1564+ Consider a topology with 3 levels: block → rack → host. A block contains 2
1565+ racks, and each rack contains 4 hosts. Each host can accommodate 8 pods. The
1566+ PodSet has 64 pods with ` sliceSize=32 ` at the block level and one additional
1567+ slice layer with ` sliceSize=16 ` at the rack level.
1568+
1569+ | Phase | Level | Action |
1570+ |-------|-------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
1571+ | 1 | block | Each block has capacity 64 pods. ` sliceState` = 64/32 = 2 slices per block. Select the block with the best fit - one block hosts all 64 pods (2 slices). |
1572+ | 2 | rack | ` sliceSizeAtLevel` gives 16 at the rack level. Recompute child ` sliceState` = 32/16 = 2. Distribute 64 pods across 2 racks in groups of 16: each rack gets 32 pods (2 × 16). |
1573+ | 3 | host | No further slice layer, so ` sliceSizeAtLevel` = 1. Distribute 32 pods per rack across 4 hosts individually: each host gets 8 pods. |
1574+
14311575### Cross-PodSet Topology Aware scheduling
14321576
14331577In consideration of a [Story 6](#story-6) a cross-podset topology aware scheduling
0 commit comments