Test/tas e2e #313

Ronkahn21 · 2026-01-12T08:55:41Z

PR Description

What type of PR is this?
/kind feature
/kind testing

What this PR does / why we need it:
Adds comprehensive end-to-end tests for Topology-Aware Scheduling (TAS) functionality in Grove. The tests validate that Grove's translation mechanism correctly converts user-defined topology constraints (pack
domains) to KAI scheduler format, and that pods are placed according to topology constraints across all hierarchy levels (zone, block, rack, host).

Which issue(s) this PR fixes:
Fixes #305

Special notes for your reviewer:

This PR covers the translation part and topology infrastructure validation (ClusterTopology + KAI Topology CRs)
Webhook validation tests will be added in a separate PR once webhook implementation is complete
All tests verify end-to-end pod placement behavior, not intermediate PodGang CR state
Tests use 28-node shared cluster to provide sufficient topology diversity
9 out of 10 test scenarios are implemented; EC-2 (infrastructure failure) remains

Does this PR introduce an API change?
No

Release note:
Add comprehensive E2E tests for Topology-Aware Scheduling (TAS) covering translation of topology constraints, infrastructure validation, placement verification, scaling scenarios, and failure handling.

Additional documentation:
E2E test coverage includes:

Infrastructure validation (ClusterTopology + KAI Topology CRs)
Full hierarchy constraints (PCS→PCSG→PodClique levels)
Independent clique placement with different constraints
Multi-replica and scaling scenarios
Failure scenarios (insufficient capacity)

Test execution: make e2e-test TEST_PATTERN="^Test_TAS"

sanjaychatterjee · 2026-01-16T03:31:49Z

operator/internal/controller/podcliqueset/components/podgang/syncflow.go

 				fqn:                     podGangName,
 				pclqs:                   pcsgPodCliqueInfos,
 				pcsgTopologyConstraints: pcsgTopologyConstraints,
+				topologyConstraint:      createTopologyPackConstraint(sc, apicommonconstants.KindPodCliqueSet, client.ObjectKeyFromObject(sc.pcs), sc.pcs.Spec.Template.TopologyConstraint),


Thanks for fixing this but can we make a separate PR for this? cc @unmarshall

* Introduced a new condition TopologyLevelsUnavailable in PCS status. * Added reconciliation code to update the PCS status condition. * Added missing cluster role to delete KAI topology CR. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

* Added code to update or remove the condition on PCS. * Create utility function for cluster topology with unit test Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

* Upgraded KAI scheduler version dependency for e2e test to v0.12.0 * Changed polling timeout for e2e tests to 2 mins. * Removed NVIDIA GPU operator to be installed as its not required. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

* Moved `synchronizeTopology` in main to clustertopology package. * Adjusted unit tests for clustertopology.go * Removed the previously added delete cluster role for KAI Topology resource. * Removed the code to setup NVIDIA GPU Operator from e2e tests as its not required. * Increased the poll timeout to 4 mins. * Added restartPolicy to Always for Grove operator deployment. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

will be set later after requirements are clear. * Added unit tests for computeExpectedPodGangs function. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

* Moved GetClusterTopologyLevels to clustertopology package. * Added docstring for buildClusterTopology function. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

* Renamed ClusterTopologyConfiguration to TopologyAwareSchedulingConfiguration in operator config. * Introduced a new condition TopologyLevelsUnavailable on PCS * PackDomain field in corev1alpha1 TopologyConstraint is now required. * When creating ClusterTopology, if host topology level is not defined in TopologyAwareSchedulingConfiguration then the operator will set this level in ClusterTopology as this is a required level. * Adapted PodGang component to set pack constraints at all hierarchy levels. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Add topology-aware scheduling tests for multi-clique constraints: - BP-1: Multiple cliques with different topology constraints - SP-1: Full 3-level hierarchy with cascading constraints Changes: - Add workload7.yaml for BP-1 (rack+block constraints) - Add workload8.yaml for SP-1 (block->rack->host cascade) - Implement Test_BP1_MultipleCliquesWithDifferentConstraints - Implement Test_SP1_FullHierarchyWithCascadingConstraints - Add helper functions for pod labeling and topology verification - Enable topology-test profile in e2e cluster setup - Fix pod label selectors to use correct Grove labels Signed-off-by: Ron Kahn <rkahn@nvidia.com>

The test now correctly verifies: - Each PCLQ's 2 pods on same host (4 cliques total) - Each PCSG replica's 4 pods in same rack (2 replicas) - All 8 pods in same block (PCS constraint) This properly tests the cascading constraint hierarchy where child constraints (host) are stricter than parent (rack > block) Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

…l, and decode Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

… management Signed-off-by: Ron Kahn <rkahn@nvidia.com>

…ogy constraints Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

… optional test pattern Signed-off-by: Ron Kahn <rkahn@nvidia.com>

…ndencies Signed-off-by: Ron Kahn <rkahn@nvidia.com>

…roup name in topology.go Signed-off-by: Ron Kahn <rkahn@nvidia.com>

…specifications for disaggregated inference and host-level packing Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

…pdate Makefile usage Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

…logy tests Signed-off-by: Ron Kahn <rkahn@nvidia.com>

…proach Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

…ment Signed-off-by: Ron Kahn <rkahn@nvidia.com>

fix scale podGang pcs topology constrains Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Scaled PodGangs (PCSG replicas above MinAvailable) were missing PCSG-level topology constraints in TopologyConstraintGroupConfigs. This caused pods in scaled PCSG replicas to be scheduled without proper topology grouping constraints. Changes: - Collect pclqFQNs while building scaled PodGang - Create TopologyConstraintGroupConfig for PCSG when TAS enabled - Set pcsgTopologyConstraints in podGangInfo for scaled PodGangs - Now mirrors the pattern used in base PodGang creation Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Add comprehensive E2E test validating topology constraints across: - 2 PCS replicas creating 6 PodGangs (2 base + 4 scaled) - 3-level topology hierarchy: PCS (block) → PCSG (rack) → PCLQ (host) - 20 pods total with proper topology constraint enforcement This test ensures the topology constraint fix for scaled PodGangs works correctly across multiple PCS replicas. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

SP5 and SP8 tests were using non-existent label 'grove.io/podcliquescalinggroupreplica'. Fixed to use two-label filtering approach like SP4: - grove.io/podcliquescalinggroup (identifies PCSG) - grove.io/podcliquescalinggroup-replica-index (identifies replica) Both tests now pass successfully. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

SP7 test duplicates BP1 (multiple PCLQs with different constraints). The combination of PCSG + PCLQ is already validated by SP1 and SP5. Reduces test count from 17 to 16 with zero coverage loss. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Updated test expectations in TestComputeExpectedPodGangsWithTopologyConstraints to correctly validate the 3-level constraint hierarchy for scaled PodGangs. Changes: - Fixed topologyLevel from rack to zone (PCS-level constraint) - Added missing pcsgConstraints field for PCSG-level constraints Both base and scaled PodGangs have unified 3-level structure with PCS-level at top, PCLQ-level for PodGroups, and PCSG-level in TopologyConstraintGroupConfigs. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

…tion Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Ronkahn21 marked this pull request as ready for review January 13, 2026 11:30

Ronkahn21 requested review from gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners January 13, 2026 11:30

sanjaychatterjee reviewed Jan 16, 2026

View reviewed changes

renormalize mentioned this pull request Jan 16, 2026

Add missing TopologyConstraint for the PodGang resource at the top-level. #328

Closed

unmarshall and others added 23 commits January 18, 2026 10:09

* Introduced Conditions in PCS status.

7ec8e51

* Introduced a new condition TopologyLevelsUnavailable in PCS status. * Added reconciliation code to update the PCS status condition. * Added missing cluster role to delete KAI topology CR. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

* Added constants for TopologyLevelsUnavailable condition reason

07402fe

* Added code to update or remove the condition on PCS. * Create utility function for cluster topology with unit test Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

formatting changes

653f99d

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

added license header to clustertoplogy util

d32a99d

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

* Removed defaulting preferred constraint to Host topology domain. This

c12c0c9

will be set later after requirements are clear. * Added unit tests for computeExpectedPodGangs function. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

formatting fixed

d161df9

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

Addressed review comments:

dc5ccdd

* Moved GetClusterTopologyLevels to clustertopology package. * Added docstring for buildClusterTopology function. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

feat: working on topology e2e infra

eae0f9e

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

chore: update kai-scheduler dependencies to v0.13.0-rc0

838a1a6

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

fix: update pod replicas and minAvailable for inference-group, prefil…

a6628bd

…l, and decode Signed-off-by: Ron Kahn <rkahn@nvidia.com>

fix: update rack distribution and reapply topology labels in k3d cluster

eecda8b

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

feat: refactor topology tests to use deployWorkloadAndGetPods for pod…

6d3c802

… management Signed-off-by: Ron Kahn <rkahn@nvidia.com>

feat: rename workload YAML files and add new test scenarios for topol…

db8d227

…ogy constraints Signed-off-by: Ron Kahn <rkahn@nvidia.com>

fix: update kai-scheduler component versions to latest commit

259c30d

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

feat: add new PodGangSet YAML files and enhance e2e test command with…

46b23b2

… optional test pattern Signed-off-by: Ron Kahn <rkahn@nvidia.com>

feat: add utility functions for topology verification and update depe…

980cb3f

…ndencies Signed-off-by: Ron Kahn <rkahn@nvidia.com>

fix: correct topology constraint placement in YAML files and update g…

be4a774

…roup name in topology.go Signed-off-by: Ron Kahn <rkahn@nvidia.com>

feat: update workload YAML files to use PodCliqueSet and enhance pod …

ab1b628

…specifications for disaggregated inference and host-level packing Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Ronkahn21 added 17 commits January 18, 2026 10:11

fix: update KAI Topology retrieval to use cluster-scoped resource

a8d6b8f

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

feat: enhance e2e tests with Topology Aware Scheduling patterns and u…

e0455b2

…pdate Makefile usage Signed-off-by: Ron Kahn <rkahn@nvidia.com>

fix: regenerate crds and others generated code

b06cb08

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

chore: restore from main

0a8932d

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

feat: enhance test environment setup and update cluster size for topo…

5f039a4

…logy tests Signed-off-by: Ron Kahn <rkahn@nvidia.com>

refactor: simplify environment variable filtering using functional ap…

c951fae

…proach Signed-off-by: Ron Kahn <rkahn@nvidia.com>

refactor: rename top-* files and references to tas-* for consistency

987b2e8

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

refactor: revert some issues with rebase

dbde617

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

revert change in the e2e test workflow.

59e11a0

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

feat: add KAI PodGroup utilities for topology verification and manage…

b01966f

…ment Signed-off-by: Ron Kahn <rkahn@nvidia.com>

feat: add KAI PodGroup utilities for topology verification and manage…

ec8cf85

…ment Signed-off-by: Ron Kahn <rkahn@nvidia.com>

feat: add new test scenarios for PodCliqueSet topology constraints

ca7da49

fix scale podGang pcs topology constrains Signed-off-by: Ron Kahn <rkahn@nvidia.com>

test: remove redundant SP7 topology test

6e0d002

SP7 test duplicates BP1 (multiple PCLQs with different constraints). The combination of PCSG + PCLQ is already validated by SP1 and SP5. Reduces test count from 17 to 16 with zero coverage loss. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

fix: ensure topology constraint is created when TAS is enabled

4a6bda7

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Ronkahn21 force-pushed the test/tas-e2e branch from 5d35d3c to 4a6bda7 Compare January 18, 2026 13:16

Ronkahn21 added 3 commits January 18, 2026 15:29

chore: add license header to kai_topology.go

e63eb7a

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

chore: remove obsolete single-node-disaggregated PCS sample configura…

626479d

…tion Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test/tas e2e #313

Test/tas e2e #313

Uh oh!

Ronkahn21 commented Jan 12, 2026 •

edited

Loading

Uh oh!

sanjaychatterjee Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Test/tas e2e #313

Are you sure you want to change the base?

Test/tas e2e #313

Uh oh!

Conversation

Ronkahn21 commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanjaychatterjee Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ronkahn21 commented Jan 12, 2026 •

edited

Loading