Skip to content

Conversation

@Ronkahn21
Copy link
Contributor

@Ronkahn21 Ronkahn21 commented Jan 12, 2026

PR Description

What type of PR is this?
/kind feature
/kind testing

What this PR does / why we need it:
Adds comprehensive end-to-end tests for Topology-Aware Scheduling (TAS) functionality in Grove. The tests validate that Grove's translation mechanism correctly converts user-defined topology constraints (pack
domains) to KAI scheduler format, and that pods are placed according to topology constraints across all hierarchy levels (zone, block, rack, host).

Which issue(s) this PR fixes:
Fixes #305

Special notes for your reviewer:

  • This PR covers the translation part and topology infrastructure validation (ClusterTopology + KAI Topology CRs)
  • Webhook validation tests will be added in a separate PR once webhook implementation is complete
  • All tests verify end-to-end pod placement behavior, not intermediate PodGang CR state
  • Tests use 28-node shared cluster to provide sufficient topology diversity
  • 9 out of 10 test scenarios are implemented; EC-2 (infrastructure failure) remains

Does this PR introduce an API change?
No

Release note:
Add comprehensive E2E tests for Topology-Aware Scheduling (TAS) covering translation of topology constraints, infrastructure validation, placement verification, scaling scenarios, and failure handling.

Additional documentation:
E2E test coverage includes:

  • Infrastructure validation (ClusterTopology + KAI Topology CRs)
  • Full hierarchy constraints (PCS→PCSG→PodClique levels)
  • Independent clique placement with different constraints
  • Multi-replica and scaling scenarios
  • Failure scenarios (insufficient capacity)

Test execution: make e2e-test TEST_PATTERN="^Test_TAS"

@Ronkahn21 Ronkahn21 marked this pull request as ready for review January 13, 2026 11:30
fqn: podGangName,
pclqs: pcsgPodCliqueInfos,
pcsgTopologyConstraints: pcsgTopologyConstraints,
topologyConstraint: createTopologyPackConstraint(sc, apicommonconstants.KindPodCliqueSet, client.ObjectKeyFromObject(sc.pcs), sc.pcs.Spec.Template.TopologyConstraint),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this but can we make a separate PR for this? cc @unmarshall

unmarshall and others added 23 commits January 18, 2026 10:09
* Introduced a new condition TopologyLevelsUnavailable in PCS status.
* Added reconciliation code to update the PCS status condition.
* Added missing cluster role to delete KAI topology CR.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Added code to update or remove the condition on PCS.
* Create utility function for cluster topology with unit test

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Upgraded KAI scheduler version dependency for e2e test to v0.12.0
* Changed polling timeout for e2e tests to 2 mins.
* Removed NVIDIA GPU operator to be installed as its not required.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Moved `synchronizeTopology` in main to clustertopology package.
* Adjusted unit tests for clustertopology.go
* Removed the previously added delete cluster role for KAI Topology
  resource.
* Removed the code to setup NVIDIA GPU Operator from e2e tests as its
  not required.
* Increased the poll timeout to 4 mins.
* Added restartPolicy to Always for Grove operator deployment.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
  will be set later after requirements are clear.
* Added unit tests for computeExpectedPodGangs function.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Moved GetClusterTopologyLevels to clustertopology package.
* Added docstring for buildClusterTopology function.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Renamed ClusterTopologyConfiguration to
  TopologyAwareSchedulingConfiguration in operator config.
* Introduced a new condition TopologyLevelsUnavailable on PCS
* PackDomain field in corev1alpha1 TopologyConstraint is now required.
* When creating ClusterTopology, if host topology level is not defined
  in TopologyAwareSchedulingConfiguration then the operator will set
  this level in ClusterTopology as this is a required level.
* Adapted PodGang component to set pack constraints at all hierarchy levels.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add topology-aware scheduling tests for multi-clique constraints:
- BP-1: Multiple cliques with different topology constraints
- SP-1: Full 3-level hierarchy with cascading constraints

Changes:
- Add workload7.yaml for BP-1 (rack+block constraints)
- Add workload8.yaml for SP-1 (block->rack->host cascade)
- Implement Test_BP1_MultipleCliquesWithDifferentConstraints
- Implement Test_SP1_FullHierarchyWithCascadingConstraints
- Add helper functions for pod labeling and topology verification
- Enable topology-test profile in e2e cluster setup
- Fix pod label selectors to use correct Grove labels

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
The test now correctly verifies:
- Each PCLQ's 2 pods on same host (4 cliques total)
- Each PCSG replica's 4 pods in same rack (2 replicas)
- All 8 pods in same block (PCS constraint)

This properly tests the cascading constraint hierarchy where
child constraints (host) are stricter than parent (rack > block)

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…l, and decode

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… management

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ogy constraints

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… optional test pattern

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ndencies

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…roup name in topology.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…specifications for disaggregated inference and host-level packing

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…pdate Makefile usage

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…logy tests

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…proach

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ment

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ment

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
fix scale podGang pcs topology constrains

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Scaled PodGangs (PCSG replicas above MinAvailable) were missing
PCSG-level topology constraints in TopologyConstraintGroupConfigs.
This caused pods in scaled PCSG replicas to be scheduled without
proper topology grouping constraints.

Changes:
- Collect pclqFQNs while building scaled PodGang
- Create TopologyConstraintGroupConfig for PCSG when TAS enabled
- Set pcsgTopologyConstraints in podGangInfo for scaled PodGangs
- Now mirrors the pattern used in base PodGang creation

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add comprehensive E2E test validating topology constraints across:
- 2 PCS replicas creating 6 PodGangs (2 base + 4 scaled)
- 3-level topology hierarchy: PCS (block) → PCSG (rack) → PCLQ (host)
- 20 pods total with proper topology constraint enforcement

This test ensures the topology constraint fix for scaled PodGangs
works correctly across multiple PCS replicas.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
SP5 and SP8 tests were using non-existent label 'grove.io/podcliquescalinggroupreplica'.
Fixed to use two-label filtering approach like SP4:
- grove.io/podcliquescalinggroup (identifies PCSG)
- grove.io/podcliquescalinggroup-replica-index (identifies replica)

Both tests now pass successfully.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
SP7 test duplicates BP1 (multiple PCLQs with different constraints).
The combination of PCSG + PCLQ is already validated by SP1 and SP5.
Reduces test count from 17 to 16 with zero coverage loss.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Updated test expectations in TestComputeExpectedPodGangsWithTopologyConstraints
to correctly validate the 3-level constraint hierarchy for scaled PodGangs.

Changes:
- Fixed topologyLevel from rack to zone (PCS-level constraint)
- Added missing pcsgConstraints field for PCSG-level constraints

Both base and scaled PodGangs have unified 3-level structure with
PCS-level at top, PCLQ-level for PodGroups, and PCSG-level in
TopologyConstraintGroupConfigs.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…tion

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

E2E tests for TAS

3 participants