Skip to content

Conversation

@Ronkahn21
Copy link
Contributor

@Ronkahn21 Ronkahn21 commented Jan 18, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds 5 test scenarios for simple topology constraint patterns in Topology Aware Scheduling (TAS).

Tests Added:

  • SL1_PCSOnlyConstraint: Tests PCS-level constraint inherited by child PCSG and PCLQ resources
  • SL2_PCSGOnlyConstraint: Tests constraint at PCSG level only (no parent PCS constraint)
  • SL3_NoTopologyConstraint: Baseline test with no topology constraints applied
  • PC1_HostLevelConstraint: Tests strictest host-level constraint (all pods on same host)
  • ZL1_ZoneLevelConstraint: Tests zone-level topology constraint

Test Coverage:

  • Constraint behavior at different resource levels (PCS, PCSG, PCLQ)
  • Different topology domains (zone, rack, host, none)
  • Constraint inheritance and override patterns
  • KAI PodGroup structure verification with SubGroups

This PR is part 2 of 4 in the TAS e2e test suite.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Dependencies:

Test Verification:

  • All tests compile successfully with -tags e2e
  • Linter passes with 0 issues
  • Added 398 lines of test code across 5 test functions
  • Added 5 YAML test scenario files

What's Next:

File Summary:

  • Modified: 1 file (topology_test.go - added 5 tests)
  • New: 5 YAML test scenario files

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

Ronkahn21 added a commit to Ronkahn21/grove that referenced this pull request Jan 18, 2026
Add 5 tests for scaling and hierarchical topology patterns:
- SP1: Full hierarchy with cascading constraints (PCS→PCSG→PCLQ)
- SP2: PCS + PCLQ constraint combination
- SP3: PCSG scaling with topology constraints
- SP5: PCSG + PCLQ without parent PCS constraint
- SP8: Large scaling ratio (6+ replicas)

These tests verify:
- Hierarchical constraint inheritance and overrides
- PCSG-level topology constraint propagation
- Large-scale PCSG replica handling
- KAI PodGroup SubGroup structure with constraints

Builds on PR ai-dynamo#349 (simple level tests).

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 marked this pull request as draft January 20, 2026 10:39
@Ronkahn21 Ronkahn21 force-pushed the test/tas-e2e-simple branch from 9975b7b to 96d8fc7 Compare January 21, 2026 22:01
@Ronkahn21 Ronkahn21 marked this pull request as ready for review January 21, 2026 22:23
Copy link
Contributor

@gflarity gflarity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good test coverage for TAS constraint inheritance patterns. Found a few minor issues to address.

Copy link
Contributor

@gflarity gflarity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a couple issues around the documentation.

- Add 4-level topology hierarchy setup (zone/block/rack/host)
- Add KAI Topology verification utilities
- Add topology constraint verification helpers
- Include 2 foundational tests:
  * TI1: Topology infrastructure verification
  * BP1: Multiple cliques with different constraints
- Update dependencies to KAI Scheduler v0.13.0-rc1
- Add Makefile target for selective test execution
- Add topology-test skaffold profile

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add 5 tests for simple topology constraint scenarios:
- SL1: PCS-only constraint (inherited by children)
- SL2: PCSG-only constraint
- SL3: No topology constraints (baseline)
- PC1: Host-level constraint (strictest packing)
- ZL1: Zone-level constraint

These tests verify constraint behavior at different
resource levels (PCS, PCSG, PCLQ) and topology domains
(zone, rack, host, none).

Builds on PR ai-dynamo#348 (infrastructure).

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Rename test functions to simplified format (TAS1-TAS7)
- Remove scenario labels from test comments
- Replace hardcoded strings with setup package constants
- Update log messages to use new test names
- Fix PR ai-dynamo#357 semantic change: PCSG inherits PCS constraint when nil
- Remove duplicate topology functions from k8s_clusters.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
When PCSG has nil TopologyConstraint, PCSG parent groups are not
created in KAI PodGroup. PCLQ children are placed directly under
PCS-level constraint instead.

- Test_TAS3: Changed from 5 to 3 SubGroups (removed PCSG parents)
- Test_TAS7: Changed from 4 to 2 SubGroups (removed PCSG parents)
- Updated PCLQ children to have Parent: nil instead of PCSG parents

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Revert e2e/setup/k8s_clusters.go to upstream (remove extra blank line)
- Revert e2e/tests/setup.go to upstream (restore MarkCleanupFailed)
- Revert e2e/utils/topology.go to upstream (remove WaitForPodsReady)
- Make TAS1 identical to upstream (use GetWorkerNodeLabelSelector)
- Make TAS2 identical to upstream (use 28 nodes, not 7)
- Add deployWorkloadAndGetPods helper from upstream
- Remove duplicate deployWorkloadAndGetPods

TAS1 and TAS2 now exactly match upstream/main.
TAS3-TAS7 follow the same pattern using deployWorkloadAndGetPods.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Rename Test_TAS6 to StandalonePCLQOnlyPCSZoneConstraint for clarity
- Add ExpectedSubGroup helper functions to reduce duplication
- Fix Test_TAS3 comment block and typo in error message
- Refactor TAS3-7 tests to use helper functions

All TAS tests pass successfully.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 force-pushed the test/tas-e2e-simple branch from 4dd807e to 2edc57d Compare January 24, 2026 18:03
…ication

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ubgroups

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 merged commit 16bf3c9 into ai-dynamo:main Jan 26, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants