Skip to content

Conversation

@Ronkahn21
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds the final 4 test scenarios for advanced patterns and edge cases in Topology Aware Scheduling (TAS), completing the e2e test suite.

Tests Added:

  • EC1_InsufficientNodesForConstraint: Tests error handling when insufficient topology domains exist for constraints. Verifies pod events show Unschedulable reason.
  • MR1_MultiReplicaWithRackConstraint: Tests multi-replica PCS (2 replicas) with rack-level constraints across all pods.
  • SP4_DisaggregatedInferenceMultiplePCSGs: Tests disaggregated inference pattern with 3 PCSGs (prefill/decode/router) coordinating across topology domains.
  • SP9_MultiReplicaPCSWithThreeLevelHierarchy: Tests the most complex scenario - multi-replica PCS (2 replicas) with 3-level hierarchy (PCS → PCSG → PCLQ), creating 6 PodGangs total.

Test Coverage:

  • Error cases with insufficient nodes for constraints
  • Multi-replica PCS behavior with topology constraints
  • Disaggregated inference architectural patterns
  • Complex multi-level hierarchies across scaled resources
  • KAI PodGroup verification for base and scaled PodGangs
  • Pod event verification for scheduling failures

This PR completes the TAS e2e test suite (part 4 of 4).

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Dependencies:

Test Verification:

  • All tests compile successfully with -tags e2e
  • Linter passes with 0 issues
  • Added 592 lines of test code across 4 test functions
  • Added 4 YAML test scenario files

Complete Test Suite Summary:

Total: 16 tests covering infrastructure, simple patterns, scaling, hierarchies, and edge cases

File Summary:

  • Modified: 1 file (topology_test.go - added 4 tests)
  • New: 4 YAML test scenario files

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

- Add 4-level topology hierarchy setup (zone/block/rack/host)
- Add KAI Topology verification utilities
- Add topology constraint verification helpers
- Include 2 foundational tests:
  * TI1: Topology infrastructure verification
  * BP1: Multiple cliques with different constraints
- Update dependencies to KAI Scheduler v0.13.0-rc1
- Add Makefile target for selective test execution
- Add topology-test skaffold profile

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add 5 tests for simple topology constraint scenarios:
- SL1: PCS-only constraint (inherited by children)
- SL2: PCSG-only constraint
- SL3: No topology constraints (baseline)
- PC1: Host-level constraint (strictest packing)
- ZL1: Zone-level constraint

These tests verify constraint behavior at different
resource levels (PCS, PCSG, PCLQ) and topology domains
(zone, rack, host, none).

Builds on PR ai-dynamo#348 (infrastructure).

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add 5 tests for scaling and hierarchical topology patterns:
- SP1: Full hierarchy with cascading constraints (PCS→PCSG→PCLQ)
- SP2: PCS + PCLQ constraint combination
- SP3: PCSG scaling with topology constraints
- SP5: PCSG + PCLQ without parent PCS constraint
- SP8: Large scaling ratio (6+ replicas)

These tests verify:
- Hierarchical constraint inheritance and overrides
- PCSG-level topology constraint propagation
- Large-scale PCSG replica handling
- KAI PodGroup SubGroup structure with constraints

Builds on PR ai-dynamo#349 (simple level tests).

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add 4 tests for advanced scenarios and edge cases:
- EC1: Insufficient nodes for constraint (error handling)
- MR1: Multi-replica PCS with rack constraint
- SP4: Disaggregated inference with multiple PCSGs
- SP9: Multi-replica PCS with 3-level hierarchy (most complex)

These tests verify:
- Error cases with insufficient topology domains
- Multi-replica PCS scaling behavior
- Disaggregated inference patterns (prefill/decode/router)
- Complex 3-level hierarchies across multiple PCS replicas
- KAI PodGroup verification for scaled PodGangs

Completes TAS e2e test suite (part 4 of 4).

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant