-
Notifications
You must be signed in to change notification settings - Fork 35
Test/tas e2e #313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Ronkahn21
wants to merge
43
commits into
ai-dynamo:main
Choose a base branch
from
Ronkahn21:test/tas-e2e
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Test/tas e2e #313
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| fqn: podGangName, | ||
| pclqs: pcsgPodCliqueInfos, | ||
| pcsgTopologyConstraints: pcsgTopologyConstraints, | ||
| topologyConstraint: createTopologyPackConstraint(sc, apicommonconstants.KindPodCliqueSet, client.ObjectKeyFromObject(sc.pcs), sc.pcs.Spec.Template.TopologyConstraint), |
Collaborator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this but can we make a separate PR for this? cc @unmarshall
* Introduced a new condition TopologyLevelsUnavailable in PCS status. * Added reconciliation code to update the PCS status condition. * Added missing cluster role to delete KAI topology CR. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Added code to update or remove the condition on PCS. * Create utility function for cluster topology with unit test Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Upgraded KAI scheduler version dependency for e2e test to v0.12.0 * Changed polling timeout for e2e tests to 2 mins. * Removed NVIDIA GPU operator to be installed as its not required. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Moved `synchronizeTopology` in main to clustertopology package. * Adjusted unit tests for clustertopology.go * Removed the previously added delete cluster role for KAI Topology resource. * Removed the code to setup NVIDIA GPU Operator from e2e tests as its not required. * Increased the poll timeout to 4 mins. * Added restartPolicy to Always for Grove operator deployment. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
will be set later after requirements are clear. * Added unit tests for computeExpectedPodGangs function. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Moved GetClusterTopologyLevels to clustertopology package. * Added docstring for buildClusterTopology function. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Renamed ClusterTopologyConfiguration to TopologyAwareSchedulingConfiguration in operator config. * Introduced a new condition TopologyLevelsUnavailable on PCS * PackDomain field in corev1alpha1 TopologyConstraint is now required. * When creating ClusterTopology, if host topology level is not defined in TopologyAwareSchedulingConfiguration then the operator will set this level in ClusterTopology as this is a required level. * Adapted PodGang component to set pack constraints at all hierarchy levels. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add topology-aware scheduling tests for multi-clique constraints: - BP-1: Multiple cliques with different topology constraints - SP-1: Full 3-level hierarchy with cascading constraints Changes: - Add workload7.yaml for BP-1 (rack+block constraints) - Add workload8.yaml for SP-1 (block->rack->host cascade) - Implement Test_BP1_MultipleCliquesWithDifferentConstraints - Implement Test_SP1_FullHierarchyWithCascadingConstraints - Add helper functions for pod labeling and topology verification - Enable topology-test profile in e2e cluster setup - Fix pod label selectors to use correct Grove labels Signed-off-by: Ron Kahn <rkahn@nvidia.com>
The test now correctly verifies: - Each PCLQ's 2 pods on same host (4 cliques total) - Each PCSG replica's 4 pods in same rack (2 replicas) - All 8 pods in same block (PCS constraint) This properly tests the cascading constraint hierarchy where child constraints (host) are stricter than parent (rack > block) Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…l, and decode Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… management Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ogy constraints Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… optional test pattern Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ndencies Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…roup name in topology.go Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…specifications for disaggregated inference and host-level packing Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…pdate Makefile usage Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…logy tests Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…proach Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ment Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ment Signed-off-by: Ron Kahn <rkahn@nvidia.com>
fix scale podGang pcs topology constrains Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Scaled PodGangs (PCSG replicas above MinAvailable) were missing PCSG-level topology constraints in TopologyConstraintGroupConfigs. This caused pods in scaled PCSG replicas to be scheduled without proper topology grouping constraints. Changes: - Collect pclqFQNs while building scaled PodGang - Create TopologyConstraintGroupConfig for PCSG when TAS enabled - Set pcsgTopologyConstraints in podGangInfo for scaled PodGangs - Now mirrors the pattern used in base PodGang creation Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add comprehensive E2E test validating topology constraints across: - 2 PCS replicas creating 6 PodGangs (2 base + 4 scaled) - 3-level topology hierarchy: PCS (block) → PCSG (rack) → PCLQ (host) - 20 pods total with proper topology constraint enforcement This test ensures the topology constraint fix for scaled PodGangs works correctly across multiple PCS replicas. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
SP5 and SP8 tests were using non-existent label 'grove.io/podcliquescalinggroupreplica'. Fixed to use two-label filtering approach like SP4: - grove.io/podcliquescalinggroup (identifies PCSG) - grove.io/podcliquescalinggroup-replica-index (identifies replica) Both tests now pass successfully. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
SP7 test duplicates BP1 (multiple PCLQs with different constraints). The combination of PCSG + PCLQ is already validated by SP1 and SP5. Reduces test count from 17 to 16 with zero coverage loss. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
5d35d3c to
4a6bda7
Compare
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Updated test expectations in TestComputeExpectedPodGangsWithTopologyConstraints to correctly validate the 3-level constraint hierarchy for scaled PodGangs. Changes: - Fixed topologyLevel from rack to zone (PCS-level constraint) - Added missing pcsgConstraints field for PCSG-level constraints Both base and scaled PodGangs have unified 3-level structure with PCS-level at top, PCLQ-level for PodGroups, and PCSG-level in TopologyConstraintGroupConfigs. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…tion Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Description
What type of PR is this?
/kind feature
/kind testing
What this PR does / why we need it:
Adds comprehensive end-to-end tests for Topology-Aware Scheduling (TAS) functionality in Grove. The tests validate that Grove's translation mechanism correctly converts user-defined topology constraints (pack
domains) to KAI scheduler format, and that pods are placed according to topology constraints across all hierarchy levels (zone, block, rack, host).
Which issue(s) this PR fixes:
Fixes #305
Special notes for your reviewer:
Does this PR introduce an API change?
No
Release note:
Add comprehensive E2E tests for Topology-Aware Scheduling (TAS) covering translation of topology constraints, infrastructure validation, placement verification, scaling scenarios, and failure handling.
Additional documentation:
E2E test coverage includes:
Test execution:
make e2e-test TEST_PATTERN="^Test_TAS"