feat(validator): self-contained gang scheduling conformance check by dims · Pull Request #184 · NVIDIA/aicr

dims · 2026-02-23T00:55:06Z

Summary

Make the gang-scheduling conformance check fully self-contained by programmatically creating test resources (PodGroup + 2 GPU pods with DRA ResourceClaims) instead of relying on pre-deployed manifests
Add GPU availability pre-flight check via ResourceSlices/ResourceClaims to fail fast when fewer than 2 GPUs are free, avoiding a 5-minute timeout
Add countAvailableGPUs() shared helper in conformance/helpers.go for reuse by other GPU-dependent checks
Add PodGroup create/delete RBAC for the validator service account

Test plan

Unit tests pass with race detector (8 test cases including insufficient GPUs, pod failure, missing deployments/CRDs)
Full conformance test suite passes
CI: lint, unit tests, build
CI: GPU workflow validates gang scheduling end-to-end

Make the gang-scheduling conformance check fully self-contained by programmatically creating test resources instead of relying on pre-deployed manifests. The check now: 1. Verifies KAI scheduler deployments and CRDs (unchanged) 2. Pre-flight: counts free GPUs via ResourceSlices/ResourceClaims and fails fast if fewer than 2 are available 3. Creates a PodGroup with 2 GPU test pods using DRA ResourceClaims 4. Waits for all pods to reach terminal state 5. Validates gang scheduling patterns (kai-scheduler, PodGroup labels, DRA resource claims, pod success) 6. Cleans up all test resources Adds countAvailableGPUs() helper to conformance/helpers.go for reuse by other GPU-dependent checks. Remove redundant "Deploy gang scheduling test" and cleanup steps from the GPU training CI workflow since the conformance check now handles this end-to-end.

The GPU Conformance Test (nvkind + H100 x2) workflow was created on PR #180's branch but never merged to main. This adds it with an updated schedule (08:45/20:45 UTC) to maintain a 2h15m gap from the GPU Training Test (06:30/18:30 UTC), ensuring the two H100 x2 jobs don't compete for the same runner. Schedule layout (all 2x daily, 12h apart): - T4 Smoke: 06:00 / 18:00 UTC - H100 Inference: 06:15 / 18:15 UTC - H100 Training x2: 06:30 / 18:30 UTC - H100 Conformance: 08:45 / 20:45 UTC (2h15m after training) Aligned with current CI patterns: - gpu-snapshot-validate action instead of inline snapshot steps - Karpenter nodepool.yaml applied after install - load-versions + setup-build-tools for chainsaw install - Dockerfile.validator and missing action paths in path triggers - Step ordering and naming consistent with inference/training - Removed redundant DRA/gang pre-deploy steps that would exhaust GPU claim capacity before the self-contained conformance checks run inside aicr validate (introduced in PRs #184, #185)

dims requested a review from a team as a code owner February 23, 2026 00:55

github-actions bot added area/validator size/XL labels Feb 23, 2026

dims force-pushed the dims/gang-scheduling-conformance branch from 044d0a1 to d43c901 Compare February 23, 2026 01:01

dims requested a review from a team as a code owner February 23, 2026 01:01

github-actions bot added the area/ci label Feb 23, 2026

dims force-pushed the dims/gang-scheduling-conformance branch 2 times, most recently from ede7e01 to fffb7ce Compare February 23, 2026 01:15

dims force-pushed the dims/gang-scheduling-conformance branch from fffb7ce to 23d3fa0 Compare February 23, 2026 01:32

dims merged commit 2acf5d0 into NVIDIA:main Feb 23, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(validator): self-contained gang scheduling conformance check#184

feat(validator): self-contained gang scheduling conformance check#184
dims merged 1 commit intoNVIDIA:mainfrom
dims:dims/gang-scheduling-conformance

dims commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dims commented Feb 23, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant