Skip to content

feat: add DRA and gang scheduling test manifests for CNCF AI conformance#150

Merged
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/cncf-ai-conformance-tests
Feb 19, 2026
Merged

feat: add DRA and gang scheduling test manifests for CNCF AI conformance#150
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/cncf-ai-conformance-tests

Conversation

@yuanchen8911
Copy link
Contributor

Summary

Add test manifests for validating DRA GPU allocation and KAI scheduler gang scheduling,
supporting CNCF AI conformance evidence collection.

Motivation / Context

These manifests provide repeatable test cases for two key CNCF AI conformance areas:
DRA-based GPU allocation and gang scheduling with KAI scheduler.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/eidos, pkg/cli)
  • API server (cmd/eidosd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: Test manifests (tests/manifests/)

Implementation Notes

  • tests/manifests/dra-gpu-test.yaml — Creates namespace, ResourceClaim, and Pod to validate
    DRA GPU allocation via gpu.nvidia.com device class
  • tests/manifests/gang-scheduling-test.yaml — Creates namespace and PyTorchJob (1 Master +
    1 Worker) to validate KAI scheduler gang scheduling via PodGroup creation

Testing

make lint

Manifests tested manually on H100 EKS cluster.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: N/A

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner February 19, 2026 04:45
@yuanchen8911 yuanchen8911 added enhancement New feature or request size/XS area/validator and removed size/XS enhancement New feature or request labels Feb 19, 2026
Copy link
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgmt

@mchmarny mchmarny merged commit 0c53d8e into NVIDIA:main Feb 19, 2026
14 checks passed
@mchmarny mchmarny deleted the feat/cncf-ai-conformance-tests branch February 19, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants