Skip to content

refactor: move kai-scheduler and DRA driver to base overlay for CNCF AI conformance#139

Merged
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:refactor/move-kai-dra-to-base
Feb 19, 2026
Merged

refactor: move kai-scheduler and DRA driver to base overlay for CNCF AI conformance#139
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:refactor/move-kai-dra-to-base

Conversation

@yuanchen8911
Copy link
Contributor

@yuanchen8911 yuanchen8911 commented Feb 18, 2026

Summary

Move nvidia-dra-driver-gpu and kai-scheduler component definitions from overlay-specific configs into base.yaml.

Motivation / Context

Both training and inference clusters need gang scheduling (kai-scheduler) and DRA support (nvidia-dra-driver-gpu) to meet CNCF AI Conformance requirements. Moving them to base ensures all recipe variants include these components. Overlays retain only environment-specific overrides (EKS: controller affinity/tolerations, Kind: nvidiaDriverRoot).

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/eidos, pkg/cli)
  • API server (cmd/eidosd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

  • CUJ1 training test assertion updated to include the two new base components in componentRefs and deploymentOrder
  • Overlay-specific overrides preserved (EKS affinity/tolerations, Kind nvidiaDriverRoot)

Testing

make test
make lint

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: N/A

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner February 18, 2026 03:48
@yuanchen8911 yuanchen8911 changed the title refactor: move kai-scheduler and DRA driver to base overlay refactor: move kai-scheduler and DRA driver to base overlay for CNCF AI conformance Feb 18, 2026
@yuanchen8911 yuanchen8911 changed the title refactor: move kai-scheduler and DRA driver to base overlay for CNCF AI conformance refactor: move kai-scheduler and DRA driver to base overlay for CNCF AI conformance (WIP) Feb 18, 2026
@yuanchen8911 yuanchen8911 force-pushed the refactor/move-kai-dra-to-base branch from 433ee49 to 4bf433b Compare February 18, 2026 22:16
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner February 18, 2026 22:16
@mchmarny mchmarny marked this pull request as draft February 19, 2026 00:28
@mchmarny mchmarny marked this pull request as draft February 19, 2026 00:28
@yuanchen8911 yuanchen8911 force-pushed the refactor/move-kai-dra-to-base branch from 4bf433b to 44bc354 Compare February 19, 2026 00:32
@yuanchen8911 yuanchen8911 force-pushed the refactor/move-kai-dra-to-base branch from 44bc354 to cfe03f4 Compare February 19, 2026 00:49
@yuanchen8911 yuanchen8911 force-pushed the refactor/move-kai-dra-to-base branch from cfe03f4 to 608e50e Compare February 19, 2026 02:56
@github-actions github-actions bot removed the area/ci label Feb 19, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review February 19, 2026 15:25
@yuanchen8911 yuanchen8911 changed the title refactor: move kai-scheduler and DRA driver to base overlay for CNCF AI conformance (WIP) refactor: move kai-scheduler and DRA driver to base overlay for CNCF AI conformance Feb 19, 2026
Move nvidia-dra-driver-gpu and kai-scheduler component definitions from
dynamo-specific and kind overlays into base.yaml so that both training
and inference clusters include gang scheduling and DRA support, meeting
CNCF AI Conformance requirements. Overlays retain only
environment-specific overrides. Update CUJ1 training test assertion to
include the two new base components.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911 yuanchen8911 force-pushed the refactor/move-kai-dra-to-base branch from a36c779 to b735b14 Compare February 19, 2026 15:29
Copy link
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@mchmarny mchmarny merged commit ed4973b into NVIDIA:main Feb 19, 2026
16 checks passed
@mchmarny mchmarny deleted the refactor/move-kai-dra-to-base branch February 19, 2026 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants