Skip to content

feat(gpu): add NVIDIA GRID v20 driver support for RTX PRO 6000 BSE v6 SKUs#8619

Open
ganeshkumarashok wants to merge 2 commits into
Azure:mainfrom
ganeshkumarashok:gpu-grid-v20-driver-support
Open

feat(gpu): add NVIDIA GRID v20 driver support for RTX PRO 6000 BSE v6 SKUs#8619
ganeshkumarashok wants to merge 2 commits into
Azure:mainfrom
ganeshkumarashok:gpu-grid-v20-driver-support

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Contributor

What

Adds NVIDIA GRID v20 (595.x) driver support, selecting the new aks-gpu-grid-v20 container image for RTX PRO 6000 Blackwell Server Edition v6 SKUs:

  • Standard_NC128ds_xl_RTXPRO6000BSE_v6
  • Standard_NC256ds_xl_RTXPRO6000BSE_v6
  • Standard_NC320ds_xl_RTXPRO6000BSE_v6

All existing GRID SKUs keep using aks-gpu-grid (570.x); the CUDA path is untouched.

Changes

  • parts/common/components.json — add aks-gpu-grid-v20 GPUContainerImages entry.
  • pkg/agent/datamodel/gpu_components.go — parse it into NvidiaGridV20DriverVersion / AKSGPUGridV20VersionSuffix; refactor LoadConfig to match on the exact repo name (fixes a latent substring collision: aks-gpu-grid-v20 contains aks-gpu-grid); add RTXPro6000GPUDriverSizes.
  • pkg/agent/baker.go — add useGridV20Drivers(); branch GetGPUDriverVersion / GetAKSGPUImageSHA / GetGPUDriverType on it (checked before grid); driver type string "grid-v20".
  • .github/renovate.json — add aks/aks-gpu-grid-v20 package rule.
  • Unit tests for the new selection paths.

Design notes

On Ubuntu the driver image repo is built as mcr.microsoft.com/aks/aks-gpu-${GPU_DRIVER_TYPE} (cse_helpers.sh), so setting the driver type to grid-v20 resolves the new repo automatically.

Scope is Ubuntu-only by design. RTX PRO 6000 BSE v6 runs on Ubuntu GPU nodes. The non-Ubuntu install paths (Mariner RPM / ACL sysext) do not use the container image and have no v20 packages, so those CSE checks are deliberately left unchanged.

The new image comes from aks-gpu PR #158 (merged).

⚠️ Do not merge yet

aks-gpu-grid-v20 is not yet published to MCR (onboarding tracked separately). The version tag suffix in components.json (595.58.03-20260101000000) is a placeholder and must be replaced with the real published tag before merge. Until then nodes would attempt to pull a nonexistent tag.

make generate produces no testdata/manifest diff (no existing scenario uses these SKUs), so the placeholder does not leak into generated snapshots.

Testing

  • go build ./pkg/agent/...
  • go test ./pkg/agent ./pkg/agent/datamodel — pass
  • make validate-components — pass

… SKUs

Select the new aks-gpu-grid-v20 image (NVIDIA GRID 595.x) for
NC_RTXPRO6000BSE_v6 SKUs. All existing GRID SKUs continue to use
aks-gpu-grid (570.x); CUDA path is untouched.

- components.json: add aks-gpu-grid-v20 GPUContainerImages entry.
- gpu_components.go: parse it into NvidiaGridV20DriverVersion /
  AKSGPUGridV20VersionSuffix; refactor LoadConfig to match on the exact
  repo name (fixes a latent substring collision between aks-gpu-grid and
  aks-gpu-grid-v20); add RTXPro6000GPUDriverSizes.
- baker.go: add useGridV20Drivers(); branch GetGPUDriverVersion /
  GetAKSGPUImageSHA / GetGPUDriverType on it (checked before grid),
  driver type "grid-v20".
- renovate.json: add aks/aks-gpu-grid-v20 package rule.
- tests for the new selection paths.

Scope is Ubuntu-only: RTX PRO 6000 BSE v6 runs on Ubuntu GPU nodes, which
build the driver image repo as aks-gpu-${GPU_DRIVER_TYPE}; non-Ubuntu
(Mariner/ACL) install paths do not use the container image and are
deliberately untouched.

NOTE (do not merge yet): aks-gpu-grid-v20 is not yet published to MCR, so
the version tag suffix in components.json is a placeholder and must be
replaced with the real published tag before merge.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 2, 2026 00:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new NVIDIA GRID v20 (595.x) driver selection path for RTX PRO 6000 Blackwell Server Edition v6 NC SKUs by introducing a new GPU driver container image (aks-gpu-grid-v20) and ensuring config parsing doesn’t confuse it with the existing aks-gpu-grid image.

Changes:

  • Add aks-gpu-grid-v20 to GPUContainerImages and parse it into new datamodel globals (version + suffix), using exact repo-name matching to avoid substring collisions.
  • Add SKU-based routing so RTX PRO 6000 BSE v6 sizes use GRID v20 for GetGPUDriverVersion, GetAKSGPUImageSHA, and GetGPUDriverType (new type: grid-v20).
  • Extend Renovate rules and unit tests to cover the new config fields and selection paths.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
parts/common/components.json Adds new GPU container image entry for aks-gpu-grid-v20 (tag currently a placeholder per PR description).
pkg/agent/datamodel/gpu_components.go Adds v20 config globals and exact repo-name parsing; introduces RTX PRO 6000 BSE v6 SKU map.
pkg/agent/datamodel/gpu_components_test.go Validates v20 config values are populated and correctly formatted.
pkg/agent/baker.go Routes RTX PRO 6000 BSE v6 SKUs to GRID v20 driver/version/type before standard GRID selection.
pkg/agent/baker_test.go Adds unit tests for GRID v20 selection in version/type/image suffix.
.github/renovate.json Adds Renovate package rule for aks/aks-gpu-grid-v20.

Comment thread pkg/agent/datamodel/gpu_components_test.go
Comment thread pkg/agent/datamodel/gpu_components_test.go
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants