Releases · kubernetes-sigs/dra-driver-nvidia-gpu

12 Jun 21:41

shivamerla

v0.4.1-rc.1

fc7bcf3

v0.4.1-rc.1 Pre-release

Pre-release

What's Changed

build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.19.0 to 1.19.1 in #1161
Fix remove redundant component labels causing failures during GitOps based helm install in #1193
Fix nil pointer panic in verifyDisabledVFs for non-SR-IOV GPUs in #1194
Update go-nvlib to v0.11.0 in #1198

Full Changelog: v0.4.0...v0.4.1-rc.1

Assets 2

15 May 13:07

shivamerla

v0.4.0

6910a2b

v0.4.0 Latest

Latest

This is the first release of DRA Driver for NVIDIA GPUs as a part of a Kubernetes SIG community at kubernetes-sigs/dra-driver-nvidia-gpu. This release also updates the versioning scheme to Semantic Versioning (SemVer), starting at v0.4.0.

Project move

The DRA Driver for NVIDIA GPUs has moved from the NVIDIA org to kubernetes-sigs as part of its donation to CNCF. The new identifiers are:

Artifact	Identifier
Repository	`kubernetes-sigs/dra-driver-nvidia-gpu`
Go module	`sigs.k8s.io/dra-driver-nvidia-gpu`
Container images	`registry.k8s.io/dra-driver-nvidia/dra-driver-nvidia-gpu`
Helm chart	`oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu:0.4.0`

Change to Semantic Versioning with v0.4.0

This release also updates the versioning scheme to Semantic Versioning (SemVer), starting at v0.4.0. This change was to make it easier to build and publish client API bindings for ComputeDomains to align with Go module semantic versioning for importing dependencies.

Refer to the following issues for more details around this change: #988, #715, and #1046.

Helm chart location and name change

Starting with v0.4.0, the Helm chart is published to two locations:

NGC (continuing): https://helm.ngc.nvidia.com/nvidia/charts/dra-driver-nvidia-gpu. Note that this is a different name from the previous release. Refer to the Upgrade section for details on upgrading to the new chart name.
Kubernetes registry (new): oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu.

Users can choose to use either chart.

The Helm chart naming is also updated from nvidia-dra-driver-gpu to dra-driver-nvidia-gpu in v0.4.0. Users can continue to use their existing component names by passing the nameOverride=nvidia-dra-driver-gpu flag when upgrading. Refer to the Upgrade section for commands and details about required flags.

Action required

Starting in v0.4.0, the Helm chart follows Semantic Versioning. To upgrade, you must pass --version 0.4.0 when using helm upgrade. See the Upgrade section for details and commands.
If you are switching from the NGC chart to the Kubernetes registry chart, pass --set nameOverride=nvidia-dra-driver-gpu on your first upgrade to keep existing Kubernetes resource names stable. The override is not required for subsequent releases. Refer to the Upgrade section for exact commands.
Users who hit "device cannot be reprepared" after a host reboot prior to v0.4.0 (issue #951) must remove the kubelet plugin checkpoint file manually before upgrading. The new BootID-aware checkpoint format (#1066) only invalidates checkpoints with a recorded BootID. Legacy checkpoints written by older versions are assumed valid.

Feature gate changes

The following feature gates changed in v0.4.0. See pkg/featuregates/featuregates.go for the complete list of gates and their current defaults.

Feature Gate	Change	Stage	Default	Required K8s Feature Gate	PRs
DeviceMetadata	New	Alpha	false	None. Driver-side only (KEP-5304). Framework support is Alpha in K8s 1.36+	#1000
PassthroughSupport	Behavior change	Alpha	false	None beyond core DRA. IOMMUFD backend additionally requires a host kernel with IOMMUFD enabled.	IOMMUFD backend added (#1036), persistence-mode toggling during vfio prep (#1038), plugin startup, GPU tracking, and validation fixes (#994)
NVMLDeviceHealthCheck	Behavior change	Alpha	false	DRADeviceTaints (KEP-5055) Alpha (default off) in K8s 1.34 and 1.35, Beta (default on) in K8s 1.36. Informational taints additionally require K8s ≥ 1.35.	Unhealthy GPUs are now retained in the ResourceSlice with a DeviceTaint attached. The v25.12.0 behavior of removing unhealthy GPUs from the slice (which required a driver restart to re-add after recovery) is replaced. Non-fatal XIDs surface as informational taints (#983)
IMEXDaemonsWithDNSNames	Behavior change	Beta	true	None	When enabled, `numNodes` in the ComputeDomain API is now optional (#1081)

This release introduces enhanced validation logic for feature-gate flags:

The driver now enforces mutual exclusivity between PassthroughSupport and NVMLDeviceHealthCheck during startup (#994).
Enabling DeviceMetadata functionality now requires that the PassthroughSupport feature gate is also active (#1000).

New features

Leader election for the compute-domain-controller. This adds high availability when running multiple controller replicas. Disabled by default. Enable by setting controller.replicas: 2 (or more) and controller.leaderElection.enabled: true in your Helm values (#851).
Prometheus metrics. The GPU kubelet plugin, ComputeDomain plugin, and DRA controller now expose optional Prometheus metrics under the nvidia_gpu_dra_* prefix. Enable via controller.metrics.enabled and kubeletPlugin.metrics.enabled (#964).
GPU health taints are now used to track health status. When the NVMLDeviceHealthCheck feature gate is enabled, unhealthy GPUs are tainted via Kubernetes DeviceTaints and remain in the ResourceSlice, replacing the v25.12.0 behavior of removing unhealthy GPUs from the slice (which required a driver restart to re-add them after recovery). Non-fatal XID errors are surfaced as informational taints for observability without affecting scheduling (#983).
IOMMUFD-backed VFIO passthrough. VFIO passthrough now supports the IOMMUFD kernel backend in addition to the legacy IOMMU interface. Workloads opt in via a new VfioDeviceConfig opaque config on the ResourceClaim (apiVersion: resource.nvidia.com/v1beta1) with iommu.backendPolicy: PreferIommuFD; the driver falls back to the legacy backend if IOMMUFD is not available on the node. A companion iommu.enableAPIDevice field controls whether the IOMMU API device is injected into the container. Defaults preserve v25.12.0 behavior — the default vfio.gpu.nvidia.com DeviceClass ships with no config and existing workloads continue to use the legacy backend unchanged. Requires the PassthroughSupport feature gate (Alpha, default off) and, for IOMMUFD, a host kernel with IOMMUFD support enabled (#1036).
Device metadata downward API (KEP-5304). When the DeviceMetadata feature gate is enabled, the kubelet plugin writes a DeviceMetadata JSON file (apiVersion metadata.resource.k8s.io/v1alpha1) per claim and injects it via CDI, exposing device attributes such as pciBusID to the workload (#1000).
numNodes in the ComputeDomain API is now optional. With IMEXDaemonsWithDNSNames=true (default), the field can be omitted. The default value is 0. The field remains deprecated and will be removed in a future release. When running with IMEXDaemonsWithDNSNames=false, set numNodes explicitly. The API server no longer rejects a missing value (#1081).
imagePullSecrets propagation to the ComputeDomain daemon. Secrets configured on the controller are now passed through to dynamically created CD daemon DaemonSet pods, resolving ImagePullBackOff against private registries on Kubernetes 1.35+ (#1033).
Upstream NFD GPU labels for kubelet plugin scheduling. The kubelet plugin DaemonSet now selects nodes using upstream Node Feature Discovery GPU labels in addition to NVIDIA-specific labels, allowing the chart to be used with upstream NFD without overrides (#1122).
ExtendedResources examples. Sample DeviceClasses and Pod specs for ExtendedResources requests are now included (#940).
Kubernetes 1.36 support added.
OpenShift 4.21 support added.
Plugin pods now use a higher startup-probe rate, providing faster readiness on healthy nodes (#872).
The controller now **tolerates `node-ro...

Contributors

dims, RobertNorthard, and 8 other contributors

Assets 2

12 Feb 14:55

jgehrcke

v25.12.0

0882da8

v25.12.0

This release marks the general availability of the GPU allocation plugin.

When upgrading from a previous release, please follow the upgrade instructions below.

Highlights

Dynamic MIG device management (demo, see remarks below).
Significant reduction of the ComputeDomain formation time in large-scale clusters (~10 seconds for domains comprised of thousands of nodes).

Improvements

Added preliminary support for VFIO passthrough devices in the GPU plugin (not enabled by default, set the PassthroughSupport feature gate, see #668).
Added the memory addressingMode as an attribute to announced GPU devices (#717).
Added support for GPU health checks (not enabled by default, use the NVMLDeviceHealthCheck feature gate, see #689).
Enhanced robustness of the ComputeDomain controller in view of deliberately or accidentally running replicas (#868).
Made the ComputeDomain kubelet plugin crash in view of obvious MNNVL fabric configuration errors or degraded fabric health (#844, #865).
Added a networkPolicy parameter to the Helm chart to support clusters with restricted networks (#708).
Tuned binary search paths for widening Linux distribution support (#706).

All commits since last release: v25.8.1...v25.12.0

Upgrades

First, update CRDs by running

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-dra-driver-gpu/refs/tags/v25.12.0/deployments/helm/nvidia-dra-driver-gpu/crds/resource.nvidia.com_computedomains.yaml

Only then update the chart by running helm upgrade -i ... (instead of helm install).

Dynamic MIG

We'd like your feedback on this alpha feature. Use it in a controlled environment and at your own risk, and please report any issues you observe.

Make sure you:

Use H100 GPUs or newer (A100 is not supported in this first release because it does not support enabling/disabling MIG mode freely).
Run Kubernetes (ideally v1.34+) with DRAPartitionableDevices support enabled for API server, scheduler, and kubelet.
Enable the DynamicMIG feature gate when deploying this driver.

Please keep these key concepts in mind:

When the feature gate is enabled, the kubelet plugin assumes full ownership of the MIG configuration on the node it runs on. On startup, it will attempt to tear down any unexpected MIG devices it finds.
Disable any other mechanism that creates, modifies, or manages MIG devices.
Correctly managing MIG state in a long-running cluster requires invasive cleanup routines. Please help us identify bugs in the current implementation and share feedback so we can evolve the cleanup design with confidence.

New feature gates and known limitations

For now, enabling DynamicMIG is mutually exclusive with enabling any of MPSSupport, NVMLDeviceHealthCheck, and PassthroughSupport.
The new fail-fast behavior in the CD kubelet plugin can be disabled with the new CrashOnNVLinkFabricErrors feature gate.
The scalability improvements for ComputeDomains come with a number of architectural changes under the hood. These can be disabled to restore 25.8.x behavior by disabling the ComputeDomainCliques feature gate.

Assets 2

0 Join discussion

17 Dec 17:06

jgehrcke

v25.8.1

fc69d98

v25.8.1

This release contains bug fixes as well as performance improvements and platform compatibility enhancements.

Fixes

Do not announce the full GPU as allocatable anymore when it is MIG-enabled (#719).
Adjust startup probe for the GPU plugin to give device discovery more time for before crash-cycling (#773, #774).
Fix a checkpoint state management bug in the GPU plugin (#755).

Improvements

Significantly shorten the time it takes to form a large ComputeDomain comprised of O(100) compute nodes (#732).
Conditionally set up service accounts with specific privileges upon Helm chart installation for OpenShift compatibility (thanks to Vitaliy Emporopulo, cf. #569).

Notable changes

The Go binaries are now built with Golang v1.25.5.
The NVIDIA container toolkit dependency was updated to v1.18.0.

All commits since last release: v25.8.0...v25.8.1

Assets 2

0 Join discussion

20 Oct 20:11

jgehrcke

v25.8.0

2ef1f2a

v25.8.0

This release introduces substantial improvements to the operational ergonomics and fault tolerance of ComputeDomains.

Installation instructions can be found here.

Important: When upgrading from a previous release, please follow the upgrade instructions below.

Warning: If you'd like to run the latest and greatest, please be advised that there is a known DRA issue in Kubernetes 1.34.0 and 1.34.1 -- we recommend using 1.33.x or waiting for 1.34.2 to be released in November 2025 (or applying one of the workarounds mentioned on the issue).

Highlights

Elasticity and fault tolerance: ComputeDomains were always described as following the workload, in terms of node placement. In 25.3.x releases, though, a ComputeDomain remained static after initial workload scheduling: it could not expand or shrink, and it could not incorporate a replacement node upon node failure. Now, a ComputeDomain dynamically responds to workload placement changes at runtime. For example, when a workload pod fails and subsequently gets scheduled to a new node (which was previously not part of the ComputeDomain), the domain now dynamically expands to the new node.
Ergonomics: With ComputeDomains now being elastic, the numNodes parameter for ComputeDomains lost relevance and can always be set to 0. Thus, one no longer needs a priori knowledge of the number of nodes required for a workload when creating a ComputeDomain. For details and caveats, carefully review the current numNodes specification. This field will be removed in a future version of the API provided by this DRA driver.
Scheduling latency improvement: individual workload pods in a ComputeDomain now get released (started) much faster: individual IMEX daemons now come online as soon as they are ready (without waiting for the entire ensemble to be formed). Specifically, an individual workload pod comes online when its corresponding local IMEX daemon is ready. This effectively removes a barrier from the system. Your workload must ensure that Multi-Node NVLink communication is only attempted once all relevant peers are online. To restore the previous barrier behavior, see below.

Improvements

A new allocationMode parameter can be used for the channel specification when creating a ComputeDomain. When set to all, all (currently 2048) IMEX channels get injected into each workload pod (#468, #506). See here for a usage example.
For larger-scale deployments, startup/liveness/readiness probes were adjusted for predictable and fast scheduling, and the kubelet plugins' rollingUpdate.maxUnavailable setting was changed to allow for smooth upgrades (#616).
For long-running deployments, a new cleanup routine was introduced to clean up after partially prepared claims (#672).
Following the principle of least privilege, we have further minimized the privileges assigned to individual components (#666).
A new chart-wide logVerbosity parameter was introduced to control overall verbosity of all driver components (#633).
The GPU index and minor number are now not advertised anymore, see discussion in #624 and #563 for details.
Support was added for a validating admission webhook to inspect opaque configs (such as on ResourceClaims, see #461).
More diagnostics output is provided around NVLink clique ID determination (#630).
Individual components now systematically log their startup config (#658), and more output is shown during shutdown (#644).

Fixes

Stderr/out emitted by probes based on nvidia-imex-ctl is now displayed as part of kubectl describe output (#636).
Logs are now flushed on component shutdown (#661).

Other changes

More changes were made that have not yet been explicitly called out above. All pull requests that made it into this milestone are listed here. All commits since v25.3.2 can be reviewed here.

Upgrades

Procedure

We invested into making upgrades from version 25.3.2 of this driver work smoothly, without having to tear down workload.

First, make sure to upgrade to version 25.3.2 first. This is required (and can be verified with for example helm list -A).

Then, follow this procedure:

Upgrade the CRD:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-dra-driver-gpu/refs/tags/v25.8.0/deployments/helm/nvidia-dra-driver-gpu/crds/resource.nvidia.com_computedomains.yaml

Upgrade the Helm chart by running helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.8.0" --create-namespace --namespace nvidia-dra-driver-gpu [...] (followed by arguments for setting chart parameters as usual.

Restoring IMEX daemon ensemble barrier behavior

With the new IMEX daemon orchestration architecture, workload pods start when their local IMEX daemon is ready -- the underlying IMEX domain may not have fully formed yet.

Previously, all workload pods were released (started) only after the full IMEX domain had formed (according to the numNodes parameter). If your workload relies on this barrier , you can for now --set featureGates.IMEXDaemonsWithDNSNames=false upon Helm chart installation/upgrade. It is advised to make your workload cooperate with the new mode of operation though; as the old behavior now has limited support and will eventually be phased out.

Assets 2

0 Join discussion

19 Sep 20:03

jgehrcke

v25.3.2

7020737

v25.3.2

This release ensures compatibility with the recently published Kubernetes 1.34.

Fixes

Dynamically install the correct DeviceClass version to fix compatibility with k8s 1.34 (#533).
Allow for downgrades from versions 25.8.x of this driver (allow for unknown fields when decoding opaque config JSON documents in the device unprepare code path, see #578).

Improvements

Add a readiness probe and relax the aggressiveness of the liveness and startup probes for the ComputeDomain daemon. This gives more time for convergence and fail-over via internal mechanisms before reporting a daemon as unhealthy and letting the kubelet cycle it (#541).
Enhance compatibility for custom IMEX binary paths (#556).
Make the IMEX daemon's command server listen on an interface that is reachable by other nodes in the same overlay network, to allow for individual nvidia-imex-ctl processes to reach all other daemons in the same IMEX domain (with the -N switch, see docs).

Commits since last release: v25.3.1...v25.3.2

Assets 2

0 Join discussion

12 Sep 14:31

jgehrcke

v25.3.1

42270a3

v25.3.1

Commits since last release: v25.3.0...v25.3.1

Fixes

Fixed kubelet plugin startup failing with [...] device container path [...] already exists on the host in some environments (#479).
Fixed a failing IMEX daemon startup probe in some environments using NVIDIA GPU driver 580+ (#510).
Fixed an edge case in informer cache handling which could result in a ComputeDomain never getting ready (#517).
Fixed a rare issue which could lead to permanently inconsistent ComputeDomain node label state (#518).
Fixed an edge case which prevented IMEX daemon startup because of a bad directory path (#519).
Fixed a rare problem in which a PrepareResources() retry attempt is performed after an authoritative UnprepareResources() action (#520).

Improvements

Added a kubeVersion constraint to the Helm chart, to fail chart installation immediately in environments that do not support DRA (#485).
Expanded logic that prevents rogue code from announcing resources in the name of this DRA driver (#476).

Assets 2

0 Join discussion

15 Aug 18:46

jgehrcke

v25.3.0

06db1a7

v25.3.0

This release marks the general availability of NVIDIA's DRA Driver for GPUs. It focuses on ComputeDomains, for robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems. Official support for GPU allocation will be part of the next release.

No functional changes were added compared to the last release candidate.

Documentation for this DRA driver can be found here.

For background on how ComputeDomains enable support for MNNVL workloads on Kubernetes (and on NVIDIA GB200 systems in particular), see this doc, this slide deck, and this conference talk.

An outlook on upcoming changes can be found here, and our GitHub milestones provide a more fine-grained view into features planned for upcoming releases.

Assets 2

0 Join discussion

31 Jul 19:29

jgehrcke

v25.3.0-rc.5

34f9832

v25.3.0-rc.5 Pre-release

Pre-release

Commits since last release: v25.3.0-rc.4...v25.3.0-rc.5.

Fixes

A race condition in the ComputeDomain controller was fixed which allowed for uncontrolled creation of objects under informer lag (#440, #441).
The device name used to represent an individual GPU device in a ResourceSlice has been stabilized (#427, #428).
Processing of device requests has been consolidated to not overlap with other DRA drivers (#435).

Improvements

The GPU allocation side of the driver now supports the pcieRoot attribute, allowing for topologically aligned resource allocation across drivers (for e.g. GPU-NIC alignment, see #213, #400, #401, #429).
The nvidiaCDIHookPath Helm chart parameter has been added to allow for using for a pre-installed nvidia-cdi-hook executable instead of the embedded one (#430).

Notable changes

Updated the bundled NVIDIA Container Toolkit to v1.18.0-rc.2 (see release notes for rc.1, rc.2).
Driver internals are now based on the k8s.io/api/resource/v1 types and use latest components from the upcoming Kubernetes 1.34 release (#429).
Other, minor dependency updates were included (#422, #403, #437).

Assets 2

26 Jun 16:54

jgehrcke

v25.3.0-rc.4

a8d4243

v25.3.0-rc.4 Pre-release

Pre-release

Release notes

Commits since last release: v25.3.0-rc.3...v25.3.0-rc.4. Changes are summarized below.

Fixes

The logic for removing stale ComputeDomain node labels has been fixed and consolidated, which is especially important when workload pods are created and then deleted again in rapid succession (#404).
A ComputeDomain update (an IMEX daemon Pod IP change) was not reliably leading to daemon restart (#407).
The IMEX daemon's liveness probe's stderr was not collected (#407).
The IMEX daemon's log output was not reliably collected, especially around shutdown (#407).
Controller pod and IMEX daemon pods now explicitly run with NVIDIA_VISIBLE_DEVICES=void which addresses various error symptoms in some environments (#402).

Notable changes

Container images are now based on nvcr.io/nvidia/cuda:12.9.1-base-ubi9.

Assets 2

0 Join discussion

Releases: kubernetes-sigs/dra-driver-nvidia-gpu

v0.4.1-rc.1

What's Changed

Uh oh!

v0.4.0

Project move

Change to Semantic Versioning with v0.4.0

Helm chart location and name change

Action required

Feature gate changes

New features

Contributors

Uh oh!

v25.12.0

Highlights

Improvements

Upgrades

Dynamic MIG

New feature gates and known limitations

Uh oh!

v25.8.1

Fixes

Improvements

Notable changes

Uh oh!

v25.8.0

Highlights

Improvements

Fixes

Other changes

Upgrades

Procedure

Restoring IMEX daemon ensemble barrier behavior

Uh oh!

v25.3.2

Fixes

Improvements

Uh oh!

v25.3.1

Fixes

Improvements

Uh oh!

v25.3.0

Uh oh!

v25.3.0-rc.5

Fixes

Improvements

Notable changes

Uh oh!

v25.3.0-rc.4

Release notes

Fixes

Notable changes

Uh oh!