Releases · NVIDIA/k8s-dra-driver-gpu

20 Oct 20:11

jgehrcke

v25.8.0

2ef1f2a

v25.8.0 Latest

Latest

This release introduces substantial improvements to the operational ergonomics and fault tolerance of ComputeDomains.

Installation instructions can be found here.

Important: When upgrading from a previous release, please follow the upgrade instructions below.

Warning: If you'd like to run the latest and greatest, please be advised that there is a known DRA issue in Kubernetes 1.34.0 and 1.34.1 -- we recommend using 1.33.x or waiting for 1.34.2 to be released in November 2025 (or applying one of the workarounds mentioned on the issue).

Highlights

Elasticity and fault tolerance: ComputeDomains were always described as following the workload, in terms of node placement. In 25.3.x releases, though, a ComputeDomain remained static after initial workload scheduling: it could not expand or shrink, and it could not incorporate a replacement node upon node failure. Now, a ComputeDomain dynamically responds to workload placement changes at runtime. For example, when a workload pod fails and subsequently gets scheduled to a new node (which was previously not part of the ComputeDomain), the domain now dynamically expands to the new node.
Ergonomics: With ComputeDomains now being elastic, the numNodes parameter for ComputeDomains lost relevance and can always be set to 0. Thus, one no longer needs a priori knowledge of the number of nodes required for a workload when creating a ComputeDomain. For details and caveats, carefully review the current numNodes specification. This field will be removed in a future version of the API provided by this DRA driver.
Scheduling latency improvement: individual workload pods in a ComputeDomain now get released (started) much faster: individual IMEX daemons now come online as soon as they are ready (without waiting for the entire ensemble to be formed). Specifically, an individual workload pod comes online when its corresponding local IMEX daemon is ready. This effectively removes a barrier from the system. Your workload must ensure that Multi-Node NVLink communication is only attempted once all relevant peers are online. To restore the previous barrier behavior, see below.

Improvements

A new allocationMode parameter can be used for the channel specification when creating a ComputeDomain. When set to all, all (currently 2048) IMEX channels get injected into each workload pod (#468, #506). See here for a usage example.
For larger-scale deployments, startup/liveness/readiness probes were adjusted for predictable and fast scheduling, and the kubelet plugins' rollingUpdate.maxUnavailable setting was changed to allow for smooth upgrades (#616).
For long-running deployments, a new cleanup routine was introduced to clean up after partially prepared claims (#672).
Following the principle of least privilege, we have further minimized the privileges assigned to individual components (#666).
A new chart-wide logVerbosity parameter was introduced to control overall verbosity of all driver components (#633).
The GPU index and minor number are now not advertised anymore, see discussion in #624 and #563 for details.
Support was added for a validating admission webhook to inspect opaque configs (such as on ResourceClaims, see #461).
More diagnostics output is provided around NVLink clique ID determination (#630).
Individual components now systematically log their startup config (#658), and more output is shown during shutdown (#644).

Fixes

Stderr/out emitted by probes based on nvidia-imex-ctl is now displayed as part of kubectl describe output (#636).
Logs are now flushed on component shutdown (#661).

Other changes

More changes were made that have not yet been explicitly called out above. All pull requests that made it into this milestone are listed here. All commits since v25.3.2 can be reviewed here.

Upgrades

Procedure

We invested into making upgrades from version 25.3.2 of this driver work smoothly, without having to tear down workload.

First, make sure to upgrade to version 25.3.2 first. This is required (and can be verified with for example helm list -A).

Then, follow this procedure:

Upgrade the CRD:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-dra-driver-gpu/refs/tags/v25.8.0/deployments/helm/nvidia-dra-driver-gpu/crds/resource.nvidia.com_computedomains.yaml

Upgrade the Helm chart by running helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.8.0" --create-namespace --namespace nvidia-dra-driver-gpu [...] (followed by arguments for setting chart parameters as usual.

Restoring IMEX daemon ensemble barrier behavior

With the new IMEX daemon orchestration architecture, workload pods start when their local IMEX daemon is ready -- the underlying IMEX domain may not have fully formed yet.

Previously, all workload pods were released (started) only after the full IMEX domain had formed (according to the numNodes parameter). If your workload relies on this barrier , you can for now --set featureGates.IMEXDaemonsWithDNSNames=false upon Helm chart installation/upgrade. It is advised to make your workload cooperate with the new mode of operation though; as the old behavior now has limited support and will eventually be phased out.

Assets 2

0 Join discussion

19 Sep 20:03

jgehrcke

v25.3.2

7020737

v25.3.2

This release ensures compatibility with the recently published Kubernetes 1.34.

Fixes

Dynamically install the correct DeviceClass version to fix compatibility with k8s 1.34 (#533).
Allow for downgrades from versions 25.8.x of this driver (allow for unknown fields when decoding opaque config JSON documents in the device unprepare code path, see #578).

Improvements

Add a readiness probe and relax the aggressiveness of the liveness and startup probes for the ComputeDomain daemon. This gives more time for convergence and fail-over via internal mechanisms before reporting a daemon as unhealthy and letting the kubelet cycle it (#541).
Enhance compatibility for custom IMEX binary paths (#556).
Make the IMEX daemon's command server listen on an interface that is reachable by other nodes in the same overlay network, to allow for individual nvidia-imex-ctl processes to reach all other daemons in the same IMEX domain (with the -N switch, see docs).

Commits since last release: v25.3.1...v25.3.2

Assets 2

0 Join discussion

12 Sep 14:31

jgehrcke

v25.3.1

42270a3

v25.3.1

Commits since last release: v25.3.0...v25.3.1

Fixes

Fixed kubelet plugin startup failing with [...] device container path [...] already exists on the host in some environments (#479).
Fixed a failing IMEX daemon startup probe in some environments using NVIDIA GPU driver 580+ (#510).
Fixed an edge case in informer cache handling which could result in a ComputeDomain never getting ready (#517).
Fixed a rare issue which could lead to permanently inconsistent ComputeDomain node label state (#518).
Fixed an edge case which prevented IMEX daemon startup because of a bad directory path (#519).
Fixed a rare problem in which a PrepareResources() retry attempt is performed after an authoritative UnprepareResources() action (#520).

Improvements

Added a kubeVersion constraint to the Helm chart, to fail chart installation immediately in environments that do not support DRA (#485).
Expanded logic that prevents rogue code from announcing resources in the name of this DRA driver (#476).

Assets 2

0 Join discussion

15 Aug 18:46

jgehrcke

v25.3.0

06db1a7

v25.3.0

This release marks the general availability of NVIDIA's DRA Driver for GPUs. It focuses on ComputeDomains, for robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems. Official support for GPU allocation will be part of the next release.

No functional changes were added compared to the last release candidate.

Documentation for this DRA driver can be found here.

For background on how ComputeDomains enable support for MNNVL workloads on Kubernetes (and on NVIDIA GB200 systems in particular), see this doc, this slide deck, and this conference talk.

An outlook on upcoming changes can be found here, and our GitHub milestones provide a more fine-grained view into features planned for upcoming releases.

Assets 2

0 Join discussion

31 Jul 19:29

jgehrcke

v25.3.0-rc.5

34f9832

v25.3.0-rc.5 Pre-release

Pre-release

Commits since last release: v25.3.0-rc.4...v25.3.0-rc.5.

Fixes

A race condition in the ComputeDomain controller was fixed which allowed for uncontrolled creation of objects under informer lag (#440, #441).
The device name used to represent an individual GPU device in a ResourceSlice has been stabilized (#427, #428).
Processing of device requests has been consolidated to not overlap with other DRA drivers (#435).

Improvements

The GPU allocation side of the driver now supports the pcieRoot attribute, allowing for topologically aligned resource allocation across drivers (for e.g. GPU-NIC alignment, see #213, #400, #401, #429).
The nvidiaCDIHookPath Helm chart parameter has been added to allow for using for a pre-installed nvidia-cdi-hook executable instead of the embedded one (#430).

Notable changes

Updated the bundled NVIDIA Container Toolkit to v1.18.0-rc.2 (see release notes for rc.1, rc.2).
Driver internals are now based on the k8s.io/api/resource/v1 types and use latest components from the upcoming Kubernetes 1.34 release (#429).
Other, minor dependency updates were included (#422, #403, #437).

Assets 2

26 Jun 16:54

jgehrcke

v25.3.0-rc.4

a8d4243

v25.3.0-rc.4 Pre-release

Pre-release

Release notes

Commits since last release: v25.3.0-rc.3...v25.3.0-rc.4. Changes are summarized below.

Fixes

The logic for removing stale ComputeDomain node labels has been fixed and consolidated, which is especially important when workload pods are created and then deleted again in rapid succession (#404).
A ComputeDomain update (an IMEX daemon Pod IP change) was not reliably leading to daemon restart (#407).
The IMEX daemon's liveness probe's stderr was not collected (#407).
The IMEX daemon's log output was not reliably collected, especially around shutdown (#407).
Controller pod and IMEX daemon pods now explicitly run with NVIDIA_VISIBLE_DEVICES=void which addresses various error symptoms in some environments (#402).

Notable changes

Container images are now based on nvcr.io/nvidia/cuda:12.9.1-base-ubi9.

Assets 2

0 Join discussion

12 Jun 18:17

klueska

v25.3.0-rc.3

a78f2cf

v25.3.0-rc.3 Pre-release

Pre-release

Release notes

This release is an important milestone towards the general availability of the NVIDIA DRA Driver for GPUs. It focuses on improving support for NVIDIA's Multi-Node NVLink (MNNVL) in Kubernetes by delivering a number of ComputeDomain improvements and bug fixes.

All commits since the last release can be seen here: v25.3.0-rc.2...v25.3.0-rc.3. The changes are summarized below.

For background on how ComputeDomains enable support for MNNVL workloads on Kubernetes (and on NVIDIA GB200 systems in particular), see this doc and this slide deck.

Improvements

More predictable ComputeDomain cleanup semantics: deletion of a ComputeDomain is now immediately followed by resource teardown (instead of waiting for workload to complete).
Troubleshooting improvement: a new init container helps users set a correct value for the nvidiaDriverRoot Helm chart variable and overcome common GPU driver setup issues.
All driver components are now based on the same container image (configurable via Helm chart variable). This removes a dependency on Docker Hub and generally helps with compliance and reliability.
IMEX daemons orchestrated by a ComputeDomain now communicate via Pod IP (using a virtual overlay network instead of using hostnetwork: true) to improve robustness and security.
The dependency on a pre-provisioned NVIDIA Container Toolkit was removed.

Fixes

ComputeDomain teardown now works even after a corresponding ResourceClaim was removed from the API server (#342).
Fixed an issue where the IMEX daemon startup probe failed with “family not supported by protocol“ (#328).
Pod labels were adjusted so that e.g. kubectl logs ds/nvidia-dra-driver-gpu-kubelet-plugin actually yields plugin logs (#355).
The IMEX daemon startup probe is now less timing-sensitive (d1f7c).
Other minor fixes: #321, #334.

Notable changes

Introduced an IMEX daemon wrapper allowing for more robust and flexible daemon reconfiguration and monitoring.
Added support for the NVIDIA GPU Driver 580.x releases.
Added support for the Blackwell GPU architecture in the GPU plugin of the DRA driver.
The DRA library was updated to v0.33.0 (cf. changes) for various robustness improvements (such as for more reliable rolling upgrades).

Breaking changes

The nvidiaCtkPath Helm chart variable does not need to be provided anymore (see above); doing so now results in an error.

The path forward

ComputeDomains

Future versions of the NVIDIA GPU driver (580+) will include IMEX daemons with support for communicating using DNS names in addition to raw IP addresses. This feature allows us to overcome a number of limitations inherent to the existing ComputeDomain implementation – with no breaking changes to the user-facing API.

Highlights include:

Removal of the numNodes field in the ComputeDomain abstraction. Users will no longer need to pre-calculate how many nodes their (static) multi-node workload will ultimately span.
Support for elastic workloads. The number of pods associated with a mulit-node workload will no longer need to be fixed and forced to match the value of the numNodes field in the ComputeDomain the workload is running in.
Support for running more than one pod per ComputeDomain on a given node. As of now, all pods of a multi-node workload are artificially forced to run on different nodes, even if there are enough GPUs on a single node to service more than one of them. This new feature will remove this restriction.
Support for running pods from different ComputeDomains on the same node. As of now, only one pod from any multi-node workload is allowed to run on a given node associated with a ComputeDomain (even if there are enough GPUs available to service more than one of them). This new feature will remove this restriction.
Improved tolerance to node failures within an IMEX domain. As of now, if one node of an IMEX domain goes down, the entire workload needs to be shut down and rescheduled. This new feature will allow the failed node to be swapped in-place, without needing to shut down the entire IMEX domain (of course, many types of failures may still require the workloads to restart anyway to explicitly recover from a loss of state).

We also plan on adding improvements to the general debuggability and observability of ComputeDomains, including:

Proper definition of a set of high-level states that a ComputeDomain can be in to allow for robust automation.
Export of metrics to allow for monitoring health and performance.
Actionable error messages and description strings as well as improved component logging for facilitating troubleshooting.

GPUs

The upcoming 25.3.0 release will not include official support for allocating GPUs (only ComputeDomains). In the following release (25.8.0), we will add official support for allocating GPUs. This 25.8.0 release will be integrated with the NVIDIA GPU Operator and does not need to be installed as a standalone Helm chart anymore.

Note: The DRA feature in upstream Kubernetes is slated to go GA in August. The 25.8.0 release of the NVIDIA DRA driver for GPUs is planned to coincide with that.

Features we plan to include in the 25.8.0 release:

GPU selection via complex constraints
Support for having multiple GPU types per node
Controlled GPU sharing via ResourceClaims
User-mediated Time-Tlicing across a subset of GPUs on a node
User-mediated MPS sharing across a subset of GPUs on a node
Allocation of statically partitioned MIG devices
Custom policies to align multiple resource types (e.g. GPUs, CPUs, and NICs)

Features for future releases in the near term:

Dynamic allocation of MIG devices
System-mediated sharing of GPUs via Time-slicing and MPS
“Management” pods with access to all GPUs / MIG devices without allocating them
Dynamic swapping of NVIDIA driver with vfio driver depending on intended use of GPU
Ability to use DRA to allocate GPUs with “traditional” API (e.g. nvidia.com/gpu: 2)

Assets 2

0 Join discussion

Releases: NVIDIA/k8s-dra-driver-gpu

v25.8.0

Highlights

Improvements

Fixes

Other changes

Upgrades

Procedure

Restoring IMEX daemon ensemble barrier behavior

Uh oh!

v25.3.2

Fixes

Improvements

Uh oh!

v25.3.1

Fixes

Improvements

Uh oh!

v25.3.0

Uh oh!

v25.3.0-rc.5

Fixes

Improvements

Notable changes

Uh oh!

v25.3.0-rc.4

Release notes

Fixes

Notable changes

Uh oh!

v25.3.0-rc.3

Release notes

Improvements

Fixes

Notable changes

Breaking changes

The path forward

ComputeDomains

GPUs

Uh oh!