Support VFIO passthrough #668

varunrsekar · 2025-10-11T00:07:52Z

Design

Introduce new featuregate PassthroughSupport
Discover GPUs using github.com/NVIDIA/go-nvlib/nvpci pkg
Publish them as vfio devices in the node resource slice of type vfio with canonicalName gpu-vfio-<idx>
Create a new deviceclass vfio.gpu.nvidia.com
Create a new set of CDI devices in a new file k8s.gpu.nvidia.com-device_vfio.yaml:

k8s.gpu.nvidia.com/device=gpu-vfio-0
k8s.gpu.nvidia.com/device=gpu-vfio-1

A successful NodePrepare will cause the device of type gpu for the same PCIBusID to be removed from the resourceslice.
A successful NodeUnprepare will cause the device of type gpu for the same PCIBusID to be rediscovered and added back to the resourceslice

GPU Discovery

Invoke nvpci.GetGPUs (github.com/NVIDIA/go-nvlib/nvpci)
Publish all discovered GPUs as devices of type vfio in the resourceslice
Set the parent of the vfio device to the device of type gpu discovered from NVML (is nil if device is not bound to the nvidia driver)
Publish new CDI file (k8s.gpu.nvidia.com-device_vfio.yaml) with the discovered vfio devices

NodePrepare

Verify there are no active gpu clients on the device by checking for open FDs against <driver-root>/dev/nvidia<minor> (timeout after 60s)
Verify there are no VFs on the GPU
Configure the GPU for passthrough using introduced scripts/unbind_from_driver.sh, scripts/bind_to_driver.sh shell scripts.
Remove the parent GPU device of type gpu in the resourceslice

NodeUnprepare

Configure the GPU back to the nvidia driver using introduced scripts/unbind_from_driver.sh, scripts/bind_to_driver.sh shell scripts.
Rediscover the parent GPU device (as its device minor might've changed) and re-add it back to the resourceslice

Deployment

kubeletplugin is a privileged pod running as root
Mount host path / (read-only), /sys/ (read-write) and /proc/ (read-write)

Testing

Verified with Kubevirt VM using sample resourceclaimtemplate spec from demo/specs/quickstart/gpu-test-vfiopci.yaml
Regression tests with the specs under https://github.com/NVIDIA/k8s-dra-driver-gpu/tree/main/demo/specs/quickstart/:
1. With FG enabled: PASS
2. With FG disabled: PASS

TODO

Use standard attribute name for PCI Bus ID
Use nvidia-container-toolkit library to generate the CDI device file (PENDING Add vfio mode to generate CDI specs for NVIDIA passthrough GPUs nvidia-container-toolkit#315)
active GPU VF enablement/disablement
Verify there are no VGPU/MIG before advertising VFIO device.

copy-pr-bot · 2025-10-11T00:07:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cmd/gpu-kubelet-plugin/vfio-device.go

varunrsekar · 2025-10-22T17:29:40Z

Updates:

Use chroot to use modprobe from nvidiaDriverRoot
During NodePrepare/NodeUnprepare, sync advertised resources to prevent different device types from being allocatable for the same GPU
Add soft-check for VFs to be disabled before attempting driver unbind

cmd/gpu-kubelet-plugin/vfio-device.go

scripts/bind_to_driver.sh

cmd/gpu-kubelet-plugin/vfio-device.go

shivamerla · 2025-10-31T20:37:26Z

LGTM! thanks @varunrsekar for being patient with this change. We have to follow up on some of the blockers as below when we move this feature out of alpha.

Need a robust way to evict all GPU clients (as pods or systemd services) before performing unbind from the nvidia driver.
Need to identify ways to restart nvidia-persistenced without any side-effects to other GPU workloads.
Need to handle potential blocking unbind calls in the kernel if the device is busy.
Need to ensure we don't setup health monitoring for devices bound to vfio-pci usign nvml event watcher.

cc @klueska to further review this feature to support behind an alpha feature-gate.

Signed-off-by: Varun Ramachandra Sekar <[email protected]> use chroot to run modprobe Signed-off-by: Varun Ramachandra Sekar <[email protected]> deadvertise sibling devices on preparation Signed-off-by: Varun Ramachandra Sekar <[email protected]> soft check for VFs before attempting unbind Signed-off-by: Varun Ramachandra Sekar <[email protected]> address review comments Signed-off-by: Varun Ramachandra Sekar <[email protected]> address comments (2) Signed-off-by: Varun Ramachandra Sekar <[email protected]> use fuser to check if gpu is free Signed-off-by: Varun Ramachandra Sekar <[email protected]> remove unnecessary securityContext Signed-off-by: Varun Ramachandra Sekar <[email protected]>

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

varunrsekar · 2025-10-31T22:22:06Z

Updates:

Squashed all previous commits
If MIGs are present, then VFIO devices wouldn't be advertised.
Updated regression testing in description

github-project-automation bot added this to Planning Board: k8s-dra-driver-gpu Oct 11, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Oct 11, 2025

This was referenced Oct 11, 2025

Support for allocating GPUs in Passthrough-Mode #183

Closed

Add vfio mode to generate CDI specs for NVIDIA passthrough GPUs NVIDIA/nvidia-container-toolkit#315

Open

varunrsekar force-pushed the vfio-support-1.33 branch 2 times, most recently from 8df3681 to 298704d Compare October 15, 2025 00:32

shivamerla reviewed Oct 15, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/vfio-device.go Outdated Show resolved Hide resolved