Skip to content

Conversation

@varunrsekar
Copy link

@varunrsekar varunrsekar commented Oct 11, 2025

Design

  1. Introduce new featuregate PassthroughSupport
  2. Discover GPUs using github.com/NVIDIA/go-nvlib/nvpci pkg
  3. Publish them as vfio devices in the node resource slice of type vfio with canonicalName gpu-vfio-<idx>
  4. Create a new deviceclass vfio.gpu.nvidia.com
  5. Create a new set of CDI devices in a new file k8s.gpu.nvidia.com-device_vfio.yaml:
k8s.gpu.nvidia.com/device=gpu-vfio-0
k8s.gpu.nvidia.com/device=gpu-vfio-1
  1. A successful NodePrepare will cause the device of type gpu for the same PCIBusID to be removed from the resourceslice.
  2. A successful NodeUnprepare will cause the device of type gpu for the same PCIBusID to be rediscovered and added back to the resourceslice

GPU Discovery

  1. Invoke nvpci.GetGPUs (github.com/NVIDIA/go-nvlib/nvpci)
  2. Publish all discovered GPUs as devices of type vfio in the resourceslice
  3. Set the parent of the vfio device to the device of type gpu discovered from NVML (is nil if device is not bound to the nvidia driver)
  4. Publish new CDI file (k8s.gpu.nvidia.com-device_vfio.yaml) with the discovered vfio devices

NodePrepare

  1. Verify there are no active gpu clients on the device by checking for open FDs against <driver-root>/dev/nvidia<minor> (timeout after 60s)
  2. Verify there are no VFs on the GPU
  3. Configure the GPU for passthrough using introduced scripts/unbind_from_driver.sh, scripts/bind_to_driver.sh shell scripts.
  4. Remove the parent GPU device of type gpu in the resourceslice

NodeUnprepare

  1. Configure the GPU back to the nvidia driver using introduced scripts/unbind_from_driver.sh, scripts/bind_to_driver.sh shell scripts.
  2. Rediscover the parent GPU device (as its device minor might've changed) and re-add it back to the resourceslice

Deployment

  1. kubeletplugin is a privileged pod running as root
  2. Mount host path / (read-only), /sys/ (read-write) and /proc/ (read-write)

Testing

  1. Verified with Kubevirt VM using sample resourceclaimtemplate spec from demo/specs/quickstart/gpu-test-vfiopci.yaml
  2. Regression tests with the specs under https://github.com/NVIDIA/k8s-dra-driver-gpu/tree/main/demo/specs/quickstart/:
    1. With FG enabled: PASS
    2. With FG disabled: PASS

TODO

  1. Use standard attribute name for PCI Bus ID
  2. Use nvidia-container-toolkit library to generate the CDI device file (PENDING Add vfio mode to generate CDI specs for NVIDIA passthrough GPUs nvidia-container-toolkit#315)
  3. active GPU VF enablement/disablement
  4. Verify there are no VGPU/MIG before advertising VFIO device.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@varunrsekar
Copy link
Author

Updates:

  • Use chroot to use modprobe from nvidiaDriverRoot
  • During NodePrepare/NodeUnprepare, sync advertised resources to prevent different device types from being allocatable for the same GPU
  • Add soft-check for VFs to be disabled before attempting driver unbind

@varunrsekar varunrsekar marked this pull request as ready for review October 31, 2025 18:27
@shivamerla
Copy link
Contributor

LGTM! thanks @varunrsekar for being patient with this change. We have to follow up on some of the blockers as below when we move this feature out of alpha.

  • Need a robust way to evict all GPU clients (as pods or systemd services) before performing unbind from the nvidia driver.
  • Need to identify ways to restart nvidia-persistenced without any side-effects to other GPU workloads.
  • Need to handle potential blocking unbind calls in the kernel if the device is busy.
  • Need to ensure we don't setup health monitoring for devices bound to vfio-pci usign nvml event watcher.

cc @klueska to further review this feature to support behind an alpha feature-gate.

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

use chroot to run modprobe

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

deadvertise sibling devices on preparation

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

soft check for VFs before attempting unbind

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

address review comments

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

address comments (2)

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

use fuser to check if gpu is free

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

remove unnecessary securityContext

Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
@varunrsekar
Copy link
Author

Updates:

  • Squashed all previous commits
  • If MIGs are present, then VFIO devices wouldn't be advertised.
  • Updated regression testing in description

@klueska klueska added this to the v25.12.0 milestone Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants