Skip to content

Conversation

@shivakunv
Copy link
Contributor

@shivakunv shivakunv commented Nov 29, 2025

VF-8 Ability to set vGPU host driver options
vGPU host driver options are set via nvidia.ko kernel module parameters in /etc/modprobe.d/nvidia.conf. For example:

options nvidia NVreg_RegistryDwords="RmPVMRL=value"

GPU Operator currently supports setting of kernel module parameters for NVIDIA guest drivers via a module.conf file that's passed to GPU Operator via a ConfigMap.

A similar mechanism will be used to support module parameters for the vGPU host kernel driver.

testing


# git clone https://github.com/NVIDIA/gpu-driver-container.git
# before mering of this PR: 
# git fetch -all 
# git checkout gpumanagerkernelmodulespec 
#build vgpu-manager
# copy NVIDIA-Linux-x86_64-570.124.06-vgpu-kvm.run from smb server to vgpu-manager 
export VGPU_DRIVER_VERSION="570.124.06"
export IMAGE_HOST_NAME="vgpu-manager"
export DIST=ubuntu22.04
export IMAGE_NAME=nvcr.io/ea-cnt/nv_only
#build
VGPU_HOST_DRIVER_VERSION=${VGPU_DRIVER_VERSION} IMAGE_NAME=${IMAGE_NAME}/${IMAGE_HOST_NAME} make build-vgpuhost-${DIST}
#push
VGPU_GUEST_DRIVER_VERSION=${DRIVER_VERSION} IMAGE_NAME=${IMAGE_NAME}/${IMAGE_GUEST_NAME} make push-vgpuguest-${DIST}


# on system (collossus)
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on iommu=pt"
sudo update-grub
# reboot


# Complete cleanup
kubectl delete namespace nvidia-gpu-operator --wait=true
kubectl delete secret -n nvidia-gpu-operator ngc-secret
kubectl delete namespace nvidia-gpu-operator
kubectl delete clusterrole gpu-operator
kubectl delete clusterrolebinding  gpu-operator
kubectl delete ClusterPolicy  gpu-operator
kubectl delete clusterpolicy  gpu-operator
kubectl delete clusterpolicy cluster-policy
kubectl delete clusterrole gpu-operator-node-feature-discovery
kubectl delete clusterrole gpu-operator-node-feature-discovery-gc
kubectl delete clusterrolebinding  gpu-operator-node-feature-discovery
kubectl delete clusterrolebinding  gpu-operator-node-feature-discovery-gc
kubectl delete crd -n nvidia-gpu-operator clusterpolicies.nvidia.com nvidiadrivers.nvidia.com
helm uninstall -n nvidia-gpu-operator $(helm list -n nvidia-gpu-operator -q)
kubectl delete clusterpolicy --all



export NGC_API_KEY=ngc_api_key
export REGISTRY_SECRET_NAME=ngc-secret
export PRIVATE_REGISTRY=nvcr.io/ea-cnt/nv_only
export VGPU_DRIVER_VERSION=570.124.06


# Wait 30 seconds for cleanup
sleep 30

# Recreate everything
kubectl create namespace nvidia-gpu-operator
kubectl label --overwrite namespace nvidia-gpu-operator pod-security.kubernetes.io/enforce=privileged

kubectl create secret docker-registry ${REGISTRY_SECRET_NAME} \
  --docker-server=${PRIVATE_REGISTRY} \
  --docker-username='$oauthtoken' \
  --docker-password=${NGC_API_KEY} \
  -n nvidia-gpu-operator

cat <<EOF > nvidia.conf
options nvidia NVreg_EnableGpuFirmwareLogs=1
EOF


# create params configmap
kubectl create configmap kernel-module-params \
  -n nvidia-gpu-operator \
  --from-file=nvidia.conf=./nvidia.conf

# Remove the failing vGPU config label
kubectl label node ipp1-0555 nvidia.com/vgpu.config-

# Install with vGPU Manager
helm install gpu-operator nvidia/gpu-operator \
  -n nvidia-gpu-operator \
  --set operator.repository=ghcr.io/nvidia \
  --set operator.version=280c5460 \
  --set driver.enabled=false \
  --set vgpuManager.enabled=true \
  --set vgpuManager.repository=${PRIVATE_REGISTRY} \
  --set vgpuManager.image=vgpu-manager \
  --set vgpuManager.version="${VGPU_DRIVER_VERSION}" \
  --set vgpuManager.imagePullSecrets[0]="${REGISTRY_SECRET_NAME}" \
  --set vgpuManager.kernelModuleConfig.name="kernel-module-params"




#copy nvidia.com_clusterpolicies.yaml from config/crd/bases/nvidia.com_clusterpolicies.yaml to test system 

kubectl apply -f nvidia.com_clusterpolicies.yaml

kubectl patch clusterpolicy cluster-policy --type='json' -p='[                                                                           
  {
    "op": "add",
    "path": "/spec/vgpuManager/kernelModuleConfig",
    "value": {
      "name": "kernel-module-params"
    }
  }
]'


kubectl patch clusterpolicy cluster-policy --type='merge' -p '{"spec":{"sandboxWorkloads":{"enabled":true,"defaultWorkload":"vm-vgpu"}}}'

kubectl label node ipp1-0555 nvidia.com/gpu.workload.config=vm-vgpu


cat /proc/driver/nvidia/params  | grep EnableGpuFirmwareLogs

should o/p updated value from nvidia.conf configmap 

@shivakunv shivakunv self-assigned this Nov 29, 2025
@shivakunv shivakunv force-pushed the gpumanagerkernelmodulespec branch 3 times, most recently from 8468f59 to aa1a34d Compare November 29, 2025 15:09
@coveralls
Copy link

coveralls commented Nov 29, 2025

Coverage Status

coverage: 25.65% (+0.2%) from 25.43%
when pulling a54b7e1 on gpumanagerkernelmodulespec
into 96351fc on main.

@shivakunv shivakunv force-pushed the gpumanagerkernelmodulespec branch from aa1a34d to a3d34df Compare November 30, 2025 04:52
@shivakunv shivakunv marked this pull request as ready for review December 1, 2025 03:16
@shivakunv shivakunv marked this pull request as draft December 1, 2025 16:27
@shivakunv shivakunv marked this pull request as ready for review December 1, 2025 16:40
@shivakunv
Copy link
Contributor Author

PTAL:
NVIDIA/gpu-driver-container#512

@shivakunv shivakunv force-pushed the gpumanagerkernelmodulespec branch 4 times, most recently from 2a41292 to 280c546 Compare December 18, 2025 07:30
@shivakunv shivakunv force-pushed the gpumanagerkernelmodulespec branch from 280c546 to dae7499 Compare December 18, 2025 15:54
@rajathagasthya
Copy link
Contributor

@shivakunv Could you rebase this PR?

@shivakunv shivakunv force-pushed the gpumanagerkernelmodulespec branch from 1ceaafa to b04453a Compare January 27, 2026 10:13
@shivakunv
Copy link
Contributor Author

@shivakunv Could you rebase this PR?

done

@shivakunv shivakunv force-pushed the gpumanagerkernelmodulespec branch from b04453a to 422e2ff Compare January 27, 2026 17:49
Comment on lines 2824 to 2908
if config.VGPUManager.KernelModuleConfig != nil && config.VGPUManager.KernelModuleConfig.Name != "" {
// note: transformVGPUManagerContainer() will have already created a Volume backed by the ConfigMap.
// Only add a VolumeMount for nvidia-vgpu-manager-ctr.
volumeMounts, _, err := createConfigMapVolumeMounts(n, config.VGPUManager.KernelModuleConfig.Name, driversDir)
if err != nil {
return fmt.Errorf("failed to create ConfigMap VolumeMounts for vGPU manager kernel module configuration: %w", err)
}
obj.Spec.Template.Spec.Containers[i].VolumeMounts = append(obj.Spec.Template.Spec.Containers[i].VolumeMounts, volumeMounts...)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed. The transformPeerMemoryContainer() function is only relevant for the main driver daemonset, not the vGPU manager daemonset. The vGPU manager daemonset does not install the nvidia-peermem module.

Suggested change
if config.VGPUManager.KernelModuleConfig != nil && config.VGPUManager.KernelModuleConfig.Name != "" {
// note: transformVGPUManagerContainer() will have already created a Volume backed by the ConfigMap.
// Only add a VolumeMount for nvidia-vgpu-manager-ctr.
volumeMounts, _, err := createConfigMapVolumeMounts(n, config.VGPUManager.KernelModuleConfig.Name, driversDir)
if err != nil {
return fmt.Errorf("failed to create ConfigMap VolumeMounts for vGPU manager kernel module configuration: %w", err)
}
obj.Spec.Template.Spec.Containers[i].VolumeMounts = append(obj.Spec.Template.Spec.Containers[i].VolumeMounts, volumeMounts...)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some unit tests to transforms_test.go for TransformVGPUManager()? One of the unit tests can verify that the kernel module config map is getting rendered correctly.

@shivakunv shivakunv force-pushed the gpumanagerkernelmodulespec branch from 8430f36 to e4b5dcd Compare January 27, 2026 19:48
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
@shivakunv shivakunv force-pushed the gpumanagerkernelmodulespec branch from e4b5dcd to a54b7e1 Compare January 27, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants