-
Notifications
You must be signed in to change notification settings - Fork 406
Description
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
nvidia-cuda-validator pod in nvidia-gpu-operator is in Init:CrashLoopBackOff
% oc logs nvidia-cuda-validator-4bnbm -c cuda-validation -n nvidia-gpu-operator
Failed to allocate device vector A (error code initialization error)!
[Vector addition of 50000 elements]
To Reproduce
Install Nvidia GPU Operator 25.10.0 from Operator Hub on Redhat openshift container platform 4.18.22
Expected behavior
Install the NVIDIA GPU Operator v25.10.0 Successfully, with all pods in the nvidia-gpu-operator namespace in a Running state.
Environment (please provide the following information):
Bare metal OpenShift cluster 4.18.22
1 GPU nodes with L40S GPU cards
Node Feature Discovery operator installed 4.18.0
Nvidia gpu Operator installed 25.10.0
Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE - kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE - If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME - If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi - containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]