Skip to content

nvidia-cuda-validator Pod in nvidia-gpu-operator Namespace Stuck in Init:CrashLoopBackOff on OpenShift 4.18.22 #1848

@KothaHariChandana2000

Description

@KothaHariChandana2000

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug
nvidia-cuda-validator pod in nvidia-gpu-operator is in Init:CrashLoopBackOff

% oc logs nvidia-cuda-validator-4bnbm -c cuda-validation -n nvidia-gpu-operator
Failed to allocate device vector A (error code initialization error)!
[Vector addition of 50000 elements]

To Reproduce
Install Nvidia GPU Operator 25.10.0 from Operator Hub on Redhat openshift container platform 4.18.22

Expected behavior
Install the NVIDIA GPU Operator v25.10.0 Successfully, with all pods in the nvidia-gpu-operator namespace in a Running state.

Environment (please provide the following information):
Bare metal OpenShift cluster 4.18.22
1 GPU nodes with L40S GPU cards
Node Feature Discovery operator installed 4.18.0
Nvidia gpu Operator installed 25.10.0

Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions