Skip to content

Need guidance: sometimes (not always) pytorch can't detect the GPU. Is it pytorch or the nvidia addon? #355

@JPFrancoia

Description

@JPFrancoia

Hi,

I'm in a weird situation. I have containarized a Python job that uses pytorch. I have deployed this job on my microk8s cluster. All my nodes are running the latest microk8s version (v1.33.0). One of my node has a GPU. I'm deploying my container on this node specifically. Here is the manifest for the job:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: feedoscope-infer-job
spec:
  schedule: "*/30 6-23 * * *"
  jobTemplate:
    spec:
      backoffLimit: 2
      template:
        metadata:
          labels:
            app: feedoscope-infer
        spec:
          runtimeClassName: nvidia
          tolerations:
            - key: "nvidia.com/gpu"
              operator: "Exists"
              effect: "NoSchedule"
          nodeName: djipey-server
          containers:
            - name: feedoscope-infer
              image: 192.168.0.13:32000/feedoscope:latest
              imagePullPolicy: Always
              command:
                - /bin/sh
                - -c
                - |
                  nvidia-smi
                  python -c 'import torch; print(torch.cuda.is_available())'
                  python -m feedoscope.llm_infer
              volumeMounts:
                - name: models-volume
                  mountPath: /app/saved_models
              resources:
                requests:
                  nvidia.com/gpu: 1
                limits:
                  nvidia.com/gpu: 1
              env:
                - name: NVIDIA_DRIVER_CAPABILITIES
                  value: "all"
                - name: TTRSS_DB_NAME
                  valueFrom:
                    configMapKeyRef:
                      name: ttrss-config
                      key: TTRSS_DB_NAME
                - name: TTRSS_DB_PASS
                  valueFrom:
                    configMapKeyRef:
                      name: ttrss-config
                      key: TTRSS_DB_PASS
                - name: TTRSS_DB_USER
                  valueFrom:
                    configMapKeyRef:
                      name: ttrss-config
                      key: TTRSS_DB_USER
                - name: DATABASE_URL
                  value: "postgresql://$(TTRSS_DB_USER):$(TTRSS_DB_PASS)@db:5432/$(TTRSS_DB_NAME)"
          restartPolicy: Never
          volumes:
            - name: models-volume
              persistentVolumeClaim:
                claimName: models-pvc

Here is how I enabled the nvidia addon:

❯ microk8s enable nvidia --set toolkit.enabled=false        
Infer repository core for addon nvidia
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
GPU 0: NVIDIA GeForce RTX 3060 (UUID: GPU-19d48f76-1939-be75-94ee-0fffd1f683be)
WARNING: --set is deprecated, please use --gpu-operator-set instead
Error: repository name (nvidia) already exists, please specify a different name
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
Deploy NVIDIA GPU operator
Using host GPU driver
W0813 16:29:19.877214 2371802 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
W0813 16:29:19.883814 2371802 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
NAME: gpu-operator
LAST DEPLOYED: Wed Aug 13 16:29:18 2025
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
Deployed NVIDIA GPU operator

I can confirm that:

  • the pods for the jobs are created on the node djipey-server
  • The image exists, can be pulled, can start
  • All the resources in the namespace gpu-oprator-resources were deployed and ran successfully
  • I have enabled the nvidia addon, see above
  • Most importantly: my job runs correctly from time to time, but not always. This is what's baffling me.

This is the logs for a failed run (the python script crashes when the device is not cuda):

Wed Aug 13 15:13:24 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.05              Driver Version: 575.64.05      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   47C    P8             22W /  170W |       1MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
/app/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
False
{"asctime": "2025-08-13 15:13:33,790", "levelname": "DEBUG", "message": "Using selector: EpollSelector", "otelTraceID": null, "otelSpanID": null, "otelServiceName": null, "otelTraceSampled": null, "filename": "selector_events.py", "lineno": 64, "funcName": "__init__"}
/app/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
{"asctime": "2025-08-13 15:13:33,836", "levelname": "INFO", "message": "Using device: cpu", "otelTraceID": null, "otelSpanID": null, "otelServiceName": null, "otelTraceSampled": null, "filename": "llm_infer.py", "lineno": 48, "funcName": "main"}
{"asctime": "2025-08-13 15:13:33,836", "levelname": "CRITICAL", "message": "GPU not available. Device is 'cpu'. Exiting", "otelTraceID": null, "otelSpanID": null, "otelServiceName": null, "otelTraceSampled": null, "filename": "llm_infer.py", "lineno": 53, "funcName": "main"}
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/feedoscope/llm_infer.py", line 127, in <module>
    asyncio.run(main())
  File "/usr/local/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/app/feedoscope/llm_infer.py", line 54, in main
    raise RuntimeError(mes)
RuntimeError: GPU not available. Device is 'cpu'. Exiting

And here are the logs for a successful run:

Wed Aug 13 15:31:27 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.05              Driver Version: 575.64.05      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   44C    P8             21W /  170W |       1MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
True
{"asctime": "2025-08-13 15:31:36,927", "levelname": "DEBUG", "message": "Using selector: EpollSelector", "otelTraceID": null, "otelSpanID": null, "otelServiceName": null, "otelTraceSampled": null, "filename": "selector_events.py", "lineno": 64, "funcName": "__init__"}
{"asctime": "2025-08-13 15:31:37,048", "levelname": "INFO", "message": "Using device: cuda", "otelTraceID": null, "otelSpanID": null, "otelServiceName": null, "otelTraceSampled": null, "filename": "llm_infer.py", "lineno": 48, "funcName": "main"}

We can see that nvidia-smi from inside the container always works, regardless of if the job fails or not. And the job runs from time to time. So I don't think it's an obvious driver problem. Also, the job tries to run 3 times in a row, and when the job fails, I get 3 failures in a row.

At that point, I don't know if it's a problem with the nvidia addon or with torch, and I ran out of ideas. Would you be able to shed some light?

Maybe the only thing I can think of is that it feels I get successful runs just after disabling/enabling the nvidia addon on the node with the GPU. But I'm not even sure about that

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions