Hi,
I'm in a weird situation. I have containarized a Python job that uses pytorch. I have deployed this job on my microk8s cluster. All my nodes are running the latest microk8s version (v1.33.0). One of my node has a GPU. I'm deploying my container on this node specifically. Here is the manifest for the job:
apiVersion: batch/v1
kind: CronJob
metadata:
name: feedoscope-infer-job
spec:
schedule: "*/30 6-23 * * *"
jobTemplate:
spec:
backoffLimit: 2
template:
metadata:
labels:
app: feedoscope-infer
spec:
runtimeClassName: nvidia
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
nodeName: djipey-server
containers:
- name: feedoscope-infer
image: 192.168.0.13:32000/feedoscope:latest
imagePullPolicy: Always
command:
- /bin/sh
- -c
- |
nvidia-smi
python -c 'import torch; print(torch.cuda.is_available())'
python -m feedoscope.llm_infer
volumeMounts:
- name: models-volume
mountPath: /app/saved_models
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_DRIVER_CAPABILITIES
value: "all"
- name: TTRSS_DB_NAME
valueFrom:
configMapKeyRef:
name: ttrss-config
key: TTRSS_DB_NAME
- name: TTRSS_DB_PASS
valueFrom:
configMapKeyRef:
name: ttrss-config
key: TTRSS_DB_PASS
- name: TTRSS_DB_USER
valueFrom:
configMapKeyRef:
name: ttrss-config
key: TTRSS_DB_USER
- name: DATABASE_URL
value: "postgresql://$(TTRSS_DB_USER):$(TTRSS_DB_PASS)@db:5432/$(TTRSS_DB_NAME)"
restartPolicy: Never
volumes:
- name: models-volume
persistentVolumeClaim:
claimName: models-pvc
Here is how I enabled the nvidia addon:
❯ microk8s enable nvidia --set toolkit.enabled=false
Infer repository core for addon nvidia
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
GPU 0: NVIDIA GeForce RTX 3060 (UUID: GPU-19d48f76-1939-be75-94ee-0fffd1f683be)
WARNING: --set is deprecated, please use --gpu-operator-set instead
Error: repository name (nvidia) already exists, please specify a different name
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
Deploy NVIDIA GPU operator
Using host GPU driver
W0813 16:29:19.877214 2371802 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
W0813 16:29:19.883814 2371802 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
NAME: gpu-operator
LAST DEPLOYED: Wed Aug 13 16:29:18 2025
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
Deployed NVIDIA GPU operator
I can confirm that:
- the pods for the jobs are created on the node
djipey-server
- The image exists, can be pulled, can start
- All the resources in the namespace
gpu-oprator-resources were deployed and ran successfully
- I have enabled the nvidia addon, see above
- Most importantly: my job runs correctly from time to time, but not always. This is what's baffling me.
This is the logs for a failed run (the python script crashes when the device is not cuda):
Wed Aug 13 15:13:24 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.05 Driver Version: 575.64.05 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A |
| 0% 47C P8 22W / 170W | 1MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
/app/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
{"asctime": "2025-08-13 15:13:33,790", "levelname": "DEBUG", "message": "Using selector: EpollSelector", "otelTraceID": null, "otelSpanID": null, "otelServiceName": null, "otelTraceSampled": null, "filename": "selector_events.py", "lineno": 64, "funcName": "__init__"}
/app/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
{"asctime": "2025-08-13 15:13:33,836", "levelname": "INFO", "message": "Using device: cpu", "otelTraceID": null, "otelSpanID": null, "otelServiceName": null, "otelTraceSampled": null, "filename": "llm_infer.py", "lineno": 48, "funcName": "main"}
{"asctime": "2025-08-13 15:13:33,836", "levelname": "CRITICAL", "message": "GPU not available. Device is 'cpu'. Exiting", "otelTraceID": null, "otelSpanID": null, "otelServiceName": null, "otelTraceSampled": null, "filename": "llm_infer.py", "lineno": 53, "funcName": "main"}
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/app/feedoscope/llm_infer.py", line 127, in <module>
asyncio.run(main())
File "/usr/local/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/app/feedoscope/llm_infer.py", line 54, in main
raise RuntimeError(mes)
RuntimeError: GPU not available. Device is 'cpu'. Exiting
And here are the logs for a successful run:
Wed Aug 13 15:31:27 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.05 Driver Version: 575.64.05 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 44C P8 21W / 170W | 1MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
True
{"asctime": "2025-08-13 15:31:36,927", "levelname": "DEBUG", "message": "Using selector: EpollSelector", "otelTraceID": null, "otelSpanID": null, "otelServiceName": null, "otelTraceSampled": null, "filename": "selector_events.py", "lineno": 64, "funcName": "__init__"}
{"asctime": "2025-08-13 15:31:37,048", "levelname": "INFO", "message": "Using device: cuda", "otelTraceID": null, "otelSpanID": null, "otelServiceName": null, "otelTraceSampled": null, "filename": "llm_infer.py", "lineno": 48, "funcName": "main"}
We can see that nvidia-smi from inside the container always works, regardless of if the job fails or not. And the job runs from time to time. So I don't think it's an obvious driver problem. Also, the job tries to run 3 times in a row, and when the job fails, I get 3 failures in a row.
At that point, I don't know if it's a problem with the nvidia addon or with torch, and I ran out of ideas. Would you be able to shed some light?
Maybe the only thing I can think of is that it feels I get successful runs just after disabling/enabling the nvidia addon on the node with the GPU. But I'm not even sure about that
Hi,
I'm in a weird situation. I have containarized a Python job that uses pytorch. I have deployed this job on my microk8s cluster. All my nodes are running the latest microk8s version (v1.33.0). One of my node has a GPU. I'm deploying my container on this node specifically. Here is the manifest for the job:
Here is how I enabled the nvidia addon:
I can confirm that:
djipey-servergpu-oprator-resourceswere deployed and ran successfullyThis is the logs for a failed run (the python script crashes when the device is not
cuda):And here are the logs for a successful run:
We can see that
nvidia-smifrom inside the container always works, regardless of if the job fails or not. And the job runs from time to time. So I don't think it's an obvious driver problem. Also, the job tries to run 3 times in a row, and when the job fails, I get 3 failures in a row.At that point, I don't know if it's a problem with the nvidia addon or with torch, and I ran out of ideas. Would you be able to shed some light?
Maybe the only thing I can think of is that it feels I get successful runs just after disabling/enabling the nvidia addon on the node with the GPU. But I'm not even sure about that