Open
Description
What happened:
The pod contains multiple containers,The CPU container annotation should be ;; or ,,0,0:; Now it seems that both exist
hami.io/vgpu-devices-allocated: GPU-0aa6b97c-d386-26ba-a94a-b9d27c2e3a71,NVIDIA,1000,0:;;,,0,0:;,,0,0:;,,0,0:;
What you expected to happen:
hami.io/vgpu-devices-allocated: ,,0,0:;GPU-0aa6b97c-d386-26ba-a94a-b9d27c2e3a71,NVIDIA,1000,0:;,,0,0:;,,0,0:;,,0,0:;
or
hami.io/vgpu-devices-allocated: ;GPU-0aa6b97c-d386-26ba-a94a-b9d27c2e3a71,NVIDIA,1000,0:;;;;
How to reproduce it (as minimally and precisely as possible):
apiVersion: v1
kind: Pod
metadata:
name: gpu-task-qos-pod-1
spec:
containers:
- name: qos-pod-1
image: pytorch:1.12.1-cuda11.3
command:
- sh
- -c
- sleep 800000
- name: qos-pod-2
image: pytorch:1.12.1-cuda11.3
command:
- sh
- -c
- sleep 800000
resources:
limits:
nvidia.com/vgpu: 1
nvidia.com/gpumem: 1000
- name: qos-pod-3
image: pytorch:1.12.1-cuda11.3
command:
- sh
- -c
- sleep 800000
- name: qos-pod-4
image: pytorch:1.12.1-cuda11.3
command:
- sh
- -c
- sleep 800000
- name: qos-pod-5
image: pytorch:1.12.1-cuda11.3
command:
- sh
- -c
- sleep 800000
Anything else we need to know?:
- The output of
nvidia-smi -a
on your host - Your docker or containerd configuration file (e.g:
/etc/docker/daemon.json
) - The hami-device-plugin container logs
- The hami-scheduler container logs
- The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
) - Any relevant kernel output lines from
dmesg
Environment:
- HAMi version:
- nvidia driver or other AI device driver version:
- Docker version from
docker version
- Docker command, image and tag used
- Kernel version from
uname -a
- Others: