Skip to content

vgpu-devices-allocated annotations are inconsistent #991

Open
@ouyangluwei163

Description

@ouyangluwei163

What happened:

The pod contains multiple containers,The CPU container annotation should be ;; or ,,0,0:; Now it seems that both exist

hami.io/vgpu-devices-allocated: GPU-0aa6b97c-d386-26ba-a94a-b9d27c2e3a71,NVIDIA,1000,0:;;,,0,0:;,,0,0:;,,0,0:;

What you expected to happen:

hami.io/vgpu-devices-allocated: ,,0,0:;GPU-0aa6b97c-d386-26ba-a94a-b9d27c2e3a71,NVIDIA,1000,0:;,,0,0:;,,0,0:;,,0,0:;

or

hami.io/vgpu-devices-allocated: ;GPU-0aa6b97c-d386-26ba-a94a-b9d27c2e3a71,NVIDIA,1000,0:;;;;

How to reproduce it (as minimally and precisely as possible):

apiVersion: v1
kind: Pod
metadata:
  name: gpu-task-qos-pod-1
spec:
  containers:
    - name: qos-pod-1
      image: pytorch:1.12.1-cuda11.3
      command:
        - sh
        - -c
        - sleep 800000
    - name: qos-pod-2
      image: pytorch:1.12.1-cuda11.3
      command:
        - sh
        - -c
        - sleep 800000
      resources:
        limits:
          nvidia.com/vgpu: 1
          nvidia.com/gpumem: 1000
    - name: qos-pod-3
      image: pytorch:1.12.1-cuda11.3
      command:
        - sh
        - -c
        - sleep 800000
    - name: qos-pod-4
      image: pytorch:1.12.1-cuda11.3
      command:
        - sh
        - -c
        - sleep 800000
    - name: qos-pod-5
      image: pytorch:1.12.1-cuda11.3
      command:
        - sh
        - -c
        - sleep 800000

Anything else we need to know?:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version:
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions