Skip to content

An error occurred while create Iluvatar pod #933

Open
@ouyangluwei163

Description

@ouyangluwei163

What happened:
An error occurred while create Iluvatar pod

Warning  UnexpectedAdmissionError  8s    kubelet, aio-node67  Allocate failed due to rpc error: code = Unknown desc = Iluvatar node mismatch for pod iluvatar-1(pod-1), pick up:/dev/iluvatar0  predicate: /dev/iluvatar1, which is unexpected

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
When there are more than 2 Iluvatar GPUs on the machine, create this pod

apiVersion: v1
kind: Pod
metadata:
  name: iluvatar-1
spec:
  containers:
    - name: pod-1
      image: ubuntu:20.04
      command:
      - bash
      args:
      - -c
      - |
        set -ex
        echo "export LD_LIBRARY_PATH=/usr/local/corex/lib64:$LD_LIBRARY_PATH">> /root/.bashrc
        cp -f /usr/local/iluvatar/lib64/libcuda.* /usr/local/corex/lib64/
        cp -f /usr/local/iluvatar/lib64/libixml.* /usr/local/corex/lib64/
        source /root/.bashrc
        sleep 360000
      resources:
        limits:
          iluvatar.ai/vgpu: "1"
          iluvatar.ai/MR-V100.vCore: "50"
          iluvatar.ai/MR-V100.vMem: "64"
        requests:
          iluvatar.ai/vgpu: "1"
          iluvatar.ai/MR-V100.vCore: "50"
          iluvatar.ai/MR-V100.vMem: "64"

Anything else we need to know?:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version: v2.5.0
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions