Open
Description
What happened:
An error occurred while create Iluvatar pod
Warning UnexpectedAdmissionError 8s kubelet, aio-node67 Allocate failed due to rpc error: code = Unknown desc = Iluvatar node mismatch for pod iluvatar-1(pod-1), pick up:/dev/iluvatar0 predicate: /dev/iluvatar1, which is unexpected
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
When there are more than 2 Iluvatar GPUs on the machine, create this pod
apiVersion: v1
kind: Pod
metadata:
name: iluvatar-1
spec:
containers:
- name: pod-1
image: ubuntu:20.04
command:
- bash
args:
- -c
- |
set -ex
echo "export LD_LIBRARY_PATH=/usr/local/corex/lib64:$LD_LIBRARY_PATH">> /root/.bashrc
cp -f /usr/local/iluvatar/lib64/libcuda.* /usr/local/corex/lib64/
cp -f /usr/local/iluvatar/lib64/libixml.* /usr/local/corex/lib64/
source /root/.bashrc
sleep 360000
resources:
limits:
iluvatar.ai/vgpu: "1"
iluvatar.ai/MR-V100.vCore: "50"
iluvatar.ai/MR-V100.vMem: "64"
requests:
iluvatar.ai/vgpu: "1"
iluvatar.ai/MR-V100.vCore: "50"
iluvatar.ai/MR-V100.vMem: "64"
Anything else we need to know?:
- The output of
nvidia-smi -a
on your host - Your docker or containerd configuration file (e.g:
/etc/docker/daemon.json
) - The hami-device-plugin container logs
- The hami-scheduler container logs
- The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
) - Any relevant kernel output lines from
dmesg
Environment:
- HAMi version: v2.5.0
- nvidia driver or other AI device driver version:
- Docker version from
docker version
- Docker command, image and tag used
- Kernel version from
uname -a
- Others: