Open
Description
What happened:
Pod is scheduled to a GPU card that does not meet expectations of gpu scheduler policy in multi numa GPU node. Pod are configured as "spread", but Pod is scheduled to GPU card with high usage.
What you expected to happen:
when gpu scheduler policy configured as "spread", pod should scheduled to GPU card with low usage.
How to reproduce it (as minimally and precisely as possible):
This problem only occur in multi numa node
Anything else we need to know?:
- The output of
nvidia-smi -a
on your host - Your docker or containerd configuration file (e.g:
/etc/docker/daemon.json
) - The hami-device-plugin container logs
- The hami-scheduler container logs
- The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
) - Any relevant kernel output lines from
dmesg
Environment:
- HAMi version: v2.5.0
- nvidia driver or other AI device driver version:
- Docker version from
docker version
- Docker command, image and tag used
- Kernel version from
uname -a
- Others: