Open
Description
What happened: pytorch分布式训练,单机多卡训练报错,日志如下:
What you expected to happen: pytorch分布式训练,单机多卡训练正常
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
- The output of
nvidia-smi -a
on your host
- Your docker or containerd configuration file (e.g:
/etc/docker/daemon.json
) - The hami-device-plugin container logs
- The hami-scheduler container logs
- The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
) - Any relevant kernel output lines from
dmesg
Environment:
- HAMi version: v2.4.1
- nvidia driver or other AI device driver version:
- Docker version from
docker version
- Docker command, image and tag used
- Kernel version from
uname -a
- Others: