unified_lock locked, waiting 1 second

**What happened**: pytorch分布式训练，单机多卡训练报错，日志如下：

![Image](https://github.com/user-attachments/assets/6192a647-0e29-47e3-9949-3e618a793e2a)

![Image](https://github.com/user-attachments/assets/1443a8fe-f2aa-4077-8009-52bc0ea07a92)


**What you expected to happen**: pytorch分布式训练，单机多卡训练正常

**How to reproduce it (as minimally and precisely as possible)**:

**Anything else we need to know?**:

- The output of `nvidia-smi -a` on your host 

![Image](https://github.com/user-attachments/assets/fa6abff0-6aba-4e4a-9e5b-595d73c10775)

- Your docker or containerd configuration file (e.g: `/etc/docker/daemon.json`)
- The hami-device-plugin container logs

![Image](https://github.com/user-attachments/assets/87509c6a-e558-4658-8733-c4bc47815393)

- The hami-scheduler container logs

![Image](https://github.com/user-attachments/assets/451b84c2-9c76-41e8-b57a-55ddbd29d183)

- The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)
- Any relevant kernel output lines from `dmesg`

**Environment**:
- HAMi version:  v2.4.1
- nvidia driver or other AI device driver version:
- Docker version from `docker version`
- Docker command, image and tag used
- Kernel version from `uname -a`
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unified_lock locked, waiting 1 second #976

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

unified_lock locked, waiting 1 second #976

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions