Skip to content

unified_lock locked, waiting 1 second #976

Open
@2232729885

Description

@2232729885

What happened: pytorch分布式训练,单机多卡训练报错,日志如下:

Image

Image

What you expected to happen: pytorch分布式训练,单机多卡训练正常

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

  • The output of nvidia-smi -a on your host

Image

  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs

Image

  • The hami-scheduler container logs

Image

  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version: v2.4.1
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions