Skip to content

Pod is scheduled to a GPU card that does not meet expectations of gpu scheduler policy in multi numa GPU node. #1006

Open
@Kyrie336

Description

@Kyrie336

What happened:
Pod is scheduled to a GPU card that does not meet expectations of gpu scheduler policy in multi numa GPU node. Pod are configured as "spread", but Pod is scheduled to GPU card with high usage.

What you expected to happen:
when gpu scheduler policy configured as "spread", pod should scheduled to GPU card with low usage.

How to reproduce it (as minimally and precisely as possible):
This problem only occur in multi numa node

Anything else we need to know?:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version: v2.5.0
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions