Skip to content

VGPU-Monitor error when allocate multi gpu to multi containers in one Pod #863

Open
@JING21

Description

@JING21

A Deployment has been created with a single Pod containing two containers, both of which have requested GPU resources. However, the monitoring system is unable to collect relevant data, and the vgpu-monitor is reporting the following error:

Image

It turns out there are 3 files under the device-plugin vgpu dir:

Image

Related error code is here:

Image

Here are the metrics data. It turns out missing lots of metrics labels, such as Device_memory_desc_of_container, Device_utilization_desc_of_container.

Image

What happened:
The monitor can not get metrics data.
What you expected to happen:
We can get correct metrics data from vgpu-monitor.
How to reproduce it (as minimally and precisely as possible):
Creating a multi-container pod using gpu resources
Anything else we need to know?:

  • The output of nvidia-smi -a on your host

Image

Environment:

  • HAMi version:2.5.0
  • nvidia driver or other AI device driver version: 550.54.15
  • CUDA version: 12.4
  • Kernel version from uname -a 4.19.0-240.23.36
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions