VGPU-Monitor error when allocate multi gpu to multi containers in one Pod

A Deployment has been created with a single Pod containing two containers, both of which have requested GPU resources. However, the monitoring system is unable to collect relevant data, and the vgpu-monitor is reporting the following error:

![Image](https://github.com/user-attachments/assets/2242af75-204b-4808-a871-8d48cce3f921)

It turns out there are 3 files under the device-plugin vgpu dir:

![Image](https://github.com/user-attachments/assets/15a28999-59d5-433a-846c-3a71b5a1569b)

Related error code is here:

![Image](https://github.com/user-attachments/assets/60525c8e-be75-45a2-bd46-5f7cbcf9270c)

Here are the metrics data. It turns out missing lots of metrics labels, such as Device_memory_desc_of_container, Device_utilization_desc_of_container.

![Image](https://github.com/user-attachments/assets/1498c364-bbd2-441c-bf5f-aa3ae54b75ad)

**What happened**:
The monitor can not get metrics data.
**What you expected to happen**:
We can get correct metrics data from vgpu-monitor.
**How to reproduce it (as minimally and precisely as possible)**:
Creating a multi-container pod using gpu resources
**Anything else we need to know?**:

- The output of `nvidia-smi -a` on your host

![Image](https://github.com/user-attachments/assets/5001b594-a3ad-4f6c-94e5-9ff4c555e9fa)

**Environment**:
- HAMi version:2.5.0
- nvidia driver or other AI device driver version: 550.54.15 
- CUDA version: 12.4
- Kernel version from `uname -a` 4.19.0-240.23.36
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VGPU-Monitor error when allocate multi gpu to multi containers in one Pod #863

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VGPU-Monitor error when allocate multi gpu to multi containers in one Pod #863

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions