Description
A Deployment has been created with a single Pod containing two containers, both of which have requested GPU resources. However, the monitoring system is unable to collect relevant data, and the vgpu-monitor is reporting the following error:
It turns out there are 3 files under the device-plugin vgpu dir:
Related error code is here:
Here are the metrics data. It turns out missing lots of metrics labels, such as Device_memory_desc_of_container, Device_utilization_desc_of_container.
What happened:
The monitor can not get metrics data.
What you expected to happen:
We can get correct metrics data from vgpu-monitor.
How to reproduce it (as minimally and precisely as possible):
Creating a multi-container pod using gpu resources
Anything else we need to know?:
- The output of
nvidia-smi -a
on your host
Environment:
- HAMi version:2.5.0
- nvidia driver or other AI device driver version: 550.54.15
- CUDA version: 12.4
- Kernel version from
uname -a
4.19.0-240.23.36 - Others: