-
Notifications
You must be signed in to change notification settings - Fork 4
Description
The DCGM exporter changed the logic of setting the Hostname label in v4. The label is used in the Grafana Dashboard exported by the hardware-observer. If its value is localhost (currently the case for the exported metrics by DCGM snap v4), the dashboard will be broken, and no data on the graphs will be shown.
Problem
The problem is that we always set localhost as the host for the dcgm-exporter to connect to the nv-hostengine service. Reference.
The upstream dcgm-exporter will set the label to use the hostname provided through the argument -r. See this and this.
Because we always set the argument when starting the exporter, the label will always be set to localhost.
Solution 1
I propose to add a new boolean snap config option dcgm-exporter-use-hostname which, instead of localhost, will use the hostname command if set, ie:
if [ -n "$nv_hostengine_port" ]; then
if [ "$dcgm_exporter_use_hostname" = "true" ]; then
args+=("-r" "$(hostname):$nv_hostengine_port")
else
args+=("-r" "localhost:$nv_hostengine_port")
fi
fiThe dcgm-exporter service will still be able to connect to the nv-hosteninge service as the hostname will correctly resolve to localhost in this case, and the correct value for the label will be set, allowing the dashboard to work correctly.
HOWEVER, this will only work if there is an entry in /etc/hosts :
...
127.0.0.1 {hostname}
...On my local machine, I have this entry, but on the swob machine from Testflinger, there are only the following entries:
127.0.1.1 swob.maas swob
127.0.0.1 localhost
...The exporter will fail to start on that machine until I manually add the entry.
Solution 2
Use juju_instance to filter by host in the GPU dashboard. I am not completely sure how it works and will juju set the correct labels, but in my testing environments, I only see the label set to localhost:9400.