Skip to content

No GPU metrics are being collected #64

@przemeklal

Description

@przemeklal

Logs:

2025-08-11T10:20:23Z systemd[1]: Started Service for snap application dcgm.dcgm-exporter.
2025-08-11T10:20:23Z nv-hostengine[1936766]: DCGM initialized
2025-08-11T10:20:23Z dcgm.nv-hostengine[1936766]: Started host engine version 3.3.8 using port number: 5555
2025-08-11T10:20:23Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:23Z" level=info msg="Starting dcgm-exporter"
2025-08-11T10:20:23Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:23Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
2025-08-11T10:20:23Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:23Z" level=info msg="DCGM successfully initialized!"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Falling back to metric file '/var/snap/dcgm/common/dcgm_metrics.csv'"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 6 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 7 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 48 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 49 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 50 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 51 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 52 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 53 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 54 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 56 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 57 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Initializing system entities of type: GPU"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system."
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Initializing system entities of type: NvSwitch"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Not collecting NvSwitch metrics; no switches to monitor"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Initializing system entities of type: NvLink"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Not collecting NvLink metrics; no switches to monitor"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Initializing system entities of type: CPU"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="Not collecting CPU metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="Initializing system entities of type: CPU Core"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="Not collecting CPU Core metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="Starting webserver"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="Pipeline starting"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="Listening on" address="[::]:9400"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false

Version:

dcgm                 3.3.8+snap-96ac85fd53   56     latest/edge    canonical✓  -

No metrics:

# curl localhost:9400/metrics
<empty output>

GPU:

4e:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)
62:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)
c9:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)
de:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)

Ubuntu:

# cat /etc/os-release 
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

NVIDIA packages:

# dpkg -l | grep nvidia | grep ii
ii  libnvidia-compute-570-server:amd64     570.86.15-0ubuntu0.22.04.4              amd64        NVIDIA libcompute package
ii  libnvidia-container-tools              1.17.4-1                                amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64             1.17.4-1                                amd64        NVIDIA container runtime library
ii  nvidia-container-toolkit               1.17.4-1                                amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base          1.17.4-1                                amd64        NVIDIA Container Toolkit Base
ii  nvidia-utils-570-server                570.86.15-0ubuntu0.22.04.4              amd64        NVIDIA Server Driver support binaries

nvidia-smi output:

# nvidia-smi
Mon Aug 11 13:15:05 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:4E:00.0 Off |                    0 |
| N/A   78C    P0            151W /  350W |   35619MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    Off |   00000000:62:00.0 Off |                    0 |
| N/A   66C    P0            124W /  350W |   34375MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L40S                    Off |   00000000:C9:00.0 Off |                    0 |
| N/A   55C    P0            111W /  350W |   36863MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L40S                    Off |   00000000:DE:00.0 Off |                    0 |
| N/A   60C    P0            117W /  350W |   34375MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          779064      C   /usr/local/bin/python3.11              1240MiB |
|    0   N/A  N/A         2792393      C   /usr/src/.venv/bin/python3            34368MiB |
|    1   N/A  N/A         2792403      C   /usr/src/.venv/bin/python3            34368MiB |
|    2   N/A  N/A          807387      C   /usr/local/bin/python3.11              2484MiB |
|    2   N/A  N/A         2792474      C   /usr/src/.venv/bin/python3            34368MiB |
|    3   N/A  N/A         2792544      C   /usr/src/.venv/bin/python3            34368MiB |
+-----------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions