-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Logs:
2025-08-11T10:20:23Z systemd[1]: Started Service for snap application dcgm.dcgm-exporter.
2025-08-11T10:20:23Z nv-hostengine[1936766]: DCGM initialized
2025-08-11T10:20:23Z dcgm.nv-hostengine[1936766]: Started host engine version 3.3.8 using port number: 5555
2025-08-11T10:20:23Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:23Z" level=info msg="Starting dcgm-exporter"
2025-08-11T10:20:23Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:23Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
2025-08-11T10:20:23Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:23Z" level=info msg="DCGM successfully initialized!"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Falling back to metric file '/var/snap/dcgm/common/dcgm_metrics.csv'"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 6 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 7 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 48 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 49 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 50 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 51 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 52 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 53 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 54 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 56 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=warning msg="Skipping line 57 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Initializing system entities of type: GPU"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system."
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Initializing system entities of type: NvSwitch"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Not collecting NvSwitch metrics; no switches to monitor"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Initializing system entities of type: NvLink"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Not collecting NvLink metrics; no switches to monitor"
2025-08-11T10:20:24Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:24Z" level=info msg="Initializing system entities of type: CPU"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="Not collecting CPU metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="Initializing system entities of type: CPU Core"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="Not collecting CPU Core metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="Starting webserver"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="Pipeline starting"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="Listening on" address="[::]:9400"
2025-08-11T10:20:25Z dcgm.dcgm-exporter[1936877]: time="2025-08-11T10:20:25Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
Version:
dcgm 3.3.8+snap-96ac85fd53 56 latest/edge canonical✓ -
No metrics:
# curl localhost:9400/metrics
<empty output>
GPU:
4e:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)
62:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)
c9:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)
de:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)
Ubuntu:
# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
NVIDIA packages:
# dpkg -l | grep nvidia | grep ii
ii libnvidia-compute-570-server:amd64 570.86.15-0ubuntu0.22.04.4 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.17.4-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.17.4-1 amd64 NVIDIA container runtime library
ii nvidia-container-toolkit 1.17.4-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.17.4-1 amd64 NVIDIA Container Toolkit Base
ii nvidia-utils-570-server 570.86.15-0ubuntu0.22.04.4 amd64 NVIDIA Server Driver support binaries
nvidia-smi output:
# nvidia-smi
Mon Aug 11 13:15:05 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S Off | 00000000:4E:00.0 Off | 0 |
| N/A 78C P0 151W / 350W | 35619MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L40S Off | 00000000:62:00.0 Off | 0 |
| N/A 66C P0 124W / 350W | 34375MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L40S Off | 00000000:C9:00.0 Off | 0 |
| N/A 55C P0 111W / 350W | 36863MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L40S Off | 00000000:DE:00.0 Off | 0 |
| N/A 60C P0 117W / 350W | 34375MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 779064 C /usr/local/bin/python3.11 1240MiB |
| 0 N/A N/A 2792393 C /usr/src/.venv/bin/python3 34368MiB |
| 1 N/A N/A 2792403 C /usr/src/.venv/bin/python3 34368MiB |
| 2 N/A N/A 807387 C /usr/local/bin/python3.11 2484MiB |
| 2 N/A N/A 2792474 C /usr/src/.venv/bin/python3 34368MiB |
| 3 N/A N/A 2792544 C /usr/src/.venv/bin/python3 34368MiB |
+-----------------------------------------------------------------------------------------+
Metadata
Metadata
Assignees
Labels
No labels