What is the version?
4.4.1-4.5.2 & 4.5.2-4.8.1
What happened?
When curl-ing the dcgm-exporter's metrics endpoint, we get no results for DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE and DCGM_FI_DEV_XID_ERRORS .
# curl localhost:5000/metrics | grep DCGM_FI_DEV_RETIRED_SBE
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 69847 0 69847 0 0 2490k 0 --:--:-- --:--:-- --:--:-- 2526k
# curl localhost:5000/metrics | grep DCGM_FI_DEV_RETIRED_DBE
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 69850 0 69850 0 0 1420k 0 --:--:-- --:--:-- --:--:-- 1451k
# curl localhost:5000/metrics | grep DCGM_FI_DEV_XID_ERRORS
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 69850 0 69850 0 0 2651k 0 --:--:-- --:--:-- --:--:-- 2728k
NOTE: we did verify other expected metrics do behave as expected.
What did you expect to happen?
When curl-ing our metrics endpoint we expect to see a value for each missing metric for each GPU.
What is the GPU model?
We've seen this issue on:
- internal-a30-1x
- h100 with 8 gpus configured
output of nvidia-smi for internal-a30-1x:
nvidia-smi
Wed Mar 25 19:21:45 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A30 On | 00000000:83:00.0 Off | 0 |
| N/A 30C P0 30W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
output of nvidia-smi for VM running h100x8
nvidia-smi
Wed Mar 25 22:52:35 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:00:0A.0 Off | 0 |
| N/A 58C P0 80W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:00:0B.0 Off | 0 |
| N/A 29C P0 72W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:00:0C.0 Off | 0 |
| N/A 29C P0 71W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:00:0D.0 Off | 0 |
| N/A 32C P0 74W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:00:0E.0 Off | 0 |
| N/A 55C P0 75W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:00:0F.0 Off | 0 |
| N/A 30C P0 70W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:00:10.0 Off | 0 |
| N/A 31C P0 73W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:00:11.0 Off | 0 |
| N/A 29C P0 70W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
What is the environment?
we've seen this in both pod and virtual environments.
How did you deploy the dcgm-exporter and what is the configuration?
For our Kubernetes environment: nvidia-dcgm helm chart and configure it (which metrics to expose) with a k8s ConfigMap
For Virtual Machines, we built the dcgm-exporter from source and configure the metrics to expose via a csv file that is passed in to the dcgm-exporter via the -f argument.
How to reproduce the issue?
- Install dcgm-exporter (we reproduced the issue with both 4.4.1-4.5.2 & 4.5.2-4.8.1 versions)
Run like so:
/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/dcgm-metrics.csv -c 1000 -a 127.0.0.1:5000
I've also attached the dcgm-metrics.csv config used to expose the metrics we want to see
dcgm-metrics.csv
Anything else we need to know?
No response
What is the version?
4.4.1-4.5.2 & 4.5.2-4.8.1
What happened?
When
curl-ing the dcgm-exporter's metrics endpoint, we get no results for DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE and DCGM_FI_DEV_XID_ERRORS .NOTE: we did verify other expected metrics do behave as expected.
What did you expect to happen?
When
curl-ing our metrics endpoint we expect to see a value for each missing metric for each GPU.What is the GPU model?
We've seen this issue on:
output of
nvidia-smiforinternal-a30-1x:output of
nvidia-smifor VM running h100x8What is the environment?
we've seen this in both pod and virtual environments.
How did you deploy the dcgm-exporter and what is the configuration?
For our Kubernetes environment: nvidia-dcgm helm chart and configure it (which metrics to expose) with a k8s ConfigMap
For Virtual Machines, we built the dcgm-exporter from source and configure the metrics to expose via a csv file that is passed in to the dcgm-exporter via the
-fargument.How to reproduce the issue?
Run like so:
I've also attached the dcgm-metrics.csv config used to expose the metrics we want to see
dcgm-metrics.csv
Anything else we need to know?
No response