Skip to content

Missing metrics DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE and DCGM_FI_DEV_XID_ERRORS #646

@aliya-do

Description

@aliya-do

What is the version?

4.4.1-4.5.2 & 4.5.2-4.8.1

What happened?

When curl-ing the dcgm-exporter's metrics endpoint, we get no results for DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE and DCGM_FI_DEV_XID_ERRORS .

# curl localhost:5000/metrics | grep DCGM_FI_DEV_RETIRED_SBE
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69847    0 69847    0     0  2490k      0 --:--:-- --:--:-- --:--:-- 2526k
# curl localhost:5000/metrics | grep DCGM_FI_DEV_RETIRED_DBE
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69850    0 69850    0     0  1420k      0 --:--:-- --:--:-- --:--:-- 1451k
# curl localhost:5000/metrics | grep DCGM_FI_DEV_XID_ERRORS
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69850    0 69850    0     0  2651k      0 --:--:-- --:--:-- --:--:-- 2728k

NOTE: we did verify other expected metrics do behave as expected.

What did you expect to happen?

When curl-ing our metrics endpoint we expect to see a value for each missing metric for each GPU.

What is the GPU model?

We've seen this issue on:

  • internal-a30-1x
  • h100 with 8 gpus configured

output of nvidia-smi for internal-a30-1x:

nvidia-smi
Wed Mar 25 19:21:45 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     On  |   00000000:83:00.0 Off |                    0 |
| N/A   30C    P0             30W /  165W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

output of nvidia-smi for VM running h100x8

nvidia-smi
Wed Mar 25 22:52:35 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:00:0A.0 Off |                    0 |
| N/A   58C    P0             80W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:00:0B.0 Off |                    0 |
| N/A   29C    P0             72W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:00:0C.0 Off |                    0 |
| N/A   29C    P0             71W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:00:0D.0 Off |                    0 |
| N/A   32C    P0             74W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:00:0E.0 Off |                    0 |
| N/A   55C    P0             75W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:00:0F.0 Off |                    0 |
| N/A   30C    P0             70W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:00:10.0 Off |                    0 |
| N/A   31C    P0             73W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:00:11.0 Off |                    0 |
| N/A   29C    P0             70W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

What is the environment?

we've seen this in both pod and virtual environments.

How did you deploy the dcgm-exporter and what is the configuration?

For our Kubernetes environment: nvidia-dcgm helm chart and configure it (which metrics to expose) with a k8s ConfigMap

For Virtual Machines, we built the dcgm-exporter from source and configure the metrics to expose via a csv file that is passed in to the dcgm-exporter via the -f argument.

How to reproduce the issue?

  • Install dcgm-exporter (we reproduced the issue with both 4.4.1-4.5.2 & 4.5.2-4.8.1 versions)
    Run like so:
/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/dcgm-metrics.csv -c 1000 -a 127.0.0.1:5000

I've also attached the dcgm-metrics.csv config used to expose the metrics we want to see

dcgm-metrics.csv

Anything else we need to know?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions