Missing metrics DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE and DCGM_FI_DEV_XID_ERRORS

### What is the version?

4.4.1-4.5.2 & 4.5.2-4.8.1

### What happened?

 When `curl`-ing the dcgm-exporter's metrics endpoint, we get no results for DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE and DCGM_FI_DEV_XID_ERRORS .

```
# curl localhost:5000/metrics | grep DCGM_FI_DEV_RETIRED_SBE
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69847    0 69847    0     0  2490k      0 --:--:-- --:--:-- --:--:-- 2526k
# curl localhost:5000/metrics | grep DCGM_FI_DEV_RETIRED_DBE
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69850    0 69850    0     0  1420k      0 --:--:-- --:--:-- --:--:-- 1451k
# curl localhost:5000/metrics | grep DCGM_FI_DEV_XID_ERRORS
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69850    0 69850    0     0  2651k      0 --:--:-- --:--:-- --:--:-- 2728k
```

NOTE: we did verify other expected metrics do behave as expected.

### What did you expect to happen?

When `curl`-ing our metrics endpoint we expect to see a value for  each missing metric for each GPU.


### What is the GPU model?

We've seen this issue on:
- internal-a30-1x
- h100 with 8 gpus configured

output of `nvidia-smi` for `internal-a30-1x`:
```
nvidia-smi
Wed Mar 25 19:21:45 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     On  |   00000000:83:00.0 Off |                    0 |
| N/A   30C    P0             30W /  165W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
```

output of `nvidia-smi` for VM running h100x8 

```
nvidia-smi
Wed Mar 25 22:52:35 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:00:0A.0 Off |                    0 |
| N/A   58C    P0             80W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:00:0B.0 Off |                    0 |
| N/A   29C    P0             72W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:00:0C.0 Off |                    0 |
| N/A   29C    P0             71W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:00:0D.0 Off |                    0 |
| N/A   32C    P0             74W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:00:0E.0 Off |                    0 |
| N/A   55C    P0             75W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:00:0F.0 Off |                    0 |
| N/A   30C    P0             70W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:00:10.0 Off |                    0 |
| N/A   31C    P0             73W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:00:11.0 Off |                    0 |
| N/A   29C    P0             70W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
```

### What is the environment?

we've seen this in both pod and virtual environments.

### How did you deploy the dcgm-exporter and what is the configuration?

For our Kubernetes environment: [nvidia-dcgm helm chart ](https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#quickstart-on-kubernetes)and configure it (which metrics to expose) with a k8s ConfigMap

For Virtual Machines, we built the dcgm-exporter from source and configure the metrics to expose via a csv file that is passed in to the dcgm-exporter via the `-f` argument.

### How to reproduce the issue?

- Install dcgm-exporter (we reproduced the issue with both 4.4.1-4.5.2 & 4.5.2-4.8.1 versions)
Run like so:
```
/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/dcgm-metrics.csv -c 1000 -a 127.0.0.1:5000
```
I've also attached the dcgm-metrics.csv  config used to expose the metrics we want to see

[dcgm-metrics.csv](https://github.com/user-attachments/files/26256331/dcgm-metrics.csv)

### Anything else we need to know?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing metrics DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE and DCGM_FI_DEV_XID_ERRORS #646

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing metrics DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE and DCGM_FI_DEV_XID_ERRORS #646

Description

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions