bug(GPU): DCGM metrics not fully reporting on G6/L4 GPUs on AL2023

**What happened**:

Once we migrated to AL2023, DCGM metrics stopped reporting process max gpu memory metrics _only_ on G6 (L4 GPUs) instances. Once we upgraded some of our DCGM libraries on the newer NVIDIA Driver, they were reported again for T4 and A10 GPUs.

```bash
Successfully retrieved process info for PID: 71081. Process ran on 1 GPUs.
+------------------------------------------------------------------------------+
| GPU ID: 2                                                                    |
+====================================+=========================================+
|-----  Execution Stats  ------------+-----------------------------------------|
| Start Time                     *   | Fri Dec 12 00:11:25 2025                |
| End Time                       *   | Still Running                           |
| Total Execution Time (sec)     *   | Still Running                           |
| No. of Conflicting Processes   *   | 0                                       |
+-----  Performance Stats  ----------+-----------------------------------------+
| Energy Consumed (Joules)           | 27154                                   |
| Max GPU Memory Used (bytes)    *   | 0                                       |  <------
| SM Clock (MHz)                     | Avg: 1602, Max: 2040, Min: 1440         |
| Memory Clock (MHz)                 | Avg: 6251, Max: 6251, Min: 6251         |
| SM Utilization (%)                 | Avg: 91, Max: 100, Min: 70              |
| Memory Utilization (%)             | Avg: 23, Max: 100, Min: 0               |
| PCIe Rx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
| PCIe Tx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
+-----  Event Stats  ----------------+-----------------------------------------+
| Double Bit ECC Errors              | 0                                       |
| PCIe Replay Warnings               | 0                                       |
| Critical XID Errors                | 0                                       |
+-----  Slowdown Stats  -------------+-----------------------------------------+
| Due to - Power (%)                 | 3.02787e-06                             |
|        - Thermal (%)               | 0                                       |
|        - Reliability (%)           | 0                                       |
|        - Board Limit (%)           | 0                                       |
|        - Low Utilization (%)       | 0                                       |
|        - Sync Boost (%)            | 0                                       |
+-----  Process Utilization  --------+-----------------------------------------+
| PID                                | 71081                                   |
|     Avg SM Utilization (%)         | 90                                      |
|     Avg Memory Utilization (%)     | 32                                      |
+-----  Overall Health  -------------+-----------------------------------------+
| Overall Health                     | Healthy                                 |
+------------------------------------+-----------------------------------------+
```

**What you expected to happen**:

I should have GPU Max Memory reporting for DCGM Process Metrics.

**How to reproduce it (as minimally and precisely as possible)**:

Launch 1.32 GPU AMI
Install DCGM packages as described by NVIDIA
Launch CUDA process
Use dcgmi to report metrics from that process `dcgmi stats -p <pid>`

**Environment**:
- AWS Region: us-east-1
- Instance Type(s): g6(s)
- Cluster Kubernetes version: 1.32
- Node Kubernetes version: 1.32
- AMI Version: v20251108

Also filed this with NVIDIA, but it seems like something might be missing in the AL2023 AMI compared to the AL2 image, but that migration and the previous AMIs are black boxes.

https://github.com/NVIDIA/DCGM/issues/268
https://github.com/NVIDIA/go-dcgm/issues/106


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(GPU): DCGM metrics not fully reporting on G6/L4 GPUs on AL2023 #2566

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug(GPU): DCGM metrics not fully reporting on G6/L4 GPUs on AL2023 #2566

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions