Skip to content

bug(GPU): DCGM metrics not fully reporting on G6/L4 GPUs on AL2023 #2566

@sidewinder12s

Description

@sidewinder12s

What happened:

Once we migrated to AL2023, DCGM metrics stopped reporting process max gpu memory metrics only on G6 (L4 GPUs) instances. Once we upgraded some of our DCGM libraries on the newer NVIDIA Driver, they were reported again for T4 and A10 GPUs.

Successfully retrieved process info for PID: 71081. Process ran on 1 GPUs.
+------------------------------------------------------------------------------+
| GPU ID: 2                                                                    |
+====================================+=========================================+
|-----  Execution Stats  ------------+-----------------------------------------|
| Start Time                     *   | Fri Dec 12 00:11:25 2025                |
| End Time                       *   | Still Running                           |
| Total Execution Time (sec)     *   | Still Running                           |
| No. of Conflicting Processes   *   | 0                                       |
+-----  Performance Stats  ----------+-----------------------------------------+
| Energy Consumed (Joules)           | 27154                                   |
| Max GPU Memory Used (bytes)    *   | 0                                       |  <------
| SM Clock (MHz)                     | Avg: 1602, Max: 2040, Min: 1440         |
| Memory Clock (MHz)                 | Avg: 6251, Max: 6251, Min: 6251         |
| SM Utilization (%)                 | Avg: 91, Max: 100, Min: 70              |
| Memory Utilization (%)             | Avg: 23, Max: 100, Min: 0               |
| PCIe Rx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
| PCIe Tx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
+-----  Event Stats  ----------------+-----------------------------------------+
| Double Bit ECC Errors              | 0                                       |
| PCIe Replay Warnings               | 0                                       |
| Critical XID Errors                | 0                                       |
+-----  Slowdown Stats  -------------+-----------------------------------------+
| Due to - Power (%)                 | 3.02787e-06                             |
|        - Thermal (%)               | 0                                       |
|        - Reliability (%)           | 0                                       |
|        - Board Limit (%)           | 0                                       |
|        - Low Utilization (%)       | 0                                       |
|        - Sync Boost (%)            | 0                                       |
+-----  Process Utilization  --------+-----------------------------------------+
| PID                                | 71081                                   |
|     Avg SM Utilization (%)         | 90                                      |
|     Avg Memory Utilization (%)     | 32                                      |
+-----  Overall Health  -------------+-----------------------------------------+
| Overall Health                     | Healthy                                 |
+------------------------------------+-----------------------------------------+

What you expected to happen:

I should have GPU Max Memory reporting for DCGM Process Metrics.

How to reproduce it (as minimally and precisely as possible):

Launch 1.32 GPU AMI
Install DCGM packages as described by NVIDIA
Launch CUDA process
Use dcgmi to report metrics from that process dcgmi stats -p <pid>

Environment:

  • AWS Region: us-east-1
  • Instance Type(s): g6(s)
  • Cluster Kubernetes version: 1.32
  • Node Kubernetes version: 1.32
  • AMI Version: v20251108

Also filed this with NVIDIA, but it seems like something might be missing in the AL2023 AMI compared to the AL2 image, but that migration and the previous AMIs are black boxes.

NVIDIA/DCGM#268
NVIDIA/go-dcgm#106

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions