-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
What happened:
Once we migrated to AL2023, DCGM metrics stopped reporting process max gpu memory metrics only on G6 (L4 GPUs) instances. Once we upgraded some of our DCGM libraries on the newer NVIDIA Driver, they were reported again for T4 and A10 GPUs.
Successfully retrieved process info for PID: 71081. Process ran on 1 GPUs.
+------------------------------------------------------------------------------+
| GPU ID: 2 |
+====================================+=========================================+
|----- Execution Stats ------------+-----------------------------------------|
| Start Time * | Fri Dec 12 00:11:25 2025 |
| End Time * | Still Running |
| Total Execution Time (sec) * | Still Running |
| No. of Conflicting Processes * | 0 |
+----- Performance Stats ----------+-----------------------------------------+
| Energy Consumed (Joules) | 27154 |
| Max GPU Memory Used (bytes) * | 0 | <------
| SM Clock (MHz) | Avg: 1602, Max: 2040, Min: 1440 |
| Memory Clock (MHz) | Avg: 6251, Max: 6251, Min: 6251 |
| SM Utilization (%) | Avg: 91, Max: 100, Min: 70 |
| Memory Utilization (%) | Avg: 23, Max: 100, Min: 0 |
| PCIe Rx Bandwidth (megabytes) | Avg: N/A, Max: N/A, Min: N/A |
| PCIe Tx Bandwidth (megabytes) | Avg: N/A, Max: N/A, Min: N/A |
+----- Event Stats ----------------+-----------------------------------------+
| Double Bit ECC Errors | 0 |
| PCIe Replay Warnings | 0 |
| Critical XID Errors | 0 |
+----- Slowdown Stats -------------+-----------------------------------------+
| Due to - Power (%) | 3.02787e-06 |
| - Thermal (%) | 0 |
| - Reliability (%) | 0 |
| - Board Limit (%) | 0 |
| - Low Utilization (%) | 0 |
| - Sync Boost (%) | 0 |
+----- Process Utilization --------+-----------------------------------------+
| PID | 71081 |
| Avg SM Utilization (%) | 90 |
| Avg Memory Utilization (%) | 32 |
+----- Overall Health -------------+-----------------------------------------+
| Overall Health | Healthy |
+------------------------------------+-----------------------------------------+What you expected to happen:
I should have GPU Max Memory reporting for DCGM Process Metrics.
How to reproduce it (as minimally and precisely as possible):
Launch 1.32 GPU AMI
Install DCGM packages as described by NVIDIA
Launch CUDA process
Use dcgmi to report metrics from that process dcgmi stats -p <pid>
Environment:
- AWS Region: us-east-1
- Instance Type(s): g6(s)
- Cluster Kubernetes version: 1.32
- Node Kubernetes version: 1.32
- AMI Version: v20251108
Also filed this with NVIDIA, but it seems like something might be missing in the AL2023 AMI compared to the AL2 image, but that migration and the previous AMIs are black boxes.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working