-
Notifications
You must be signed in to change notification settings - Fork 53
Description
Prerequisites
- I searched existing issues
- I can reproduce this issue
Feature Description
NVSentinel currently detects GPU failures via XID errors, ECC errors, and temperature thresholds. There are two categories of silent performance degradation that don't generate XIDs or health events:
1. PCIe Link Downtraining
When a PCIe link degrades (e.g., Gen5 x16 → Gen3 x8), GPU-to-host bandwidth drops significantly. This happens due to:
- Signal integrity issues
- Retimer failures
- Post-maintenance seating issues
Detection method: Compare nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current,pcie.link.width.max
Expected values:
- H100 (P5.48xlarge): Gen5 x16
- A100 (P4d.24xlarge): Gen4 x16
2. Silent Clock Throttling
GPUs can be throttled due to thermal, power, or hardware conditions without generating XID errors. This degrades training throughput silently.
Detection method: Compare nvidia-smi --query-gpu=clocks.current.graphics,clocks.max.graphics and check clocks_throttle_reasons.active
Important: Idle GPUs (reason 0x0000000000000001) naturally downclock and should be excluded from alerts.
Proposed Solution
Add checks to an existing health monitor (or new monitor) that:
- Periodically queries PCIe link status and flags degradation
- Periodically queries clock ratios and flags non-idle throttling
- Generates health events that flow through the normal NVSentinel remediation pipeline
Workaround
Our standalone fabric-manager-monitor DaemonSet implements both checks. Validated on P4d.24xlarge — all 8 GPUs correctly report Gen4 x16, and idle downclocking (210/1410 MHz) is correctly filtered as benign.
Source: https://github.com/dmvevents/nvsentinel-eks-deployment/tree/master/fabric-manager-monitor
Component
Health Monitor