Skip to content

[Feature]: PCIe link downtraining and GPU clock throttle detection #890

@dmvevents

Description

@dmvevents

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Feature Description

NVSentinel currently detects GPU failures via XID errors, ECC errors, and temperature thresholds. There are two categories of silent performance degradation that don't generate XIDs or health events:

1. PCIe Link Downtraining

When a PCIe link degrades (e.g., Gen5 x16 → Gen3 x8), GPU-to-host bandwidth drops significantly. This happens due to:

  • Signal integrity issues
  • Retimer failures
  • Post-maintenance seating issues

Detection method: Compare nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current,pcie.link.width.max

Expected values:

  • H100 (P5.48xlarge): Gen5 x16
  • A100 (P4d.24xlarge): Gen4 x16

2. Silent Clock Throttling

GPUs can be throttled due to thermal, power, or hardware conditions without generating XID errors. This degrades training throughput silently.

Detection method: Compare nvidia-smi --query-gpu=clocks.current.graphics,clocks.max.graphics and check clocks_throttle_reasons.active

Important: Idle GPUs (reason 0x0000000000000001) naturally downclock and should be excluded from alerts.

Proposed Solution

Add checks to an existing health monitor (or new monitor) that:

  1. Periodically queries PCIe link status and flags degradation
  2. Periodically queries clock ratios and flags non-idle throttling
  3. Generates health events that flow through the normal NVSentinel remediation pipeline

Workaround

Our standalone fabric-manager-monitor DaemonSet implements both checks. Validated on P4d.24xlarge — all 8 GPUs correctly report Gen4 x16, and idle downclocking (210/1410 MHz) is correctly filtered as benign.

Source: https://github.com/dmvevents/nvsentinel-eks-deployment/tree/master/fabric-manager-monitor

Component

Health Monitor

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions