[Feature]: PCIe link downtraining and GPU clock throttle detection

### Prerequisites

- [x] I searched existing issues
- [x] I can reproduce this issue

### Feature Description

NVSentinel currently detects GPU failures via XID errors, ECC errors, and temperature thresholds. There are two categories of silent performance degradation that don't generate XIDs or health events:

### 1. PCIe Link Downtraining

When a PCIe link degrades (e.g., Gen5 x16 → Gen3 x8), GPU-to-host bandwidth drops significantly. This happens due to:
- Signal integrity issues
- Retimer failures  
- Post-maintenance seating issues

**Detection method**: Compare `nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current,pcie.link.width.max`

Expected values:
- H100 (P5.48xlarge): Gen5 x16
- A100 (P4d.24xlarge): Gen4 x16

### 2. Silent Clock Throttling

GPUs can be throttled due to thermal, power, or hardware conditions without generating XID errors. This degrades training throughput silently.

**Detection method**: Compare `nvidia-smi --query-gpu=clocks.current.graphics,clocks.max.graphics` and check `clocks_throttle_reasons.active`

**Important**: Idle GPUs (reason `0x0000000000000001`) naturally downclock and should be excluded from alerts.

### Proposed Solution

Add checks to an existing health monitor (or new monitor) that:
1. Periodically queries PCIe link status and flags degradation
2. Periodically queries clock ratios and flags non-idle throttling
3. Generates health events that flow through the normal NVSentinel remediation pipeline

### Workaround

Our standalone `fabric-manager-monitor` DaemonSet implements both checks. Validated on P4d.24xlarge — all 8 GPUs correctly report Gen4 x16, and idle downclocking (210/1410 MHz) is correctly filtered as benign.

Source: https://github.com/dmvevents/nvsentinel-eks-deployment/tree/master/fabric-manager-monitor

### Component

Health Monitor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: PCIe link downtraining and GPU clock throttle detection #890

Prerequisites

Feature Description

1. PCIe Link Downtraining

2. Silent Clock Throttling

Proposed Solution

Workaround

Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: PCIe link downtraining and GPU clock throttle detection #890

Description

Prerequisites

Feature Description

1. PCIe Link Downtraining

2. Silent Clock Throttling

Proposed Solution

Workaround

Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions