-
Notifications
You must be signed in to change notification settings - Fork 86
Description
Summary
DcgmHealthWatch::MonitorPower() in modules/health/DcgmHealthWatch.cpp produces false positive DCGM_FR_CLOCK_THROTTLE_POWER warnings on GPUs under normal load. This is because the function uses a zero-threshold check on the cumulative power violation time from nvmlDeviceGetViolationStatus(), meaning even 1 nanosecond of violation triggers a health warning.
Problem
Bug 1: Zero Threshold Check (line 2146)
if (violationTime) { // any non-zero value triggers WARN
SetResponse(entityGroupId, entityId, DCGM_HEALTH_RESULT_WARN, DCGM_HEALTH_WATCH_POWER, d, response);
}GPU Boost is designed to push clocks up to the power limit, causing brief momentary power violations as part of normal clock negotiation. These micro-second scale violations are not actual power issues, but the zero-threshold check reports them as warnings. As a result, any GPU under load will almost always trigger a power health warning, making this health check effectively meaningless.
Bug 2: abs() on Cumulative Counter (lines 2143-2144)
violationTime = startValue.val.i64 >= endValue.val.i64
? (startValue.val.i64 - endValue.val.i64)
: (endValue.val.i64 - startValue.val.i64);The DCGM_FI_DEV_POWER_VIOLATION field is a monotonically increasing cumulative counter. Taking the absolute difference means counter resets or sample ordering issues also produce false positives.
Impact
nvidia-smi(which usesnvmlDeviceGetCurrentClocksEventReasons) shows no power throttling- DCGM health check reports
DCGM_HEALTH_RESULT_WARNwithDCGM_FR_CLOCK_THROTTLE_POWER - This inconsistency causes unnecessary investigation and erodes trust in DCGM health monitoring
Proposed Fix
Replace the zero-threshold check with a ratio-based threshold:
double elapsedTime = (double)(endValue.timestamp - startValue.timestamp);
if (elapsedTime > 0) {
double violationRatio = (double)violationTime / elapsedTime;
if (violationRatio > 0.05) { // Only WARN when >5% of time is in violation
DcgmError d { entityId };
DCGM_ERROR_FORMAT_MESSAGE(DCGM_FR_CLOCKS_EVENT_POWER, d, entityId);
SetResponse(entityGroupId, entityId, DCGM_HEALTH_RESULT_WARN, DCGM_HEALTH_WATCH_POWER, d, response);
}
}This preserves the high sensitivity of GetViolationStatus while filtering out normal GPU Boost behavior.
Environment
- DCGM 4.5.0
- Observed on NVIDIA H100 GPUs with various workloads (vLLM, inference serving)