Skip to content

MonitorPower health check produces false positive warnings due to zero-threshold violation time check #285

@wkd-woo

Description

@wkd-woo

Summary

DcgmHealthWatch::MonitorPower() in modules/health/DcgmHealthWatch.cpp produces false positive DCGM_FR_CLOCK_THROTTLE_POWER warnings on GPUs under normal load. This is because the function uses a zero-threshold check on the cumulative power violation time from nvmlDeviceGetViolationStatus(), meaning even 1 nanosecond of violation triggers a health warning.

Problem

Bug 1: Zero Threshold Check (line 2146)

if (violationTime) {  // any non-zero value triggers WARN
    SetResponse(entityGroupId, entityId, DCGM_HEALTH_RESULT_WARN, DCGM_HEALTH_WATCH_POWER, d, response);
}

GPU Boost is designed to push clocks up to the power limit, causing brief momentary power violations as part of normal clock negotiation. These micro-second scale violations are not actual power issues, but the zero-threshold check reports them as warnings. As a result, any GPU under load will almost always trigger a power health warning, making this health check effectively meaningless.

Bug 2: abs() on Cumulative Counter (lines 2143-2144)

violationTime = startValue.val.i64 >= endValue.val.i64 
    ? (startValue.val.i64 - endValue.val.i64)
    : (endValue.val.i64 - startValue.val.i64);

The DCGM_FI_DEV_POWER_VIOLATION field is a monotonically increasing cumulative counter. Taking the absolute difference means counter resets or sample ordering issues also produce false positives.

Impact

  • nvidia-smi (which uses nvmlDeviceGetCurrentClocksEventReasons) shows no power throttling
  • DCGM health check reports DCGM_HEALTH_RESULT_WARN with DCGM_FR_CLOCK_THROTTLE_POWER
  • This inconsistency causes unnecessary investigation and erodes trust in DCGM health monitoring

Proposed Fix

Replace the zero-threshold check with a ratio-based threshold:

double elapsedTime = (double)(endValue.timestamp - startValue.timestamp);
if (elapsedTime > 0) {
    double violationRatio = (double)violationTime / elapsedTime;
    if (violationRatio > 0.05) {  // Only WARN when >5% of time is in violation
        DcgmError d { entityId };
        DCGM_ERROR_FORMAT_MESSAGE(DCGM_FR_CLOCKS_EVENT_POWER, d, entityId);
        SetResponse(entityGroupId, entityId, DCGM_HEALTH_RESULT_WARN, DCGM_HEALTH_WATCH_POWER, d, response);
    }
}

This preserves the high sensitivity of GetViolationStatus while filtering out normal GPU Boost behavior.

Environment

  • DCGM 4.5.0
  • Observed on NVIDIA H100 GPUs with various workloads (vLLM, inference serving)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions