Skip to content

feat(nvml-mock): DCGM / dcgm-exporter mock for Prometheus GPU metrics #370

@giuliocalzo

Description

@giuliocalzo

Priority: High

Context. The simulator exposes NVML values and IB hw_counters/, but nothing speaks the DCGM API. On real clusters dcgm-exporter is the de-facto Prometheus GPU-metrics path and one of the most-deployed GPU Operator components.

Gap. No libdcgm shim, no nvidia-dcgm host engine fake, no dcgm-exporter surface. GPU monitoring/alerting test paths (DCGM_FI_DEV_*, DCGM_FI_DEV_XID_ERRORS, profiling fields like SM/Tensor activity) cannot be exercised.

Proposed scope.

  • A mock that satisfies dcgm-exporter (either a libdcgm.so shim driven by the existing profiles/dynamic-metrics engine, or a fake nv-hostengine).
  • Map core fields (clocks, power, temp, util, ECC, Xid) to existing NVML/dynamic-metrics values so sysfs/NVML/DCGM stay consistent.
  • E2E: scrape dcgm-exporter on Kind and assert non-zero, time-varying DCGM_FI_DEV_* and a fired DCGM_FI_DEV_XID_ERRORS under failure injection.

Why. Highest-leverage missing piece; unblocks the entire Prometheus/Grafana monitoring stack in CI. Complements failure injection (#328) and dynamic metrics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions