Priority: High
Context. The simulator exposes NVML values and IB hw_counters/, but nothing speaks the DCGM API. On real clusters dcgm-exporter is the de-facto Prometheus GPU-metrics path and one of the most-deployed GPU Operator components.
Gap. No libdcgm shim, no nvidia-dcgm host engine fake, no dcgm-exporter surface. GPU monitoring/alerting test paths (DCGM_FI_DEV_*, DCGM_FI_DEV_XID_ERRORS, profiling fields like SM/Tensor activity) cannot be exercised.
Proposed scope.
- A mock that satisfies
dcgm-exporter (either a libdcgm.so shim driven by the existing profiles/dynamic-metrics engine, or a fake nv-hostengine).
- Map core fields (clocks, power, temp, util, ECC, Xid) to existing NVML/dynamic-metrics values so sysfs/NVML/DCGM stay consistent.
- E2E: scrape
dcgm-exporter on Kind and assert non-zero, time-varying DCGM_FI_DEV_* and a fired DCGM_FI_DEV_XID_ERRORS under failure injection.
Why. Highest-leverage missing piece; unblocks the entire Prometheus/Grafana monitoring stack in CI. Complements failure injection (#328) and dynamic metrics.
Priority: High
Context. The simulator exposes NVML values and IB
hw_counters/, but nothing speaks the DCGM API. On real clustersdcgm-exporteris the de-facto Prometheus GPU-metrics path and one of the most-deployed GPU Operator components.Gap. No
libdcgmshim, nonvidia-dcgmhost engine fake, nodcgm-exportersurface. GPU monitoring/alerting test paths (DCGM_FI_DEV_*,DCGM_FI_DEV_XID_ERRORS, profiling fields like SM/Tensor activity) cannot be exercised.Proposed scope.
dcgm-exporter(either alibdcgm.soshim driven by the existing profiles/dynamic-metrics engine, or a fakenv-hostengine).dcgm-exporteron Kind and assert non-zero, time-varyingDCGM_FI_DEV_*and a firedDCGM_FI_DEV_XID_ERRORSunder failure injection.Why. Highest-leverage missing piece; unblocks the entire Prometheus/Grafana monitoring stack in CI. Complements failure injection (#328) and dynamic metrics.