Summary
vigil_ready_daemonsets and vigil_expected_daemonsets only register with the Prometheus client after the first node reconcile fires. On low-churn clusters this causes dashboards to show "no data" indefinitely — making a healthy-but-idle Vigil look identical to a broken Vigil.
How this bit us
We have a sandbox cluster (low churn) and a taskworker cluster (high churn) running the same Vigil v0.6.2 binary.
Querying promxy for vigil metrics by cluster:
group by (__name__) ({__name__=~"vigil_.*", cluster="taskworker-cluster"})
→ 9 metric families (includes vigil_ready_daemonsets, vigil_expected_daemonsets)
group by (__name__) ({__name__=~"vigil_.*", cluster="sandbox-cluster"})
→ 7 metric families (missing the two daemonset gauges)
The Vigil dashboard's panels for ready/expected DaemonSets were empty for the sandbox cluster. We assumed Vigil was broken there, spent time investigating, and only after running queries against vigil_tainted_nodes (which existed) did it become clear Vigil was healthy — it just hadn't done any work yet, so the gauges were never registered.
A separate problem (#55) was eventually found, but the empty panels masked it for hours because absence-of-metric was the only signal we had.
Proposed fix
Register the two gauges at controller startup with an initial value of 0 (or NaN), so they always appear in /metrics even before the first reconcile.
This is the standard prometheus-client idiom — pre-register metrics at process start so dashboards show "0" rather than "no data" when the system is idle. Avoids the dashboard ambiguity between "Vigil is broken" and "Vigil has nothing to do."
Acceptance
- Fresh Vigil install on a cluster with zero
nextdoor.com/initializing-tainted nodes exposes vigil_ready_daemonsets and vigil_expected_daemonsets in /metrics from process start.
- Both metrics report
0 until the first reconcile populates real values.
Summary
vigil_ready_daemonsetsandvigil_expected_daemonsetsonly register with the Prometheus client after the first node reconcile fires. On low-churn clusters this causes dashboards to show "no data" indefinitely — making a healthy-but-idle Vigil look identical to a broken Vigil.How this bit us
We have a sandbox cluster (low churn) and a taskworker cluster (high churn) running the same Vigil v0.6.2 binary.
Querying promxy for vigil metrics by cluster:
The Vigil dashboard's panels for ready/expected DaemonSets were empty for the sandbox cluster. We assumed Vigil was broken there, spent time investigating, and only after running queries against
vigil_tainted_nodes(which existed) did it become clear Vigil was healthy — it just hadn't done any work yet, so the gauges were never registered.A separate problem (#55) was eventually found, but the empty panels masked it for hours because absence-of-metric was the only signal we had.
Proposed fix
Register the two gauges at controller startup with an initial value of 0 (or NaN), so they always appear in
/metricseven before the first reconcile.This is the standard prometheus-client idiom — pre-register metrics at process start so dashboards show "0" rather than "no data" when the system is idle. Avoids the dashboard ambiguity between "Vigil is broken" and "Vigil has nothing to do."
Acceptance
nextdoor.com/initializing-tainted nodes exposesvigil_ready_daemonsetsandvigil_expected_daemonsetsin/metricsfrom process start.0until the first reconcile populates real values.