Skip to content

Initialize vigil_ready_daemonsets and vigil_expected_daemonsets metrics at startup, not on first reconcile #57

@diranged

Description

@diranged

Summary

vigil_ready_daemonsets and vigil_expected_daemonsets only register with the Prometheus client after the first node reconcile fires. On low-churn clusters this causes dashboards to show "no data" indefinitely — making a healthy-but-idle Vigil look identical to a broken Vigil.

How this bit us

We have a sandbox cluster (low churn) and a taskworker cluster (high churn) running the same Vigil v0.6.2 binary.

Querying promxy for vigil metrics by cluster:

group by (__name__) ({__name__=~"vigil_.*", cluster="taskworker-cluster"})
→ 9 metric families (includes vigil_ready_daemonsets, vigil_expected_daemonsets)

group by (__name__) ({__name__=~"vigil_.*", cluster="sandbox-cluster"})
→ 7 metric families (missing the two daemonset gauges)

The Vigil dashboard's panels for ready/expected DaemonSets were empty for the sandbox cluster. We assumed Vigil was broken there, spent time investigating, and only after running queries against vigil_tainted_nodes (which existed) did it become clear Vigil was healthy — it just hadn't done any work yet, so the gauges were never registered.

A separate problem (#55) was eventually found, but the empty panels masked it for hours because absence-of-metric was the only signal we had.

Proposed fix

Register the two gauges at controller startup with an initial value of 0 (or NaN), so they always appear in /metrics even before the first reconcile.

This is the standard prometheus-client idiom — pre-register metrics at process start so dashboards show "0" rather than "no data" when the system is idle. Avoids the dashboard ambiguity between "Vigil is broken" and "Vigil has nothing to do."

Acceptance

  • Fresh Vigil install on a cluster with zero nextdoor.com/initializing-tainted nodes exposes vigil_ready_daemonsets and vigil_expected_daemonsets in /metrics from process start.
  • Both metrics report 0 until the first reconcile populates real values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions