Skip to content

Worker count prometheus metric should expose disjoint counters #7461

Open
@fjetter

Description

@fjetter

The current prometheus metric that exposes number workers known to the scheduler is ill defined.

worker_states = GaugeMetricFamily(
self.build_name("workers"),
"Number of workers known by scheduler",
labels=["state"],
)
worker_states.add_metric(["connected"], len(self.server.workers))
worker_states.add_metric(["saturated"], len(self.server.saturated))
worker_states.add_metric(["idle"], len(self.server.idle))

It's current definition is something like

connected is the number of all connected workers
idle/saturated is the number of workers in the respective state

but these sets are not disjoint. Saturated and idle are disjoint and both are subsets of connected.
Ideally, this metric would expose gauges s.t. summing all labels would yield the total number of workers similar to how we expose task states

for state in ALL_TASK_STATES:
if state != "forgotten":
tasks.add_metric([state], task_counter.get(state, 0.0))

The implementation should ideally not iterate over all workers every time the metrics are collected but instead the scheduler should maintain the count online.

Metadata

Metadata

Assignees

No one assigned

    Labels

    diagnosticsgood second issueClearly described, educational, but less trivial than "good first issue".

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions