Description
The current prometheus metric that exposes number workers known to the scheduler is ill defined.
distributed/distributed/http/scheduler/prometheus/core.py
Lines 35 to 42 in 3e793f7
It's current definition is something like
connected
is the number of all connected workers
idle/saturated
is the number of workers in the respective state
but these sets are not disjoint. Saturated and idle are disjoint and both are subsets of connected.
Ideally, this metric would expose gauges s.t. summing all labels would yield the total number of workers similar to how we expose task states
distributed/distributed/http/scheduler/prometheus/core.py
Lines 76 to 78 in 3e793f7
The implementation should ideally not iterate over all workers every time the metrics are collected but instead the scheduler should maintain the count online.