Description
If a cluster hasn't run any work yet, it will only recommend 1 worker initially, regardless of how many tasks are queued on the scheduler:
@gen_cluster(
client=True,
nthreads=[],
config={"distributed.scheduler.default-task-durations": {"inc": 1}},
)
async def test_adaptive_target_empty_cluster(c, s):
assert s.adaptive_target() == 0
f = c.submit(inc, -1)
await async_wait_for(lambda: s.tasks, timeout=5)
assert s.adaptive_target() == 1
fs = c.map(inc, range(1000))
await async_wait_for(lambda: len(s.tasks) == len(fs) + 1, timeout=5)
print(s.total_occupancy)
> assert s.adaptive_target() > 1
E AssertionError: assert 1 > 1
The scheduler's adaptive target is based on looking at its total_occupancy
. But occupancy is only updated once tasks are scheduled (into processing). So if there are no workers, no tasks can be scheduled, and occupancy remains 0 even with tons of tasks in unrunnable
.
I would expect the total_occupancy
to also include the expected runtime of all unrunnable/queued tasks. That would result in faster scale-up from zero usually. Some deployment systems might be quite slow to scale. You might have to wait a few minutes to get 1 worker, to realize you then need more, and then wait a few minutes again. It would be better to ask for more up front.
This is what ensures we at least get one worker, otherwise we'd never scale up at all:
distributed/distributed/scheduler.py
Lines 7287 to 7288 in 16748b7