Skip to content

Poor adaptive target for empty clusters #6962

Open
@gjoseph92

Description

@gjoseph92

If a cluster hasn't run any work yet, it will only recommend 1 worker initially, regardless of how many tasks are queued on the scheduler:

@gen_cluster(
    client=True,
    nthreads=[],
    config={"distributed.scheduler.default-task-durations": {"inc": 1}},
)
async def test_adaptive_target_empty_cluster(c, s):
    assert s.adaptive_target() == 0

    f = c.submit(inc, -1)
    await async_wait_for(lambda: s.tasks, timeout=5)
    assert s.adaptive_target() == 1

    fs = c.map(inc, range(1000))
    await async_wait_for(lambda: len(s.tasks) == len(fs) + 1, timeout=5)
    print(s.total_occupancy)
  > assert s.adaptive_target() > 1
E   AssertionError: assert 1 > 1

The scheduler's adaptive target is based on looking at its total_occupancy. But occupancy is only updated once tasks are scheduled (into processing). So if there are no workers, no tasks can be scheduled, and occupancy remains 0 even with tons of tasks in unrunnable.

I would expect the total_occupancy to also include the expected runtime of all unrunnable/queued tasks. That would result in faster scale-up from zero usually. Some deployment systems might be quite slow to scale. You might have to wait a few minutes to get 1 worker, to realize you then need more, and then wait a few minutes again. It would be better to ask for more up front.

This is what ensures we at least get one worker, otherwise we'd never scale up at all:

if self.unrunnable and not self.workers:
cpu = max(1, cpu)

Metadata

Metadata

Assignees

No one assigned

    Labels

    adaptiveAll things relating to adaptive scalingenhancementImprove existing functionality or make things work betterscheduling

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions