Skip to content

gather(direct=True) may fall back to direct=False #7993

Open
@crusaderky

Description

@crusaderky

The current implementation for

  • Client(...).gather(..., direct=True)
  • Client(direct_to_workers=True).gather(...)
  • get_client().gather(...) (worker clients)

is that, if anything bad happens on the direct connection to the workers, the keys that failed to be fetched directly are fetched through the scheduler (direct=False) instead.

I believe that the intent of this design is to build resilience when the user specifies direct=True while in fact they have a firewall preventing comms.

First, I think even the original intent is not a good idea to begin with - most firewalls will just silently drop the packets, thus causing the user to wait ~30 seconds for timeout. If the workflow takes e.g. ~2 minutes to compute, the user may not realise there's a substantial slowdown.

Second, there are many cases where a key will fail to be gathered on the first try - notably, the who_has dict can fall out of sync with the scheduler, or a worker's GIL can be temporarily clogged. This may cause a workflow that gathers huge amounts of data from workers to client to run fine most times and then, once in a while, kill off the scheduler due to memory pressure. This is particularly true when the client mounts vastly more RAM than the scheduler - which is not an uncommon configuration, particularly when worker_clients are involved.

I suggest we drop this feature completely. If workers are permanently unreachable with direct=True due to network configuration, gather(direct=True) should just not work. Note that the default is direct=False.

CC @fjetter @hendrikmakait

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions