Description
The current implementation for
Client(...).gather(..., direct=True)
Client(direct_to_workers=True).gather(...)
get_client().gather(...)
(worker clients)
is that, if anything bad happens on the direct connection to the workers, the keys that failed to be fetched directly are fetched through the scheduler (direct=False)
instead.
I believe that the intent of this design is to build resilience when the user specifies direct=True while in fact they have a firewall preventing comms.
First, I think even the original intent is not a good idea to begin with - most firewalls will just silently drop the packets, thus causing the user to wait ~30 seconds for timeout. If the workflow takes e.g. ~2 minutes to compute, the user may not realise there's a substantial slowdown.
Second, there are many cases where a key will fail to be gathered on the first try - notably, the who_has dict can fall out of sync with the scheduler, or a worker's GIL can be temporarily clogged. This may cause a workflow that gathers huge amounts of data from workers to client to run fine most times and then, once in a while, kill off the scheduler due to memory pressure. This is particularly true when the client mounts vastly more RAM than the scheduler - which is not an uncommon configuration, particularly when worker_clients are involved.
I suggest we drop this feature completely. If workers are permanently unreachable with direct=True due to network configuration, gather(direct=True) should just not work. Note that the default is direct=False.