Description
Follow-up from #6223
At the moment, retire_workers
waits indefinitely, until either
- all in-memory tasks also exist somewhere else (worker transitions from closing_gracefully to closing), or
- there are no other workers on the cluster which can accept the unique data and are neither in paused nor closing_gracefully state (worker transitions from closing_gracefully back to running).
There is an implicit timeout where the Active Memory Manager tries to replicate keys from the retiring worker to other workers, but the recipients are unresponsive and the scheduler hasn't noticed yet: the AMM will repeat the replication request every 2 seconds, until the unresponsive recipient finally hits timeout and is removed. Note that this behaviour has changed very recently (#6200). When this happens, the AMM will try replicating towards a different recipient, or abort retirement if no more recipients are available.
Notably, the above system will not protect you in case of
- Deadlocked recipient, which however is still regularly sending heartbeats to the scheduler
- Yet to be discovered deadlocks within the AMM. In particular, in early development stages I faced a lot of problems with tasks "ping-ponging" between workers at each AMM iteration, and fixes them only through fairly delicate logic.
It would be healthy to have a timeout, e.g. 60s, in retire_workers to mitigate this.
Such a timeout should be scheduler-side.
It is important to consider the use case of a worker that is retired while it holds tens of GBs of spilled data, and the spill disk is not very performant. This can easily mean it will take way more than 60s. #5999 will aggravate this use case, since it will extend the mechanism that chokes the outgoing comms of a paused worker to workers in closing_gracefully state which are above the pause threshold.
To deal with this, I would suggest to define the timeout as "the number of unique keys did not decrease compared to 60 seconds ago".