Open
Description
I've investigated a cluster that deadlocked after work-stealing failed in move_task_confirm
with the following traceback:
distributed.stealing - ERROR - <TaskState
('rechunk-getitem-getitem-getitem-1bd003f53f0630ef705ff016830a2c8f', 0, 1, 0) released>
Traceback (most recent call last):
File "/opt/coiled/env/lib/python3.10/site-packages/distributed/stealing.py", line 380, in move_task_confirm
victim.remove_from_processing(ts)
File "/opt/coiled/env/lib/python3.10/site-packages/distributed/scheduler.py", line 771, in remove_from_processing
self.processing.remove(ts)
KeyError: <TaskState ('rechunk-getitem-getitem-getitem-1bd003f53f0630ef705ff016830a2c8f', 0, 1, 0) released>
From what I understand, stealing has come pretty far in confirmation, i.e., it checked that the request is up-to-date, that the worker has indeed confirmed the request (by checking the worker status), and checked whether the task is currently stealable.
After looking into this for a while, I have not been able to understand the root-cause of this, so I'm leaving this here in case this ever comes up again.
Environment:
- Dask version:
2024.7.1
- Python version:
3.10.12