Skip to content

Scheduler deadlocked after stealing failed in move_task_confirm #8787

Open
@hendrikmakait

Description

@hendrikmakait

I've investigated a cluster that deadlocked after work-stealing failed in move_task_confirm with the following traceback:

distributed.stealing - ERROR - <TaskState 
('rechunk-getitem-getitem-getitem-1bd003f53f0630ef705ff016830a2c8f', 0, 1, 0) released>
  Traceback (most recent call last):
    File "/opt/coiled/env/lib/python3.10/site-packages/distributed/stealing.py", line 380, in move_task_confirm
      victim.remove_from_processing(ts)
    File "/opt/coiled/env/lib/python3.10/site-packages/distributed/scheduler.py", line 771, in remove_from_processing
      self.processing.remove(ts)
  KeyError: <TaskState ('rechunk-getitem-getitem-getitem-1bd003f53f0630ef705ff016830a2c8f', 0, 1, 0) released>

From what I understand, stealing has come pretty far in confirmation, i.e., it checked that the request is up-to-date, that the worker has indeed confirmed the request (by checking the worker status), and checked whether the task is currently stealable.

After looking into this for a while, I have not been able to understand the root-cause of this, so I'm leaving this here in case this ever comes up again.

Environment:

  • Dask version: 2024.7.1
  • Python version: 3.10.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is brokendeadlockThe cluster appears to not make any progress

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions