Skip to content

Issues with tasks completing on workers after being released and re-submitted #7356

Open
@gjoseph92

Description

@gjoseph92

#7348 fixed a deadlock where a task gets released by the client, then re-submitted by a client, but before the worker hears about the task being cancelled, it completes it and tells the scheduler.

This fix was specific to the queued state, but I'd think that the problem (and possible deadlocks) are broader than that. There's more fundamentally a consistency issue whenever:

  1. client cancels a task
  2. scheduler forgets about the task immediately before confirming that workers have also forgotten about the task
  3. the same task key is submitted again

Some theoretical possibilities (I haven't come up with tests to create them yet, but I imagine they're slight variations of the test added in #7348):

  1. The task has resource restrictions, so it's in no-worker instead of queued. Maybe this causes a deadlock?
  2. The task is processing instead of queued. When re-submitted, the scheduler picks a different worker for it. Then, the message from the original worker that the task is finished arrives. This would trigger a RuntimeError on the scheduler from the task completing on an unexpected worker.
  3. The task and its dependencies were cancelled, then resubmitted, so the task is now waiting. Then the task completed message arrives, triggering the waiting->memory transition @crusaderky just added in Edge and impossible transitions to memory #7205 (the docstring of transition_waiting_memory does not mention this case FWIW, so I don't think this is intended, though the behavior would be okay)
  4. After being cancelled and resubmitted, the task ran on a new worker and erred. Then, the task finished message from the original worker arrives. This would trigger an impossible erred->memory transition.

A couple ways to address this:

  1. Just don't let the scheduler eagerly forget about tasks when a client asks it to; instead, wait for confirmation from workers. client_releases_keys would send a free-keys message to workers, and only forget the task once the workers confirm they've forgotten it. (We can transition the task to released immediately, just not forgotten.) This ensures the scheduler maintains a consistent view of the workers' state.
  2. On each TaskState, store the value of transition_counter when it was created, to use for disambiguating re-submitted tasks. Send this to workers as part of the task spec. Whenever workers send messages regarding a key, they also include this transition_counter value. If the value doesn't match what the scheduler has on record for that task (i.e. it's been forgotten and re-submitted in the interim), we know the message is stale, and we can ignore it and just tell the worker to release that key. (We'd have to be careful for consistency issues though—we don't want to accidentally tell the worker to release a newer version of the key, if it has that too.)

cc @fjetter @crusaderky @hendrikmakait

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is brokendeadlockThe cluster appears to not make any progressscheduler

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions