Failure to pickle/unpickle on flight breaks the worker state machine

### Use case 1
A task in flight fails to unpickle when it lands; this triggers a `GatherDepFailureEvent`.

### Use case 2
A task in flight unpickles successfully when it lands; this triggers a `GatherDepSuccesfulEvent`.
The task is larger than 60% of memory_limit, so it's spilled immediately.
However, it fails to pickle back.

### Use case 3
Network stack raises, as shown in #6877 / #6875
This is a bug in the network stack and should be fixed there.

### Expected behaviour
The task is marked as erred on the worker.
All waiters are taken care of (rescheduled?)
The scheduler is informed .

This is where things get complicated: the scheduler receives an error event for a task that is in memory somewhere else.
Does it need to try fetching it again, incrementing the failure counter?
What happens to the task's dependents when the scheduler finally transitions it from memory to error (a transition that today does not exist)?

### Actual behaviour
If validate=True, the worker shuts itself down with `@fail_hard`:
```python
            for ts_wait in ts.waiting_for_data:
                assert self.tasks[ts_wait.key] is ts_wait
>               assert ts_wait.state in WAITING_FOR_DATA, ts_wait
E               AssertionError: <TaskState 'x' error>
```
If validate=False, the worker sends `{op: task-erred}` to the scheduler. Unsure what happens next, considering that the task is in memory for the scheduler. This is untested.

#6703 introduces tests for both use cases (xfailed):
- ``test_worker_memory.py::test_workerstate_fail_to_pickle_flight``
- ``test_worker_state_machine.py::test_gather_dep_failure``

Note that there are no integration tests with the scheduler - they need to be added.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to pickle/unpickle on flight breaks the worker state machine #6705

Use case 1

Use case 2

Use case 3

Expected behaviour

Actual behaviour

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failure to pickle/unpickle on flight breaks the worker state machine #6705

Description

Use case 1

Use case 2

Use case 3

Expected behaviour

Actual behaviour

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions