Open
Description
Problem
There is a race condition in P2P that causes tasks to log compute failures on the worker even though those tasks will get restarted later on and then succeed. This happens when:
- A worker involved in the P2P operation is removed
- We restart the P2P operation on the scheduler and schedule the messages to be sent to the workers
- A task on worker A is not cancelled yet, but its RPC calls fail because the remote worker B has already closed the shuffle run, throwing a
P2PConsistencyError
- The task raises the
P2PConsistencyError
and fails while still seen asexecuting
by worker A, which causes the error to get logged.
Solution
Instead of failing directly on a P2PConsistencyError
, the task could double-check with the scheduler whether its shuffle run is still supposed to be active. If not, it could instead silently succeed as the result will get rejected by the scheduler as outdated.