Skip to content

P2P tasks log compute failures even if they are later restarted #8679

Open
@hendrikmakait

Description

@hendrikmakait

Problem

There is a race condition in P2P that causes tasks to log compute failures on the worker even though those tasks will get restarted later on and then succeed. This happens when:

  1. A worker involved in the P2P operation is removed
  2. We restart the P2P operation on the scheduler and schedule the messages to be sent to the workers
  3. A task on worker A is not cancelled yet, but its RPC calls fail because the remote worker B has already closed the shuffle run, throwing a P2PConsistencyError
  4. The task raises the P2PConsistencyError and fails while still seen as executing by worker A, which causes the error to get logged.

Solution

Instead of failing directly on a P2PConsistencyError, the task could double-check with the scheduler whether its shuffle run is still supposed to be active. If not, it could instead silently succeed as the result will get rejected by the scheduler as outdated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementImprove existing functionality or make things work bettershuffle

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions