Skip to content

Restart P2P on OSError: [Errno 28] No space left on device instead of failing if the cluster has grown since we started #8674

Open
@hendrikmakait

Description

@hendrikmakait

The idea here is similar to #8673:

Since P2P fixes the set of involved workers during the initialization of a shuffle run, we don't benefit from workers who join the cluster afterward. This is particularly important because P2P can't succeed if the sum of available disk space across all involved workers is smaller than the size of the (serialized) data.

Even if we don't hit the heuristic suggested in #8673, I think we should restart a P2P operation if the disk buffer on an involved worker encounters a OSError: [Errno 28] No space left on device and the worker count has grown since we started. We should add a circuit-breaker to this similar to the suspicious_count to avoid errors that are genuinely caused by inhomogeneous partitions (or because the cluster refuses to scale to a sufficient size).

Metadata

Metadata

Assignees

No one assigned

    Labels

    adaptiveAll things relating to adaptive scalingenhancementImprove existing functionality or make things work bettershuffle

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions