Restart P2P on `OSError: [Errno 28] No space left on device` instead of failing if the cluster has grown since we started

The idea here is similar to https://github.com/dask/distributed/issues/8673:

Since P2P fixes the set of involved workers during the initialization of a shuffle run, we don't benefit from workers who join the cluster afterward. This is particularly important because P2P can't succeed if the sum of available disk space across all involved workers is smaller than the size of the (serialized) data.

Even if we don't hit the heuristic suggested in https://github.com/dask/distributed/issues/8673, I think we should restart a P2P operation if the disk buffer on an involved worker encounters a `OSError: [Errno 28] No space left on device` and the worker count has grown since we started. We should add a circuit-breaker to this similar to the `suspicious_count` to avoid errors that are genuinely caused by inhomogeneous partitions (or because the cluster refuses to scale to a sufficient size). 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Restart P2P on `OSError: [Errno 28] No space left on device` instead of failing if the cluster has grown since we started #8674

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Restart P2P on OSError: [Errno 28] No space left on device instead of failing if the cluster has grown since we started #8674

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Restart P2P on `OSError: [Errno 28] No space left on device` instead of failing if the cluster has grown since we started #8674