Description
The idea here is similar to #8673:
Since P2P fixes the set of involved workers during the initialization of a shuffle run, we don't benefit from workers who join the cluster afterward. This is particularly important because P2P can't succeed if the sum of available disk space across all involved workers is smaller than the size of the (serialized) data.
Even if we don't hit the heuristic suggested in #8673, I think we should restart a P2P operation if the disk buffer on an involved worker encounters a OSError: [Errno 28] No space left on device
and the worker count has grown since we started. We should add a circuit-breaker to this similar to the suspicious_count
to avoid errors that are genuinely caused by inhomogeneous partitions (or because the cluster refuses to scale to a sufficient size).