Description
Apologies if this has already been requested, or is clearly impossible for some reason. My Dask knowledge isn't super deep.
I know that OSErrors, which can occur due to a disk being full, are handled pretty gracefully when spilling:
distributed/distributed/spill.py
Lines 134 to 137 in 81774d4
However, I am frequently running into OSErrors during the shuffle operation, here:
distributed/distributed/shuffle/_disk.py
Lines 179 to 180 in 81774d4
It does not appear that these are handled well -- they are treated as if they were an error in the task itself and surfaced up to me, when really I would like the task to be rerun elsewhere, since this is a problem local to one worker. Even killing the worker in question and allowing Dask to recompute the necessary data is more graceful.
This is a frequent annoyance for me running large dataframe operations (dataframes with a few hundred million rows and ~15 string columns) on a cluster that has unpredictable disk capacity constraints (which is a separate issue, but I would not expect to bubble up like this).
I can provide more details, such as a stack trace, if this is unexpected/should already work -- but I don't see any signs in the code of this being a bug, more like a missing feature.