Skip to content

Gracefully handle unevenly distributed disk space during P2P #8433

Open
@zmbc

Description

@zmbc

Apologies if this has already been requested, or is clearly impossible for some reason. My Dask knowledge isn't super deep.

I know that OSErrors, which can occur due to a disk being full, are handled pretty gracefully when spilling:

except OSError:
# Typically, this is a disk full error
logger.error("Spill to disk failed; keeping data in memory", exc_info=True)
raise HandledError()

However, I am frequently running into OSErrors during the shuffle operation, here:

with open(self.directory / str(id), mode="ab") as f:
f.writelines(frames)

It does not appear that these are handled well -- they are treated as if they were an error in the task itself and surfaced up to me, when really I would like the task to be rerun elsewhere, since this is a problem local to one worker. Even killing the worker in question and allowing Dask to recompute the necessary data is more graceful.

This is a frequent annoyance for me running large dataframe operations (dataframes with a few hundred million rows and ~15 string columns) on a cluster that has unpredictable disk capacity constraints (which is a separate issue, but I would not expect to bubble up like this).

I can provide more details, such as a stack trace, if this is unexpected/should already work -- but I don't see any signs in the code of this being a bug, more like a missing feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementImprove existing functionality or make things work bettershuffle

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions