Skip to content

Automatically using Dask for large datasets when necessary #4095

@Zethson

Description

@Zethson

Coming out of a discussion with @ilan-gold.

Idea

We keep noticing that many (technical even) users are not aware of how to perform analyses of larger-than-memory datasets. There's multiple ways to improve this including workshops, more prominent tutorials, yadayada, but there might also be a technical solution. This is inspired a bit by @timtreis spatialdata-plot fallback where large datasets (automatically) use datashader to get the job done.

Analogously, we could calculate the dataset size & estimated memory requirements. If we suspect that memory is insufficient, we could automatically (and transparently) create a Dask array to perform OOC computation. In such cases, we'd likely need to "return" a Dask array which could confuse users. I'd suggest we enable this behavior by default with an option to opt-out but we could of course start with it being opt-in while we try it out. I argue that many users don't understand and don't want to deal with ways to make "large dataset processing possible" and if we can do it for them, this might be a cheap win.

I'll omit an API draft for now because this would just be a new setting for the settings class.

Primarily suggesting this for scanpy but if this is worthwhile considering, we should harmonize the behavior across all of scverse. Therefore, I'm kidnly asking for feedback from all of @scverse/core-devs .

caveats

  1. Ideally, we should of course complete the Dask support across all of scanpy's algorithms.
  2. @ilan-gold mentioned that we don't retain the file handle of anndata objects but this could be changed.
  3. I don't know how easy it is to estimate available memory on Slurm jobs. There might be some hidden complexity here. I also don't know how difficult it is to estimate the required peak memory for the respective algorithms.
  4. We might need to also estimate chunk size then.

Maybe, better catching OOM and pointing users to our dask tutorials is already good enough but I think this is worthwhile discussing

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions