Skip to content

A basic default ChunkManager for arrays that report their own chunks #8733

Open
@hmaarrfk

Description

Is your feature request related to a problem?

I'm creating duckarrays for various file backed datastructures for mine that are naturally "chunked". i.e. different parts of the array may appear in completely different files.

Using these "chunks" and the "strides" algorithms can better decide on how to iterate in a convenient manner.

For example, an MP4 file's chunks may be defined as being delimited by I frames, while images stored in a TIFF may be delimited by a page.

So for me, chunks are not so useful for parallel computing, but more for computing locally and choosing the appropriate way to iterate through a large arrays (TB of uncompressed data).

Describe the solution you'd like

I think a default Chunk manager could simply implement compute as np.asarray as a default instance, and be a catchall to all other instances.

Advanced users could then go in an reimplement their own chunkmanager, but I was unable to use my duckarrays that incldued a chunk property because they weren't associated with any chunk manager.

Something as simple as:

diff --git a/xarray/core/parallelcompat.py b/xarray/core/parallelcompat.py
index c009ef48..bf500abb 100644
--- a/xarray/core/parallelcompat.py
+++ b/xarray/core/parallelcompat.py
@@ -681,3 +681,26 @@ class ChunkManagerEntrypoint(ABC, Generic[T_ChunkedArray]):
         cubed.store
         """
         raise NotImplementedError()
+
+
+class DefaultChunkManager(ChunkMangerEntrypoint):
+    def __init__(self) -> None:
+        self.array_cls = None
+
+    def is_chunked_array(self, data: Any) -> bool:
+        return is_duck_array(data) and hasattr(data, "chunks")
+
+    def chunks(self, data: T_ChunkedArray) -> T_NormalizedChunks:
+        return data.chunks
+
+    def compute(self, *data: T_ChunkedArray | Any, **kwargs) -> tuple[np.ndarray, ...]:
+        raise tuple(np.asarray(d) for d in data)
+
+    def normalize_chunks(self, *args, **kwargs):
+        raise NotImplementedError()
+
+    def from_array(self, *args, **kwargs):
+        raise NotImplementedError()
+
+    def apply_gufunc(self, *args, **kwargs):
+        raise NotImplementedError()

Describe alternatives you've considered

I created my own chunk manager, with my own chunk manager entry point.

Kinda tedious...

Additional context

It seems that this is related to: #7019

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions