-
Notifications
You must be signed in to change notification settings - Fork 54
Description
Currently if we create a virtual reference that points at an entire object it is stored internally as
(path=s3://bucket/file.whatever, offset=0, length=<full_length_of_object>)where the <full_length_of_object> needs to be determined at parsing-time.
This leads to a big inefficiency in the ZarrParser, which currently does an O(n_chunks) iteration over all chunks in the Zarr array to discover the sizes of every chunk object. It would be great to be able to skip this iteration.
VirtualiZarr/virtualizarr/parsers/zarr.py
Line 81 in c67dcc1
| size = await zarr_array.store.getsize(scalar_key) |
It is apparently possible to issue a HTTP range request for a whole object without specifying the length of the object (see allowed range header syntax). So if we instead e.g. stored the virtual reference as
(path=s3://bucket/file.whatever, offset=0, length=-1)and our chunk-fetching IO implementations (i.e. ManifestStore/Icechunk/fsspec) knew what to do with this, then we could skip getting the object size, and see an O(n_chunks) speedup when virtualizing any (un-sharded) Zarr store.