Skip to content

More efficient way to encode "fetch this whole chunk" #850

@TomNicholas

Description

@TomNicholas

Currently if we create a virtual reference that points at an entire object it is stored internally as

(path=s3://bucket/file.whatever, offset=0, length=<full_length_of_object>)

where the <full_length_of_object> needs to be determined at parsing-time.

This leads to a big inefficiency in the ZarrParser, which currently does an O(n_chunks) iteration over all chunks in the Zarr array to discover the sizes of every chunk object. It would be great to be able to skip this iteration.

size = await zarr_array.store.getsize(scalar_key)

It is apparently possible to issue a HTTP range request for a whole object without specifying the length of the object (see allowed range header syntax). So if we instead e.g. stored the virtual reference as

(path=s3://bucket/file.whatever, offset=0, length=-1)

and our chunk-fetching IO implementations (i.e. ManifestStore/Icechunk/fsspec) knew what to do with this, then we could skip getting the object size, and see an O(n_chunks) speedup when virtualizing any (un-sharded) Zarr store.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions