Can we optimize reading of archival data containing small chunks by pretending that files are zarr shards?
Maximum read throughput from S3 is achieved for objects above a certain size (maybe a few MBs? This number depends on various implementation details, and so is lower in rust than python), but often archival data has chunks < that size, but files > that size. If we can read e.g. one TIFF/HDF5 file in one request, then read throughput from the virtual store could sidestep the bottleneck of the overhead of reading many tiny chunks.
Whether or not this would actually be faster overall than just fetching chunks individually depends on the density of chunks within our "shard" (which can be visualized with @maxrjones' tool). If they are tightly packed then the single shard request would be fetching only data it needs, and this would be a strict improvement. But if they are too spaced out for any reason (e.g. because the file-shard contains multiple variables, with interleaved chunks, which HDF5 definitely does), you might end up fetching too much data that isn't relevant for that array, and it be less efficient overall.
Sharding is Zarr's implementation of internal chunking, where one shard is stored as one object, but contains many chunks, and also contains the shard index, telling you where exactly inside the shards each chunk is. Zarr v3 implements sharding as a codec, so in theory maybe we could change our parsers to return a codec chain containing the ShardingCodec, with the shard index pointing at the internal TIFF/HDF5 chunks?
Unfortunately I don't think that will work, because where would we put the shard index?
- In native zarr v3 it is part of the object, but in virtual stores we can't change the object. So does it have to go in memory? Does that then mean it needs to get serialized into icechunk separately somehow? It's annoying that sharding is a codec and not part of the zarr spec...
- We can't treat the file itself as the shard either because the
ShardingCodec assumes a particular format for the index within the object, which won't match pre-existing archival TIFF/HDF5 files.
This just leaves me thinking that zarr should not have implemented sharding as a codec...
cc @sharkinsspatial @jsignell
Can we optimize reading of archival data containing small chunks by pretending that files are zarr shards?
Maximum read throughput from S3 is achieved for objects above a certain size (maybe a few MBs? This number depends on various implementation details, and so is lower in rust than python), but often archival data has chunks < that size, but files > that size. If we can read e.g. one TIFF/HDF5 file in one request, then read throughput from the virtual store could sidestep the bottleneck of the overhead of reading many tiny chunks.
Whether or not this would actually be faster overall than just fetching chunks individually depends on the density of chunks within our "shard" (which can be visualized with @maxrjones' tool). If they are tightly packed then the single shard request would be fetching only data it needs, and this would be a strict improvement. But if they are too spaced out for any reason (e.g. because the file-shard contains multiple variables, with interleaved chunks, which HDF5 definitely does), you might end up fetching too much data that isn't relevant for that array, and it be less efficient overall.
Sharding is Zarr's implementation of internal chunking, where one shard is stored as one object, but contains many chunks, and also contains the shard index, telling you where exactly inside the shards each chunk is. Zarr v3 implements sharding as a codec, so in theory maybe we could change our parsers to return a codec chain containing the
ShardingCodec, with the shard index pointing at the internal TIFF/HDF5 chunks?Unfortunately I don't think that will work, because where would we put the shard index?
ShardingCodecassumes a particular format for the index within the object, which won't match pre-existing archival TIFF/HDF5 files.This just leaves me thinking that zarr should not have implemented sharding as a codec...
cc @sharkinsspatial @jsignell