-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
To evaluate this we should account for multiple factors:
Compression
- Is there a point at which the dense compression (like blosc) handling repeated 0's overtakes the sparse compression in terms of either on-disk size (i.e., smaller) or read-throughput, especially given that we can't block-aligh sparse due to the file format?
- Is the same compression method (like blosc) that is good for dense also good for the sparse data itself (beyond the fact that sparse data is already "compressed").
Sharding
- What is the impact of using very tiny shards? i.e., https://github.com/zarrs/zarr_benchmarks?tab=readme-ov-file#standalone-2 shows that reading tiny shards is more performant than chunks.
- Does having these tiny shards be block aligned outperform sparse, even if sparse has tiny shards?
- Do we need to use https://zarrs-python.readthedocs.io/en/stable/ and at what point do you need to use it to get good performance with sharding?
Denseification
- What is the impact of dense-ification as an operation on this tradeoff? We can't rely on using sparse matrices as input to models and thus need to dense-ify.
- Does GPU denseification help with this? (definitely yes, but should understand better, see next point)
- Where is the best time to denseify, batch-by-batch or within the prefetching i.e., is denseifying large quantities of data at once better than dense-ifying small quantities of data repeatedly?
There will likely be interplay along all of these axes within the context of this sparse/dense tradeoff.
Metadata
Metadata
Assignees
Labels
No labels