-
Notifications
You must be signed in to change notification settings - Fork 1
Description
We need a baseline becnhmark in terms of performance and randomness. This will come in the form of MappedCollection
, likely, where rows are fetched completely independently across stores i.e., perfrect randomness. In that case, we can likely assume that, although having a hard number on e.g., entropy would be helpfu;. Also, the performance of MappedCollection
for this task is likely optimal i.e., generating perfectly random samples across a set of datasets.
The baseline MappedCollection will need to be re-run given any "recommended" settings that it itself could incorporate into the underlying stores/reader software:
- Shard size
- Sparse vs. Dense
- Usage of https://zarrs-python.readthedocs.io/en/stable/
Lastly:
- What is the different between mapped collection and our implementation when our implementation is set to "fetch individual rows instead of chunks." Are we doing something special (other than block fetching) that is actually faster than
MappedCollection
for fetching individual rows i.e., have we implemented a more efficient mapped loader by accident?
The big question, likely, is at what shard size (i.e., how small) would mapped collection potentially become as performant as our implementation here (under the same size). I suspect "never" given the fact that we batch requests, but this is TBD.