Baseline Benchmarks

We need a baseline becnhmark in terms of performance and randomness.  This will come in the form of `MappedCollection`, likely, where rows are fetched completely independently across stores i.e., perfrect randomness.  In that case, we can likely assume that, although having a hard number on e.g., entropy would be helpfu;.  Also, the performance of `MappedCollection` for this task is likely optimal i.e., generating perfectly random samples across a set of datasets.

The baseline MappedCollection will need to be re-run given any "recommended" settings that it itself could incorporate into the underlying stores/reader software:

- [ ] Shard size
- [ ] Sparse vs. Dense
- [ ] Usage of https://zarrs-python.readthedocs.io/en/stable/

Lastly:

- [ ] What is the different between mapped collection and our implementation when our implementation is set to "fetch individual rows instead of chunks."  Are we doing something special (other than block fetching) that is actually faster than `MappedCollection` for fetching individual rows i.e., have we implemented a more efficient mapped loader by accident?

The big question, likely, is at what shard size (i.e., how small) would mapped collection potentially become as performant as our implementation here (under the same size).  I suspect "never" given the fact that we batch requests, but this is TBD.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Baseline Benchmarks #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Baseline Benchmarks #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions