Skip to content

Baseline Benchmarks #5

@ilan-gold

Description

@ilan-gold

We need a baseline becnhmark in terms of performance and randomness. This will come in the form of MappedCollection, likely, where rows are fetched completely independently across stores i.e., perfrect randomness. In that case, we can likely assume that, although having a hard number on e.g., entropy would be helpfu;. Also, the performance of MappedCollection for this task is likely optimal i.e., generating perfectly random samples across a set of datasets.

The baseline MappedCollection will need to be re-run given any "recommended" settings that it itself could incorporate into the underlying stores/reader software:

Lastly:

  • What is the different between mapped collection and our implementation when our implementation is set to "fetch individual rows instead of chunks." Are we doing something special (other than block fetching) that is actually faster than MappedCollection for fetching individual rows i.e., have we implemented a more efficient mapped loader by accident?

The big question, likely, is at what shard size (i.e., how small) would mapped collection potentially become as performant as our implementation here (under the same size). I suspect "never" given the fact that we batch requests, but this is TBD.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions