Skip to content

Add aggregate statistics for datasets. #397

@yarikoptic

Description

@yarikoptic

ATM we have

Image

so we have them only per repository. In principle, while extracting metadata using metalad_core we extract also info about included subdatasets

Image

although not necessarily we would have information about that specific version (I guess that is what is in @id) but we could have up to date info about that identifier (e.g. `identifier": "datalad:77e45fa8-6810-4287-a1e9-22bd7225eee7",) and thus estimate aggregate sizes . It would become very useful to know/show at the higher level. It would be not unlike we have on https://datasets.datalad.org/ knowing full size for every subdataset -- by aggregating information up the hierarchy. Then for that datasets.datalad.org https://registry.datalad.org/overview/?query=url%3A%22https%3A%2F%2Fdatasets.datalad.org%2F.git%22&sort=keys-desc it should show that it is probably like 10 levels in height and over a PB in size and has hundreds of millions of files covered.

I think such functionality should primarily be triggered when doing metalad_core extractor, which would start aggregating aggregates of the hasPart for which there are aggregates. But we could have some cron jobs which would also trigger recomputes based on some criteria...

To bootstrap we would need a job which would first go through the leaves -- the entries with metalad_core which have no hasPart at all in metalad_core and compute their aggregate (their own). Then should go through all for which not yet computed, and compute whenever some information about some aggregates is known.

In addition to actual aggregated values, we might need extra fields

  • agg_uptodate: bool - true if up-to-date to the version
  • agg_ratio: float - would indicate ratio of aggregates available, with 1.0 meaning all are available and computed.
  • agg_precise_ratio: float - on for how many we know exactly to the version of aggregates (by id) and not by identifier
  • agg_ts: timestamp for when last aggregation was attempted.
  • agg_height: int - max of agg_heights of aggregates + 1 -- so how deep hierarchy below. For leaves - 0. For something like the

when new version for a repo is analyzed and metalad_core is extracted, agg_uptodate is set to False, and we trigger aggregation for that repository, which should aggregate and set all of those fields accordingly with agg_uptodate to True upon any type of completion,

Showstopper? There should be a disclaimer though that such aggregate size compute might compute the same subdataset size multiple times if used at multiple nodes in the hierarchy if expanded. E.g. it would be computing ///repronim/containers multiple times across all of the derivatives, and their raw bids ones too. To account for such duplicates we would need to establish tracking and propagate to the top level which might be too heavy to be feasible... May be it could be more of some kind of a "graph" aggregation, instead of proposed above "hierarchical" if we create a full graph of relations and then compute across of it for every node!

@candleindark how hard do you think it would be to accomplish (if we forget about showstopper; or if you research into computing across the graph)? I think it might be a nice task overall to guide spec-kit design/implementation and try on a smaller collection of "hierarchical" URLs to ensure correct operation before going for the full scale deployment. But not yet sure we should given the Showstopper.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions