-
Notifications
You must be signed in to change notification settings - Fork 2
Description
ATM we have
so we have them only per repository. In principle, while extracting metadata using metalad_core we extract also info about included subdatasets
although not necessarily we would have information about that specific version (I guess that is what is in @id) but we could have up to date info about that identifier (e.g. `identifier": "datalad:77e45fa8-6810-4287-a1e9-22bd7225eee7",) and thus estimate aggregate sizes . It would become very useful to know/show at the higher level. It would be not unlike we have on https://datasets.datalad.org/ knowing full size for every subdataset -- by aggregating information up the hierarchy. Then for that datasets.datalad.org https://registry.datalad.org/overview/?query=url%3A%22https%3A%2F%2Fdatasets.datalad.org%2F.git%22&sort=keys-desc it should show that it is probably like 10 levels in height and over a PB in size and has hundreds of millions of files covered.
I think such functionality should primarily be triggered when doing metalad_core extractor, which would start aggregating aggregates of the hasPart for which there are aggregates. But we could have some cron jobs which would also trigger recomputes based on some criteria...
To bootstrap we would need a job which would first go through the leaves -- the entries with metalad_core which have no hasPart at all in metalad_core and compute their aggregate (their own). Then should go through all for which not yet computed, and compute whenever some information about some aggregates is known.
In addition to actual aggregated values, we might need extra fields
agg_uptodate: bool - true if up-to-date to the versionagg_ratio:float- would indicate ratio of aggregates available, with1.0meaning all are available and computed.agg_precise_ratio:float- on for how many we know exactly to the version of aggregates (byid) and not byidentifieragg_ts:timestampfor when last aggregation was attempted.agg_height:int- max of agg_heights of aggregates + 1 -- so how deep hierarchy below. For leaves -0. For something like the
when new version for a repo is analyzed and metalad_core is extracted, agg_uptodate is set to False, and we trigger aggregation for that repository, which should aggregate and set all of those fields accordingly with agg_uptodate to True upon any type of completion,
Showstopper? There should be a disclaimer though that such aggregate size compute might compute the same subdataset size multiple times if used at multiple nodes in the hierarchy if expanded. E.g. it would be computing ///repronim/containers multiple times across all of the derivatives, and their raw bids ones too. To account for such duplicates we would need to establish tracking and propagate to the top level which might be too heavy to be feasible... May be it could be more of some kind of a "graph" aggregation, instead of proposed above "hierarchical" if we create a full graph of relations and then compute across of it for every node!
@candleindark how hard do you think it would be to accomplish (if we forget about showstopper; or if you research into computing across the graph)? I think it might be a nice task overall to guide spec-kit design/implementation and try on a smaller collection of "hierarchical" URLs to ensure correct operation before going for the full scale deployment. But not yet sure we should given the Showstopper.