Add aggregate statistics for datasets.

ATM we have 

<img width="2014" height="138" alt="Image" src="https://github.com/user-attachments/assets/58539a6e-52a3-4ebd-9608-b64cabcf2032" />

so we have them only per repository.  In principle, while extracting metadata using `metalad_core` we extract also info about included subdatasets

<img width="3226" height="686" alt="Image" src="https://github.com/user-attachments/assets/12ef7f39-bf2b-4f99-88df-ddf6c9b7dcf3" />

although not necessarily we would have information about that specific version (I guess that is what is in `@id`) but we could have up to date info about that identifier (e.g. `identifier": "datalad:77e45fa8-6810-4287-a1e9-22bd7225eee7",) and thus estimate aggregate sizes .  It would become very useful to know/show at the higher level.  It would be not unlike we have on https://datasets.datalad.org/ knowing full size for every subdataset -- by aggregating information up the hierarchy.  Then for that datasets.datalad.org https://registry.datalad.org/overview/?query=url%3A%22https%3A%2F%2Fdatasets.datalad.org%2F.git%22&sort=keys-desc it should show that it is probably like 10 levels in height and over  a PB in size and has hundreds of millions of files covered.

I think such functionality should primarily be triggered when doing `metalad_core` extractor, which would start aggregating aggregates of the `hasPart` for which there are aggregates. But we could have some cron jobs which would also trigger recomputes based on some criteria...  

To bootstrap we would need a job which would first go through the leaves -- the entries with `metalad_core` which have no `hasPart` at all in `metalad_core` and compute their aggregate (their own). Then should go through all for which not yet computed, and compute whenever some information about some aggregates is known.

In addition to actual aggregated values, we might need extra fields

- `agg_uptodate`: bool - true if up-to-date to the version
- `agg_ratio`: `float` - would indicate ratio of aggregates available, with `1.0` meaning all are available and computed. 
- `agg_precise_ratio`: `float` - on for how many we know exactly to the version of aggregates (by `id`) and not by `identifier`
- `agg_ts`: `timestamp` for when last aggregation was attempted.
- `agg_height`: `int` - max of agg_heights of aggregates + 1 -- so how deep hierarchy below. For leaves - `0`. For something like the 

when new version for a repo is analyzed and `metalad_core` is extracted, `agg_uptodate` is set to False, and we trigger aggregation for that repository, which should aggregate and set all of those fields accordingly with `agg_uptodate` to True upon any type of completion,

**Showstopper?** There should be a disclaimer though that such aggregate size compute might compute the same subdataset size multiple times if used at multiple nodes in the hierarchy if expanded. E.g. it would be computing ///repronim/containers multiple times across all of the derivatives, and their raw bids ones too. To account for such duplicates we would need to establish tracking and propagate to the top level which might be too heavy to be feasible...  May be it could be more of some kind of a "graph" aggregation, instead of proposed above "hierarchical" if we create a full graph of relations and then compute across of it for every node! 

@candleindark  how hard do you think it would be to accomplish (if we forget about showstopper; or if you research into computing across the graph)?  I think it might be a nice task overall to guide spec-kit design/implementation and try on a smaller collection of "hierarchical" URLs to ensure correct operation before going for the full scale deployment.  But not yet sure we should given the Showstopper.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add aggregate statistics for datasets. #397

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add aggregate statistics for datasets. #397

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions