Support ingestion of new metadata source format

#482 will provide the new source format specification. Then we need a new set of tools/scripts to allow ingestion of metadata deposited in said format, and output into a format compliant with the `datalad-catalog` schema, i.e. ready to be `datalad catalog-add`ed. The new specification will allow multiple files/formats of metadata per dataset-version, and the tools need to account for this.

`datalad-catalog`, the SFB1451 catalog, and the ABCD-J catalog all have existing functionality that in some way contribute to achieving a similar goal. It is worth investigating these to see which parts can be reused.

### Extractors

An extractor understands a particular metadata format (e.g. `datacite.yml`), reads such a metadata file, extracts the information, and outputs this (usually) in JSON format, often via `datalad-metalad`

Existing examples include:
- extractors shipped with and dependent on `datalad-metalad` (via `datalad meta-extract`): https://github.com/datalad/datalad-metalad/tree/master/datalad_metalad/extractors; most often used are `metalad-studyminimeta` and `metalad-core` (dataset and file-level)
- metalad-compatible extractors used in SFB1451: https://github.com/mslw/datalad-wackyextra/tree/main/datalad_wackyextra/extractors (including cff and citations)
- metalad-compatible extractors used for BIDS datasets: https://github.com/datalad/datalad-neuroimaging/blob/master/datalad_neuroimaging/extractors/bids_dataset.py

### Translators

Translators take `datalad-metalad` output and translates them into a `datalad-catalog`-schema compatible format. They inherit from a base translator class and for the purposes of the `datalad catalog-translate` method use a common procedure for matching a specific translator to a specific metadata record. Some translators use `jq` bindings to do the translation. Other translators use pure python. Examples:

- translators shipped with `datalad-catalog`: https://github.com/datalad/datalad-catalog/tree/main/datalad_catalog/translators (including translators for `core`, `studyminimeta`, `bids_dataset`, `datacite_gin`, all or most based on jq)
- translators used in SFB1451: https://github.com/mslw/datalad-wackyextra/tree/main/datalad_wackyextra/translators (including cff and citations, and python-based translation for those shipped with `datalad-catalog`)

### Standalone extraction+translation scripts

Some standalone scripts have been created to be independent of both `datalad-metalad` extraction functionality and the `datalad-catalog` translation functionality. These are used in the ABCD-J catalog pipeline, specifically:
- https://github.com/datalad/datalad-catalog/blob/main/datalad_catalog/extractors/catalog_core.py
- https://github.com/datalad/datalad-catalog/blob/main/datalad_catalog/extractors/catalog_runprov.py

### `datalad-catalog` helpers

`datalad-catalog` ships with functions that help to construct catalog-ready records. They are located in https://github.com/datalad/datalad-catalog/blob/main/datalad_catalog/schema_utils.py, and used by both SFB1451 and ABCD-J catalog pipelines when constructing/translating records to be added to a catalog.

---

From my POV:
We should have a set of tools that make the creation of catalog-ready records from "raw" metadata formats as simple as possible. The concept of a "reader" was conceived in previous discussions, with the idea that there would be a reader for any metadata format deposited per dataset-version in the source specification. The reader would do all that's necessary to get from the ingestion state to the catalog-ready state, it would be supported by helper functionality living inside of `datalad-catalog`, and would not be dependent on external packages/extensions. Pretty much what the abovementioned standalone scripts do, but perhaps with some wrapper functionality that makes it a common interface?

@mih @mslw curious to hear your thoughts here. (@mslw please also add updates if I missed or misrepresented any relevant functionality from the SFB1451 pipeline)







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support ingestion of new metadata source format #483

Extractors

Translators

Standalone extraction+translation scripts

`datalad-catalog` helpers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support ingestion of new metadata source format #483

Description

Extractors

Translators

Standalone extraction+translation scripts

datalad-catalog helpers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`datalad-catalog` helpers