-
Notifications
You must be signed in to change notification settings - Fork 9
Description
#482 will provide the new source format specification. Then we need a new set of tools/scripts to allow ingestion of metadata deposited in said format, and output into a format compliant with the datalad-catalog schema, i.e. ready to be datalad catalog-added. The new specification will allow multiple files/formats of metadata per dataset-version, and the tools need to account for this.
datalad-catalog, the SFB1451 catalog, and the ABCD-J catalog all have existing functionality that in some way contribute to achieving a similar goal. It is worth investigating these to see which parts can be reused.
Extractors
An extractor understands a particular metadata format (e.g. datacite.yml), reads such a metadata file, extracts the information, and outputs this (usually) in JSON format, often via datalad-metalad
Existing examples include:
- extractors shipped with and dependent on
datalad-metalad(viadatalad meta-extract): https://github.com/datalad/datalad-metalad/tree/master/datalad_metalad/extractors; most often used aremetalad-studyminimetaandmetalad-core(dataset and file-level) - metalad-compatible extractors used in SFB1451: https://github.com/mslw/datalad-wackyextra/tree/main/datalad_wackyextra/extractors (including cff and citations)
- metalad-compatible extractors used for BIDS datasets: https://github.com/datalad/datalad-neuroimaging/blob/master/datalad_neuroimaging/extractors/bids_dataset.py
Translators
Translators take datalad-metalad output and translates them into a datalad-catalog-schema compatible format. They inherit from a base translator class and for the purposes of the datalad catalog-translate method use a common procedure for matching a specific translator to a specific metadata record. Some translators use jq bindings to do the translation. Other translators use pure python. Examples:
- translators shipped with
datalad-catalog: https://github.com/datalad/datalad-catalog/tree/main/datalad_catalog/translators (including translators forcore,studyminimeta,bids_dataset,datacite_gin, all or most based on jq) - translators used in SFB1451: https://github.com/mslw/datalad-wackyextra/tree/main/datalad_wackyextra/translators (including cff and citations, and python-based translation for those shipped with
datalad-catalog)
Standalone extraction+translation scripts
Some standalone scripts have been created to be independent of both datalad-metalad extraction functionality and the datalad-catalog translation functionality. These are used in the ABCD-J catalog pipeline, specifically:
- https://github.com/datalad/datalad-catalog/blob/main/datalad_catalog/extractors/catalog_core.py
- https://github.com/datalad/datalad-catalog/blob/main/datalad_catalog/extractors/catalog_runprov.py
datalad-catalog helpers
datalad-catalog ships with functions that help to construct catalog-ready records. They are located in https://github.com/datalad/datalad-catalog/blob/main/datalad_catalog/schema_utils.py, and used by both SFB1451 and ABCD-J catalog pipelines when constructing/translating records to be added to a catalog.
From my POV:
We should have a set of tools that make the creation of catalog-ready records from "raw" metadata formats as simple as possible. The concept of a "reader" was conceived in previous discussions, with the idea that there would be a reader for any metadata format deposited per dataset-version in the source specification. The reader would do all that's necessary to get from the ingestion state to the catalog-ready state, it would be supported by helper functionality living inside of datalad-catalog, and would not be dependent on external packages/extensions. Pretty much what the abovementioned standalone scripts do, but perhaps with some wrapper functionality that makes it a common interface?
@mih @mslw curious to hear your thoughts here. (@mslw please also add updates if I missed or misrepresented any relevant functionality from the SFB1451 pipeline)