Skip to content

Support ingestion of new metadata source format #483

@jsheunis

Description

@jsheunis

#482 will provide the new source format specification. Then we need a new set of tools/scripts to allow ingestion of metadata deposited in said format, and output into a format compliant with the datalad-catalog schema, i.e. ready to be datalad catalog-added. The new specification will allow multiple files/formats of metadata per dataset-version, and the tools need to account for this.

datalad-catalog, the SFB1451 catalog, and the ABCD-J catalog all have existing functionality that in some way contribute to achieving a similar goal. It is worth investigating these to see which parts can be reused.

Extractors

An extractor understands a particular metadata format (e.g. datacite.yml), reads such a metadata file, extracts the information, and outputs this (usually) in JSON format, often via datalad-metalad

Existing examples include:

Translators

Translators take datalad-metalad output and translates them into a datalad-catalog-schema compatible format. They inherit from a base translator class and for the purposes of the datalad catalog-translate method use a common procedure for matching a specific translator to a specific metadata record. Some translators use jq bindings to do the translation. Other translators use pure python. Examples:

Standalone extraction+translation scripts

Some standalone scripts have been created to be independent of both datalad-metalad extraction functionality and the datalad-catalog translation functionality. These are used in the ABCD-J catalog pipeline, specifically:

datalad-catalog helpers

datalad-catalog ships with functions that help to construct catalog-ready records. They are located in https://github.com/datalad/datalad-catalog/blob/main/datalad_catalog/schema_utils.py, and used by both SFB1451 and ABCD-J catalog pipelines when constructing/translating records to be added to a catalog.


From my POV:
We should have a set of tools that make the creation of catalog-ready records from "raw" metadata formats as simple as possible. The concept of a "reader" was conceived in previous discussions, with the idea that there would be a reader for any metadata format deposited per dataset-version in the source specification. The reader would do all that's necessary to get from the ingestion state to the catalog-ready state, it would be supported by helper functionality living inside of datalad-catalog, and would not be dependent on external packages/extensions. Pretty much what the abovementioned standalone scripts do, but perhaps with some wrapper functionality that makes it a common interface?

@mih @mslw curious to hear your thoughts here. (@mslw please also add updates if I missed or misrepresented any relevant functionality from the SFB1451 pipeline)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions