Skip to content

Directories ingestion #173

@MatthMig

Description

@MatthMig

In an old facility, we'll have million of files that could potentially be ingested by SciCat, this lead to several issues to adopt SciCat in a facility. One of the big feature that would ease it is to add the possibility to ingest directories. Therefore we should have strong selectors and also maybe the possibility to have a system to apply multiple schema files on a same data file. The purpose for it being to reduce the difficulty to maintain the ingestor for the facility.

The directories ingestion is a complex feature, I have done a nasty implementation in those commits:

For OrigDataBlocks:

For selector:

It lacks of several major features and use cases, as my ingestor takes one month to ingest 4 years of data, we should have strong mechanism to ensure connection with the SciCat backend. We should also severely optimize the whole thing to reduce run time. Also, it would be better to have another service to check what we have ingested or not, because with million of files it is impossible to know if some files have not been ingested by error.

Also, I have met an issue with OrigDataBlock to efficiently add new files to dataFilList, because sometime an OrigDataBlock can reach thousand of files, so it can be long if we do not have an endpoint to let the backend do the insertion of the new file, and doing the appropriate checks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions