Directories ingestion

In an old facility, we'll have million of files that could potentially be ingested by SciCat, this lead to several issues to adopt SciCat in a facility. One of the big feature that would ease it is to add the possibility to ingest directories. Therefore we should have strong selectors and also maybe the possibility to have a system to apply multiple schema files on a same data file. The purpose for it being to reduce the difficulty to maintain the ingestor for the facility.

The directories ingestion is a complex feature, I have done a nasty implementation in those commits:
- https://github.com/SciCatProject/scicat-ingestor/pull/141/commits/b60107b8a4109434d44641407d1f838480b4e3f1
- https://github.com/SciCatProject/scicat-ingestor/pull/141/commits/eaebfd45448c00e79ba72f94c96ce9065d6c5d2c

For OrigDataBlocks:
- https://github.com/SciCatProject/scicat-ingestor/pull/141/commits/e72aeea1f636def75634156b862695b0293acdbf
- https://github.com/SciCatProject/scicat-ingestor/pull/141/commits/86458eae45b7c14b9d37b9685a9488805b79d674

For selector:
- Related issue: https://github.com/SciCatProject/scicat-ingestor/issues/172
- https://github.com/SciCatProject/scicat-ingestor/pull/141/commits/349d745a10013166ec75b779f2f9705cd589ae01

It lacks of several major features and use cases, as my ingestor takes one month to ingest 4 years of data, we should have strong mechanism to ensure connection with the SciCat backend. We should also severely optimize the whole thing to reduce run time. Also, it would be better to have another service to check what we have ingested or not, because with million of files it is impossible to know if some files have not been ingested by error.

Also, I have met an issue with OrigDataBlock to efficiently add new files to dataFilList, because sometime an OrigDataBlock can reach thousand of files, so it can be long if we do not have an endpoint to let the backend do the insertion of the new file, and doing the appropriate checks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Directories ingestion #173

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Directories ingestion #173

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions