Ill suggestion #141

MatthMig · 2025-03-24T15:32:08Z

Description

To ingest the hundreds of To of ILL, I needed to made several changes.
Some changes in the code are specific to ILL and should probably be done using advanced configurations schemes.

I've also Dockerised the application, and made a corresponding service in our local SciCatLive repo

Changes

Dockerization of the application
Add of temporary login and logout scicat endpoints to ease long ingestion session, we've also added associated functions
Implementation of few new selector keywords:
- contains, looking if an element contains a substring
- dirname, looking if one of the following properties apply to one of the individual directory name that compound the file path
- possibility to have miultiple schemes applied to a file, this make possible factoring by having a scheme file for all instruments, and a scheme file specific to an instrument and have both appliying to the concerned instruments. This leads to better factorization possible for scheme files
- better or and and selector management with possibility to have nested ones
- possibility of having a path being a directory, this is because sometimes path are not the same through our files for a same variable, and there is no logic that we can use to make the variable be correctly find, so we need to try both path
possibility of patching datasets to add new files to it, we need an endpoint to append new files to the dataset, because currently the append is done by getting the whole list of files and adding it, which leads to exponential backoff
patch of origdatablock
automatic instrument creation in the database
automatic sample creation in the database
automatic proposal creation in the dataset
datafile existence check
instrument get by name
proposal get by id
sample get by id
dataset get through sample id
possibility to ingest directories, instead of having to do one run per file to ingest
add of units retrieve
add of function to manage ILL special cases, should later be done through configurations files

I know these changes are a lot and are breaking, its changes that ILL have required to be done to SciCat, and I have implemented them my way, but now I look into how can we adapt those functionalities to make something properly merge-able into the the core.

…tion-test Add intergration test github action.

…ame-for-UO-query fix: renamed proposal organisation field to insttution_id

…ample or datafile

…es if applicable

…er of logs

…lection performance

…nt}, and to use /entry0/title or /entry0/experimentTitle

…at 1 instead of 0

… is TODO

YooSunYoung · 2025-04-28T15:52:19Z

@MatthMig
Thank you for the PR! Sorry for the late reply.

@nitrosx and I will try to support your needs...!

I heard you are going to have a meeting about this rather soon.
I will post some questions here in the meantime.

Login/Logout

It seems like you need an ingestor that can process multiple files at once.
Is it something that you want to run regularly...?
Or was it just an one time thing that you needed to do for existing dataset?

The question is mostly because of the authentication.
We typically don't want to have password in configuration file.
If it's running regularly, it should use long-life token instead.
If it's just for quick-one-time running with user-name and password, then it'd be better that ingestor asks for username and password as a user prompt and delete it from the memory as soon as it doesn't need them.

It won't be too much of work but we can also just use the token from the scicat frontend so I'm not sure if it's worth implementing and maintaining.
Is anything keeps the ILL ingestor from using token?

Deployment

I see the dockerfile and configuration files for deployment as well.
We typically store them in the separate repository as it can contain institute-specific values.
Also it should be accessible from the deployment tools.
What kind of deployment tool CI/CD tool do you use...?
For example, do you prefer github or gitlab...?

I can make a template repository for either of platform that also has CI tests to validate configuration files.

YooSunYoung · 2025-04-29T11:59:56Z

src/scicat_dataset.py

+            try:
+                # Try custom format "dd-MMM-yy HH:mm:ss"
+                return datetime.datetime.strptime(value, "%d-%b-%y %H:%M:%S").replace(tzinfo=datetime.UTC).isoformat()
+            except ValueError:
+                return None


I'll make it configurable so that we don't have to hard-code it.

YooSunYoung · 2025-04-29T12:10:33Z

src/scicat_dataset.py

    variable_recipe: NexusFileMetadataVariable, h5file: h5py.File
 ) -> Any:
-    if "*" in variable_recipe.path:  # Selectors are used
+    """Retrieve values from file, with unit support and multi possible paths support."""


Allowing multiple paths
might also be useful for ESS...?

Discussion Point

But it wasn't clear if we just want to select one of the paths or we want to make a list of all values from the paths.
And how do we handle some edge cases.

This is legacy, the thing was that on some files a field was under /entry0/instrument/name and for other it was under /entry0/{name of the instrument}/name, but now I've managed it differently as there was too many edge cases, now I've hard-coded it in the ingestor using a succession of if-else, otherwise it would have been way too complicated to manage it only using the schemes files

YooSunYoung · 2025-04-29T12:12:09Z

src/scicat_dataset.py

-    return h5file[path][...].item().decode(encoding)
-
+) -> tuple[Any, str | None]:
+    """Retrieve both value and unit (if available) from an HDF5 dataset."""


Question: What do you want to do with the unit...?

We want to get values with their associated units in the hdf5 file, for example when we get a time we want to read in the file if it is seconds, minutes or other.

MatthMig · 2025-05-09T11:47:41Z

@MatthMig Thank you for the PR! Sorry for the late reply.

@nitrosx and I will try to support your needs...!

I heard you are going to have a meeting about this rather soon. I will post some questions here in the meantime.

Login/Logout

It seems like you need an ingestor that can process multiple files at once. Is it something that you want to run regularly...? Or was it just an one time thing that you needed to do for existing dataset?

The question is mostly because of the authentication. We typically don't want to have password in configuration file. If it's running regularly, it should use long-life token instead. If it's just for quick-one-time running with user-name and password, then it'd be better that ingestor asks for username and password as a user prompt and delete it from the memory as soon as it doesn't need them.

It won't be too much of work but we can also just use the token from the scicat frontend so I'm not sure if it's worth implementing and maintaining. Is anything keeps the ILL ingestor from using token?

Deployment

I see the dockerfile and configuration files for deployment as well. We typically store them in the separate repository as it can contain institute-specific values. Also it should be accessible from the deployment tools. What kind of deployment tool CI/CD tool do you use...? For example, do you prefer github or gitlab...?

I can make a template repository for either of platform that also has CI tests to validate configuration files.

Login/Logout

Yes we have multiple terabytes of data that we'd like to put in the data catalog, therefore a complete ingestion of past data starting at 2021 takes a month. I've therefore encountered an issue due to tokes lifespan which was of one hour. Since this merge request I've made a new service in scicatlive for the ingestor with a config.json file that contains the credentials. But for sure, the best solution would be to have long-life token. For our institute, quick-one-time running seems to never be useful, as we have too much data, it would be only valid if we spotted that a specific file is missing in a dataset, but regarding the size of the data we produced, it is very unlikely for someone to spot it.

But I guess that it is way more likely that someone spot a dataset that is missing (So for ILL a dataset would be data acquired for a sample for a proposal). In that case, we'd ingest the directory again, which should take only few minutes in the worst case.

Deployment

I did not removed the configuration file is only there so I could run the ingestor on host before I've dockerized it, it is not really important. Concerning the dockerfile, I thought it was better to have it there so we can build and push the image from here and then pull it from the Scicatlive service leveraging the docker compose associated, I think you have done it the same way for the frontend and backend services.

Currently I do not use any CI/CD tool and I've got no idea who will continue the project after my contract ends in few months.
Our institute prefer GitLab, but I guess GitHub would be fine too. At least I've decided to push our forked version on a GitLab repo, same for our custom docker images.

Additional changes since the initial draft pull request

Here is an overview of additional changes I've made since the pushed pull request, this code is not currently pushed on your repo, only on ours intern one.

Performance improvement

When patching a dataset to add a new file there was no endpoints to just append a file to the data files list, therefore in the current pull request you can see that I get all from the backend, append the new file in the ingestor and push all again, which leads to poor performance. To improve it I've added an endpoint to directly append the data file, and I call it in the patch_scicat_origdatablock function.

Online Ingestor

My management wanted to do the online ingestion using RabbitMQ, so I've made changes to add the possibility to use RabbitMQ, also they wanted to have the message sent to be the data that we directly ingest, because your current implementation has the problem that it makes request to the database of the institute to get the file, but it will double the workload on our files management system because we will write the file and just after access it to read it. To reduce the workload, at the acquisition, we create the file and simultaneously we send the data with RabbitMQ in JSON format to the ingestor, so the ingestor can ingest the data without having to access our data files servers. Therefore, I have also factorized the ingestion process so the offline and online just extract variables and values and call same function written in a common file.

…atus for origdatablocks is ignored in current backend version

YooSunYoung · 2025-06-16T12:25:56Z

Just so you know,
We are working on the issues and commits you kindly created
but it is taking time since we have to make sure these changes will not affect the operation at ESS.

I'm sorry that I'm not keeping up so fast...! We don't have so much time allocated in this project at the moment...

MatthMig · 2025-06-16T12:35:32Z

Just so you know, We are working on the issues and commits you kindly created but it is taking time since we have to make sure these changes will not affect the operation at ESS.

I'm sorry that I'm not keeping up so fast...! We don't have so much time allocated in this project at the moment...

Okay, thanks ! Anyways I've started to work on the upstream ingestor too since recently, I'll try to implement myself a maximum of features I suggested, I hope it will help the project !

YooSunYoung and others added 30 commits May 10, 2024 16:10

Add intergration test github action.

a219fe1

Merge pull request SciCatProject#20 from SciCatProject/master-integra…

4ac3327

…tion-test Add intergration test github action.

fix: renamed proposal organisation field to insttution_id

adf3872

Merge pull request SciCatProject#79 from SciCatProject/update-field-n…

45ee152

…ame-for-UO-query fix: renamed proposal organisation field to insttution_id

fix: add missing dependency and fix UO query (SciCatProject#80)

032b843

add of login/logout and proposal, sample, instrument check and creation

a6de6d2

Merge remote-tracking branch 'origin' into ILL

d2b3fb1

implement ingestor insertions in case of already existing proposal, s…

af46730

…ample or datafile

implementation of patch on origdatablock

e72aeea

management of locacl contact name, specific to ILL

617c56f

handling of dict type

33f948a

duration implementation

598cf95

implementation of contains selector and application of multiple schem…

8f70b9d

…es if applicable

ingestion of directories

b60107b

change of logger level from info to debug for many logs to limit numb…

3d1267e

…er of logs

manage of empty sampl id, internal server error log, better schema se…

1ec8d44

…lection performance

keep numberOfFiles dataset scicat's field updated

025ed15

change of instrument unique name for nomad instrument id

c0b40f6

units implementation

f5b2e81

add of dirname and fullpath options for selector, bug fix on or selector

349d745

less computationnally intensive search of files corresponding to schemes

eaebfd4

bug fix for or selector nested in an and selector

ecd9c61

implementation of multiple nexus files arguments and bug fix on selector

81fc4b1

bug fix due to the multiple nexus input files

50355b3

management of symlink for nexus files

218d9ac

implementation of multiple new configs

d68ca1b

add of ILL flexibility to use /entry0/instrument or /entry0/{instrume…

7b8fbb5

…nt}, and to use /entry0/title or /entry0/experimentTitle

add check to ensure proposal id presence

ede955a

fix for size and number of files update, and sampl indexation starts …

a014f87

…at 1 instead of 0

Merge branch 'ILL' into ILL_suggestion

a491328

Matthieu Migné added 6 commits March 25, 2025 13:37

origdatablock append to datafilelist

86458ea

further optimisation of datafiles insertions in origdatablock

61479e4

ingestor credentials moved in environment variables

1dd1879

sample properties changed to not include nomad specifiic properties

239eab7

online ingesor data getting from rabbitmq, the processing of the data…

a2aa789

… is TODO

RabbitMQ online ingestor and ILL accessGroups config

84f5876

YooSunYoung reviewed Apr 29, 2025

View reviewed changes

Matthieu Migné added 5 commits May 16, 2025 16:13

Merge branch 'OnlineIngestor'

0a882de

Initial commit for scicat update published

cdae5d5

update origdatablocks publish status

7b2c61c

origdatablocks update published status commented because published st…

df87af5

…atus for origdatablocks is ignored in current backend version

Merge branch 'PublicDataset' into ILL_suggestion

c761330

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ill suggestion #141

Ill suggestion #141

Uh oh!

MatthMig commented Mar 24, 2025

Uh oh!

YooSunYoung commented Apr 28, 2025 •

edited

Loading

Uh oh!

YooSunYoung Apr 29, 2025

Uh oh!

YooSunYoung Apr 29, 2025

Uh oh!

MatthMig May 9, 2025

Uh oh!

YooSunYoung Apr 29, 2025

Uh oh!

MatthMig May 9, 2025

Uh oh!

MatthMig commented May 9, 2025

Login/Logout

Deployment

Uh oh!

YooSunYoung commented Jun 16, 2025

Uh oh!

MatthMig commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ill suggestion #141

Are you sure you want to change the base?

Ill suggestion #141

Uh oh!

Conversation

MatthMig commented Mar 24, 2025

Description

Changes

Uh oh!

YooSunYoung commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Login/Logout

Deployment

Uh oh!

YooSunYoung Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

YooSunYoung Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

MatthMig May 9, 2025

Choose a reason for hiding this comment

Uh oh!

YooSunYoung Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

MatthMig May 9, 2025

Choose a reason for hiding this comment

Uh oh!

MatthMig commented May 9, 2025

Login/Logout

Deployment

Login/Logout

Deployment

Additional changes since the initial draft pull request

Performance improvement

Online Ingestor

Uh oh!

YooSunYoung commented Jun 16, 2025

Uh oh!

MatthMig commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

YooSunYoung commented Apr 28, 2025 •

edited

Loading