Skip to content

Conversation

@MatthMig
Copy link

Description

To ingest the hundreds of To of ILL, I needed to made several changes.
Some changes in the code are specific to ILL and should probably be done using advanced configurations schemes.

I've also Dockerised the application, and made a corresponding service in our local SciCatLive repo

Changes

  • Dockerization of the application
  • Add of temporary login and logout scicat endpoints to ease long ingestion session, we've also added associated functions
  • Implementation of few new selector keywords:
    • contains, looking if an element contains a substring
    • dirname, looking if one of the following properties apply to one of the individual directory name that compound the file path
    • possibility to have miultiple schemes applied to a file, this make possible factoring by having a scheme file for all instruments, and a scheme file specific to an instrument and have both appliying to the concerned instruments. This leads to better factorization possible for scheme files
    • better or and and selector management with possibility to have nested ones
    • possibility of having a path being a directory, this is because sometimes path are not the same through our files for a same variable, and there is no logic that we can use to make the variable be correctly find, so we need to try both path
  • possibility of patching datasets to add new files to it, we need an endpoint to append new files to the dataset, because currently the append is done by getting the whole list of files and adding it, which leads to exponential backoff
  • patch of origdatablock
  • automatic instrument creation in the database
  • automatic sample creation in the database
  • automatic proposal creation in the dataset
  • datafile existence check
  • instrument get by name
  • proposal get by id
  • sample get by id
  • dataset get through sample id
  • possibility to ingest directories, instead of having to do one run per file to ingest
  • add of units retrieve
  • add of function to manage ILL special cases, should later be done through configurations files

I know these changes are a lot and are breaking, its changes that ILL have required to be done to SciCat, and I have implemented them my way, but now I look into how can we adapt those functionalities to make something properly merge-able into the the core.

YooSunYoung and others added 30 commits May 10, 2024 16:10
…tion-test

Add intergration test github action.
…ame-for-UO-query

fix: renamed proposal  organisation field to insttution_id
…nt}, and to use /entry0/title or /entry0/experimentTitle
@YooSunYoung
Copy link
Contributor

YooSunYoung commented Apr 28, 2025

@MatthMig
Thank you for the PR! Sorry for the late reply.

@nitrosx and I will try to support your needs...!

I heard you are going to have a meeting about this rather soon.
I will post some questions here in the meantime.

Login/Logout

It seems like you need an ingestor that can process multiple files at once.
Is it something that you want to run regularly...?
Or was it just an one time thing that you needed to do for existing dataset?

The question is mostly because of the authentication.
We typically don't want to have password in configuration file.
If it's running regularly, it should use long-life token instead.
If it's just for quick-one-time running with user-name and password, then it'd be better that ingestor asks for username and password as a user prompt and delete it from the memory as soon as it doesn't need them.

It won't be too much of work but we can also just use the token from the scicat frontend so I'm not sure if it's worth implementing and maintaining.
Is anything keeps the ILL ingestor from using token?

Deployment

I see the dockerfile and configuration files for deployment as well.
We typically store them in the separate repository as it can contain institute-specific values.
Also it should be accessible from the deployment tools.
What kind of deployment tool CI/CD tool do you use...?
For example, do you prefer github or gitlab...?

I can make a template repository for either of platform that also has CI tests to validate configuration files.

Comment on lines +62 to +66
try:
# Try custom format "dd-MMM-yy HH:mm:ss"
return datetime.datetime.strptime(value, "%d-%b-%y %H:%M:%S").replace(tzinfo=datetime.UTC).isoformat()
except ValueError:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it configurable so that we don't have to hard-code it.

variable_recipe: NexusFileMetadataVariable, h5file: h5py.File
) -> Any:
if "*" in variable_recipe.path: # Selectors are used
"""Retrieve values from file, with unit support and multi possible paths support."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing multiple paths
might also be useful for ESS...?

Discussion Point

But it wasn't clear if we just want to select one of the paths or we want to make a list of all values from the paths.
And how do we handle some edge cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is legacy, the thing was that on some files a field was under /entry0/instrument/name and for other it was under /entry0/{name of the instrument}/name, but now I've managed it differently as there was too many edge cases, now I've hard-coded it in the ingestor using a succession of if-else, otherwise it would have been way too complicated to manage it only using the schemes files

return h5file[path][...].item().decode(encoding)

) -> tuple[Any, str | None]:
"""Retrieve both value and unit (if available) from an HDF5 dataset."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What do you want to do with the unit...?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to get values with their associated units in the hdf5 file, for example when we get a time we want to read in the file if it is seconds, minutes or other.

@MatthMig
Copy link
Author

MatthMig commented May 9, 2025

@MatthMig Thank you for the PR! Sorry for the late reply.

@nitrosx and I will try to support your needs...!

I heard you are going to have a meeting about this rather soon. I will post some questions here in the meantime.

Login/Logout

It seems like you need an ingestor that can process multiple files at once. Is it something that you want to run regularly...? Or was it just an one time thing that you needed to do for existing dataset?

The question is mostly because of the authentication. We typically don't want to have password in configuration file. If it's running regularly, it should use long-life token instead. If it's just for quick-one-time running with user-name and password, then it'd be better that ingestor asks for username and password as a user prompt and delete it from the memory as soon as it doesn't need them.

It won't be too much of work but we can also just use the token from the scicat frontend so I'm not sure if it's worth implementing and maintaining. Is anything keeps the ILL ingestor from using token?

Deployment

I see the dockerfile and configuration files for deployment as well. We typically store them in the separate repository as it can contain institute-specific values. Also it should be accessible from the deployment tools. What kind of deployment tool CI/CD tool do you use...? For example, do you prefer github or gitlab...?

I can make a template repository for either of platform that also has CI tests to validate configuration files.

Login/Logout

Yes we have multiple terabytes of data that we'd like to put in the data catalog, therefore a complete ingestion of past data starting at 2021 takes a month. I've therefore encountered an issue due to tokes lifespan which was of one hour. Since this merge request I've made a new service in scicatlive for the ingestor with a config.json file that contains the credentials. But for sure, the best solution would be to have long-life token. For our institute, quick-one-time running seems to never be useful, as we have too much data, it would be only valid if we spotted that a specific file is missing in a dataset, but regarding the size of the data we produced, it is very unlikely for someone to spot it.

But I guess that it is way more likely that someone spot a dataset that is missing (So for ILL a dataset would be data acquired for a sample for a proposal). In that case, we'd ingest the directory again, which should take only few minutes in the worst case.

Deployment

I did not removed the configuration file is only there so I could run the ingestor on host before I've dockerized it, it is not really important. Concerning the dockerfile, I thought it was better to have it there so we can build and push the image from here and then pull it from the Scicatlive service leveraging the docker compose associated, I think you have done it the same way for the frontend and backend services.

Currently I do not use any CI/CD tool and I've got no idea who will continue the project after my contract ends in few months.
Our institute prefer GitLab, but I guess GitHub would be fine too. At least I've decided to push our forked version on a GitLab repo, same for our custom docker images.

Additional changes since the initial draft pull request

Here is an overview of additional changes I've made since the pushed pull request, this code is not currently pushed on your repo, only on ours intern one.

Performance improvement

When patching a dataset to add a new file there was no endpoints to just append a file to the data files list, therefore in the current pull request you can see that I get all from the backend, append the new file in the ingestor and push all again, which leads to poor performance. To improve it I've added an endpoint to directly append the data file, and I call it in the patch_scicat_origdatablock function.

Online Ingestor

My management wanted to do the online ingestion using RabbitMQ, so I've made changes to add the possibility to use RabbitMQ, also they wanted to have the message sent to be the data that we directly ingest, because your current implementation has the problem that it makes request to the database of the institute to get the file, but it will double the workload on our files management system because we will write the file and just after access it to read it. To reduce the workload, at the acquisition, we create the file and simultaneously we send the data with RabbitMQ in JSON format to the ingestor, so the ingestor can ingest the data without having to access our data files servers. Therefore, I have also factorized the ingestion process so the offline and online just extract variables and values and call same function written in a common file.

@YooSunYoung
Copy link
Contributor

Just so you know,
We are working on the issues and commits you kindly created
but it is taking time since we have to make sure these changes will not affect the operation at ESS.

I'm sorry that I'm not keeping up so fast...! We don't have so much time allocated in this project at the moment...

@MatthMig
Copy link
Author

Just so you know, We are working on the issues and commits you kindly created but it is taking time since we have to make sure these changes will not affect the operation at ESS.

I'm sorry that I'm not keeping up so fast...! We don't have so much time allocated in this project at the moment...

Okay, thanks ! Anyways I've started to work on the upstream ingestor too since recently, I'll try to implement myself a maximum of features I suggested, I hope it will help the project !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants