-
Notifications
You must be signed in to change notification settings - Fork 1
Ill suggestion #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Ill suggestion #141
Conversation
…tion-test Add intergration test github action.
…ame-for-UO-query fix: renamed proposal organisation field to insttution_id
…ample or datafile
…lection performance
…nt}, and to use /entry0/title or /entry0/experimentTitle
…at 1 instead of 0
|
@MatthMig @nitrosx and I will try to support your needs...! I heard you are going to have a meeting about this rather soon. Login/LogoutIt seems like you need an ingestor that can process multiple files at once. The question is mostly because of the authentication. It won't be too much of work but we can also just use the token from the scicat frontend so I'm not sure if it's worth implementing and maintaining. DeploymentI see the dockerfile and configuration files for deployment as well. I can make a template repository for either of platform that also has CI tests to validate configuration files. |
| try: | ||
| # Try custom format "dd-MMM-yy HH:mm:ss" | ||
| return datetime.datetime.strptime(value, "%d-%b-%y %H:%M:%S").replace(tzinfo=datetime.UTC).isoformat() | ||
| except ValueError: | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make it configurable so that we don't have to hard-code it.
| variable_recipe: NexusFileMetadataVariable, h5file: h5py.File | ||
| ) -> Any: | ||
| if "*" in variable_recipe.path: # Selectors are used | ||
| """Retrieve values from file, with unit support and multi possible paths support.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing multiple paths
might also be useful for ESS...?
Discussion Point
But it wasn't clear if we just want to select one of the paths or we want to make a list of all values from the paths.
And how do we handle some edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is legacy, the thing was that on some files a field was under /entry0/instrument/name and for other it was under /entry0/{name of the instrument}/name, but now I've managed it differently as there was too many edge cases, now I've hard-coded it in the ingestor using a succession of if-else, otherwise it would have been way too complicated to manage it only using the schemes files
| return h5file[path][...].item().decode(encoding) | ||
|
|
||
| ) -> tuple[Any, str | None]: | ||
| """Retrieve both value and unit (if available) from an HDF5 dataset.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What do you want to do with the unit...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to get values with their associated units in the hdf5 file, for example when we get a time we want to read in the file if it is seconds, minutes or other.
Login/LogoutYes we have multiple terabytes of data that we'd like to put in the data catalog, therefore a complete ingestion of past data starting at 2021 takes a month. I've therefore encountered an issue due to tokes lifespan which was of one hour. Since this merge request I've made a new service in scicatlive for the ingestor with a config.json file that contains the credentials. But for sure, the best solution would be to have long-life token. For our institute, quick-one-time running seems to never be useful, as we have too much data, it would be only valid if we spotted that a specific file is missing in a dataset, but regarding the size of the data we produced, it is very unlikely for someone to spot it. But I guess that it is way more likely that someone spot a dataset that is missing (So for ILL a dataset would be data acquired for a sample for a proposal). In that case, we'd ingest the directory again, which should take only few minutes in the worst case. DeploymentI did not removed the configuration file is only there so I could run the ingestor on host before I've dockerized it, it is not really important. Concerning the dockerfile, I thought it was better to have it there so we can build and push the image from here and then pull it from the Scicatlive service leveraging the docker compose associated, I think you have done it the same way for the frontend and backend services. Currently I do not use any CI/CD tool and I've got no idea who will continue the project after my contract ends in few months. Additional changes since the initial draft pull requestHere is an overview of additional changes I've made since the pushed pull request, this code is not currently pushed on your repo, only on ours intern one. Performance improvementWhen patching a dataset to add a new file there was no endpoints to just append a file to the data files list, therefore in the current pull request you can see that I get all from the backend, append the new file in the ingestor and push all again, which leads to poor performance. To improve it I've added an endpoint to directly append the data file, and I call it in the patch_scicat_origdatablock function. Online IngestorMy management wanted to do the online ingestion using RabbitMQ, so I've made changes to add the possibility to use RabbitMQ, also they wanted to have the message sent to be the data that we directly ingest, because your current implementation has the problem that it makes request to the database of the institute to get the file, but it will double the workload on our files management system because we will write the file and just after access it to read it. To reduce the workload, at the acquisition, we create the file and simultaneously we send the data with RabbitMQ in JSON format to the ingestor, so the ingestor can ingest the data without having to access our data files servers. Therefore, I have also factorized the ingestion process so the offline and online just extract variables and values and call same function written in a common file. |
…atus for origdatablocks is ignored in current backend version
|
Just so you know, I'm sorry that I'm not keeping up so fast...! We don't have so much time allocated in this project at the moment... |
Okay, thanks ! Anyways I've started to work on the upstream ingestor too since recently, I'll try to implement myself a maximum of features I suggested, I hope it will help the project ! |
Description
To ingest the hundreds of To of ILL, I needed to made several changes.
Some changes in the code are specific to ILL and should probably be done using advanced configurations schemes.
I've also Dockerised the application, and made a corresponding service in our local SciCatLive repo
Changes
I know these changes are a lot and are breaking, its changes that ILL have required to be done to SciCat, and I have implemented them my way, but now I look into how can we adapt those functionalities to make something properly merge-able into the the core.