-
Notifications
You must be signed in to change notification settings - Fork 309
Description
The engine repository contains a lot of data files, like for instance the CSV files used as expectation values for the GSIM tests (they have been there from the beginning) or the .onnx files used for the models using machine learning technologies (a relatively new addition). This causes the repository to grow constantly. Already, a lot of time is spent downloading the repository, both for final users and for our tests, and that gets worse at each release. In particular, 99% of the user do not need 99% of the data, but right now they are forced to download everything, also what they do not need. Also, docker images are large. In the future we expect more .onnx files to be included, and much larger than they are now, so the problem will become worse, even if most people will not use models requiring such files. How can we solve that? Using git support for large files makes things more complicated and it is very expensive (every time a collaborator or a CI/CD runner like GitHub Actions clones the repo or pulls updates, it consumes your bandwidth quota), so it is not a solution. A solution could be to move the data files to a public http server and save the files in directories containing the engine version number (i.e. something like https://downloads.openquake.org/engine-3.23/gsim) and then change the installer to download only the files needed for the current version, possibly with parameters like light or full depending on how much data she wants to download.
Preliminary to such change would be to move the data files in a sensible location in the current repository, for instance we could introduce a package openquake.data for that. That would improve a lot the organization of the repository and would make the installation easier, avoiding errors like the ones we had recently, with the PyPI package missing data files required for the gsims (see #11056).