This repository is meant for anyone who wants to fine-tune a global ML-based streamflow prediction model on local data. More details in our paper (submitted to the AI for Climate Change workshop at ICLR 2025).
First, you must download the Caravan dataset from here. There are more details in the Caravan Github repository. The dataset is ~100GB in size.
Clone the repository, by navigating to the folder on your local computer that you want to work in, and type (in the terminal):
git init
git clone url/to/this/repo
Then, add move the Caravan data into a folder in your local project. Whatever folder you choose to put it in, make sure to add this into the field data_dir: in the config file when you run the fine-tuning script in step 4.
Activate the right python environment by running:
conda env create -f environment.yml
To run this command you will need Anaconda or miniconda, which can be installed here. If you prefer using virtualenv or and environment manager, you can just copy the requirements from the environment.yml file.
To activate your environment (necessary for next step), run:
conda activate finetuning
Finally, you will likely also need to run (again, in the workspace folder):
export PYTHONPATH=.
To fine-tune a pre-trained model on the basins of your choosing. Run:
python scripts/finetune.py config=path/to/config/file
See the configs folder for an example config file. It is in your config file that you can make all the choices regarding your fine-tuning run. Things to set in your config file:
base_model_path: This is the path to your pre-trained model. If you do not have one, don't worry! If you leave this field empty, the code will automatically download a pre-trained model from Hugging Face (link to model repo) onto your computer that you can use. After it is downloaded, you should set base_model_path to the path in which the model is downloadedpretrained_models/caravan_base, so you don't need to re-download the model each time.- Hyperparameters for your finetuning sweep in your config file.
- What basins to fine-tune on (by picking the
basin_file). You can edit the standard basin file inassets/basins.txtto include exactly the basins you want to add. You can find all the basins in Caravan (and since you have the Caravan dataset on your computer you can readily print out all the basins).
The outputs from the fine-tuning sweep will end up in the output/ folder. This folder will contain 3 subfolders:
finetuned_modelscontaining the best fine-tuned models for each basin.runscontaining all the attempted fine-tuned models for each basin (same as the number ofmax_evalsin your config, per basin.sweepscontaining a .pkl file for every basin you fine_tune on, with information regarding the hyperparameter sweep.
When you are done fine-tuning, you can analyze your results by running the analysis.ipynbnotebook in the analysis folder. This notebook contains only the mest rudimentary display of the results, and there are far more things you may want to explore, such as the relationship between fine-tuning improvements and hydrological attributes of the basins.
In the steps described above, you are limited to fine-tuning on the 22,732 basins that are already in the Caravan dataset. If you have your own basins (with accompanying catchment delineations and streamflow data), you can format them to fit into the Caravan dataset by following the tutorials here. You can then either contribute them to the Caravan dataset itself (it's very easy), or you can just keep them as a local copy without having to publicize your data. Either way, you can then add your basin ids into the basins.txt file just as any other basin in Caravan, and then run the same fine-tuning procedure as above. Voilà!
If you have any comments or problems with the repository, feel free to open an issue!