MSAmodeling: Python and R scripts to run the random forest and additive model of Pernov et al. (2025)
Git repository: https://github.com/SwissDataScienceCenter/msamodeling. Note that the GitLab repository linked in the paper (https://gitlab.renkulab.io/arcticnap/msamodeling) is defunct as of January 2026.
This repository accompanies the paper Pernov et al. (2025). The Python and R scripts allow to fit/train a random forest (scripts under /RF) and an additive model (under /AM) on methanesulfonic acid (MSA) aerosol concentrations measured at four stations in the Arctic (Alert, Gruvebadet, Thule, and Utqiagvik), as well as at two pan-Arctic fictional sites consisting of mergers of the four (all stations full, denoted as ASF, and all stations, denoted as AS). These two models use engineered features to model the MSA concentration on given years (training set, with the _tr suffix in data files) and then predict it on held-out years (test set, _te suffix). The processed data, both training and test sets, and both MSA target measurements and engineered features, are organized by site under /Data. The Python and R scripts load data with relative paths assuming they are run from the repository's root directory, i.e., from /msamodeling.
To run the Random Forest baselines, enter in the ./RF/ folder. You can create a python environment using conda env create -f environment.yml and then activate it via conda activate msamod. You now can run the bash script run_cv_rf.sh to run the cross validation of the Random Forest model and get CV results.
Once run, you can create the summary statistics from running the script ./scripts/run_summary_feature_groups.py.
To fit the Additive Model, you first need to install the relevant R packages (assuming you have already installed R version >= 4.0.0 either system-wide or within an activated conda environment) and compile the C++ underlying code (assuming you have an appropriate compiler). For this, from a Terminal/Console cd to the /AM directory and run RScript src/prep.r. This needs to be done only once per system/environment. Then execute run_fss_am.sh to run the forward stepwise variable selection (fss) procedure on the training set. The R log is written in /logs while the outputs are saved in /outputs. This fss procedure is run for each station separately, choose the station by uncommenting the corresponding line in run_fss_am.sh (Alert by default). The log and output files all get a prefix according to the specified station, unless this prefix is overwritten in the corresponding config file.
This is MSAmodeling version 0.1. This is the initial release.
Pernov, J. B., Aeberhard, W. H., Volpi, M., Harris, E., Hohermuth, B., Ishino, S., Henne, S., Im, U., Quinn, P. K., Upchurch, L. M., and Schmale, J. (2025). Data-driven modeling of environmental factors influencing Arctic methanesulfonic acid aerosol concentrations. Atmospheric Chemistry and Physics 25 (12), 6497–6537. DOI: 10.5194/acp-25-6497-2025