Skip to content

SwissDataScienceCenter/msamodeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MSAmodeling: Python and R scripts to run the random forest and additive model of Pernov et al. (2025)

Git repository: https://github.com/SwissDataScienceCenter/msamodeling. Note that the GitLab repository linked in the paper (https://gitlab.renkulab.io/arcticnap/msamodeling) is defunct as of January 2026.

This repository accompanies the paper Pernov et al. (2025). The Python and R scripts allow to fit/train a random forest (scripts under /RF) and an additive model (under /AM) on methanesulfonic acid (MSA) aerosol concentrations measured at four stations in the Arctic (Alert, Gruvebadet, Thule, and Utqiagvik), as well as at two pan-Arctic fictional sites consisting of mergers of the four (all stations full, denoted as ASF, and all stations, denoted as AS). These two models use engineered features to model the MSA concentration on given years (training set, with the _tr suffix in data files) and then predict it on held-out years (test set, _te suffix). The processed data, both training and test sets, and both MSA target measurements and engineered features, are organized by site under /Data. The Python and R scripts load data with relative paths assuming they are run from the repository's root directory, i.e., from /msamodeling.

Random Forests baselines

To run the Random Forest baselines, enter in the ./RF/ folder. You can create a python environment using conda env create -f environment.yml and then activate it via conda activate msamod. You now can run the bash script run_cv_rf.sh to run the cross validation of the Random Forest model and get CV results.

Once run, you can create the summary statistics from running the script ./scripts/run_summary_feature_groups.py.

Additive Model

To fit the Additive Model, you first need to install the relevant R packages (assuming you have already installed R version >= 4.0.0 either system-wide or within an activated conda environment) and compile the C++ underlying code (assuming you have an appropriate compiler). For this, from a Terminal/Console cd to the /AM directory and run RScript src/prep.r. This needs to be done only once per system/environment. Then execute run_fss_am.sh to run the forward stepwise variable selection (fss) procedure on the training set. The R log is written in /logs while the outputs are saved in /outputs. This fss procedure is run for each station separately, choose the station by uncommenting the corresponding line in run_fss_am.sh (Alert by default). The log and output files all get a prefix according to the specified station, unless this prefix is overwritten in the corresponding config file.

Version history

This is MSAmodeling version 0.1. This is the initial release.

References

Pernov, J. B., Aeberhard, W. H., Volpi, M., Harris, E., Hohermuth, B., Ishino, S., Henne, S., Im, U., Quinn, P. K., Upchurch, L. M., and Schmale, J. (2025). Data-driven modeling of environmental factors influencing Arctic methanesulfonic acid aerosol concentrations. Atmospheric Chemistry and Physics 25 (12), 6497–6537. DOI: 10.5194/acp-25-6497-2025

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published