Skip to content

Use Paramspace to automate the file naming scheme based on wildcards #40

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 89 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
56a403e
Switch from snake_case to kebab-case for config params
kelly-sovacool Jan 18, 2023
754329d
Quick proof of concept for paramspace (#36)
kelly-sovacool Jan 18, 2023
00346d7
Config key names need to be valid Python
kelly-sovacool Jan 20, 2023
8461a3c
WIP: test paramspace.instance in R & permuations
kelly-sovacool Jan 20, 2023
63dd26c
Get dot product of config lists for paramspace
kelly-sovacool Jan 25, 2023
b734d3a
Merge branch 'main' into paramspace
kelly-sovacool Jan 26, 2023
6306e35
Fix wildcards
kelly-sovacool Jan 26, 2023
63249bb
Remove old test code
kelly-sovacool Jan 28, 2023
279e7dd
Move un-pre-processed data to 'data/'
kelly-sovacool Jan 28, 2023
8561b1c
Implement 'exclude_param_keys' in config
kelly-sovacool Jan 28, 2023
6110b0f
Doc 'exclude_param_keys' & hardwrap lines
kelly-sovacool Jan 28, 2023
3436877
Move config & paramspace to separate rule/modules
kelly-sovacool Jan 28, 2023
58bdfc4
Get instance patterns without seed wildcard
kelly-sovacool Jan 28, 2023
8edca31
Create separate functions to tweak wildcards in paramspace
kelly-sovacool Jan 28, 2023
4c245b5
Move paramspace-related functions to separate script
kelly-sovacool Jan 28, 2023
57b8f82
Use mambaforge for deps & test files in workflow/scripts/
kelly-sovacool Jan 28, 2023
50800e2
Write get_paramspace_from_config()
kelly-sovacool Jan 28, 2023
afbb45b
Use set_default() for more DRY code
kelly-sovacool Jan 28, 2023
f3496c7
Merge branch 'main' into paramspace
kelly-sovacool Jan 29, 2023
e6b5e2a
Fix indent
kelly-sovacool Jan 29, 2023
2aec70f
Fix typo for find_feature_importance
kelly-sovacool Jan 29, 2023
7bb8255
Use 'include' instead of 'import' for snakedeploy
kelly-sovacool Jan 29, 2023
c93109e
Merge 7bb825509d44f8bd641ace5509ceecd7fd8d1e7f into 49b327ac71efe1306…
kelly-sovacool Jan 29, 2023
f6cd873
🎨 Style Python & Snakemake code 🐍
github-actions[bot] Jan 29, 2023
773e3f6
🐳 Update Dockerfile
github-actions[bot] Jan 29, 2023
791473a
Ignore results & figures from test runs
kelly-sovacool Jan 29, 2023
09b75b4
Switch 'ml_method' to 'method' for compatibility in run_ml()
kelly-sovacool Jan 29, 2023
5fcaf69
No need to specify inner_join(by)
kelly-sovacool Jan 29, 2023
07f9380
Renamed 'ml_method' to 'method' for compatibility in run_ml()
kelly-sovacool Jan 29, 2023
90f7e8f
Encode dataset wildcard in aggregated results filenames
kelly-sovacool Jan 29, 2023
7415cbb
Merge branch 'paramspace' of https://github.com/SchlossLab/mikropml-s…
kelly-sovacool Jan 29, 2023
7e60ea1
Fix filenames for example report
kelly-sovacool Jan 29, 2023
1888f47
Merge 7e60ea100b8b6c9620d073e7f6a12bb1edf8bf28 into 49b327ac71efe1306…
kelly-sovacool Jan 29, 2023
6842ef9
🎨 Style Python & Snakemake code 🐍
github-actions[bot] Jan 29, 2023
2c28be0
Get method,seed,kfold from wildcards not params
kelly-sovacool Jan 29, 2023
89f9c3d
Merge branch 'paramspace' of https://github.com/SchlossLab/mikropml-s…
kelly-sovacool Jan 29, 2023
0c8629a
Remove params that are redundant with wildcards
kelly-sovacool Jan 29, 2023
24eff57
Merge 0c8629a1fec183c681d02b76fd520e5eed808f98 into 49b327ac71efe1306…
kelly-sovacool Jan 29, 2023
e05e36c
Silence snakemake linter warning about abs paths
kelly-sovacool Jan 29, 2023
8ef170e
Merge branch 'paramspace' of https://github.com/SchlossLab/mikropml-s…
kelly-sovacool Jan 29, 2023
020e813
Merge 8ef170ed90d2fd62ac36ece607eee2a26017ce8b into 49b327ac71efe1306…
kelly-sovacool Jan 29, 2023
a1f3f89
New rule to write the paramspace to csv
kelly-sovacool Jan 29, 2023
679489e
Merge branch 'paramspace' of https://github.com/SchlossLab/mikropml-s…
kelly-sovacool Jan 29, 2023
6d3a656
Merge 679489e28713613fecbb56ce3fd65d5d916e03f2 into 49b327ac71efe1306…
kelly-sovacool Jan 29, 2023
fa09085
🎨 Style Python & Snakemake code 🐍
github-actions[bot] Jan 29, 2023
3ac6da8
Move write_paramspace code to script to satisfy linter
kelly-sovacool Jan 29, 2023
1e30f77
Merge branches 'paramspace' and 'paramspace' of https://github.com/Sc…
kelly-sovacool Jan 29, 2023
0667b61
Merge 1e30f77bf755ae9a1b336b141dce6fb395c8d18f into 49b327ac71efe1306…
kelly-sovacool Jan 29, 2023
79bfce4
🎨 Style Python & Snakemake code 🐍
github-actions[bot] Jan 29, 2023
d688ae3
Include shell magic for conda env
kelly-sovacool Jan 29, 2023
96fb339
Merge d688ae3d86fd2758d8d273aee3921587bd5265ea into 49b327ac71efe1306…
kelly-sovacool Jan 29, 2023
5ffec5b
Switch from pytest-parallel to pytest-xdist
kelly-sovacool Jan 30, 2023
070fbcb
Merge branch 'paramspace' of https://github.com/SchlossLab/mikropml-s…
kelly-sovacool Jan 30, 2023
d852355
Merge 070fbcba2af9ed652ffc2362c91112cb734b551c into 49b327ac71efe1306…
kelly-sovacool Jan 30, 2023
2262d62
Use smk env for test action
kelly-sovacool Jan 30, 2023
8b4fb8c
Merge 2262d628c368f1c6b6c9dd0fda481cfd2df28f8d into 49b327ac71efe1306…
kelly-sovacool Jan 30, 2023
a72de4a
🐳 Update Dockerfile
github-actions[bot] Jan 30, 2023
48c7b35
Note pandas dependency
kelly-sovacool Jan 30, 2023
2c1c768
Merge branch 'main' into paramspace
kelly-sovacool Jan 30, 2023
a7dd11a
Merge branch 'main' into paramspace
kelly-sovacool Jan 30, 2023
59c79ff
Merge a7dd11ae6c75feb75185fa0ff2e5a8da3c0f0ed7 into e6ec7d1e00ae7203c…
kelly-sovacool Jan 30, 2023
7df13d3
Fill in dataset wildcard for example report
kelly-sovacool Jan 30, 2023
37175ac
Merge branch 'paramspace' of https://github.com/SchlossLab/mikropml-s…
kelly-sovacool Jan 30, 2023
dd1a42c
Merge 37175ac99b4c1d547dfe7f58cd261cf428c13228 into e6ec7d1e00ae7203c…
kelly-sovacool Jan 30, 2023
2333b1e
Add option to use a custom paramspace csv file
kelly-sovacool Jan 30, 2023
67c3d4a
Link to snakemake docs on paramspace
kelly-sovacool Jan 30, 2023
f337938
Merge branch 'paramspace' of https://github.com/SchlossLab/mikropml-s…
kelly-sovacool Jan 30, 2023
a5c4d4e
Merge f33793860ab9e71255faadf2af910f652dbdf26c into e6ec7d1e00ae7203c…
kelly-sovacool Jan 30, 2023
78b6c12
🎨 Style Python & Snakemake code 🐍
github-actions[bot] Jan 30, 2023
8a0d346
Tweak medium config
kelly-sovacool Jan 31, 2023
2eb5d42
Merge branch 'paramspace' of https://github.com/SchlossLab/mikropml-s…
kelly-sovacool Feb 1, 2023
f85889c
Merge branch 'main' into paramspace
kelly-sovacool Feb 1, 2023
7cc242c
Merge f85889c76ea7f145b2d6fa1c030027e8b965f1fb into 20323a3a0aa4f76dc…
kelly-sovacool Feb 1, 2023
b75bf46
Merge branch 'paramspace' of https://github.com/SchlossLab/mikropml-s…
kelly-sovacool Feb 1, 2023
911a343
Fix typo
kelly-sovacool Feb 1, 2023
5cabc22
Merge 911a343ad123cc4919bb095d4ef171e9f2755783 into 20323a3a0aa4f76dc…
kelly-sovacool Feb 1, 2023
0b49896
Merge branch 'paramspace' of https://github.com/SchlossLab/mikropml-s…
kelly-sovacool Feb 1, 2023
f2e28c1
Merge 0b4989688d46cdbed894d5e3ecdf200d0853ac7b into 20323a3a0aa4f76dc…
kelly-sovacool Feb 2, 2023
b22a22d
Separate rule to copy hp plots for example report
kelly-sovacool Feb 2, 2023
c218f14
Merge branch 'paramspace' of https://github.com/SchlossLab/mikropml-s…
kelly-sovacool Feb 2, 2023
0342604
Merge c218f1479524ca2c9d3c9228bf1d217968428c19 into 20323a3a0aa4f76dc…
kelly-sovacool Feb 2, 2023
e202855
🎨 Style Python & Snakemake code 🐍
github-actions[bot] Feb 2, 2023
4c6b528
Fix ml_method->method key name for config defaults
kelly-sovacool Feb 2, 2023
f6ea6c7
Merge branch 'paramspace' of https://github.com/SchlossLab/mikropml-s…
kelly-sovacool Feb 2, 2023
c1ea2a8
Update example report
kelly-sovacool Feb 2, 2023
3f91d0d
Try rf & 100 seeds on GHA
kelly-sovacool Feb 2, 2023
8b76cbc
Merge 3f91d0d1540a9efe32325af1777ea1e6599ffccd into 20323a3a0aa4f76dc…
kelly-sovacool Feb 2, 2023
8a3c446
Print wildcard pattern onstart
kelly-sovacool Feb 6, 2023
2ace7d3
Merge 8a3c44623a6dd7681bb49eabd771406e2b98b82a into 20323a3a0aa4f76dc…
kelly-sovacool Feb 6, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,13 @@ jobs:
with:
persist-credentials: false
fetch-depth: 0
- uses: actions/setup-python@v4
- uses: conda-incubator/setup-miniconda@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pytest pytest-parallel
python-version: 3.11
miniforge-variant: Mambaforge
miniforge-version: latest
activate-environment: smk
environment-file: workflow/envs/smk.yml
- name: Lint workflow
uses: snakemake/[email protected]
with:
Expand All @@ -42,4 +42,8 @@ jobs:
args: "archive --forceall --cores 2 --use-conda --conda-frontend mamba --conda-cleanup-pkgs cache --show-failed-logs --all-temp --configfile config/test.yaml"
# - name: Test with pytest
# run: |
# pytest --workers 2 .tests/
# pytest -n 2 .tests/
- name: Test with pytest
shell: bash -el {0}
run: |
pytest -n 2 workflow/scripts/
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ results/*/runs
!.tests/
__pycache__/
.DS_Store
figures/otu*
results/otu*
figures/dataset*
results/dataset*
report_otu*
*.zip
13 changes: 8 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
FROM condaforge/mambaforge:latest
LABEL io.github.snakemake.containerized="true"
LABEL io.github.snakemake.conda_env_hash="6aa289536136aae2d34bac6dce9ce47d037da888ed09e2c8ada989c90ef10658"
LABEL io.github.snakemake.conda_env_hash="a57a1be27a188ebf9bb5feda054b3c8e501423ae80bcd6c24c221ca36de41d15"

# Step 1: Retrieve conda environments

Expand Down Expand Up @@ -42,21 +42,24 @@ COPY workflow/envs/mikropml.yml /conda-envs/3f83a46ff5ea715a12fde6ee46136b0b/env

# Conda environment:
# source: workflow/envs/smk.yml
# prefix: /conda-envs/457b7b75191d44b96e5086432876e333
# prefix: /conda-envs/bbc262640c3353e62cad877627dd3174
# name: smk
# channels:
# - conda-forge
# - bioconda
# dependencies:
# - pandas
# - pytest
# - pytest-xdist
# - snakemake=7
# - snakedeploy
# - zip
RUN mkdir -p /conda-envs/457b7b75191d44b96e5086432876e333
COPY workflow/envs/smk.yml /conda-envs/457b7b75191d44b96e5086432876e333/environment.yaml
RUN mkdir -p /conda-envs/bbc262640c3353e62cad877627dd3174
COPY workflow/envs/smk.yml /conda-envs/bbc262640c3353e62cad877627dd3174/environment.yaml

# Step 2: Generate conda environments

RUN mamba env create --prefix /conda-envs/b42323b0ffd5d034544511c9db1bdead --file /conda-envs/b42323b0ffd5d034544511c9db1bdead/environment.yaml && \
mamba env create --prefix /conda-envs/3f83a46ff5ea715a12fde6ee46136b0b --file /conda-envs/3f83a46ff5ea715a12fde6ee46136b0b/environment.yaml && \
mamba env create --prefix /conda-envs/457b7b75191d44b96e5086432876e333 --file /conda-envs/457b7b75191d44b96e5086432876e333/environment.yaml && \
mamba env create --prefix /conda-envs/bbc262640c3353e62cad877627dd3174 --file /conda-envs/bbc262640c3353e62cad877627dd3174/environment.yaml && \
mamba clean --all -y
55 changes: 41 additions & 14 deletions config/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,49 @@
# General configuration
# Additional Dependencies

To configure this workflow, modify [`config/config.yaml`](/config/config.yaml) according to your needs.
Besides snakemake, you will also need `pandas` to run this workflow:

`mamba install pandas`

# General Configuration

To configure this workflow, modify [`config/config.yaml`](/config/config.yaml)
according to your needs.

**Configuration options:**

- `dataset_csv`: the path to the dataset as a csv file.
- `dataset_name`: a short name to identify the dataset.
- `outcome_colname`: column name of the outcomes or classes for the dataset. If blank, the first column of the dataset will be used as the outcome and all other columns are features.
- `ml_methods`: list of machine learning methods to use. Must be [supported by mikropml or caret](http://www.schlosslab.org/mikropml/articles/introduction.html#the-methods-we-support).
- `dataset`: a short name to identify the dataset. The csv file for your
dataset is assumed to be located at `data/{dataset}.csv`.
The dataset should contain one outcome column with all other columns as
features for machine learning.
- `outcome_colname`: column name of the outcomes or classes for the dataset.
If blank, the first column of the dataset will be used as the outcome and
all other columns are features.
- `ml_methods`: list of machine learning methods to use. Must be
[supported by mikropml or caret](http://www.schlosslab.org/mikropml/articles/introduction.html#the-methods-we-support).
- `kfold`: k number for k-fold cross validation during model training.
- `ncores`: the number of cores to use for `preprocess_data()`, `run_ml()`, and `get_feature_importance()`. Do not exceed the number of cores you have available.
- `nseeds`: the number of different random seeds to use for training models with `run_ml()`. This will result in `nseeds` different train/test splits of the dataset.
- `find_feature_importance`: whether to calculate feature importances with permutation tests (`true` or `false`). If `false`, the plot in the report will be blank.
- `hyperparams`: override the default model hyperparameters set by mikropml for each ML method (optional). Leave this blank if you'd like to use the defaults. You will have to set these if you wish to use an ML method from caret that we don't officially support.

We also provide [`config/test.yaml`](/config/test.yaml), which uses a smaller dataset so
you can first make sure the workflow runs without error on your machine
before using your own dataset and custom parameters.
- `ncores`: the number of cores to use for `preprocess_data()`, `run_ml()`,
and `get_feature_importance()`. Do not exceed the number of cores you have available.
- `nseeds`: the number of different random seeds to use for training models
with `run_ml()`. This will result in `nseeds` different train/test splits
of the dataset.
- `find_feature_importance`: whether to calculate feature importances with
permutation tests (`true` or `false`). If `false`, the plot in the report
will be blank.
- `hyperparams`: override the default model hyperparameters set by mikropml
for each ML method (optional). Leave this blank if you'd like to use the
defaults. You will have to set these if you wish to use an ML method from
caret that we don't officially support.
- `paramspace_csv`: if you'd like to use a custom csv file to build the
[paramspace](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration) for `run_ml`, specify the path to the csv file here. If `None`, then the
paramspace will be built based on the parameters in the configfile.
- `exclude_param_keys`: keys in the configfile to exclude from the parameter
space. All keys in the configfile not listed in `exclude_param_keys` will be
included as wildcards for `run_ml` and other rules. This option is ignored
if `paramspace_csv` is not `None`.

We also provide [`config/test.yaml`](/config/test.yaml), which uses a smaller
dataset so you can first make sure the workflow runs without error on your
machine before using your own large dataset and custom parameters.

The default and test config files are suitable for initial testing,
but we recommend using more cores (if available) and
Expand Down
17 changes: 17 additions & 0 deletions config/config-gha.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
dataset: otu_large
outcome_colname: dx
method:
- glmnet
- rf
kfold: 5
ncores: 4
nseeds: 100
find_feature_importance: false
exclude_param_keys:
- exclude_param_keys
- outcome_colname
- ncores
- nseeds
- find_feature_importance
- hyperparams
- paramspace_csv
16 changes: 12 additions & 4 deletions config/config.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,19 @@
dataset_csv: data/processed/otu-large.csv
dataset_name: otu-large
dataset: otu_large
outcome_colname: dx
ml_methods:
method:
- glmnet
- rf
kfold: 5
ncores: 8
nseeds: 10
find_feature_importance: true
hyperparams:
hyperparams:
paramspace_csv:
exclude_param_keys:
- exclude_param_keys
- outcome_colname
- ncores
- nseeds
- find_feature_importance
- hyperparams
- paramspace_csv
11 changes: 11 additions & 0 deletions config/custom-paramspace.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
dataset,kfold,method,seed
otu_large,5,glmnet,100
otu_large,5,glmnet,101
otu_large,5,glmnet,102
otu_large,5,glmnet,103
otu_large,5,glmnet,104
otu_large,5,rf,105
otu_large,5,rf,106
otu_large,5,rf,107
otu_large,5,rf,108
otu_large,5,rf,109
19 changes: 19 additions & 0 deletions config/custom-paramspace.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
dataset: otu_large
outcome_colname: dx
method:
- glmnet
- rf
kfold: 5
ncores: 8
nseeds: 10
find_feature_importance: true
hyperparams:
paramspace_csv: 'config/custom-paramspace.csv'
exclude_param_keys:
- exclude_param_keys
- outcome_colname
- ncores
- nseeds
- find_feature_importance
- hyperparams
- paramspace_csv
9 changes: 0 additions & 9 deletions config/glmnet.yaml

This file was deleted.

14 changes: 10 additions & 4 deletions config/robust.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
dataset_csv: data/processed/otu-large.csv
dataset_name: otu-large
dataset: otu_large
outcome_colname: dx
ml_methods:
method:
- glmnet
- rf
- rpart2
Expand All @@ -26,4 +25,11 @@ hyperparams:
- 42
- 83
- 166

exclude_param_keys:
- exclude_param_keys
- outcome_colname
- ncores
- nseeds
- find_feature_importance
- hyperparams
- paramspace_csv
13 changes: 10 additions & 3 deletions config/test.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
dataset_csv: data/processed/otu-micro.csv
dataset_name: otu-micro
dataset: otu_micro
outcome_colname: dx
ml_methods:
method:
- glmnet
kfold: 2
ncores: 4
Expand All @@ -18,3 +17,11 @@ hyperparams:
- 0.1
- 1
- 10
exclude_param_keys:
- exclude_param_keys
- outcome_colname
- ncores
- nseeds
- find_feature_importance
- hyperparams
- paramspace_csv
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed figures/example/hp_performance_glmnet.png
Binary file not shown.
Binary file removed figures/example/hp_performance_rf.png
Binary file not shown.
2 changes: 1 addition & 1 deletion quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
1. If you don't have conda/mamba yet, we recommend installing
[Mambaforge](https://mamba.readthedocs.io/en/latest/installation.html).

1. Create a conda environment with snakemake installed:
1. Create a conda environment with snakemake and pandas installed:

``` sh
mamba env create -f workflow/envs/smk.yml
Expand Down
6 changes: 3 additions & 3 deletions report-example.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "ML Results"
date: "2023-01-31"
date: "2023-02-02"
output:
html_document:
keep_md: true
Expand All @@ -16,7 +16,7 @@ output:

Machine learning algorithm(s) used: glmnet and rf.
Models were trained with 10 different random
partitions of the otu-large dataset into training and
partitions of the otu_large dataset into training and
testing sets using 5-fold cross validation.
See [config/config.yaml](config/config.yaml)
for the full configuration.
Expand All @@ -33,7 +33,7 @@ for the full configuration.

## Hyperparameter Performance

<img src="figures/example/hp_performance_glmnet.png" width="80%" /><img src="figures/example/hp_performance_rf.png" width="80%" />
<img src="figures/example/dataset-otu_large/kfold-5/method-glmnet/hp_performance.png" width="80%" /><img src="figures/example/dataset-otu_large/kfold-5/method-rf/hp_performance.png" width="80%" />

## Feature Importance

Expand Down
Loading