SchlossLab · kelly-sovacool · Jan 18, 2023 · Jan 18, 2023 · Jan 20, 2023 · Jan 20, 2023
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -21,13 +21,13 @@ jobs:
         with:
           persist-credentials: false
           fetch-depth: 0
-      - uses: actions/setup-python@v4
+      - uses: conda-incubator/setup-miniconda@v2
         with:
-          python-version: ${{ matrix.python-version }}
-      - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip
-          pip install pytest pytest-parallel
+          python-version: 3.11
+          miniforge-variant: Mambaforge
+          miniforge-version: latest
+          activate-environment: smk
+          environment-file: workflow/envs/smk.yml
       - name: Lint workflow
         uses: snakemake/[email protected]
         with:
@@ -42,4 +42,8 @@ jobs:
           args: "archive --forceall --cores 2 --use-conda --conda-frontend mamba --conda-cleanup-pkgs cache  --show-failed-logs --all-temp --configfile config/test.yaml"
 #      - name: Test with pytest
 #        run: |
-#          pytest --workers 2 .tests/
+#          pytest -n 2 .tests/
+      - name: Test with pytest
+        shell: bash -el {0}
+        run: |
+          pytest -n 2 workflow/scripts/
diff --git a/.gitignore b/.gitignore
@@ -11,7 +11,7 @@ results/*/runs
 !.tests/
 __pycache__/
 .DS_Store
-figures/otu*
-results/otu*
+figures/dataset*
+results/dataset*
 report_otu*
 *.zip
diff --git a/Dockerfile b/Dockerfile
@@ -1,6 +1,6 @@
 FROM condaforge/mambaforge:latest
 LABEL io.github.snakemake.containerized="true"
-LABEL io.github.snakemake.conda_env_hash="6aa289536136aae2d34bac6dce9ce47d037da888ed09e2c8ada989c90ef10658"
+LABEL io.github.snakemake.conda_env_hash="a57a1be27a188ebf9bb5feda054b3c8e501423ae80bcd6c24c221ca36de41d15"
 
 # Step 1: Retrieve conda environments
 
@@ -42,21 +42,24 @@ COPY workflow/envs/mikropml.yml /conda-envs/3f83a46ff5ea715a12fde6ee46136b0b/env
 
 # Conda environment:
 #   source: workflow/envs/smk.yml
-#   prefix: /conda-envs/457b7b75191d44b96e5086432876e333
+#   prefix: /conda-envs/bbc262640c3353e62cad877627dd3174
 #   name: smk
 #   channels:
 #     - conda-forge
 #     - bioconda
 #   dependencies:
+#     - pandas
+#     - pytest
+#     - pytest-xdist
 #     - snakemake=7
 #     - snakedeploy
 #     - zip
-RUN mkdir -p /conda-envs/457b7b75191d44b96e5086432876e333
-COPY workflow/envs/smk.yml /conda-envs/457b7b75191d44b96e5086432876e333/environment.yaml
+RUN mkdir -p /conda-envs/bbc262640c3353e62cad877627dd3174
+COPY workflow/envs/smk.yml /conda-envs/bbc262640c3353e62cad877627dd3174/environment.yaml
 
 # Step 2: Generate conda environments
 
 RUN mamba env create --prefix /conda-envs/b42323b0ffd5d034544511c9db1bdead --file /conda-envs/b42323b0ffd5d034544511c9db1bdead/environment.yaml && \
     mamba env create --prefix /conda-envs/3f83a46ff5ea715a12fde6ee46136b0b --file /conda-envs/3f83a46ff5ea715a12fde6ee46136b0b/environment.yaml && \
-    mamba env create --prefix /conda-envs/457b7b75191d44b96e5086432876e333 --file /conda-envs/457b7b75191d44b96e5086432876e333/environment.yaml && \
+    mamba env create --prefix /conda-envs/bbc262640c3353e62cad877627dd3174 --file /conda-envs/bbc262640c3353e62cad877627dd3174/environment.yaml && \
     mamba clean --all -y
diff --git a/config/README.md b/config/README.md
@@ -1,22 +1,49 @@
-# General configuration
+# Additional Dependencies
 
-To configure this workflow, modify [`config/config.yaml`](/config/config.yaml) according to your needs.
+Besides snakemake, you will also need `pandas` to run this workflow:
+
+`mamba install pandas`
+
+# General Configuration
+
+To configure this workflow, modify [`config/config.yaml`](/config/config.yaml) 
+according to your needs.
 
 **Configuration options:**
 
-  - `dataset_csv`: the path to the dataset as a csv file. 
-  - `dataset_name`: a short name to identify the dataset.
-  - `outcome_colname`: column name of the outcomes or classes for the dataset. If blank, the first column of the dataset will be used as the outcome and all other columns are features.
-  - `ml_methods`: list of machine learning methods to use. Must be [supported by mikropml or caret](http://www.schlosslab.org/mikropml/articles/introduction.html#the-methods-we-support).
+  - `dataset`: a short name to identify the dataset. The csv file for your 
+    dataset is assumed to be located at `data/{dataset}.csv`.
+    The dataset should contain one outcome column with all other columns as
+    features for machine learning.
+  - `outcome_colname`: column name of the outcomes or classes for the dataset. 
+    If blank, the first column of the dataset will be used as the outcome and 
+    all other columns are features.
+  - `ml_methods`: list of machine learning methods to use. Must be 
+    [supported by mikropml or caret](http://www.schlosslab.org/mikropml/articles/introduction.html#the-methods-we-support).
   - `kfold`: k number for k-fold cross validation during model training. 
-  - `ncores`: the number of cores to use for `preprocess_data()`, `run_ml()`, and `get_feature_importance()`. Do not exceed the number of cores you have available.
-  - `nseeds`: the number of different random seeds to use for training models with `run_ml()`. This will result in `nseeds` different train/test splits of the dataset.
-  - `find_feature_importance`: whether to calculate feature importances with permutation tests (`true` or `false`). If `false`, the plot in the report will be blank. 
-  - `hyperparams`: override the default model hyperparameters set by mikropml for each ML method (optional). Leave this blank if you'd like to use the defaults. You will have to set these if you wish to use an ML method from caret that we don't officially support. 
-
-We also provide [`config/test.yaml`](/config/test.yaml), which uses a smaller dataset so 
-you can first make sure the workflow runs without error on your machine 
-before using your own dataset and custom parameters.
+  - `ncores`: the number of cores to use for `preprocess_data()`, `run_ml()`, 
+    and `get_feature_importance()`. Do not exceed the number of cores you have available.
+  - `nseeds`: the number of different random seeds to use for training models 
+    with `run_ml()`. This will result in `nseeds` different train/test splits 
+    of the dataset.
+  - `find_feature_importance`: whether to calculate feature importances with 
+    permutation tests (`true` or `false`). If `false`, the plot in the report 
+    will be blank. 
+  - `hyperparams`: override the default model hyperparameters set by mikropml 
+    for each ML method (optional). Leave this blank if you'd like to use the 
+    defaults. You will have to set these if you wish to use an ML method from 
+    caret that we don't officially support. 
+  - `paramspace_csv`: if you'd like to use a custom csv file to build the 
+    [paramspace](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration) for `run_ml`, specify the path to the csv file here. If `None`, then the
+    paramspace will be built based on the parameters in the configfile.
+  - `exclude_param_keys`: keys in the configfile to exclude from the parameter 
+    space. All keys in the configfile not listed in `exclude_param_keys` will be
+    included as wildcards for `run_ml` and other rules. This option is ignored
+    if `paramspace_csv` is not `None`.
+
+We also provide [`config/test.yaml`](/config/test.yaml), which uses a smaller 
+dataset so you can first make sure the workflow runs without error on your
+machine before using your own large dataset and custom parameters.
 
 The default and test config files are suitable for initial testing,
 but we recommend using more cores (if available) and

diff --git a/config/config-gha.yml b/config/config-gha.yml
@@ -0,0 +1,17 @@
+dataset: otu_large
+outcome_colname: dx
+method:
+ - glmnet
+ - rf
+kfold: 5
+ncores: 4
+nseeds: 100
+find_feature_importance: false
+exclude_param_keys:
+ - exclude_param_keys
+ - outcome_colname
+ - ncores
+ - nseeds
+ - find_feature_importance
+ - hyperparams
+ - paramspace_csv
diff --git a/config/config.yaml b/config/config.yaml
@@ -1,11 +1,19 @@
-dataset_csv: data/processed/otu-large.csv
-dataset_name: otu-large
+dataset: otu_large
 outcome_colname: dx
-ml_methods:
+method:
  - glmnet
  - rf
 kfold: 5
 ncores: 8
 nseeds: 10
 find_feature_importance: true
-hyperparams:
+hyperparams:
+paramspace_csv:
+exclude_param_keys:
+ - exclude_param_keys
+ - outcome_colname
+ - ncores
+ - nseeds
+ - find_feature_importance
+ - hyperparams
+ - paramspace_csv
diff --git a/config/custom-paramspace.csv b/config/custom-paramspace.csv
@@ -0,0 +1,11 @@
+dataset,kfold,method,seed
+otu_large,5,glmnet,100
+otu_large,5,glmnet,101
+otu_large,5,glmnet,102
+otu_large,5,glmnet,103
+otu_large,5,glmnet,104
+otu_large,5,rf,105
+otu_large,5,rf,106
+otu_large,5,rf,107
+otu_large,5,rf,108
+otu_large,5,rf,109
diff --git a/config/custom-paramspace.yaml b/config/custom-paramspace.yaml
@@ -0,0 +1,19 @@
+dataset: otu_large
+outcome_colname: dx
+method:
+ - glmnet
+ - rf
+kfold: 5
+ncores: 8
+nseeds: 10
+find_feature_importance: true
+hyperparams:
+paramspace_csv: 'config/custom-paramspace.csv'
+exclude_param_keys:
+ - exclude_param_keys
+ - outcome_colname
+ - ncores
+ - nseeds
+ - find_feature_importance
+ - hyperparams
+ - paramspace_csv
diff --git a/config/glmnet.yaml b/config/glmnet.yaml
diff --git a/config/robust.yaml b/config/robust.yaml
@@ -1,7 +1,6 @@
-dataset_csv: data/processed/otu-large.csv
-dataset_name: otu-large
+dataset: otu_large
 outcome_colname: dx
-ml_methods:
+method:
  - glmnet
  - rf
  - rpart2
@@ -26,4 +25,11 @@ hyperparams:
       - 42
       - 83
       - 166
-
+exclude_param_keys:
+ - exclude_param_keys
+ - outcome_colname
+ - ncores
+ - nseeds
+ - find_feature_importance
+ - hyperparams
+ - paramspace_csv
diff --git a/config/test.yaml b/config/test.yaml
@@ -1,7 +1,6 @@
-dataset_csv: data/processed/otu-micro.csv
-dataset_name: otu-micro
+dataset: otu_micro
 outcome_colname: dx
-ml_methods:
+method:
  - glmnet
 kfold: 2
 ncores: 4
@@ -18,3 +17,11 @@ hyperparams:
       - 0.1
       - 1
       - 10
+exclude_param_keys:
+ - exclude_param_keys
+ - outcome_colname
+ - ncores
+ - nseeds
+ - find_feature_importance
+ - hyperparams
+ - paramspace_csv
diff --git a/data/processed/otu-large.csv → data/otu_large.csv b/data/processed/otu-large.csv → data/otu_large.csv
diff --git a/data/processed/otu-micro.csv → data/otu_micro.csv b/data/processed/otu-micro.csv → data/otu_micro.csv
diff --git a/data/processed/otu-mini-bin.csv → data/otu_mini_bin.csv b/data/processed/otu-mini-bin.csv → data/otu_mini_bin.csv
diff --git a/figures/example/dataset-otu_large/kfold-5/method-glmnet/hp_performance.png b/figures/example/dataset-otu_large/kfold-5/method-glmnet/hp_performance.png
diff --git a/figures/example/dataset-otu_large/kfold-5/method-rf/hp_performance.png b/figures/example/dataset-otu_large/kfold-5/method-rf/hp_performance.png
diff --git a/figures/example/hp_performance_glmnet.png b/figures/example/hp_performance_glmnet.png
diff --git a/figures/example/hp_performance_rf.png b/figures/example/hp_performance_rf.png
diff --git a/quick-start.md b/quick-start.md
@@ -18,7 +18,7 @@
     1. If you don't have conda/mamba yet, we recommend installing
        [Mambaforge](https://mamba.readthedocs.io/en/latest/installation.html).
 
-    1. Create a conda environment with snakemake installed:
+    1. Create a conda environment with snakemake and pandas installed:
 
        ``` sh
        mamba env create -f workflow/envs/smk.yml

diff --git a/report-example.md b/report-example.md
@@ -1,6 +1,6 @@
 ---
 title: "ML Results"
-date: "2023-01-31"
+date: "2023-02-02"
 output:
   html_document:
     keep_md: true
@@ -16,7 +16,7 @@ output:
 
 Machine learning algorithm(s) used: glmnet and rf.
 Models were trained with 10 different random
-partitions of the otu-large dataset into training and
+partitions of the otu_large dataset into training and
 testing sets using 5-fold cross validation.
 See [config/config.yaml](config/config.yaml) 
 for the full configuration.
@@ -33,7 +33,7 @@ for the full configuration.
 
 ## Hyperparameter Performance
 
-<img src="figures/example/hp_performance_glmnet.png" width="80%" /><img src="figures/example/hp_performance_rf.png" width="80%" />
+<img src="figures/example/dataset-otu_large/kfold-5/method-glmnet/hp_performance.png" width="80%" /><img src="figures/example/dataset-otu_large/kfold-5/method-rf/hp_performance.png" width="80%" />
 
 ## Feature Importance