Skip to content

Enhancement improved snakemake functionality#474

Open
adamcantor22 wants to merge 151 commits into
masterfrom
Enhancement-ImprovedSnakemakeFunctionality
Open

Enhancement improved snakemake functionality#474
adamcantor22 wants to merge 151 commits into
masterfrom
Enhancement-ImprovedSnakemakeFunctionality

Conversation

@adamcantor22

@adamcantor22 adamcantor22 commented Mar 12, 2025

Copy link
Copy Markdown
Member

Pull Request Template for MMEDS

This is currently the branch in use for production analyses which is not ideal given it has not been reviewed. Better late than never.

What has changed

This PR primarily focuses on improving the quality of the LEfSe snakemake workflow, and has so far saved us enormous amounts of time in the process. Detection of possible pairwise comparisons given input tables and metadata categories is fully automatic and robust against categories with insufficient values. Also included in this PR are several new analyses that will be incorporated into snakemake workflows once the future feature table upload update is complete. These analyses include differential abundance testing using benchdamic, automatic taxonomic barplots, and formatting of humann3 outputs to use in the humann_barplot pipeline.

Also incorporates small update made by Kevin to add picrust2 pipeline as a new workflow

Checklist of pre-requisites

  • Does the code run?
  • Does the code follow the repository style?
  • Is the code tested?

How to use the feature

Start a new analysis of type 'lefse' on an existing study, use the resulting config_file.yaml to specify tables and variables for the analysis to use.

With mmeds conda env active:
To generate taxa barplot use Rscript $REPO/mmeds/tools/plot_taxa_barplot.R.
To run benchdamic use Rscript $REPO/mmeds/tools/run_benchdamic.R.
To format humann3 table use format_humann.py

Additional notes:

The structure of the repo with respect to /scripts and /tools is getting slightly confusing with respect to what should go where. Would be open to hearing suggestions for better structuring.

@adamcantor22 adamcantor22 changed the title Enhancement improved snakemake functionality TEST PR Enhancement improved snakemake functionality Feb 5, 2026
"""

metadata = pd.read_csv("tables/qiime_mapping_file.tsv", sep='\t', header=[0], skiprows=[1])
metadata = pd.read_csv("tables/qiime_mapping_file.tsv", sep='\t', header=[0], skiprows=[1], dtype='str')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're ok to get rid of the brackets around [0] and use header=0, skiprows=1

filtered_metadata = metadata.loc[metadata["#SampleID"].isin(table_df.columns)]
for var in vars:
categories = list(filtered_metadata[var].unique())
categories = [c for c in categories if str(c) != "nan"]

@kbpi314 kbpi314 Feb 9, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if tables with this case can reach this stage of processing but will this be ok with other non-"nan" representations of NA?

splits = pairwise_splits(wildcards, "lefse", config["classes"])
formatted_splits = []
for s in splits:
# Replace occurrences where class==subclass with subclass="NA", which is the default behavior, this handles the issue at the DAG level

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps replace this comment with the string of what the expected file name/string being parsed as separated should look like - something like # separated = lefse_plot.{feature_table}.{var}-{cat1}-or-{cat2}.{var}.NA though I forget the exact string that belongs here

Comment on lines +119 to +122
"""
Studies from MSQ past id 90 require no golay error correction, all others runs require
rev-comp mapping barcodes. This is a poor generalization and will need to be improved in the future.
"""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually will change into a user-inputted parameter prior to the analysis step in the user-defined config file
-was initially confusing because wasn't sure where an MSQ-containing string was being parsed; seems like a folder is being read/scanned and then these params are being determined from that as opposed to centralized in a config

Comment on lines 10 to 11
conda:
"qiime2-2020.8.0"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is qiime2-2025.4 or qiime2-2023.9 etc available now?

Comment on lines 9 to 10
conda:
"qiime2-2023.9"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a different env and thus qzas etc will be a different VERSION - might make compatibility issues down the road? maybe move everything to 2023.9 or later

"--p-n-jobs {threads} "
"--output-dir {output}"

rule alpha_rarefaction_phylogenetic:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it this is without your q2-boots true rarefaction method :c

parser <- add_argument(parser, "output-file", nargs=1, help="Taxa barplot output")
parser <- add_argument(parser, "--category", help="Optional metadata category by which to separate plot sections")
parser <- add_argument(parser, "--sort", default="top", help="Options for sorting of columns. Choose from ['top' (default), 'all', 'dominant']")
parser <- add_argument(parser, "--colors", default=1, help="Options for ordering of colorscheme. Choose from [1, 2, 3]")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1,2,3] does not seem to match [1,2,4] shown in code below lines 58-67

Comment thread mmeds/tools/R_utils.R
# Running strictly, remove more
while(i > 0) {
# Remove numeric components or short modifiers to species annotations
if (grepl("^[0-9]+$", lvl_split[i]) | (str_length(lvl_split[i]) < 4 & !lvl_split[i] %in% c("d", "k", "p", "c", "o", "f", "g"))) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need 's' here for species if we're anticipating handling of t__ / strain level tables?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do still need it, yeah, but this will probably break with a strain-level annotation... will have to think about how to handle

Comment thread mmeds/tools/R_utils.R
return(fold_df)
}

get_qval_on_pval_scale <- function(plot_mat, qval_thresh) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth adding a comment or two for these functions to clarify or explain what they do

Comment on lines +47 to +52
features_vec <- str_replace_all(features_vec, ";", "\\.")
features_vec <- str_replace_all(features_vec, " - ", "-")
features_vec <- str_replace_all(features_vec, " \\/ ", "_")
features_vec <- str_replace_all(features_vec, " ", "_")
features_vec <- str_replace_all(features_vec, "\\|", "\\.")
features_vec <- str_replace_all(features_vec, ",|\\(|\\)|\\:", "")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how often we use this may be worth bundling into a 'remove_special_char' function in R_utils

conda:
"picrust2"
shell:
"picrust2_pipeline.py "

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does picrust2_pipeline.py live? should it be in tools? Not sure I see it anywhere

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahaha you wrote this! picrust2_pipeline.py is in the /bin of the picrust2 env, it's a built-in

Comment thread www_files/app.py
# Not testing if running on Web01
testing = not (gethostname() == 'web01')
sys.path.append("/sc/arion/projects/MMEDS/.modules/mmeds_test/lib/python3.9/site-packages")
import pandas as pd

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this import has to go after this sys.path call and the rest of the imports because of a pandas-lib-location-specific reason?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is related to the attempt to bring the website back online, so it is still in flux. but the concept here is that when on the web node, it needs to specifically be told where to look for python packages for some reason

@@ -0,0 +1,4 @@
tables:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if pc2 rules don't require a qmf - what is the purpose of the qiime_mapping_file in picrust2/tables? is there a test of pc2 / lefse combined pipeline?

@kbpi314 kbpi314 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much awaited!

@adamcantor22

Copy link
Copy Markdown
Member Author

@kbpi314 excellent review! will work on these comments and get back to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants