Enhancement improved snakemake functionality#474
Conversation
…emente-lab/mmeds-meta into Enhancement-SnakemakeAnalysis
…emente-lab/mmeds-meta into Enhancement-SnakemakeAnalysis
Enhancement kb
| """ | ||
|
|
||
| metadata = pd.read_csv("tables/qiime_mapping_file.tsv", sep='\t', header=[0], skiprows=[1]) | ||
| metadata = pd.read_csv("tables/qiime_mapping_file.tsv", sep='\t', header=[0], skiprows=[1], dtype='str') |
There was a problem hiding this comment.
I think you're ok to get rid of the brackets around [0] and use header=0, skiprows=1
| filtered_metadata = metadata.loc[metadata["#SampleID"].isin(table_df.columns)] | ||
| for var in vars: | ||
| categories = list(filtered_metadata[var].unique()) | ||
| categories = [c for c in categories if str(c) != "nan"] |
There was a problem hiding this comment.
not sure if tables with this case can reach this stage of processing but will this be ok with other non-"nan" representations of NA?
| splits = pairwise_splits(wildcards, "lefse", config["classes"]) | ||
| formatted_splits = [] | ||
| for s in splits: | ||
| # Replace occurrences where class==subclass with subclass="NA", which is the default behavior, this handles the issue at the DAG level |
There was a problem hiding this comment.
perhaps replace this comment with the string of what the expected file name/string being parsed as separated should look like - something like # separated = lefse_plot.{feature_table}.{var}-{cat1}-or-{cat2}.{var}.NA though I forget the exact string that belongs here
| """ | ||
| Studies from MSQ past id 90 require no golay error correction, all others runs require | ||
| rev-comp mapping barcodes. This is a poor generalization and will need to be improved in the future. | ||
| """ |
There was a problem hiding this comment.
Eventually will change into a user-inputted parameter prior to the analysis step in the user-defined config file
-was initially confusing because wasn't sure where an MSQ-containing string was being parsed; seems like a folder is being read/scanned and then these params are being determined from that as opposed to centralized in a config
| conda: | ||
| "qiime2-2020.8.0" |
There was a problem hiding this comment.
Is qiime2-2025.4 or qiime2-2023.9 etc available now?
| conda: | ||
| "qiime2-2023.9" |
There was a problem hiding this comment.
this is a different env and thus qzas etc will be a different VERSION - might make compatibility issues down the road? maybe move everything to 2023.9 or later
| "--p-n-jobs {threads} " | ||
| "--output-dir {output}" | ||
|
|
||
| rule alpha_rarefaction_phylogenetic: |
There was a problem hiding this comment.
I take it this is without your q2-boots true rarefaction method :c
| parser <- add_argument(parser, "output-file", nargs=1, help="Taxa barplot output") | ||
| parser <- add_argument(parser, "--category", help="Optional metadata category by which to separate plot sections") | ||
| parser <- add_argument(parser, "--sort", default="top", help="Options for sorting of columns. Choose from ['top' (default), 'all', 'dominant']") | ||
| parser <- add_argument(parser, "--colors", default=1, help="Options for ordering of colorscheme. Choose from [1, 2, 3]") |
There was a problem hiding this comment.
[1,2,3] does not seem to match [1,2,4] shown in code below lines 58-67
| # Running strictly, remove more | ||
| while(i > 0) { | ||
| # Remove numeric components or short modifiers to species annotations | ||
| if (grepl("^[0-9]+$", lvl_split[i]) | (str_length(lvl_split[i]) < 4 & !lvl_split[i] %in% c("d", "k", "p", "c", "o", "f", "g"))) { |
There was a problem hiding this comment.
do we need 's' here for species if we're anticipating handling of t__ / strain level tables?
There was a problem hiding this comment.
we do still need it, yeah, but this will probably break with a strain-level annotation... will have to think about how to handle
| return(fold_df) | ||
| } | ||
|
|
||
| get_qval_on_pval_scale <- function(plot_mat, qval_thresh) { |
There was a problem hiding this comment.
worth adding a comment or two for these functions to clarify or explain what they do
| features_vec <- str_replace_all(features_vec, ";", "\\.") | ||
| features_vec <- str_replace_all(features_vec, " - ", "-") | ||
| features_vec <- str_replace_all(features_vec, " \\/ ", "_") | ||
| features_vec <- str_replace_all(features_vec, " ", "_") | ||
| features_vec <- str_replace_all(features_vec, "\\|", "\\.") | ||
| features_vec <- str_replace_all(features_vec, ",|\\(|\\)|\\:", "") |
There was a problem hiding this comment.
Depending on how often we use this may be worth bundling into a 'remove_special_char' function in R_utils
| conda: | ||
| "picrust2" | ||
| shell: | ||
| "picrust2_pipeline.py " |
There was a problem hiding this comment.
Where does picrust2_pipeline.py live? should it be in tools? Not sure I see it anywhere
There was a problem hiding this comment.
hahaha you wrote this! picrust2_pipeline.py is in the /bin of the picrust2 env, it's a built-in
| # Not testing if running on Web01 | ||
| testing = not (gethostname() == 'web01') | ||
| sys.path.append("/sc/arion/projects/MMEDS/.modules/mmeds_test/lib/python3.9/site-packages") | ||
| import pandas as pd |
There was a problem hiding this comment.
this import has to go after this sys.path call and the rest of the imports because of a pandas-lib-location-specific reason?
There was a problem hiding this comment.
yes, this is related to the attempt to bring the website back online, so it is still in flux. but the concept here is that when on the web node, it needs to specifically be told where to look for python packages for some reason
| @@ -0,0 +1,4 @@ | |||
| tables: | |||
There was a problem hiding this comment.
if pc2 rules don't require a qmf - what is the purpose of the qiime_mapping_file in picrust2/tables? is there a test of pc2 / lefse combined pipeline?
|
@kbpi314 excellent review! will work on these comments and get back to you |
Pull Request Template for MMEDS
This is currently the branch in use for production analyses which is not ideal given it has not been reviewed. Better late than never.
What has changed
This PR primarily focuses on improving the quality of the LEfSe snakemake workflow, and has so far saved us enormous amounts of time in the process. Detection of possible pairwise comparisons given input tables and metadata categories is fully automatic and robust against categories with insufficient values. Also included in this PR are several new analyses that will be incorporated into snakemake workflows once the future feature table upload update is complete. These analyses include differential abundance testing using benchdamic, automatic taxonomic barplots, and formatting of humann3 outputs to use in the humann_barplot pipeline.
Also incorporates small update made by Kevin to add picrust2 pipeline as a new workflow
Checklist of pre-requisites
How to use the feature
Start a new analysis of type 'lefse' on an existing study, use the resulting config_file.yaml to specify tables and variables for the analysis to use.
With mmeds conda env active:
To generate taxa barplot use
Rscript $REPO/mmeds/tools/plot_taxa_barplot.R.To run benchdamic use
Rscript $REPO/mmeds/tools/run_benchdamic.R.To format humann3 table use
format_humann.pyAdditional notes:
The structure of the repo with respect to
/scriptsand/toolsis getting slightly confusing with respect to what should go where. Would be open to hearing suggestions for better structuring.