Seqc_pipeline¶
+-
+
- Seqc_proc +
diff --git a/docs/modules/analysis/FBA/index.html b/docs/modules/analysis/FBA/index.html index 1fe5774d8f..e8eda81fd0 100644 --- a/docs/modules/analysis/FBA/index.html +++ b/docs/modules/analysis/FBA/index.html @@ -182,6 +182,7 @@
This function merges & prunes the FBA solutions between two runs of the +optimiseRxnMultipleWBMs.m function. Reaction fluxes, FBA statistics, & +shadow prices get therefore concatenated. Note that in case the sample +filenames differ from the standard, the regular expression needs to be adapted.
+[dietInfo, dietGrowthStats] = ensureHMfeasibility (hmDirectory, Diet)
+INPUTS +directory1 [char array] Directory to flux solutions from the first run +directory2 [char array] Directory to flux solutions from the second run +resultdirectory [char array] Directory to empty folder where the combined fluxes
+++will be saved.
+
OPTIONAL INPUT +set_regexp [char array] Specifying alternative regular expression in case
+++the sample filenames are different from their style +than the standard optimiseRxnMultipleWBMs.m output.
+
Authors
+Tim Hensen, 2024
modified by Jonas Widder, 10/2024 & 11/2024 (function can now also merge dirs +with unequal number of samples + added set_regexp option)
Some strains in AGORA2 only have one strain. These strains are not moved to the +pan model folder, resulting in some soecies not being captured in the +models. By converting the strain reconstructions with only a single +strain to the species folder, you can solve this problem.
+completeSpeciesFolder (strainPath, panSpeciesPath)
+strainPath Path to folder with strain reconstructions
panSpeciesPath Path to folder with pan species models
FOUR FUNCTIONS:
++++
+- +
Retrieve metabolite IDs corresponding to the given metabolite names AND/OR
- +
+
+- Retrieve metabolite names corresponding to a given metabolite ID or some
- +
reactions
+- +
+
+- Convert metabolite tranport reactions e.g., DM_glc_d[bc] to D-Glucose or
- +
EX_glc_D[c] to metabolite name e.g., D-Glucose.
+- +
+
+- Suggest similar names for metabolite names provided that are
- +
not found in the data base
+
metNames - EITHER – Cell array of metabolite names (strings) or metabolilte transport +reactions (can be mixed)for which IDs are required or flag indicating to skip step (0).
metIDs - Cell array of metabolite IDs (strings) – required or flag indicating to skip step (0).
suggestSimilar - Flag indicating whether to generate suggestions (1) or not (0)
foundVMHIDs - Cell array of metabolite IDs corresponding to the input names.
foundMetNames - Cell array of metabolite Names corresponsing to the input IDs
similarMets - Cell array of possible matches for each unfound metabolite name, – when searching for names.
Other requirements: COBRA toolbox installation (and paths set)
+EXAMPLE OF USE:
+VMHIDs = {‘DM_gam[bc]’; ‘malttr’}; +metNames = {‘D-glucose’, ‘fructose’, ‘carbon’}; +% metNames = false; +% VMHIDs = false; +% suggestSimilar = false; +suggestSimilar = true;
+% [foundVMHIDs, foundMetNames, similarMets] = convertVMHIDName(metNames,VMHIDs, suggestSimilar);
+Author: - Anna Sheehy & Tim Hensen - 18/07/2024
+Dimensionality reduction of high-dimensional measures (e.g. microbiome relative abundances +or reaction relative abundances) by RPCA following data preprocessing OR beta-diversity measures by PCoA, +with the aim to:
+++1. Find whether there are general differences between groups of a metadata variable of interest +(e.g. disease vs ctrl status), in case variable is categorical. +2. Identify variables from the measures (e.g. microbial taxa, reactions) +which contribute the most to the first principle component (PC1) of RPCA & +therefore its explained variance. +3. Perform linear regression on PC1 ~ metadata variable (e.g. Sex, disease vs Ctrl status) +to find metadata variables which might be important confounders in +follow-up analysis in case they are significantly correlated & +explain a lot of the variance of PC1 from RPCA/PCoA.
+
measuresTable – [table] Contains high-dimensional measures (e.g. microbiome relative abundances +or reaction relative abundances), with columns = samples & +rows = measured groups (e.g. taxa/reactions).
metadataTable – [table] Contains metadata information for samples (e.g. +sex), with columns = variables (e.g. Sex) & rows = samples.
varOfInterest – [string] Variable (e.g. Sex or disease status) +contained in metadata.
results_path – [string] Directory path, where results should be stored +(figures & statistical results in spreadsheet format).
varargin
numLoadings – [numeric] Number of PC loadings which shall be +displayed in plot of PC strongest feature +contributions. +Defaults to 15 loadings.
inputDataType – [chars/string] Specify whether data input is of type “abundance” or +“betaDiversityMatrix”, which results in alternative processing routes +(the input is treated case-insensitive). +Defaults to “abundance”.
PCofInterest – [numeric] Principle component/principle coordinate of +interest, which analysis will be performed on. +Defaults to PC 1.
In form of tables & plots into dir at results_path location.
+Authors
+Jonas Widder, 12/2024 & 01/2025
Download and unpack agora2 +INPUT +directory Directory indicating where to donwload AGORA2
+OUTPUT +AGORA2_dir Directory to AGORA2 folder
+Author: Tim Hensen, 2024
+Filters a table with metabolites for their presence in selected +compartment(s) of the unpersonalized Harvey & Harvetta WBM models and returns +both the present & absent metabolites in seperate tables. +This process ensures that all metabolites of interest are actually present +in the models & fluxes can be calculated for.
+This function finds the optimal number of workers for the HM models being investigated +INPUT: +modelPath Path to folder with COBRA models +OPTIONAL INPUT +subSetSize Size of the random subset of models used for testing
+OUTPUT +fig Figure showing the average speedup factor for each tested
+++configuration of workers.
+
Create lookup file for checking which reactions and metabolites are +present in which AGORA2 strains
+OUTPUT +lookupFilePath Path to the generated lookup file
+Authors: Tim Hensen, 2024
+Create lookup file for checking which reactions and metabolites are +present in which AGORA2 models
+OUTPUT +lookupFilePath Path to the generated lookup file
+Authors: Tim Hensen, 2024
+Generates stacked bar plots from relative abundances of taxa for single or multiple samples.
+input_relAbundances – [table] Contains taxa and their relative abundances for +all samples. Requires column ‘Taxon’ and one or more +sample columns.
saveDir – [chars/string] Path to the directory where the +stacked bar plot should be saved.
Jonas Widder, 12/2024 & 01/2025
Title: Directory disk use calculator +Author: Wiley Barton +Modified code sources:
++++
+- assistance and reference from a generative AI model [ChatGPT](https://chatgpt.com/)
- +
clean-up and improved readability
+
Last Modified: 2025.01.29 +Part of: Persephone Pipeline
+This function determines the size of a selected directory
+repoPathSeqC (char) : Path to the SeqC repository
outputPathSeqC (char) : Path for SeqC output
fileIDSeqC (char) : Unique identifier for file processing
procKeepSeqC (logical) : Keep all files (true/false)
maxMemSeqC (int) : Maximum memory allocation for SeqC
maxCpuSeqC (int) : Maximum CPU allocation for SeqC
maxProcSeqC (int) : Maximum processes for SeqC
debugSeqC (logical) : Enable debug mode (true/false)
…
+MATLAB
Docker installed and accessible in the system path
======================================================================================================#
+Create lookup file for checking which reactions and metabolites are +present in which AGORA2 taxa
+OUTPUT +lookupFilePath Path to the generated lookup file
+Authors: Tim Hensen, 2024
+getVMHID - Retrieve metabolite IDs corresponding to the given metabolite names.
+mets - Cell array of metabolite names (strings)
suggest - Flag indicating whether to generate suggestions (1) or not (0)
metIDs - Cell array of metabolite IDs corresponding to the input names.
suggestedMets - Cell array of possible matches for each unfound metabolite name.
Example
+metaboliteNames = {‘glucose’, ‘fructose’}; +[metIDs, suggestedMets] = getVMHID(metaboliteNames, 1);
+Other requirements: COBRA toolbox installation and initialisation
+Author: - Anna Sheehy - 16/07/2024
+Function for obtaining statistics on AGORA2 mapping
+INPUT +rawPath: path to the unfiltered microbiome data +marsPath: path to mapped microbiome data +saveDir: path to folder where the results are saved
+This function applies constraints to the whole-body metabolic model +metabolite concentrations have to be given in uM +organ weights have to be given in g +Please note that reaction specific constraints are applied at the end of +the function, which have been derived from the literature.
+function modelConstraint = physiologicalConstraintsHMDBbased(model,IndividualParameters, ExclList, Type, InputData, Biofluid, setDefault,ExclMet,ExclMetAbbr)
+INPUT +model model structure +IndividualParameters Structure containing physiological parameters,
+++as generated in standardPhysiolDefaultParameters
+
should be assigned to
+‘Parsed_hmdbConc.xlsx’ or ‘direct’). If +‘direct’ InputData must be provided
+metabolites, 2nd to data points (will be set as lb and ub)
+‘bc’,’u’,’csf’
+then a default concentration ranges will be used to calculate the constraints (default: 1) +Note that the default metabolite concentration +ranges are specified in IndividualParameters +for the different biofluid compartments.
+excluded from the constraint application (default: 0)
+excluded
+OUTPUT +modelConstraint model structure with updated constraints
+Ines Thiele, 2015-2019
+Based on MARS mapping input, this function generates a plot which +visualizes how much of an effect the addition of currently unmapped taxa +to the microbiome community model would have in terms of read coverage, +starting from the most abundant taxa.
+mars_preprocessedInput – [table] MARS output “preprocessed_input” which +contains read counts per pre-mapped taxa.
absentTaxa_abundanceMetrics – [table] MARS output listing all +unmapped taxa together with summary statistics on their relative +abundance across samples (mean relative abundance is of importance for +the function).
readCounts – [table] Original data table containing +read counts per taxa.
results_path – [string] Directory path, where results should be stored (figure).
numAbsentTaxaToInvestigate – [numerical] Number of unmapped taxa +whose effect should be tested for & plotted. +Optional, defaults to the full list of all unmapped taxa.
Authors
+Jonas Widder, 11/2024 & 01/2025
Filters regression results from moderation analysis for significantly +correlating metabolites fluxes/bacterial taxa. Then stratifies the filtered +flux/rel. abundances data for the moderator & performs new statistical analysis +on the stratified data. +Notes: The moderator needs to be categorical.
+data – [Table] Processed flux/relative abundances data.
metadata – [Table] Metadata containing ID & pot. additional variables +(confounders, moderators)
formula – [String] Regression formula in Wilkinson notation.
regressionResults – [Struct] Structure containing tables for flux & +rel. abundances regression results.
moderationThreshold_usePValue – [Boolean] Cutoff threshold being either +FDR or pValue. +Default = true.
moderationThreshold – [Numerical] Cutoff threshold for maximal FDR value from +moderation analysis a metabolite/bacterial taxa needs to +pass that it will be included in subsequent +analysis of stratified fluxes/taxa. +Default = 0.05 (5%).
saveDir – [Character array] Path to working directory.
statResults – [Struct] Structure containing tables for +regression results for moderator stratified data +of significant hits from initial from moderation +analysis regressions. +Will be empty, if regression does not contain Flux +or relative abundance.
+Jonas Widder, 11/2024
+This function prunes FBA solution results obtained in +optimiseRxnMultipleWBM.m and saves the slimmed down solution results in a +new folder. The function first creates a new folder and generates paths +for the flux results in that folder. Then, only the following data is +loaded: ‘rxns’,’ID’,’sex’,’f’, and’stat’. If microbiome data was available: +‘speciesBIO’,’shadowPriceBIO’, and ‘relAbundances’. Then, the solutions +are saved to the new paths.
+INPUT +FBAsolutionDir Character array with path to FBA solutions.
+OUTPUT +smallFBAsolutionPaths Path to slimmed down FBA results
+AUTHOR: Tim Hensen, October 2024
+analyseWBMs predicts the maximal fluxes for a list of user-defined +reactions (rxnList). All predicted are further described in +analyseWBMsol.m.
+[FBA_results, pathsToFilesForStatistics] = analyseWBMs (mWBMPath, fluxPath, rxnList)
+INPUTS: +mWBMPath Path (character array) to WBMs +fluxPath Path to directory where the results will be stored +rxnList Cell array of VMH metabolites to investigate.
+++Example: rxnList = {‘DM_trp_L[bc], DM_met_L[bc],’Brain_trp_L[c], +Heart_met_L[x]}. Note that demand reactions are +automatically added if they are not present in the models.
+
OPTIONAL INPUTS +numWorkersOptimization Number of workers that will perform FBA in parallel. Note
+++that more workers does not necessarily make the function +faster. It is generally not recommended to set numWorkersOptimization +equal to the number of available cores (see: +feature(‘numCores’)) as linear solvers can already +support multi-core linear optimisation, thus resulting in +unnessessary overhead. On computers with 8 cores or less, it +is recommended to set numWorkersOptimization to 1. On a computer with +36 cores, an optimal configuration of numWorkersOptimization=6 was +found.
+
, and .w vectors are stored in the result. Default = true. +It is recommended to set saveFullRes
+paramFluxProcessing Structured array with optional parameters:
+++.NumericalRounding defines how much the predicted flux values are +rounded. A defined value of 1e-6 means that a flux value of +2 + 2.3e-8 is rounded to 2. A flux value of 0 + 1e-15 would be rounded to +exactly zero. This rounding factor will also be applied to the shadow +price values. If microbiome relative abundance data is provided, the same +rounding factor will be applied to the relative abundance data.
+Default parameterisation: +- paramFluxProcessing.NumericalRounding = 1e-6;
+Example: +- paramFluxProcessing.NumericalRounding = 1e-6;
+paramFluxProcessing.NumericalRounding = 1e-6;
+.RxnRemovalCutoff defines the minimal number of samples for which a +unique reaction flux could be obtained, before removing the reaction for +further analysis. This parameter can be expressed as +* fraction: the fraction of samples with unique values, +* SD: the standard deviation across samples, and +* count: the counted number of unique values. If microbiome relative +abundance data is provided, the same removal cutoff factor will be +applied to the relative abundance data.
+Default parameterisation: +- paramFluxProcessing.RxnRemovalCutoff = {‘fraction’,0.1};
+Examples: +- paramFluxProcessing.RxnRemovalCutoff = {‘fraction’,0.1}; +- paramFluxProcessing.RxnRemovalCutoff = {‘SD’,1}; +- paramFluxProcessing.RxnRemovalCutoff = {‘count’,30};
+paramFluxProcessing.RxnRemovalCutoff = {‘fraction’,0.1};
+.RxnEquivalenceThreshold defines the minimal threshold of when +functionally identical flux values are predicted, and are thus part of +the same linear pathways. The threshold for functional equivalence is +expressed as the R2 (r-squared) value after performing a simple linear +regression between two reactions.
+Default parameterisation: +- paramFluxProcessing.RxnEquivalenceThreshold = 0.999;
+Example: +- paramFluxProcessing.RxnEquivalenceThreshold = 0.999;
+paramFluxProcessing.RxnEquivalenceThreshold = 0.999;
+.fluxMicrobeCorrelationType defines the method for correlating the +predicted fluxes with microbial relative abundances. Note that this +metric is not used if mWBMs are not present. The available correlation +types are: +* regression_r2: the R2 (r-squared) value from pairwised linear regression on the +predicted fluxes against microbial relative abundances. +* spearman_rho: the correlation coefficient, rho obtained from pairwise +Spearman nonparametric correlations between predicted fluxes and +microbial relative abundances.
+Default parameterisation: +- paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘regression_r2’;
+Examples: +- paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘regression_r2’; +- paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘spearman_rho’;
+
results will be saved (Default = pwd)
+‘tomlab_cplex’, ‘gurobi’, ‘mosek’
+OUTPUT +results Structured array with FBA fluxes, solver statistics, and
+++paths to the flux table and raw flux results
+
analyseWBMsol loads the FBA solutions produced in +analyseWBMs.m, prepares the FBA results for further +analyses, and produces summary statistics into the flux results.
+PART 1: Loading the flux solutions and corresponding shadow prices
PART 2: Converting the microbial biomass shadow prices
associated with each optimised reaction flux to human readable +tables and producing statistics on microbial influences on the flux +results. +- PART 3: Isolate the microbial component of the fluxes by +subtracting the germ-free fluxes in a sex specific manner. In this +part, descriptive statistics are also obtained on the fluxes across +samples and metabolites are removed based on a user defined +threshold (see thresholds input) +- PART 4: In this part, microbe-flux associations defined by the +species biomass shadow prices (PART 2) are further quantified by +performing Spearman correlations on the fluxes and microbe relative +abundances. +- PART 5: Reactions with identical or near identical fluxes across +samples (user defined) are grouped and collapsed into a single +result. +- PART 6: Save all results
+[FBA_results, pathsToFilesForStatistics] = analyseWBMsol (fluxPath,paramFluxProcessing, fluxAnalysisPath)
+INPUTS: +fluxPath Character array with path to .mat files produced in
+++optimiseRxnMultipleWBM.m
+
paramFluxProcessing Structured array with optional parameters:
+++.numericalRounding defines how much the predicted flux values are +rounded. A defined value of 1e-6 means that a flux value of +2 + 2.3e-8 is rounded to 2. A flux value of 0 + 1e-15 would be rounded to +exactly zero. This rounding factor will also be applied to the shadow +price values. If microbiome relative abundance data is provided, the same +rounding factor will be applied to the relative abundance data.
+Default parameterisation: +- paramFluxProcessing.numericalRounding = 1e-6;
+Example: +- paramFluxProcessing.numericalRounding = 1e-6;
+paramFluxProcessing.numericalRounding = 1e-6;
+.rxnRemovalCutoff defines the minimal number of samples for which a +unique reaction flux could be obtained, before removing the reaction for +further analysis. This parameter can be expressed as +* fraction: the fraction of samples with unique values, +* SD: the standard deviation across samples, and +* count: the counted number of unique values. If microbiome relative +abundance data is provided, the same removal cutoff factor will be +applied to the relative abundance data.
+Default parameterisation: +- paramFluxProcessing.rxnRemovalCutoff = {‘fraction’,0.1};
+Examples: +- paramFluxProcessing.rxnRemovalCutoff = {‘fraction’,0.1}; +- paramFluxProcessing.rxnRemovalCutoff = {‘SD’,1}; +- paramFluxProcessing.rxnRemovalCutoff = {‘count’,30};
+paramFluxProcessing.rxnRemovalCutoff = {‘fraction’,0.1}; +RxnEquivalenceThreshold +.rxnEquivalenceThreshold defines the minimal threshold of when +functionally identical flux values are predicted, and are thus part of +the same linear pathways. The threshold for functional equivalence is +expressed as the R2 (r-squared) value after performing a simple linear +regression between two reactions.
+Default parameterisation: +- paramFluxProcessing.rxnEquivalenceThreshold = 0.999;
+Example: +- paramFluxProcessing.rxnEquivalenceThreshold = 0.999;
+paramFluxProcessing.rxnEquivalenceThreshold = 0.999;
+.fluxMicrobeCorrelationMetric defines the method for correlating the +predicted fluxes with microbial relative abundances. Note that this +metric is not used if mWBMs are not present. The available correlation +types are: +* regression_r2: the R2 (r-squared) value from pairwised linear regression on the +predicted fluxes against microbial relative abundances. +* spearman_rho: the correlation coefficient, rho obtained from pairwise +Spearman nonparametric correlations between predicted fluxes and +microbial relative abundances.
+Default parameterisation: +- paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘regression_r2’;
+Examples: +- paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘regression_r2’; +- paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘spearman_rho’;
+
Character array with path to directory where all +results will be saved.
+Tim Hensen, July 2024
Jonas Widder, 11/2024 (small bugfixes)
This function creates personalised host-microbiome WBM models (mWBMs) by joining +microbiome community models with unpersonalised WBM models in a +sex-specific manner. The models are parameterised on a predefined diet.
+[modelStats, summaryStats, dietInfo, dietGrowthStats] = createBatchMWBM (mgpipePath, mWBMPath, metadataPath)
+INPUTS +mgpipePath: Path to microbiome community models created by the
+++microbiome modelling toolbox.
+
saved
+OPTIONAL INPUTS +Diet Diet option: ‘EUAverageDiet’ (default) +numWorkersCreation Number of cores used for parallelisation.
+++Default = 4.
+
ensureHMfeasibility.m function and check if +the generated models can grow on the diet. +Default = true.
+This function supports one user adapted male and one user adapted female +WBM or personalised germ free WBMs for each sample with a microbiome +community model. Default is empty. If wbmDirectory is empty, +Harvey/Harvetta version 1.04c are used.
+OUTPUTS +modelStats Table with summary statistics on the generated WBMs:
+++gender, number of reactions, metabolites,constrainsts, +and taxa.
+
statistics in the modelStats variable.
+models can grow on the given diet.
+dietGrowthStats Table with statistics on
+Authors: Tim Hensen, 2023, 2024
+createForestPlot generates a forest plot to display confidence intervals for estimates.
+createForestPlot (estimates, ci, names, pValues, plotTitle, xTitle, hideLegend)
+estimates - Vector of estimates (e.g., effect sizes or log fold changes)
ci - Matrix of confidence intervals [n x 2] with lower and upper bounds in columns.
names - Cell array of labels for each data point.
pValues - Vector of p-values corresponding to each estimate.
plotTitle - String, title of the plot.
xTitle - String, label for the x-axis (typically “Effect Size” or “Log Fold Change”)
hideLegend - Logical, if true, hides the legend (default is false)
This function creates personalised host-microbiome WBM models by joining +microbiome community models with unpersonalised WBM models. The models +are parameterised on a predefined diet.
+Example: modelHM = createMWBM(microbiota_model, WBM_model, ‘EUAverageDiet’)
+INPUTS +microbiota model: Microbiome community model created by the
+++microbiome modelling toolbox.
+
model
+Diet Diet option: ‘EUAverageDiet’ (default)
+Output +modelHM: Personalised host-microbiome WBM model
+Tim Hensen, expanded WBM parameterisation July 2024 +Tim Hensen, improved mWBM annotation November 2024
+createVulcanoPlot generates a volcano plot to visualize the relationship +between regression estimates and p-values.
+createVulcanoPlot (estimates, pValues, names, plotTitle, xTitle, yTitle)
+estimates - [vector] Regression estimates for each data point (e.g., effect sizes or log fold changes)
pValues - [vector] p-values corresponding to each estimate.
names - [cell array] Labels or names for each data point.
plotTitle - [string] Title of the plot.
xTitle - [string] Label for the x-axis (e.g., “Effect Size” or “Log Fold Change”)
yTitle - [string] Label for the y-axis (e.g., “-log10(p-value)”)
None - This function generates a volcano plot in the current figure.
+Example
+‘Volcano Plot’, ‘Log Fold Change’, ‘-log10(p-value)’)
+This function checks for the WBMs in the specified mWBMPath to grow on the +given diet, finds a diet that makes all WBMs feasible and propagates this +diet onto the models. The functions works as follows: +1) Load the WBMs from the directory and test which models are feasible on +the diet. +2) If any WBM is not feasible, collect all infeasible WBMs, open all diet +reactions, and test if the WBMs can grow on any diet. If one or more WBMs +cannot grow when opening all diet reactions, the statistics are +collected and the function stops. The user will need to debug the WBMs +manually. +3) If all previously infeasible WBMs can grow when all diet reactions are +opened,missing diet components are searched using getMissingDietModelHM. +If getMissingDietModelHM cannot find a diet after one function call, this +function is run again until a hard limit of 10 iterations is reached. If +this limit is reached, no feasible diet can be found for all models and +the user will need to manually debug the infeasible models. If missing +diet components are found in one WBMs, these component are automatically propagated +to the other WBMs infeasible on the original diet before searching for +more missing diet components +4) If a new diet is found for which all previously infeasible WBMs can grow, +the updated diet is propagated to all models in the mWBMPath. +5) The updated diet is stored as an output variable for this function and +for each model, its feasibility on the original and updated diet is +recorded.
+[dietInfo, dietGrowthStats] = ensureHMfeasibility (mWBMPath, Diet)
+INPUTS +mWBMPath Path to directory where the HM models are saved +Diet Diet option: ‘EUAverageDiet’ (default)
+OUTPUTS +dietInfo Structured variable containing the missing diet
+++components and the updated diet propagated to all models. +This variable is empty if no updated diet could be found.
+
the diet, which WBMs could not grow on any diet, +and which WBMs could not grow on an updated diet.
+Authors
+Tim Hensen, 2024
The code is manually tested and altered with bug fixes. This function +performs identical to running q = mafdr(p, ‘BHFDR’, true) +using the Matlab bioinformatics toolbox.
+FDR_BH Performs Benjamini-Hochberg FDR correction on a vector of p-values.
+++q = FDR_BH(p) returns the FDR-corrected p-values q for the input vector p.
++
+- Input:
- +
p - a vector of p-values (assumed to be in the interval [0, 1])
+- Output:
- +
q - a vector of FDR corrected p-values, in the same order as p.
+- The function implements the procedure described in:
- +
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery +rate: a practical and powerful approach to multiple testing.
+- Example:
- +
pvals = [0.01, 0.04, 0.03, 0.002, 0.07]; +qvals = fdr_bh(pvals);
+
This function inputs a list of p values and outputs the variables +pValueGroups and colourGroups. A maximum of three groups are defined, +FDR<0.05, P<0.05, and P>0.05.
+[groupsToPlot, colours] = findPvalCategories (pValues)
+INPUT +pValues
+OUTPUTS +groupsToPlot Table that categorises the indices of the p-value list. +colours Cell array with numerical encodings for the
+++category-associated colours.
+
Generates stacked bar plots from mean relative abundances of phyla pre- & +post-mapping to a model-database in MARS.
+input_relAbundances_preMapping – [chars/string] Path to table which contains phyla taxa and their +mean relative abundances across samples pre-mapped to a model-database +(original dataset). This spreadsheet is a standard output generated in MARS.
input_relAbundances_postMapping – [chars/string] Path to table which contains phyla taxa and their +mean relative abundances across samples post-mapped to a model-database +(filtered dataset). This spreadsheet is a standard output generated in MARS.
saveDir – [chars/string] Path to the +directory where the stacked bar plot should be saved.
mappingDatabase_name – [chars/string] The name of the model-database used for +mapping in MARS, which will be displayed in the title of the graph. +Defaults to ‘’.
Jonas Widder, 12/2024 & 01/2025
Checking for MATLAB toolbox dependencies.
Processing and validating the metadata file.
Cross-referencing metadata with microbiome data (if provided).
Setting up directory structures for Persephone results.
[initialised, paths, statToolboxInstalled] = initPersephone (resultPath, metadataPath, readsTablePath)
+resultPath - (char)
metadataPath - (char) Path to the metadata file (.csv or .xlsx)
readsTablePath - (char)
initialised - (logical)
paths - (struct)
statToolboxInstalled - (logical)
updatedMetadataPath - (char)
MATLAB R2020b or newer.
Parallel Computing Toolbox (mandatory).
Statistics and Machine Learning Toolbox (recommended for analysis).
Notes
+Metadata must contain a column of sample IDs and a column of sex information.
Metadata and reads table will be synchronized to ensure consistent samples across analyses.
Any discrepancies in variable naming (e.g., “ID”, “Sex”) are automatically resolved.
Processed metadata is saved to an updated file.
Example
+resultPath = ‘./results’; +metadataPath = ‘./data/metadata.csv’; +readsTablePath = ‘./data/readsTable.csv’; +[initialised, paths, statToolboxInstalled] = initPersephone(resultPath, metadataPath, readsTablePath);
+Tim Hensen, January 2025
Bram Nap, July 2024: Automatic creation of output folders
This function loads the smallest possible combination of WBM model fields +needed to perform FBA. Loading the minimal model can +decrease loading times by 6X.
+Author: Tim Hensen, November 2024
+This functions performs linear or logistic regressions on either flux +data or microbial relative abundances. The function supports control +variables & moderators in the regression formulae.
+results = performRegressions (data,metadata,formula)
+INPUT +data: Table with flux or microbiome data. The first column must be the ID
+++column.
+
metadata: Metadata table. The first column must be the ID column. +formula: Must contain either “Flux or “relative_abundance” as predictor. +exponentiateLogOdds: Logical variable on if a logistic regression estimate +should be exponentiately to obtain the odds ratios.
+The performStatistics function is part of the PSCM toolbox. This +function performs regression analyses on processed fluxes and gut +microbiome relative abundance data in large cohort studies. This function needs a +minimal sample size of 50 samples. This function performs different +statistical analyses depending on the user defined inputs.
+If the response variable is binary and no confounders are given, +wilcoxon tests will be performed for the fluxes and microbiome +abundances. +If the response variable is binary and confounders are given, multiple +logistic regressions will be performed. +If the response variable is continuous and no confounders are given, +simple linear regressions will be performed. +If the response variable is continuous and confounders are given, +multiple linear regressions will be performed.
+The fluxes are z-transformed before statistical testing. The microbiome +relative abundances are log transformed before testing. +If a moderator variable is given and a regression is +performed, a moderation analysis on the moderator with the predictor will be performed.
+performStatistics (statPath, pathToProcessedFluxes, pathToWbmRelAbundances, metadataPath, response, confounders, moderator, microbeCutoff,threshold)
+INPUT +statPath Path (character array) to working directory +pathToProcessedFluxes Path to processed flux data +pathToWbmRelAbundances Path to microbial relative abundances +metadataPath Path to metadata file +response Name of response variables (char or string)
+OPTIONAL INPUTS +confounders Cell array with the names of confounding variables to
+++be included. Default = empty.
+
microbe needs to present to be analysed. Default = 0.1. (10%)
+This function takes a table of physiological parameters for an individual or multiple individuals +(listed in inputs) and adjusts the paramteres of a provided WBM or Harvey/Harvetta to +create a personalised WBM
+The calculation of physiological parameteres will be performed based on +the available data e.g., if the user provides Cardiac output, this given +value is assigned, if the user provides stroked volume and no CO, CO is +calculated based on SV. If neither values are provided, CO will be +calculated based on XXX. All details of parameter calculation are +detailed in the personalised model output and in the excel file provided +by this function.
+INPUTS
+REQUIRED: +metadata Can be a cell array or a table or a path to an
+++excel file. In any case, this should contain +a list with a minimum of three column and option +for additional columns for each individual for which a model should be created:
+
This function takes a table of metabolomic parameters for an individual or multiple individuals +(listed in inputs) and adjusts the paramteres of a provided WBM or Harvey/Harvetta to +create a personalised WBM
+The calculation of metabolomic parameteres will be performed based on +the available data
+INPUTS
+REQUIRED: +metabolomicParamters Can be a cell array or a table or a path to an
+++excel file. In any case, this should contain +a list with a minimum of three columns and option +for additional columns for each individual for which a model should be created:
+|”compartment” |”blood” |”urine” | +|”unit” |”mg/dL” | “mg/dL” | +|”Individual1” |180 |50 | +|”Individual2” |150 |66 |
+Sex must be provided for each model. The compartment must be specified and must be +one of the following: csf, u, blParameters that can be personalised include: +All of those metabolites which are found in +the +*for each the default unit used by the model +is provided. For x, y, z- unit conversion +will be performed for all known common units of +measurement.
+
OPTIONAL: +WBM/(s) User can provide a WBM (Whole Body Metabolic Model) or a path to multiple models.
+++If no model is provided, either Harvey or Harvetta will be loaded from the COBRA +toolbox, depending on sex in provided physiological data. +If multiple models are provided, the model ID must +match the model name provided in the physiological data.
+
other outputs. +Default = current directory
+file from the COBRA toolbox (default = +EUAverageDiet) +** This function can also be run in isolation +but if a user wants to personalise both +physiological and metabolomic- this function +should alway be run and NOT +persWBMmetabolomics in isolation!
+OUTPUTS
+(stored as “persModelName.mat”). All updated +paramteres are described in +model.IndividualisedParameters
+adjustments. When a WBM or multiple WBMs have +been given as inputs, the controlWBMs are +exact copies of those. When Harvey or +Harvetta are used, controlWBM is copies of +Harvey/Harvetta.
+parameter and how it was calculated
+author: Anna Sheehy November 2024 +Step One: Read in the available data, check all data is valid +Define the input parser
+This function loads a metadata file into a MATLAB table. It accounts for: +- Preserving the full variable names by disabling automatic truncation. +- Capturing variable units if provided in the second row of the file. +- Storing the captured units in the ‘VariableUnits’ property of the table.
+metadata = readMetadataForPersephone (metadataPath)
+metadataPath – A string specifying the path to the metadata file +(in CSV or XLSX format).
+metadata – A MATLAB table containing the processed metadata. +Variable units, if present in the second row of the file, +are stored in the table’s ‘VariableUnits’ property.
+Notes
+If variable names exceed 64 characters, they are truncated, but the +true variable names are preserved in the ‘VariableDescriptions’ property.
Warnings related to variable name truncation are suppressed to avoid +unnecessary console output.
AUTHORS
+Tim Hensen, January 2025
+This function integrates MATLAB with the MARS Python repository to process +microbiome taxonomy and read abundance data. The MARS pipeline maps +joined microbiome data to a metabolic model database for downstream analysis.
+pythonPath char with path to the Python executable (e.g., Anaconda installation)
marsRepoPath char with path to the directory containing the MARS pipeline repository.
cutoffMARS double with relative abundance cutoff threshold below which relative abundances – are treated as zero. Defaul is 1e-6.
outputExtensionMARS char with desired file format for output files (e.g., ‘csv’, ‘txt’)
flagLoneSpecies logical that indicates if genus name is omitted in species names – (e.g., “copri” instead of “Prevotella copri”).
taxaSplit char with delimiter used to separate taxonomic levels.
removeCladeExtensionsFromTaxa Logical which specifies whether clade name extensions should be removed – from microbiome taxa for compatibility with AGORA2 models.
whichModelDatabase char that specifies the metabolic model database to use. – Options: ‘AGORA2’, ‘APOLLO’, ‘full_db’, ‘user_db’ (requires ‘userDatabase_path’).
userDatabase_path char with path to a user-defined model database file, required only if ‘whichModelDatabase’ is set to ‘user_db’.
sample_read_counts_cutoff double variable. Threshold for read counts below which samples are excluded. – Only applies if ‘readsTablePath’ contains absolute read counts.
readsTablePath char with path to the file containing microbiome taxonomy and read abundance data.
taxaTable char with path to the file containing taxonomic assignments. – Required only if taxonomy is not included in ‘readsTablePath’.
The function does not return variables but writes processed results
to the specified output directory in the MARS pipeline.
Python and the MARS repository must be installed and accessible.
COBRA Toolbox for MATLAB must be properly set up if further +integration with COBRA models is required.
Notes
+Ensure ‘pythonPath’ is set to the correct Python executable version +and environment containing the required MARS dependencies.
If ‘readsTablePath’ already includes taxonomy assignments, ‘taxaTable’ is optional.
modified by Jonas Widder, January 2025 (added generateStackedBarPlot_PhylumMARScoverage.m)
+Performs non-parametric statistical tests (Wilcoxon or Kruskal-Wallis) on metabolic data
+resultTable = runNonparametricTests (data, metadata, predictor, response)
+data – (table) m x n table containing: +* m samples +* n reactions/taxa with flux or abundance values +* First column must contain sample IDs
metadata – (table) Sample metadata containing: +* Sample IDs in first column +* Response variable in separate column
predictor – (char) Name of predictor variable (‘Flux’ or ‘relative_abundance’)
response – (char) Name of response variable in metadata
resultTable – (table) Statistical results with columns: +* name: Reaction/Taxa identifier +* groups: Cell array of group names +* n: Array of group sizes +* statistic: Test statistic (Wilcoxon or Kruskal-Wallis) +* p-value: Unadjusted p-value +* FDR: Benjamini-Hochberg adjusted p-value +* effectSize: Effect size (r for Wilcoxon, η² for Kruskal-Wallis) +* confidence: 95% confidence interval (only for binary comparisons)
+Example
+% Binary comparison (Wilcoxon test) +resultTable = runNonparametricTests(fluxData, metadata, ‘Flux’, ‘Disease’)
+% Multiple group comparison (Kruskal-Wallis test) +resultTable = runNonparametricTests(abundanceData, metadata, ‘relative_abundance’, ‘Treatment’)
+Note
+Automatically selects appropriate test based on number of groups
Data is automatically normalized before testing
Missing values (NaN) are removed before analysis
Multiple testing correction uses Benjamini-Hochberg FDR
Minimum group size of 3 samples is required for valid testing
This pipeline runs throught the entire process of mapping taxonomy +assigned reads to AGORA2/APOLLO models to create relative abundance files +through the MARS pipeline. The relative panSpecies abundance file is then +used to create microbiome models through mgpipe (Microbiome Modelling +Toolbox PMID:xx). Before HM models are created there is an option to +personalise the WBMs with physiological or metabolomic data through the +xx function. The HM creation can then either connect the microbiomes with +base-WBMs based on the sex associated with the sample ID in the metadata +file or connect it with a personalised WBM. Based on a list of reactions, +the HM models will be solved and the flux results will be processed and +various analyses such as shadow price and regression are performed. Then +based on the make-up of the study used to run this pipeline, +mixed-effect linear regression modelling can be used for various response +variables while taking into account the effect of predictors. All +response and predictor variables need to be present in the metadata file. +By default the code will do perform two runs where in one flux data is +added as an additional predictor and in the other the relative abundance +of panSpecies models. Finally the results are visualised in various +plots. The user can indicate which parts of the pipeline should be run +with various “flags” described in the input variables below. If the code +or matlab stops unexpectedly, parts of pipeline that are already +performed will be skipped. If the entire pipeline should be rerun - +delete the progress.mat file. If only certain parts of the pipeline +should be rerun, open the progress.mat file and alter change the value of +the relevant flags to false or 0.
+Most of the inputs are optional as the pipeline is designed that parts of +the analysis could be skipped. This however means that some of the +optional variables are actually required variables in order to run a +certain part of the pipeline. Behind the input description indication if +the variable is required is given.
+fullPipeline()
+Inputs: +:Required inputs: * resultPath – A string with the path to the directory where all
+++++results should be stored
++
+- +
metadataPath – A string with the path to the metadata file.
solver – String, indicates which solver is to be used for solving WBMs. +OPTIONAL, defaults to GLPK. It is highly recommended to use gurobi or cplex
+of sequencing data is performed by SeqC.
+SeqC repository is located.
+of SeqC is stored.
+sample IDs for FASTQ files (e.g., sample_id.txt).
+retained (e.g., post-QC FASTQ). If false, only the final +MARS output is kept, and all intermediary content is deleted +once SeqC completes.
+allowed in gigabytes.
+allowed for processing.
+processes allowed.
+messages should be included in the log.
+MARS should be run. OPTIONAL, Defaults to true
+repository is stored. REQUIRED
+the path is described in .txt. REQUIRED
+taxonomic assignments. OPTIONAL if the +taxonomic assignments are already in the +readsTablePath. REQUIRED if not.
+to either OTUs or taxonomic assignments. +REQUIRED
+the results. OPTIONAL, defaults to [resultPath, +filesep, ‘ResultMARS’].
+are considered 0. OPTIONAL, defaults to 1e-6.
+saved outputs. OPTIONAL, defaults to ‘csv’.
+the name of the species e.g. Prevotella copri. +If genus name is in species name set to false. +Otherwise set to true. OPTIONAL, defaults to +false.
+taxonomic levels. OPTIONAL, defaults to ‘; ‘.
+MARS descriptive statistics should be run. +OPTIONAL, defaults to true
+reads and taxonomic assignments are combined. +Could be the same file as readsTablePath. OPTIONAL, +default to [Ask TIMHuls to Create].
+species file, an output from MARS. OPTIONAL, +defaults to [outputPathMARS, filesep, +‘present’, filesep, ‘present_species.’, +outputExtensionMARS] if MARS is run with this +pipeline. If run online, the path needs to be +specified explicitly.
+statistics of the MARS output should be stored. +OPTIONAL, default to [outputPathMARS, filesep, +‘Analysis’].
+MgPipe/microbiome model creation should be run. +Defaults to true. OPTIONAL, defaults to true
+panSpecies models. REQUIRED
+species file, an output from MARS. OPTIONAL, +defaults to [outputPathMARS, filesep, +‘present’, filesep, ‘present_species.’, +outputExtensionMARS] if MARS is run with this +pipeline. If run online, the path needs to be +specified explicitly.
+netUptake are calculated for the microbiome +models via fastFVA. OPTIONAL, defaults to false
+used to create microbiome models. OPTIONAL, +default to use all available cores.
+MgPipe should be saved. OPTIONAL, defaults to +[resultPath, filesep, ‘ResultMgPipe’];
+WBM personalisation should be run. OPTIONAL, +defaults to false.
+personalised WBM are stored. OPTIONAL, defaults +to [resultPath, filesep, ‘personalisedWBMs’].
+to constrain WBM models with. OPTIONAL, defaults +to EUAverageDiet.
+HM creation and descriptive statistics should +be run. OPTIONAL, defaults to true
+community microbiome models generated from +MgPipe are stored. OPTIONAL, if this pipeline +is used to run MgPipe it defaults to +[mgpipeDir, filesep, ‘Diet’]. If MgPipe was +used outside of this function state the path +explicitly.
+OPTIONAL, defaults to [resultPath, filesep, +‘HMmodels’].
+to constrain HM models with. OPTIONAL, defaults +to EUAverageDiet.
+used to create HM WBMs. OPTIONAL, +default to use all available cores.
++++
+- paths.fba.flagFBA: Boolean, indicates if Part 6 of the pipeline:
- +
FBA should be run. OPTIONAL, defaults to true
+- hmDir: String with the path to the directory with the
- +
models to use. OPTIONAL, if HM models were made +it defaults to hmDirectory used in part 4. If +HM models were created outside this function, +the path has to be explicitly stated.
+- fluxDir: String with the path where the flux results
- +
should be stored. OPTIONAL, defaults to +[resultPath, filesep, ‘fluxResults’].
+- rxnList: Character array contain reaction IDs for
- +
reactions that are to be optimised. If the user +wants to solve non-existing demand reactions, +simply add DM_ infront of the desired +metabolite and add to the rxnList variable. +REQUIRED.
+- mgpipeDir: String with the path to the relative abundance
- +
species file, an output from MARS. OPTIONAL, +defaults to [outputPathMARS, filesep, +‘present’, filesep, ‘present_species.’, +outputExtensionMARS] if MARS is run with this +pipeline. If run online, the path needs to be +specified explicitly.
+- thresholds: 1x3 numerical vector. OPTIONAL, defaults to
- +
[90, 90, 0.999]
++
+- +
thresholds(1): “metabolite removal threshold”
This threshold indicates the maximum allowable +number of duplicate flux results between +samples expressed in the percentage of total +samples. Reactions that exceed this threshold +will be removed.
++
+- +
thresholds(2): “Microbe presence threshold”
This threshold indicates the maximum allowable +number of samples where a microbe is absent +expressed by the percentage of total samples. +Microbes that exceed this threshold will be +removed from the analysis.
++
+- +
thresholds(3): “Reaction grouping threshold”
This threshold indicates the minimal pairwise +Spearman correlation between fluxes across the +population where reactions are grouped and +handled as one result. +ADD TEXT 4th input for std dev.
+
++used to solve the WBM models. OPTIONAL, +default to 1.
+
.w vectors are stored in the result. OPTIONAL, +defaults to true. It is recommended to set +saveFullRes.
+flux results are stored. OPTIONAL, defaults to +[fluxDir, filesep, ‘fluxAnalysis’]
+statistical analysis should be run. OPTIONAL, +defaults to true
+statistical analyses and the figures should be +saved. OPTIONAL, defaults to [resultPath, +filesep, ‘statisticsResults’].
+analysed flux results. OPTIONAL, defaults to +[resultPath, filesep, ‘fluxResults’, filesep, +‘fluxAnalysis’].
+panSpecies file generated by MARS. OPTIONAL, +defaults to [resultPath, filesep, resultMars, +filesep, present,filesep, ‘present_species.csv’]
+found in the metdata file that the user wants +to use as response for statistical analysis. +REQUIRED
+found in the metadata file that are used as +confounders in the statistical analyses. +OPTIONAL, defaults to an empty cell array.
+will be performed. A moderation analysis can +only be performed if confounders. OPTIOANL, +defaults to false.
+samples a microbe needs to present in to be to +be analysed. The same variable as thresholds(2) +OPTIONAL, defaults to 0.1
+of the ‘processed_fluxes_Thr_x_xxx file, where +the threshold value is x_xxx. The same value as +thresholds (3). OPTIONAL, defaults to 0.999.
+Outputs:
+fullPipeline(resultPath,metadataPath,’marsRepoPath’, marsRepoPath,… +‘pythonPath’, pythonPath, ‘OTUTable’, OTUTable, ‘readsTablePath’ readsTablePath,… +‘panModelsPath’, panModelsPath, ‘rxnList’, rxnList, ‘response’, response)
+Example (MARS is run online and is skipped in the this pipeline):
+Title: SeqC as Flux Pipeline MATLAB Wrapper +Author: Wiley Barton +Modified code sources:
+++matlab script structure: Tim Hensen, runMars.m, 2025.01 +assistance and reference from a generative AI model [ChatGPT](https://chatgpt.com/)
+++clean-up and improved readability
+
Last Modified: 2025.02.09 +Part of: Persephone Pipeline
+This function builds and runs the SeqC Docker image from a MATLAB environment. +It ensures necessary databases exist and coordinates execution with the MARS pipeline.
+repoPathSeqC (char) : Path to the SeqC repository
outputPathSeqC (char) : Path for SeqC output
fileIDSeqC (char) : Unique identifier for file processing
procKeepSeqC (logical) : Keep all files (true/false)
maxMemSeqC (int) : Maximum memory allocation for SeqC
maxCpuSeqC (int) : Maximum CPU allocation for SeqC
maxProcSeqC (int) : Maximum processes for SeqC
debugSeqC (logical) : Enable debug mode (true/false)
…
+MATLAB
Docker installed and accessible in the system path
Determine Operating System
+Validates the inputs provided in the paths structure for the Persephone +pipeline. This function ensures that all required fields, flags, and +parameters are present, appropriately formatted, and meet the expected +constraints. Validation is performed across multiple pipeline components, +including MARS, mgPipe, WBM personalization, HM creation, FBA, and statistics.
+validated = validatePersephoneInput (paths)
+paths Structure containing all required input fields for the Persephone – pipeline. Fields include: +- General flags and paths (e.g., flagSeqC, resultPath) +- Component-specific flags and parameters (e.g., flagMars, +marsRepoPath, flagMgPipe, metadataPath, etc.)
+validated Logical scalar indicating whether the validation succeeded – (true) or failed (false).
+Notes
+Ensure all required flags and fields in the paths structure are defined +prior to invoking this function. Undefined or mismatched fields will +result in validation failure.
This function is designed to be modular and extendable for future +additions to the Persephone pipeline.
Tim Hensen, January 2025
+See also: validateattributes, validatestring, COBRA Toolbox
+