In house scripts used in the paper "Multi-Omics Integration Can be Used to Rescue Metabolic Information for Some of the Dark Region of the Pseudomonas putida Proteome"
the scripts in this section are not dependent on the results of any other scripts in this repository. They are listed in the order in which they were run.
names.txt: A mapping between the internally used mass spec data filenames and the filenames from the online repositories
valid_defects.txt: A list of mass defects corresponding to plausibly biological PTMs for filtering the ANNSoLo output
SLT01_coEx-processDecoyLibrary.py: Filters the NIST mouse spectral library of any peptides that are also potentially generated by P. putida This script requires the NIST .msp file and Pseudomonas_putida_KT2440_110.faa.
SLT01_spectraST-commands.txt: The specific commands used when processing spectral libraries with SpectraST These commands are run on the output of the philosopher pipeline and the filtered NIST mouse spectral library from above.
SLT01_coEx-annsoloToFlashLFQ.py: Processes the ANNSoLo output, does FDR control, and filters implausable PTMs then formats the output for quantification by FlashLFQ. The script will run FlashLFQ in the smithchemwisc/flashlfq:1.0.3 docker container as a subprocess. This requires a concatenated target + decoy spectral library file, the .mztab outputs of ANN-SoLo, valid_defects.txt, and names.txt
the scripts in this section are not dependent on the results of any other scripts in this repository. They are listed in the order in which they were run.
SLT01_MAFFTalign.py: Runs MAFFT on all orthogroups This script requires a .faa file.
SLT01_corrEvoRenameMSAgenes.py: Renames genes in the MSAs to their species for use in Pargenes This script requires the MSA files from MAFFT and a .faa file.
SLT01_corrEvoPargenesInput.py: Cleans and formats the MSAs for Pargenes This script requires the processed MSAs generated above.
SLT01_corrEvoFixedTopo-scheduler.py: Run RAxML-ng on all orthogroups with the species tree as a topological constraint This script requires the output directory from running Pargenes with the optional ASTRAL call.
SLT01_corrEvoTreeCMP-cleanNewick2.py: Processes the Newick outputs of the topologically constrained RAxML-ng runs to work with TreeCMP This script reqires a directory of the original .faa files and a directory of the newick outputs from SLT01_corrEvoFixedTopo-scheduler.py
SLT01_corrEvoTreeCMP-fullComparison.py: Breaks the TreeCMP inputs into blocks and runs TreeCMP on those blocks in parallel to do an all-vs-all comparison of trees This script requires a directory of processed newick files from SLT01_corrEvoTreeCMP-cleanNewick2.py
the scripts in this section are listed in the order in which they were run. Note that this section is for the within-species structural similarity analysis that feeds into the guilt-by-association model not the between-species structural similarity analysis.
SLT01_trimAF.py: Trims low confidence regions at the termini of proteins as a preprocessing step for alignment This script requires the proteins.dill file from SLT01_baselineDataSetup.py in the GBA integration section and a directory of .pdb files from the Alphafold predicted protein structure database
SLT01_TMalignScheduler.py: Runs TM-align on all P. putida proteins This script requires the directory of cleaned .pdb files from SLT01_trimAF.py
The scripts in this section are dependent on the results of the previous sections. They are listed in the order in which they were run.
names_seq_input.csv: Data from Uniprot
test_set.txt: The proteins held out as a test set for the prediction model. Created by SLT01_termCentric-filterSimilarities.py
SLT01_baselineDataSetup.py: Parses the go.obo into a list of python objects and a networkx graph, collects initial annotation information and makes a list of protein objects. These data structures are pickled with dill for use by other scripts in the analysis. This script requires the gene ontology description go.obo file, Pseudomonas_putida_KT2440_110.faa, names_seq_input.csv, proteins.dat from biocyc, GO annotations from the Pseudomonas Genome Database, and the results from running NetGO2.0.
SLT01_makeEdgelist.py: formats raw similarity data into edgelists This script requires the output of SLT01_coEx-annsoloToFlashLFQ.py, the output of SLT01_corrEvoTreeCMP-fullComparison.py, the results of running a Diamond all-vs-all search, the output of Rockhopper, the output of InterProScan and SignalP 5.0, STRINGdb data, and the output of SLT01_TMalignScheduler.py.
SLT01_combineEdgelist.py: combines individual edgelist files for use by the first within-species model This script requires the output of SLT01_makeEdgelist.py as well as proteins.dill and terms.dill from SLT01_baselineDataSetup.py
SLT01_termCentric-filterSimilarities.py: Trains and runs inference for the initial protein-protein similarity model. This script requires the output of SLT01_combineEdgelist.py as well as proteins.dill and terms.dill from SLT01_baselineDataSetup.py
SLT01_termCentric-inputData.py: Formats the data for the term transfer model This script requires the output of SLT01_termCentric-filterSimilarities.py as well as proteins.dill and terms.dill from SLT01_baselineDataSetup.py
SLT01_termCentric-semiSupervisedModel.py: Fits the semisupervised term transfer model This script requires the output of SLT01_termCentric-inputData.py
The scripts in this section are listed in the order in which they were run. They only depend on SLT01_baselineDataSetup.py.
SLT01_RUPEEwebdriver.py: automates searching P. putida proteins with the online RUPEE search tool This script requires a directory of .pdb files downloaded from the Alphafold predicted protein structure database that have been processed by SLT01_trimAF.py.
SLT01_RUPEEmapNamesAndcollectInputData.py: Maps names from PDB hits to Uniprot IDs This script requires the output of SLT01_RUPEEwebdriver.py
SLT01_RUPEEpairwiseAlignments.py: Runs NWalign on the hits and collects pairwise similarity information This script requires the output of SLT01_RUPEEmapNamesAndcollectInputData.py and Pseudomonas_putida_KT2440_110.faa
SLT01_RUPEEsemisupervisedInput.py: Collects the input information and formats it for the term transfer model This script requires the output of SLT01_RUPEEpairwiseAlignments.py and all outputs of SLT01_baselineDataSetup.py
SLT01_RUPEEsemisupervisedModel.py: Fits the semi-supervised random forest model This script requires the output of SLT01_RUPEEsemisupervisedInput.py
All of the following scripts were run after all of the above scripts.
The scripts in this section are listed in the order in which they were run.
SLT01_GOenrichmentModel.stan: The statistical model for the GO enrichment analysis in Figure 5 This script is run by SLT01_GOenrichmentModel-driver.py
SLT01_GOenrichmentModel-driver.py: Formats data and runs the above model This script requires the outputs of SLT01_baselineDataSetup.py, SLT01_termCentric-semiSupervisedModel.py, and SLT01_RUPEEsemisupervisedModel.py
SLT01_GOexpectationAnalysis.stan: The statistical model for the predicted term count model in Figure 6 This script is run by SLT01_GOexpectationAnalysis-driver.py
SLT01_GOexpectationAnalysis-driver.py: Formats data and runs the above model This script requires the outputs of SLT01_baselineDataSetup.py, SLT01_termCentric-semiSupervisedModel.py, and SLT01_RUPEEsemisupervisedModel.py
The scripts in this section are listed in the order in which they were run.
SLT01_InterproEnrichmentModel.stan: The statistical model for assessing Interpro term enrichments among PUFs. Used for figure S3. This script is run by SLT01_InterproEnrichmentModel-driver.py
SLT01_InterproEnrichmentModel-driver.py: Formats the input data and runs SLT01_InterproEnrichmentModel.stan This script requires proteins.dill from SLT01_baselineDataSetup.py and the output of InterProScan and SignalP 5.0.
The scripts in this section are listed in the order in which they were run.
SLT01_alphafoldPredsSummaryMetrics.py: collects mean pLDDT scores for Alphafold predictions. This script requires a directory of .pdb files downloaded from the Alphafold predicted protein structure database that have been processed by SLT01_trimAF.py.
SLT01_paperFigures.py: generates the underlying plots for all figures used in the paper. This script requires proteins.dill and terms.dill from SLT01_baselineDataSetup.py, the initial gene ontology annotation files, the results of SLT01_termCentric-filterSimilarities.py, SLT01_termCentric-semiSupervisedModel.py, SLT01_alphafoldPredsSummaryMetrics.py, SLT01_GOenrichmentModel-driver.py, SLT01_GOexpectationAnalysis-driver.py, SLT01_InterproEnrichmentModel.stan, and the output of Proteinortho