Skip to content

stavis1/Pputida_PUF_predictions_paper

Repository files navigation

DOI

In house scripts used in the paper "Multi-Omics Integration Can be Used to Rescue Metabolic Information for Some of the Dark Region of the Pseudomonas putida Proteome"

Within species Guilt by Association model

coexpression

the scripts in this section are not dependent on the results of any other scripts in this repository. They are listed in the order in which they were run.

names.txt: A mapping between the internally used mass spec data filenames and the filenames from the online repositories

valid_defects.txt: A list of mass defects corresponding to plausibly biological PTMs for filtering the ANNSoLo output

SLT01_coEx-processDecoyLibrary.py: Filters the NIST mouse spectral library of any peptides that are also potentially generated by P. putida This script requires the NIST .msp file and Pseudomonas_putida_KT2440_110.faa.

SLT01_spectraST-commands.txt: The specific commands used when processing spectral libraries with SpectraST These commands are run on the output of the philosopher pipeline and the filtered NIST mouse spectral library from above.

SLT01_coEx-annsoloToFlashLFQ.py: Processes the ANNSoLo output, does FDR control, and filters implausable PTMs then formats the output for quantification by FlashLFQ. The script will run FlashLFQ in the smithchemwisc/flashlfq:1.0.3 docker container as a subprocess. This requires a concatenated target + decoy spectral library file, the .mztab outputs of ANN-SoLo, valid_defects.txt, and names.txt

evolutionary_correlation

the scripts in this section are not dependent on the results of any other scripts in this repository. They are listed in the order in which they were run.

SLT01_MAFFTalign.py: Runs MAFFT on all orthogroups This script requires a .faa file.

SLT01_corrEvoRenameMSAgenes.py: Renames genes in the MSAs to their species for use in Pargenes This script requires the MSA files from MAFFT and a .faa file.

SLT01_corrEvoPargenesInput.py: Cleans and formats the MSAs for Pargenes This script requires the processed MSAs generated above.

SLT01_corrEvoFixedTopo-scheduler.py: Run RAxML-ng on all orthogroups with the species tree as a topological constraint This script requires the output directory from running Pargenes with the optional ASTRAL call.

SLT01_corrEvoTreeCMP-cleanNewick2.py: Processes the Newick outputs of the topologically constrained RAxML-ng runs to work with TreeCMP This script reqires a directory of the original .faa files and a directory of the newick outputs from SLT01_corrEvoFixedTopo-scheduler.py

SLT01_corrEvoTreeCMP-fullComparison.py: Breaks the TreeCMP inputs into blocks and runs TreeCMP on those blocks in parallel to do an all-vs-all comparison of trees This script requires a directory of processed newick files from SLT01_corrEvoTreeCMP-cleanNewick2.py

structural_similarity

the scripts in this section are listed in the order in which they were run. Note that this section is for the within-species structural similarity analysis that feeds into the guilt-by-association model not the between-species structural similarity analysis.

SLT01_trimAF.py: Trims low confidence regions at the termini of proteins as a preprocessing step for alignment This script requires the proteins.dill file from SLT01_baselineDataSetup.py in the GBA integration section and a directory of .pdb files from the Alphafold predicted protein structure database

SLT01_TMalignScheduler.py: Runs TM-align on all P. putida proteins This script requires the directory of cleaned .pdb files from SLT01_trimAF.py

GBA_integration

The scripts in this section are dependent on the results of the previous sections. They are listed in the order in which they were run.

names_seq_input.csv: Data from Uniprot

test_set.txt: The proteins held out as a test set for the prediction model. Created by SLT01_termCentric-filterSimilarities.py

SLT01_baselineDataSetup.py: Parses the go.obo into a list of python objects and a networkx graph, collects initial annotation information and makes a list of protein objects. These data structures are pickled with dill for use by other scripts in the analysis. This script requires the gene ontology description go.obo file, Pseudomonas_putida_KT2440_110.faa, names_seq_input.csv, proteins.dat from biocyc, GO annotations from the Pseudomonas Genome Database, and the results from running NetGO2.0.

SLT01_makeEdgelist.py: formats raw similarity data into edgelists This script requires the output of SLT01_coEx-annsoloToFlashLFQ.py, the output of SLT01_corrEvoTreeCMP-fullComparison.py, the results of running a Diamond all-vs-all search, the output of Rockhopper, the output of InterProScan and SignalP 5.0, STRINGdb data, and the output of SLT01_TMalignScheduler.py.

SLT01_combineEdgelist.py: combines individual edgelist files for use by the first within-species model This script requires the output of SLT01_makeEdgelist.py as well as proteins.dill and terms.dill from SLT01_baselineDataSetup.py

SLT01_termCentric-filterSimilarities.py: Trains and runs inference for the initial protein-protein similarity model. This script requires the output of SLT01_combineEdgelist.py as well as proteins.dill and terms.dill from SLT01_baselineDataSetup.py

SLT01_termCentric-inputData.py: Formats the data for the term transfer model This script requires the output of SLT01_termCentric-filterSimilarities.py as well as proteins.dill and terms.dill from SLT01_baselineDataSetup.py

SLT01_termCentric-semiSupervisedModel.py: Fits the semisupervised term transfer model This script requires the output of SLT01_termCentric-inputData.py

Between Species Structural Similarity Model

structural_similarity_model

The scripts in this section are listed in the order in which they were run. They only depend on SLT01_baselineDataSetup.py.

SLT01_RUPEEwebdriver.py: automates searching P. putida proteins with the online RUPEE search tool This script requires a directory of .pdb files downloaded from the Alphafold predicted protein structure database that have been processed by SLT01_trimAF.py.

SLT01_RUPEEmapNamesAndcollectInputData.py: Maps names from PDB hits to Uniprot IDs This script requires the output of SLT01_RUPEEwebdriver.py

SLT01_RUPEEpairwiseAlignments.py: Runs NWalign on the hits and collects pairwise similarity information This script requires the output of SLT01_RUPEEmapNamesAndcollectInputData.py and Pseudomonas_putida_KT2440_110.faa

SLT01_RUPEEsemisupervisedInput.py: Collects the input information and formats it for the term transfer model This script requires the output of SLT01_RUPEEpairwiseAlignments.py and all outputs of SLT01_baselineDataSetup.py

SLT01_RUPEEsemisupervisedModel.py: Fits the semi-supervised random forest model This script requires the output of SLT01_RUPEEsemisupervisedInput.py

Other scripts

All of the following scripts were run after all of the above scripts.

GO_enrichment

The scripts in this section are listed in the order in which they were run.

SLT01_GOenrichmentModel.stan: The statistical model for the GO enrichment analysis in Figure 5 This script is run by SLT01_GOenrichmentModel-driver.py

SLT01_GOenrichmentModel-driver.py: Formats data and runs the above model This script requires the outputs of SLT01_baselineDataSetup.py, SLT01_termCentric-semiSupervisedModel.py, and SLT01_RUPEEsemisupervisedModel.py

SLT01_GOexpectationAnalysis.stan: The statistical model for the predicted term count model in Figure 6 This script is run by SLT01_GOexpectationAnalysis-driver.py

SLT01_GOexpectationAnalysis-driver.py: Formats data and runs the above model This script requires the outputs of SLT01_baselineDataSetup.py, SLT01_termCentric-semiSupervisedModel.py, and SLT01_RUPEEsemisupervisedModel.py

interpro_analysis

The scripts in this section are listed in the order in which they were run.

SLT01_InterproEnrichmentModel.stan: The statistical model for assessing Interpro term enrichments among PUFs. Used for figure S3. This script is run by SLT01_InterproEnrichmentModel-driver.py

SLT01_InterproEnrichmentModel-driver.py: Formats the input data and runs SLT01_InterproEnrichmentModel.stan This script requires proteins.dill from SLT01_baselineDataSetup.py and the output of InterProScan and SignalP 5.0.

paper_figures

The scripts in this section are listed in the order in which they were run.

SLT01_alphafoldPredsSummaryMetrics.py: collects mean pLDDT scores for Alphafold predictions. This script requires a directory of .pdb files downloaded from the Alphafold predicted protein structure database that have been processed by SLT01_trimAF.py.

SLT01_paperFigures.py: generates the underlying plots for all figures used in the paper. This script requires proteins.dill and terms.dill from SLT01_baselineDataSetup.py, the initial gene ontology annotation files, the results of SLT01_termCentric-filterSimilarities.py, SLT01_termCentric-semiSupervisedModel.py, SLT01_alphafoldPredsSummaryMetrics.py, SLT01_GOenrichmentModel-driver.py, SLT01_GOexpectationAnalysis-driver.py, SLT01_InterproEnrichmentModel.stan, and the output of Proteinortho

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors