-
Notifications
You must be signed in to change notification settings - Fork 5
Workflows
There are four main workflows within metabolisHMM:
- single-marker-phylogeny
- create-genome-phylogeny
- summarize-metabolism
- search-custom-markers
Each workflow is written as an independent python script, and the two visualization workflows (summarize-metabolism and search-custom-markers) also run the make-heatmap.R script to produce heatmap visualizations of marker presence/absence. For each workflow, you can see the usage and optional arguments with for example python single-marker-phylogeny.py or python single-marker-phylogeny -h. Although all the workflows together provide a toolkit for summarizing and visualizing metabolic marker phylogenies and presence/absence heatmaps, they are treated as separate workflows. Therefore, when running a different workflow, you should make a different output directory corresponding to that workflow's specific results. If an output directory already exists, say from a previous workflow run, you will get an error that the directory already exists.
Additionally each workflow will reformat the FASTA file headers for each genome so as not to cause downstream problems. Raw genome FASTA files ending in .fna will go through gene prediction with Prodigal, or genomes that you have already performed functional annotation for and have files ending in .faa can be used as the starting point. In either case, the reformatted FASTA files are in the genomes/ folder of your output directory that you designated. If you have genome names with dashes or hyphens in them, those need to be changed to underscores or taken out completely. For example you can change all FASTA files in a directory to replace - with _ with rename 's/-/_/g' *. This is important for how results are parsed in downstream steps.
If you have a lot of genomes or markers to run through, or choose the option to construct the tree with RAxML, you can always run the workflow in the background for example with nohup python single-marker-phylogeny.py <all your options> & so that the workflow will keep running even if your connection to the remote server is closed or your desktop falls asleep.
Say you have a marker of interest, such as one of the steps in the nitrogen fixation pathway, and you want to understand the distribution and phylogeny of that marker in your set of genomes. To get a list of your genomes that have the specific marker and a resulting phylogeny of that marker for your genomes, run python single-marker-phylogeny.py.
The usage for single-marker-phylogeny.py looks as such:
usage: single-marker-phylogeny.py [-h] [--input INPUT] [--output OUTPUT]
[--marker MARKER] [--list LIST]
[--phylogeny PHY] [--threads THREADS]
Create phylogeny of single marker
required arguments:
--input INPUT Directory where genomes to be screened are held
--output OUTPUT Directory to store results and intermediate files
--marker MARKER Location of single marker to run analysis on
--phylogeny PHY fastree or raxml, choose one
optional arguments:
--list LIST Output list of hits locus tags
--threads THREADS number of threads for tree making
You are required to provide the input and name of your desired output directory, the specific marker you want to run the analysis on, and the phylogenetic tool to construct the phylogenetic tree. The options for FastTree and RAxML are given. It is good practice to run FastTree for preliminary results to see how the tree looks at first and identify any spurious branches, but for best results, it is best to use RAxML to have bootstrap support. If you are on a server with multiple CPUs, you can add the --threads # option, where # = the number of threads you want to use, where this is only taken into account for constructing the tree. If you are on a local desktop, FastTree will run quickly, but you are limited by the number of threads you can use for RAxML (usually 4 on a standard laptop), so keep this in mind for computational requirements and time needed to complete the run.
The create-genome-phylogeny workflow uses a set of ribosomal protein HMM markers to make a ribosomal tree of a given set of genomes. This is useful if you want to compare the phylogeny of the output for a single marker in the above workflow with the true ribosomal phylogeny of the set of genomes to make evolutionary comparisons based on tree topology.
The usage for create-genome-phylogeny.py is:
usage: create-genome-phylogeny.py [-h] [--input INPUT] [--output OUTPUT]
[--domain DOMAIN] [--phylogeny PHY]
[--threads THREADS] [--loci LOCI]
Create ribosomal phylogenies using specific ribosomal markers for archaea
and/or bacteria
required arguments:
--input INPUT Directory where genomes to be screened are held
--output OUTPUT Directory to store results and intermediate files
--domain DOMAIN archaea, bacteria, all
--phylogeny PHY fastree, raxml
optional arguments:
--threads THREADS Optional: number of threads for calculating a tree using
RAxML. This is not taken into account using Fastree
--loci LOCI Output genomes with less than x number of loci. By
default prints genomes that have less than 12 ribosomal
loci markers.
As with the single-marker-phylogeny workflow, you must provide the input and output directories, and which program you want to use to construct the phylogenetic tree (fastree or raxml). The other required argument for this workflow is the domain that your genomes belong to. Specific ribosomal HMM profiles are specific to bacterial genomes, and therefore need to be left out if you have archaeal genomes. If you have either all bacteria or all archaea, enter the domain for which your genomes belong to. If you have both bacteria and archaea for which you want to construct a/multiple ribosomal trees for, you have two options: either select the all option to run the markers that are suited for both bacteria and archaea, or run the entire workflow separately for your bacterial and archaeal genomes.
The number of threads is optional, and will only be used if you choose to run the tree with RAxML. Note that the ribosomal tree will take much longer to construct than those for single markers, and running RAxML may not be feasible on a local desktop, or even a remote server with usually a max of 16 CPUs depending on how many genomes you have. A good alternative is the CIPRES server that you can manually run RAxML on a given alignment. All intermediate files such as the concatenated alignment are provided in the out/ directory within the output directory that you specified.
Both phylogeny workflows will provide the HMM outputs, FASTA files of hits for each marker, single alignments for each marker, the concatenated alignment, and the tree file in the out/ and results/ folders of your output directory.
The summarize-metabolism workflow summarizes the metabolic capabilities of a given set of genomes covering the carbon, nitrogen, sulfur, hydrogen, and oxygen cycles. These are the markers provided in the metabolic-markers directory that were mostly curated by Anantharaman et al. 2016.