-
Notifications
You must be signed in to change notification settings - Fork 5
Workflows
There are four main workflows within metabolisHMM:
Each workflow is written as an independent python script, and for each workflow, you can see the usage and optional arguments with for example single-marker-phylogeny or single-marker-phylogeny -h. Although all the workflows together provide a toolkit for summarizing and visualizing metabolic marker phylogenies and presence/absence heatmaps, they are treated as separate workflows. Therefore, when running a different workflow, you should make a different output directory corresponding to that workflow's specific results. If an output directory already exists, say from a previous workflow run, you will get an error that the directory already exists.
Additionally each workflow will reformat the FASTA file headers for each genome so as not to cause downstream problems. Raw genome FASTA files ending in .fna will go through gene prediction with Prodigal, or genomes that you have already performed functional annotation for and have files ending in .faa can be used as the starting point. In either case, the reformatted FASTA files are in the genomes/ folder of your output directory that you designated. If you have genome names with dashes or hyphens in them, those need to be changed to underscores or taken out completely. For example you can change all FASTA files in a directory to replace - with _ with rename 's/-/_/g' *. This is important for how results are parsed in downstream steps.
If you have a lot of genomes or markers to run through, or choose the option to construct the tree with RAxML, you can always run the workflow in the background for example with nohup single-marker-phylogeny <all your options> & so that the workflow will keep running even if your connection to the remote server is closed or your desktop falls asleep.
Importantly if you are using your own custom-made HMM markers in the single-marker-phylogeny or search-custom-markers workflows, you need to include a TC (threshold/trusted cutoff) score in the HMM file. For more information on making your own HMMs and including a TC score, see this page of the Wiki.
Say you have a marker of interest, such as one of the steps in the nitrogen fixation pathway, and you want to understand the distribution and phylogeny of that marker in your set of genomes. To get a list of your genomes that have the specific marker and a resulting phylogeny of that marker for your genomes, run single-marker-phylogeny.
The usage for single-marker-phylogeny looks as such:
usage: single-marker-phylogeny [-h] [--input INPUT] [--output OUTPUT]
[--marker MARKER] [--list LIST]
[--phylogeny PHY] [--threads THREADS]
[--kofam KOFAM] [--ko_list KOLIST]
[--ribo_tree RIBO] [--domain DOMAIN]
[--loci LOCI] [--metadata METADATA]
[--names NAMES] [--itol_file ITOL]
Create phylogeny of single marker
required arguments:
--input INPUT Directory where genomes to be screened are held
--output OUTPUT Directory to store results and intermediate files
--marker MARKER Location of single marker to run analysis on
--phylogeny PHY fastree or raxml, choose one
optional arguments:
--list LIST Output list of hits locus tags
--threads THREADS number of threads for tree making
--kofam KOFAM Use KEGG HMMs from the KofamKOALA set. Options = ON or
OFF
--ko_list KOLIST Point to location of the KofamKoala ko_list file if
using the KofamKOALA KEGG HMMs
corresponding ribosomal tree arguments:
--ribo_tree RIBO Make corresponding ribosomal phylogeny of genomes
containing hits of the provided single marker.
--domain DOMAIN If constructing corresponding ribosomal tree, select
the domain for which your hits belong to. Options:
bacteria, archaea, all
--loci LOCI Output genomes with less than x number of loci. By
default prints genomes that have less than 12 ribosomal
loci markers.
metadata output files for ITOL:
--metadata METADATA Option for outputting ITOL formatted metadata files. ON
or OFF
--names NAMES Provided .csv formatted metadata file of filenames and
corresponding taxonomical or group names
--itol_file ITOL Output iTOL formatted metadata file for changing leaf
labels to taxonomical or group names
You are required to provide the input and name of your desired output directory, the specific marker you want to run the analysis on, and the phylogenetic tool to construct the phylogenetic tree. The options for FastTree and RAxML are given. It is good practice to run FastTree for preliminary results to see how the tree looks at first and identify any spurious branches, but for best results, it is best to use RAxML to have bootstrap support. If you are on a server with multiple CPUs, you can add the --threads # option, where # = the number of threads you want to use, where this is only taken into account for constructing the tree. If you are on a local desktop, FastTree will run quickly, but you are limited by the number of threads you can use for RAxML (usually 4 on a standard laptop), so keep this in mind for computational requirements and time needed to complete the run.
You can also add a number of optional arguments concerning creating metadata files to use in iTOL or corresponding ribosomal tree of your single marker hits. You can create a corresponding ribosomal tree of your hits to compare directly against by turning the ribo_tree option ON, and providing the domain option for your genomes, such as bacteria, archaea, or all, as described in the create-genome-phylogeny workflow. If you want corresponding metadata for all your genomes depending on names that you have provided (such as phylum level names), you can provide the necessary metadata arguments.
The create-genome-phylogeny workflow uses a set of ribosomal protein HMM markers to make a ribosomal tree of a given set of genomes. This is useful if you want to compare the phylogeny of the output for a single marker in the above workflow with the true ribosomal phylogeny of the set of genomes to make evolutionary comparisons based on tree topology.
The usage for create-genome-phylogeny is:
usage: create-genome-phylogeny [-h] [--input INPUT] [--output OUTPUT]
[--domain DOMAIN] [--phylogeny PHY]
[--threads THREADS] [--loci LOCI]
[--metadata METADATA] [--names NAMES]
[--itol_file ITOL]
Create ribosomal phylogenies using specific ribosomal markers for archaea
and/or bacteria
required arguments:
--input INPUT Directory where genomes to be screened are held
--output OUTPUT Directory to store results and intermediate files
--domain DOMAIN archaea, bacteria, all
--phylogeny PHY fastree, raxml
optional arguments:
--threads THREADS Optional: number of threads for calculating a tree
using RAxML. This is not taken into account using
Fastree
--loci LOCI Output genomes with less than x number of loci. By
default prints genomes that have less than 12 ribosomal
loci markers.
metadata output files for ITOL:
--metadata METADATA Option for outputting ITOL formatted metadata files. ON
or OFF
--names NAMES Provided .csv formatted metadata file of filenames and
corresponding taxonomical or group names
--itol_file ITOL Output iTOL formatted metadata file for changing leaf
labels to taxonomical or group names
As with the single-marker-phylogeny workflow, you must provide the input and output directories, and which program you want to use to construct the phylogenetic tree (fastree or raxml). The other required argument for this workflow is the domain that your genomes belong to. Specific ribosomal HMM profiles are specific to bacterial genomes, and therefore need to be left out if you have archaeal genomes. If you have either all bacteria or all archaea, enter the domain for which your genomes belong to. If you have both bacteria and archaea for which you want to construct a/multiple ribosomal trees for, you have two options: either select the all option to run the markers that are suited for both bacteria and archaea, or run the entire workflow separately for your bacterial and archaeal genomes.
The number of threads is optional, and will only be used if you choose to run the tree with RAxML. Note that the ribosomal tree will take much longer to construct than those for single markers, and running RAxML may not be feasible on a local desktop, or even a remote server with usually a max of 16 CPUs depending on how many genomes you have. A good alternative is the CIPRES server that you can manually run RAxML on a given alignment. All intermediate files such as the concatenated alignment are provided in the out/ directory within the output directory that you specified.
Both phylogeny workflows will provide the HMM outputs, FASTA files of hits for each marker, single alignments for each marker, the concatenated alignment, and the tree file in the out/ and results/ folders of your output directory.
The other functionality of metabolisHMM is creating heatmap visualizations of curated and custom HMM marker results. The summarize-metabolism workflow summarizes the metabolic capabilities of a given set of genomes covering the carbon, nitrogen, sulfur, hydrogen, and oxygen cycles. These are the markers provided in the metabolic-markers directory that were mostly curated by Anantharaman et al. 2016. The usage is:
usage: summarize-metabolism [-h] [--input GENOMEDIR] [--output OUTPUT]
[--metadata METADATA] [--summary OUTFILE]
[--heatmap HEATOUT] [--aggregate AGG]
[--plotting PLOTTING] [--ordering ORDER]
[--group_list GROUPS]
Create a summary of metabolic markers spanning biogeochemical cycles
required arguments:
--input GENOMEDIR Directory where genomes to be screened are held
--output OUTPUT Directory to store results and intermediate files
--metadata METADATA Metadata file with taxonomical classifications or
groups associated with genome file names. Needs to be a
.csv file with the genome/filename (everything before
.fna) and the group name.
optional arguments:
--summary OUTFILE Summary file of metabolic marker statistics in CSV
format
--heatmap HEATOUT Summary heatmap of metabolic markers in PDF format. If
you provide a custom name, it must end in .pdf
--aggregate AGG Aggregate metadata names by group = ON, visualize each
genome individually = OFF
--plotting PLOTTING Option to turn plotting on or off depending on
if you want just the raw statistics to perform your own
plotting, or keep the default plotting settings. Options= ON,OFF.
plotting arguments:
--ordering ORDER Turn ON or OFF to control custom ordering of rows in
heatmap. By default is OFF to list in alphabetical
ordering.
--group_list GROUPS Provide .txt file of ordering of groups if turn the
aggregating option on to list the rows in custom order
instead of the default alphabetical ordering.
The required inputs are your genome directory, your desired output directory, and a set of metadata for your genomes. Your metadata file should be a CSV formatted file, with the first column the names of your genome excluding the extension of the filename (no .fna or .faa) and the second column the taxonomical name or group that you have identified your genome/MAG belonging to, or any other special identifier. For example, I have a list of genomes with the following names, and have listed their corresponding phyla:
bacteria00190,Actinobacteria
bacteria00193,Deltaproteobacteria
bacteria00203,Bacteroidetes
bacteria00229,Deltaproteobacteria
bacteria01060v2,Deltaproteobacteria
bacteria23257,Deltaproteobacteria
bacteria23258,Bacteroidetes
bacteria23259,Chloroflexi
bacteria23260,Deltaproteobacteria
bacteria23263,Deltaproteobacteria
Do not give the CSV file a header or specified column names, that is taken care of in downstream steps. If you do not have taxonomical names or particular names for each of your genomes, just provide the name of each file twice, such as:
bacteria00190,bacteria00190
bacteria00193,bacteria00193
bacteria00203,bacteria00203
bacteria00229,bacteria00229
bacteria01060v2,bacteria01060v2
bacteria23257,bacteria23257
bacteria23258,bacteria23258
bacteria23259,bacteria23259
bacteria23260,bacteria23260
bacteria23263,bacteria23263
By default the workflow will automatically name the resulting summary CSV file and heatmap in PDF format. If you want to change the name or location of these files, provide that in the --summary and --heatmap arguments, respectively, keeping in mind that each, respectively, needs to end in .csv and .pdf.
The --aggregate option is used if you want to summarize groups of genomes by a given common identifier. Say you have lots of genomes all belonging to different phyla as in the example above. If you want to visualize a large heatmap of approximately 100 genomes in different phyla, but want to make cross-phyla comparisions while keeping within phyla calculations, turn --aggregate ON to group by a specific identifier. See the Examples part of the Wiki to see how aggregating will change the heatmap.
The search-custom-markers workflow will take any set of HMM profiles (with threshold cutoff scores provided either within the HMM file itself or an external file such as the KofamKOALA exec file) in the order you provide in a text file, search for those markers, and output the summary and heatmap visualization of the presence/absence of all markers across all genomes.
The usage for search-custom-markers is:
usage: search-custom-markers [-h] [--input GENOMEDIR] [--output OUTPUT]
[--markers_dir MARKERDIR]
[--markers_list MARKERLIST] [--metadata METADATA]
[--summary OUTFILE] [--heatmap HEATOUT]
[--kofam KOFAM] [--ko_list KOLIST]
[--aggregate AGG] [--plotting PLOTTING]
[--ordering ORDER] [--group_list GROUPS]
Search custom directory of HMMs
required arguments:
--input GENOMEDIR Directory where genomes to be screened are held
--output OUTPUT Directory to store results and intermediate files
--markers_dir MARKERDIR
Directory where custom markers are held
--markers_list MARKERLIST
Ordered list of markers to run custom search on
--metadata METADATA Metadata file with taxonomical classifications or
groups associated with genome file names
optional arguments:
--summary OUTFILE Output statistics of custom marker searches in .csv
format
--heatmap HEATOUT Summary heatmap of metabolic markers in PDF format. If
you provide a custom name, it must end in .pdf
kofam database arguments:
--kofam KOFAM Use KofamKOALA HMM distributions. By default is OFF,
options = ON,OFF
--ko_list KOLIST Point to location of the KofamKoala ko_list file if
using the KofamKOALA KEGG HMMs
plotting arguments:
--aggregate AGG Aggregate metadata names by group = ON, visualize each
genome individually = OFF
--plotting PLOTTING Option to turn plotting on or off depending on
if you want just the raw statistics to perform your
own plotting, or keep the default plotting settings.By default option
is ON. Options= ON,OFF.
--ordering ORDER Turn ON or OFF to control custom ordering of rows in
heatmap. By default is OFF to list in alphabetical
ordering.
--group_list GROUPS Provide .txt file of ordering of groups if turn the
aggregating option on to list the rows in custom order
instead of the default alphabetical ordering.
Again, you are required to provide the input and desired output directories. Additionally for this workflow, you will need to provide the directory that these custom or external markers are. DO NOT put them in the metabolic-markers folder, unless you are not planning on running the summarize-metabolism workflow, as their presence will mess things up with the other workflow. Instead, put them in a different directory, and point to that directory with the --markers_dir argument. Additionally, you will need to provide a list of the markers in that directory, in the order that you want them to appear in the heatmap in a .txt file with the --markers_list argument. Again, provide the --metadata file as formatted in the summarize-metabolism workflow.
As with the summarize-metabolism workflow, the workflow will by default write out the summary table and heatmap to the results/ folder. If you want to change the names or locations of these output files, provide them to the --summary and --heatmap arguments, respectively, and they need to end in .csv or .pdf, respectively.
metabolisHMM can also run, parse, and visualize sets of HMMs provided through KofamKOALA if you have the database and corresponding auxiliary files downloaded. With the --kofam argument, you will need to provide the path to the exec file, which has the threshold cutoffs for each curated HMM.
You can optionally turn the plotting function off if you just want to have the heatmap summary results in the raw format and perform your own plotting. You can also specify the order of the rows by turning ON the ordering argument, and then providing a text file of the order you want your rows of individual genomes or aggregated groups in the heatmap to be ordered in, depending on whether or not you have the --aggregate option ON or OFF. This way, you can put your groups or individual genomes in taxonomical order to match up with a tree of your choosing.