Skip to content

Workflows

Elizabeth McDaniel edited this page Sep 14, 2020 · 15 revisions

There are four main workflows within metabolisHMM:

  1. single-marker-phylogeny
  2. create-genome-phylogeny
  3. summarize-metabolism
  4. search-custom-markers

Each workflow is written as an independent python script, and for each workflow, you can see the usage and optional arguments with for example single-marker-phylogeny or single-marker-phylogeny -h. Although all the workflows together provide a toolkit for summarizing and visualizing metabolic marker phylogenies and presence/absence heatmaps, they are treated as separate workflows. Therefore, when running a different workflow, you should make a different output directory corresponding to that workflow's specific results. If an output directory already exists, say from a previous workflow run, you will get an error that the directory already exists.

Additionally each workflow will reformat the FASTA file headers for each genome so as not to cause downstream problems. Raw genome FASTA files ending in .fna will go through gene prediction with Prodigal, or genomes that you have already performed functional annotation for and have files ending in .faa can be used as the starting point. In either case, the reformatted FASTA files are in the genomes/ folder of your output directory that you designated. If you have genome names with dashes or hyphens in them, those need to be changed to underscores or taken out completely. For example you can change all FASTA files in a directory to replace - with _ with rename 's/-/_/g' *. This is important for how results are parsed in downstream steps.

If you have a lot of genomes or markers to run through, or choose the option to construct the tree with RAxML, you can always run the workflow in the background for example with nohup single-marker-phylogeny <all your options> & so that the workflow will keep running even if your connection to the remote server is closed or your desktop falls asleep.

Importantly if you are using your own custom-made HMM markers in the single-marker-phylogeny or search-custom-markers workflows, you need to include a TC (threshold/trusted cutoff) score in the HMM file. For more information on making your own HMMs and including a TC score, see this page of the Wiki.

Single marker phylogeny

Say you have a marker of interest, such as one of the steps in the nitrogen fixation pathway, and you want to understand the distribution and phylogeny of that marker in your set of genomes. To get a list of your genomes that have the specific marker and a resulting phylogeny of that marker for your genomes, run single-marker-phylogeny.

The usage for single-marker-phylogeny looks as such:

usage: single-marker-phylogeny [-h] [--input INPUT] [--output OUTPUT]
                               [--marker MARKER] [--list LIST]
                               [--phylogeny PHY] [--threads THREADS]
                               [--kofam KOFAM] [--ko_list KOLIST]
                               [--ribo_tree RIBO] [--domain DOMAIN]
                               [--loci LOCI] [--metadata METADATA]
                               [--names NAMES] [--itol_file ITOL]

Create phylogeny of single marker

required arguments:
  --input INPUT        Directory where genomes to be screened are held
  --output OUTPUT      Directory to store results and intermediate files
  --marker MARKER      Location of single marker to run analysis on
  --phylogeny PHY      fastree or raxml, choose one

optional arguments:
  --list LIST          Output list of hits locus tags
  --threads THREADS    number of threads for tree making
  --kofam KOFAM        Use KEGG HMMs from the KofamKOALA set. Options = ON or
                       OFF
  --ko_list KOLIST     Point to location of the KofamKoala ko_list file if
                       using the KofamKOALA KEGG HMMs

corresponding ribosomal tree arguments:
  --ribo_tree RIBO     Make corresponding ribosomal phylogeny of genomes
                       containing hits of the provided single marker.
  --domain DOMAIN      If constructing corresponding ribosomal tree, select
                       the domain for which your hits belong to. Options:
                       bacteria, archaea, all
  --loci LOCI          Output genomes with less than x number of loci. By
                       default prints genomes that have less than 12 ribosomal
                       loci markers.

metadata output files for ITOL:
  --metadata METADATA  Option for outputting ITOL formatted metadata files. ON
                       or OFF
  --names NAMES        Provided .csv formatted metadata file of filenames and
                       corresponding taxonomical or group names
  --itol_file ITOL     Output iTOL formatted metadata file for changing leaf
                       labels to taxonomical or group names

You are required to provide the input and name of your desired output directory, the specific marker you want to run the analysis on, and the phylogenetic tool to construct the phylogenetic tree. The options for FastTree and RAxML are given. It is good practice to run FastTree for preliminary results to see how the tree looks at first and identify any spurious branches, but for best results, it is best to use RAxML to have bootstrap support. If you are on a server with multiple CPUs, you can add the --threads # option, where # = the number of threads you want to use, where this is only taken into account for constructing the tree. If you are on a local desktop, FastTree will run quickly, but you are limited by the number of threads you can use for RAxML (usually 4 on a standard laptop), so keep this in mind for computational requirements and time needed to complete the run.

You can also add a number of optional arguments concerning creating metadata files to use in iTOL or corresponding ribosomal tree of your single marker hits. You can create a corresponding ribosomal tree of your hits to compare directly against by turning the ribo_tree option ON, and providing the domain option for your genomes, such as bacteria, archaea, or all, as described in the create-genome-phylogeny workflow. If you want corresponding metadata for all your genomes depending on names that you have provided (such as phylum level names), you can provide the necessary metadata arguments.

Create genome phylogeny

The create-genome-phylogeny workflow uses a set of ribosomal protein HMM markers to make a ribosomal tree of a given set of genomes. This is useful if you want to compare the phylogeny of the output for a single marker in the above workflow with the true ribosomal phylogeny of the set of genomes to make evolutionary comparisons based on tree topology.

The usage for create-genome-phylogeny is:

usage: create-genome-phylogeny [-h] [--input INPUT] [--output OUTPUT]
                               [--domain DOMAIN] [--phylogeny PHY]
                               [--threads THREADS] [--loci LOCI]
                               [--metadata METADATA] [--names NAMES]
                               [--itol_file ITOL]

Create ribosomal phylogenies using specific ribosomal markers for archaea
and/or bacteria

required arguments:
  --input INPUT        Directory where genomes to be screened are held
  --output OUTPUT      Directory to store results and intermediate files
  --domain DOMAIN      archaea, bacteria, all
  --phylogeny PHY      fastree, raxml

optional arguments:
  --threads THREADS    Optional: number of threads for calculating a tree
                       using RAxML. This is not taken into account using
                       Fastree
  --loci LOCI          Output genomes with less than x number of loci. By
                       default prints genomes that have less than 12 ribosomal
                       loci markers.

metadata output files for ITOL:
  --metadata METADATA  Option for outputting ITOL formatted metadata files. ON
                       or OFF
  --names NAMES        Provided .csv formatted metadata file of filenames and
                       corresponding taxonomical or group names
  --itol_file ITOL     Output iTOL formatted metadata file for changing leaf
                       labels to taxonomical or group names

As with the single-marker-phylogeny workflow, you must provide the input and output directories, and which program you want to use to construct the phylogenetic tree (fastree or raxml). The other required argument for this workflow is the domain that your genomes belong to. Specific ribosomal HMM profiles are specific to bacterial genomes, and therefore need to be left out if you have archaeal genomes. If you have either all bacteria or all archaea, enter the domain for which your genomes belong to. If you have both bacteria and archaea for which you want to construct a/multiple ribosomal trees for, you have two options: either select the all option to run the markers that are suited for both bacteria and archaea, or run the entire workflow separately for your bacterial and archaeal genomes.

The number of threads is optional, and will only be used if you choose to run the tree with RAxML. Note that the ribosomal tree will take much longer to construct than those for single markers, and running RAxML may not be feasible on a local desktop, or even a remote server with usually a max of 16 CPUs depending on how many genomes you have. A good alternative is the CIPRES server that you can manually run RAxML on a given alignment. All intermediate files such as the concatenated alignment are provided in the out/ directory within the output directory that you specified.

Both phylogeny workflows will provide the HMM outputs, FASTA files of hits for each marker, single alignments for each marker, the concatenated alignment, and the tree file in the out/ and results/ folders of your output directory.

Summarize metabolism

The other functionality of metabolisHMM is creating heatmap visualizations of curated and custom HMM marker results. The summarize-metabolism workflow summarizes the metabolic capabilities of a given set of genomes covering the carbon, nitrogen, sulfur, hydrogen, and oxygen cycles. These are the markers provided in the metabolic-markers directory that were mostly curated by Anantharaman et al. 2016. The usage is:

usage: summarize-metabolism [-h] [--input GENOMEDIR] [--output OUTPUT]
                            [--metadata METADATA] [--summary OUTFILE]
                            [--heatmap HEATOUT] [--aggregate AGG]
                            [--plotting PLOTTING] [--ordering ORDER]
                            [--group_list GROUPS]

Create a summary of metabolic markers spanning biogeochemical cycles

required arguments:
  --input GENOMEDIR    Directory where genomes to be screened are held
  --output OUTPUT      Directory to store results and intermediate files
  --metadata METADATA  Metadata file with taxonomical classifications or
                       groups associated with genome file names. Needs to be a
                       .csv file with the genome/filename (everything before
                       .fna) and the group name.

optional arguments:
  --summary OUTFILE    Summary file of metabolic marker statistics in CSV
                       format
  --heatmap HEATOUT    Summary heatmap of metabolic markers in PDF format. If
                       you provide a custom name, it must end in .pdf
  --aggregate AGG      Aggregate metadata names by group = ON, visualize each
                       genome individually = OFF
  --plotting PLOTTING  Option to turn plotting on or off depending on
                       if you want just the raw statistics to perform your own
                       plotting, or keep the default plotting settings. Options= ON,OFF.

plotting arguments:
  --ordering ORDER     Turn ON or OFF to control custom ordering of rows in
                       heatmap. By default is OFF to list in alphabetical
                       ordering.
  --group_list GROUPS  Provide .txt file of ordering of groups if turn the
                       aggregating option on to list the rows in custom order
                       instead of the default alphabetical ordering.

The required inputs are your genome directory, your desired output directory, and a set of metadata for your genomes. Your metadata file should be a CSV formatted file, with the first column the names of your genome excluding the extension of the filename (no .fna or .faa) and the second column the taxonomical name or group that you have identified your genome/MAG belonging to, or any other special identifier. For example, I have a list of genomes with the following names, and have listed their corresponding phyla:

bacteria00190,Actinobacteria
bacteria00193,Deltaproteobacteria
bacteria00203,Bacteroidetes
bacteria00229,Deltaproteobacteria
bacteria01060v2,Deltaproteobacteria
bacteria23257,Deltaproteobacteria
bacteria23258,Bacteroidetes
bacteria23259,Chloroflexi
bacteria23260,Deltaproteobacteria
bacteria23263,Deltaproteobacteria

Do not give the CSV file a header or specified column names, that is taken care of in downstream steps. If you do not have taxonomical names or particular names for each of your genomes, just provide the name of each file twice, such as:

bacteria00190,bacteria00190
bacteria00193,bacteria00193
bacteria00203,bacteria00203
bacteria00229,bacteria00229
bacteria01060v2,bacteria01060v2
bacteria23257,bacteria23257
bacteria23258,bacteria23258
bacteria23259,bacteria23259
bacteria23260,bacteria23260
bacteria23263,bacteria23263

By default the workflow will automatically name the resulting summary CSV file and heatmap in PDF format. If you want to change the name or location of these files, provide that in the --summary and --heatmap arguments, respectively, keeping in mind that each, respectively, needs to end in .csv and .pdf.

The --aggregate option is used if you want to summarize groups of genomes by a given common identifier. Say you have lots of genomes all belonging to different phyla as in the example above. If you want to visualize a large heatmap of approximately 100 genomes in different phyla, but want to make cross-phyla comparisions while keeping within phyla calculations, turn --aggregate ON to group by a specific identifier. See the Examples part of the Wiki to see how aggregating will change the heatmap.

Search custom markers

The search-custom-markers workflow will take any set of HMM profiles (with threshold cutoff scores provided either within the HMM file itself or an external file such as the KofamKOALA exec file) in the order you provide in a text file, search for those markers, and output the summary and heatmap visualization of the presence/absence of all markers across all genomes.

The usage for search-custom-markers is:

usage: search-custom-markers [-h] [--input GENOMEDIR] [--output OUTPUT]
                             [--markers_dir MARKERDIR]
                             [--markers_list MARKERLIST] [--metadata METADATA]
                             [--summary OUTFILE] [--heatmap HEATOUT]
                             [--kofam KOFAM] [--ko_list KOLIST]
                             [--aggregate AGG] [--plotting PLOTTING]
                             [--ordering ORDER] [--group_list GROUPS]

Search custom directory of HMMs

required arguments:
  --input GENOMEDIR     Directory where genomes to be screened are held
  --output OUTPUT       Directory to store results and intermediate files
  --markers_dir MARKERDIR
                        Directory where custom markers are held
  --markers_list MARKERLIST
                        Ordered list of markers to run custom search on
  --metadata METADATA   Metadata file with taxonomical classifications or
                        groups associated with genome file names

optional arguments:
  --summary OUTFILE     Output statistics of custom marker searches in .csv
                        format
  --heatmap HEATOUT     Summary heatmap of metabolic markers in PDF format. If
                        you provide a custom name, it must end in .pdf

kofam database arguments:
  --kofam KOFAM         Use KofamKOALA HMM distributions. By default is OFF,
                        options = ON,OFF
  --ko_list KOLIST      Point to location of the KofamKoala ko_list file if
                        using the KofamKOALA KEGG HMMs

plotting arguments:
  --aggregate AGG       Aggregate metadata names by group = ON, visualize each
                        genome individually = OFF
  --plotting PLOTTING   Option to turn plotting on or off depending on
                        if you want just the raw statistics to perform your
                        own plotting, or keep the default plotting settings.By default option
                        is ON. Options= ON,OFF.
  --ordering ORDER      Turn ON or OFF to control custom ordering of rows in
                        heatmap. By default is OFF to list in alphabetical
                        ordering.
  --group_list GROUPS   Provide .txt file of ordering of groups if turn the
                        aggregating option on to list the rows in custom order
                        instead of the default alphabetical ordering.

Again, you are required to provide the input and desired output directories. Additionally for this workflow, you will need to provide the directory that these custom or external markers are. DO NOT put them in the metabolic-markers folder, unless you are not planning on running the summarize-metabolism workflow, as their presence will mess things up with the other workflow. Instead, put them in a different directory, and point to that directory with the --markers_dir argument. Additionally, you will need to provide a list of the markers in that directory, in the order that you want them to appear in the heatmap in a .txt file with the --markers_list argument. Again, provide the --metadata file as formatted in the summarize-metabolism workflow.

As with the summarize-metabolism workflow, the workflow will by default write out the summary table and heatmap to the results/ folder. If you want to change the names or locations of these output files, provide them to the --summary and --heatmap arguments, respectively, and they need to end in .csv or .pdf, respectively.

metabolisHMM can also run, parse, and visualize sets of HMMs provided through KofamKOALA if you have the database and corresponding auxiliary files downloaded. With the --kofam argument, you will need to provide the path to the exec file, which has the threshold cutoffs for each curated HMM.

You can optionally turn the plotting function off if you just want to have the heatmap summary results in the raw format and perform your own plotting. You can also specify the order of the rows by turning ON the ordering argument, and then providing a text file of the order you want your rows of individual genomes or aggregated groups in the heatmap to be ordered in, depending on whether or not you have the --aggregate option ON or OFF. This way, you can put your groups or individual genomes in taxonomical order to match up with a tree of your choosing.

Clone this wiki locally