Skip to content

Workflows

Elizabeth McDaniel edited this page Sep 17, 2019 · 15 revisions

There are four main workflows within metabolisHMM:

  1. single-marker-phylogeny
  2. create-genome-phylogeny
  3. summarize-metabolism
  4. search-custom-markers

Each workflow is written as an independent python script, and the two visualization workflows (summarize-metabolism and search-custom-markers) also run the make-heatmap.R script to produce heatmap visualizations of marker presence/absence. For each workflow, you can see the usage and optional arguments with for example python single-marker-phylogeny.py or python single-marker-phylogeny -h. Although all the workflows together provide a toolkit for summarizing and visualizing metabolic marker phylogenies and presence/absence heatmaps, they are treated as separate workflows. Therefore, when running a different workflow, you should make a different output directory corresponding to that workflow's specific results. If an output directory already exists, say from a previous workflow run, you will get an error that the directory already exists.

Additionally each workflow will reformat the FASTA file headers for each genome so as not to cause downstream problems. Raw genome FASTA files ending in .fna will go through gene prediction with Prodigal, or genomes that you have already performed functional annotation for and have files ending in .faa can be used as the starting point. In either case, the reformatted FASTA files are in the genomes/ folder of your output directory that you designated. If you have genome names with dashes or hyphens in them, those need to be changed to underscores or taken out completely. For example you can change all FASTA files in a directory to replace - with _ with rename 's/-/_/g' *. This is important for how results are parsed in downstream steps.

If you have a lot of genomes or markers to run through, or choose the option to construct the tree with RAxML, you can always run the workflow in the background for example with nohup python single-marker-phylogeny.py <all your options> & so that the workflow will keep running even if your connection to the remote server is closed or your desktop falls asleep.

Single marker phylogeny

Say you have a marker of interest, such as one of the steps in the nitrogen fixation pathway, and you want to understand the distribution and phylogeny of that marker in your set of genomes. To get a list of your genomes that have the specific marker and a resulting phylogeny of that marker for your genomes, run python single-marker-phylogeny.py.

The usage for single-marker-phylogeny.py looks as such:

usage: single-marker-phylogeny.py [-h] [--input INPUT] [--output OUTPUT]
                                  [--marker MARKER] [--list LIST]
                                  [--phylogeny PHY] [--threads THREADS]

Create phylogeny of single marker

required arguments:
  --input INPUT      Directory where genomes to be screened are held
  --output OUTPUT    Directory to store results and intermediate files
  --marker MARKER    Location of single marker to run analysis on
  --phylogeny PHY    fastree or raxml, choose one

optional arguments:
  --list LIST        Output list of hits locus tags
  --threads THREADS  number of threads for tree making

You are required to provide the input and name of your desired output directory, the specific marker you want to run the analysis on, and the phylogenetic tool to construct the phylogenetic tree. The options for FastTree and RAxML are given. It is good practice to run FastTree for preliminary results to see how the tree looks at first and identify any spurious branches, but for best results, it is best to use RAxML to have bootstrap support. If you are on a server with multiple CPUs, you can add the --threads # option, where # = the number of threads you want to use, where this is only taken into account for constructing the tree. If you are on a local desktop, FastTree will run quickly, but you are limited by the number of threads you can use for RAxML (usually 4 on a standard laptop), so keep this in mind for computational requirements and time needed to complete the run.

Create genome phylogeny

The create-genome-phylogeny workflow uses a set of ribosomal protein HMM markers to make a ribosomal tree of a given set of genomes. This is useful if you want to compare the phylogeny of the output for a single marker in the above workflow with the true ribosomal phylogeny of the set of genomes to make evolutionary comparisons based on tree topology.

The usage for create-genome-phylogeny.py is:

usage: create-genome-phylogeny.py [-h] [--input INPUT] [--output OUTPUT]
                                  [--domain DOMAIN] [--phylogeny PHY]
                                  [--threads THREADS] [--loci LOCI]

Create ribosomal phylogenies using specific ribosomal markers for archaea
and/or bacteria

required arguments:
  --input INPUT      Directory where genomes to be screened are held
  --output OUTPUT    Directory to store results and intermediate files
  --domain DOMAIN    archaea, bacteria, all
  --phylogeny PHY    fastree, raxml

optional arguments:
  --threads THREADS  Optional: number of threads for calculating a tree using
                     RAxML. This is not taken into account using Fastree
  --loci LOCI        Output genomes with less than x number of loci. By
                     default prints genomes that have less than 12 ribosomal
                     loci markers.

As with the single-marker-phylogeny workflow, you must provide the input and output directories, and which program you want to use to construct the phylogenetic tree (fastree or raxml). The other required argument for this workflow is the domain that your genomes belong to. Specific ribosomal HMM profiles are specific to bacterial genomes, and therefore need to be left out if you have archaeal genomes. If you have either all bacteria or all archaea, enter the domain for which your genomes belong to. If you have both bacteria and archaea for which you want to construct a/multiple ribosomal trees for, you have two options: either select the all option to run the markers that are suited for both bacteria and archaea, or run the entire workflow separately for your bacterial and archaeal genomes.

The number of threads is optional, and will only be used if you choose to run the tree with RAxML. Note that the ribosomal tree will take much longer to construct than those for single markers, and running RAxML may not be feasible on a local desktop, or even a remote server with usually a max of 16 CPUs depending on how many genomes you have. A good alternative is the CIPRES server that you can manually run RAxML on a given alignment. All intermediate files such as the concatenated alignment are provided in the out/ directory within the output directory that you specified.

Both phylogeny workflows will provide the HMM outputs, FASTA files of hits for each marker, single alignments for each marker, the concatenated alignment, and the tree file in the out/ and results/ folders of your output directory.

Summarize metabolism

The summarize-metabolism workflow summarizes the metabolic capabilities of a given set of genomes covering the carbon, nitrogen, sulfur, hydrogen, and oxygen cycles. These are the markers provided in the metabolic-markers directory that were mostly curated by Anantharaman et al. 2016. The usage is:

usage: summarize-metabolism.py [-h] [--input GENOMEDIR] [--output OUTPUT]
                               [--summary OUTFILE] [--heatmap HEATOUT]
                               [--metadata METADATA] [--aggregate AGG]

Create a summary of metabolic markers spanning biogeochemical cycles

required arguments:
  --input GENOMEDIR    Directory where genomes to be screened are held
  --output OUTPUT      Directory to store results and intermediate files
  --metadata METADATA  Metadata file with taxonomical classifications or
                       groups associated with genome file names. Needs to be a
                       .csv file with the genome/filename (everything before
                       .fna) and the group name.

optional arguments:
  --summary OUTFILE    Summary file of metabolic marker statistics in CSV
                       format
  --heatmap HEATOUT    Summary heatmap of metabolic markers in PDF format. If
                       you provide a custom name, it must end in .pdf
  --aggregate AGG      Aggregate metadata names by group = ON, visualize each
                       genome individually = OFF

The required inputs are your genome directory, your desired output directory, and a set of metadata for your genomes. Your metadata file should be a CSV formatted file, with the first column the names of your genome excluding the extension of the filename (no .fna or .faa) and the second column the taxonomical name or group that you have identified your genome/MAG belonging to, or any other special identifier. For example, I have a list of genomes with the following names, and have listed their corresponding phyla:

bacteria00190,Actinobacteria
bacteria00193,Deltaproteobacteria
bacteria00203,Bacteroidetes
bacteria00229,Deltaproteobacteria
bacteria01060v2,Deltaproteobacteria
bacteria23257,Deltaproteobacteria
bacteria23258,Bacteroidetes
bacteria23259,Chloroflexi
bacteria23260,Deltaproteobacteria
bacteria23263,Deltaproteobacteria

Do not give the CSV file a header or specified column names, that is taken care of in downstream steps. If you do not have taxonomical names or particular names for each of your genomes, just provide the name of each file twice, such as:

bacteria00190,bacteria00190
bacteria00193,bacteria00193
bacteria00203,bacteria00203
bacteria00229,bacteria00229
bacteria01060v2,bacteria01060v2
bacteria23257,bacteria23257
bacteria23258,bacteria23258
bacteria23259,bacteria23259
bacteria23260,bacteria23260
bacteria23263,bacteria23263

By default the workflow will automatically name the resulting summary CSV file and heatmap in PDF format. If you want to change the name or location of these files, provide that in the --summary and --heatmap arguments, respectively, keeping in mind that each, respectively, needs to end in .csv and .pdf.

The --aggregate option is used if you want to summarize groups of genomes by a given common identifier. Say you have lots of genomes all belonging to different phyla as in the example above. If you want to visualize a large heatmap of approximately 100 genomes in different phyla, but want to make cross-phyla comparsions while keeping within phyla calculations, turn --aggregate ON to group by a specific identifier. See the Examples part of the Wiki to see how aggregating will change the heatmap.

Search custom markers

The search-custom-markers workflow will take any set of HMM profiles (with threshold cutoff scores provided either within the HMM file itself or an external file such as the KofamKOALA exec file) in the order you provide in a text file, search for those markers, and output the summary and heatmap visualization of the presence/absence of all markers across all genomes.

The usage for search-custom-markers.py is:

usage: search-custom-markers.py [-h] [--input GENOMEDIR] [--output OUTPUT]
                                [--markers_dir MARKERDIR]
                                [--markers_list MARKERLIST]
                                [--summary OUTFILE] [--heatmap HEATOUT]
                                [--metadata METADATA] [--aggregate AGG]
                                [--kofam KOFAM]

Search custom directory of HMMs

required arguments:
  --input GENOMEDIR     Directory where genomes to be screened are held
  --output OUTPUT       Directory to store results and intermediate files
  --markers_dir MARKERDIR
                        Directory where custom markers are held
  --markers_list MARKERLIST
                        Ordered list of markers to run custom search on
  --metadata METADATA   Metadata file with taxonomical classifications or
                        groups associated with genome file names

optional arguments:
  --summary OUTFILE     Output statistics of custom marker searches in .csv
                        format
  --heatmap HEATOUT     Summary heatmap of metabolic markers in PDF format. If
                        you provide a custom name, it must end in .pdf
  --aggregate AGG       Aggregate metadata names by group = ON, visualize each
                        genome individually = OFF
  --kofam KOFAM         Path to KofamKOALA description list with threshold
                        cutoff scores to use with HMM profiles

Again, you are required to provide the input and desired output directories. Additionally for this workflow, you will need to provide the directory that these custom or external markers are. DO NOT put them in the metabolic-markers folder, unless you are not planning on running the summarize-metabolism workflow, as their presence will mess things up with the other workflow. Instead, put them in a different directory, and point to that directory with the --markers_dir argument. Additionally, you will need to provide a list of the markers in that directory, in the order that you want them to appear in the heatmap in a .txt file with the --markers_list argument.

As with the summarize-metabolism workflow, the workflow will by default write out the summary table and heatmap to the results/ folder. If you want to change the names or locations of these output files, provide them to the --summary and --heatmap arguments, respectively, and they need to end in .csv or .pdf, respectively.

metabolisHMM can also run, parse, and visualize sets of HMMs provided through KofamKOALA if you have the database and corresponding auxiliary files downloaded. With the --kofam argument, you will need to provide the path to the exec file, which has the threshold cutoffs for each curated HMM.

Clone this wiki locally