Skip to content

Commit a8ed790

Browse files
Jon PalmerJon Palmer
Jon Palmer
authored and
Jon Palmer
committed
updates to docs
1 parent f63fe9e commit a8ed790

File tree

5 files changed

+130
-6
lines changed

5 files changed

+130
-6
lines changed

docs/annotate.rst

+76
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,80 @@
44
Functional annotation
55
================================
66

7+
After your genome has gone through the gene prediction module and you have gene models that pass NCBI specs the next step is to add functional annotate to the protein-coding genes. Funannotate accomplishes this using several curated databases and is run using the :code:`funannotate annotate` command.
8+
9+
Funannotate will parse the protein-coding models from the annotation and identify Pfam domains, CAZYmes, secreted proteins, proteases (MEROPS), and BUSCO groups. If you provide the script with InterProScan5 data :code:`--iprscan`, funannotate will also generate additional annotation: InterPro terms, GO ontology, and fungal transcription factors. If Eggnog-mapper is installed locally or you pass eggnog results via :code:`--eggnog`, then Eggnog annotations and COGs will be added to the functional annotation. The scripts will also parse UniProtKb/SwissProt searches with Eggnog-mapper searches (optional) to generate gene names and product descriptions.
10+
11+
InterProScan5 and Eggnog-Mapper are two functional annotation pipelines that can be parsed by funannotate, however due to the large database sizes they are not run directly. If :code:`emapper.py` (Eggnog-mapper) is installed, then it will be run automatically during the functional annotation process. Because InterProScan5 is Linux only, it must be run outside funannotate and the results passed to the script. If you are on Mac, I've included a method to run InterProScan5 using Docker and the :code:`funannotate predict` output will let the user know how to run this script. Alternatively, you can run the InterProScan5 search remotely using the :code:`funannotate remote` command.
12+
13+
Phobius and SignalP will be run automatically if they are installed (i.e. in the PATH), however, Phobius will not run on Mac. If you are on Mac you can run Phobius with the :code:`funannotate remote` script.
14+
15+
If you are annotating a fungal genome, you can run Secondary Metabolite Gene Cluster prediction using antiSMASH. This can be done on the webserver, submit your GBK file from predict (predict_results/yourGenome.gbk) or alternatively you can submit from the command line using :code:`funannotate remote`. Of course, if you are on Linux you can install the antiSMASH program locally and run that way as well. The annotated GBK file is fed back to this script with the :code:`--antismash` option.
16+
17+
Similarily to :code:`funannotate predict`, the output from :code:`funannotate annotate` will be populated in the output/annotate_results folder. The output files are:
18+
19+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
20+
| **File Name** | **Description** |
21+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
22+
| Basename.gbk | Annotated Genome in GenBank Flat File format |
23+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
24+
| Basename.contigs.fsa | Multi-fasta file of contigs, split at gaps (use for NCBI submission) |
25+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
26+
| Basename.agp | AGP file; showing linkage/location of contigs (use for NCBI submission) |
27+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
28+
| Basename.tbl | NCBI tbl annotation file (use for NCBI submission) |
29+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
30+
| Basename.sqn | NCBI Sequin genome file (use for NCBI submission) |
31+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
32+
| Basename.scaffolds.fa | Multi-fasta file of scaffolds |
33+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
34+
| Basename.proteins.fa | Multi-fasta file of protein coding genes |
35+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
36+
| Basename.transcripts.fa | Multi-fasta file of transcripts (mRNA) |
37+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
38+
| Basename.discrepency.report.txt | tbl2asn summary report of annotated genome |
39+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
40+
| Basename.annotations.txt | TSV file of all annotations added to genome. (i.e. import into excel) |
41+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
42+
| Gene2Products.must-fix.txt | TSV file of Gene Name/Product deflines that failed to pass tbl2asn checks and must be fixed |
43+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
44+
| Gene2Products.need-curating.txt | TSV file of Gene Name/Product defines that need to be curated |
45+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
46+
| Gene2Products.new-names-passed.txt | TSV file of Gene Name/Product deflines that passed tbl2asn but are not in Gene2Products database. Please submit a PR with these. |
47+
+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
48+
49+
.. code-block:: none
50+
51+
Usage: funannotate annotate <arguments>
52+
version: 1.0.1
53+
54+
Description: Script functionally annotates the results from funannotate predict. It pulls
55+
annotation from PFAM, InterPro, EggNog, UniProtKB, MEROPS, CAZyme, and GO ontology.
56+
57+
Required: -i, --input Folder from funannotate predict
58+
or
59+
--genbank Genome in GenBank format
60+
-o, --out Output folder for results
61+
or
62+
--gff Genome GFF3 annotation file
63+
--fasta Genome in multi-fasta format
64+
-s, --species Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
65+
-o, --out Output folder for results
66+
67+
Optional: --sbt NCBI submission template file. (Recommended)
68+
-a, --annotations Custom annotations (3 column tsv file)
69+
--eggnog Eggnog-mapper annotations file (if NOT installed)
70+
--antismash antiSMASH secondary metabolism results (GBK file from output)
71+
--iprscan InterProScan5 XML file
72+
--phobius Phobius pre-computed results (if phobius NOT installed)
73+
--isolate Isolate name
74+
--strain Strain name
75+
--busco_db BUSCO models. Default: dikarya
76+
-t, --tbl2asn Additional parameters for tbl2asn. Example: "-l paired-ends"
77+
-d, --database Path to funannotate database. Default: $FUNANNOTATE_DB
78+
--force Force over-write of output folder
79+
--cpus Number of CPUs to use. Default: 2
80+
81+
ENV Vars: If not specified at runtime, will be loaded from your $PATH
82+
--AUGUSTUS_CONFIG_PATH
783

docs/compare.rst

+22-1
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,26 @@
33

44
Comparative genomics
55
================================
6-
6+
A typical workflow in a genomics project would be to compare your newly sequenced/assembled/annotated genome to other organisms. The impetus behind :code:`funannotate compare` was that there was previously no way to easily compare multiple genomes. Funannotate stores all annotation in GenBank flat file format, while some people don't like this format as it is difficult to parse with standard unix tools, the main advantage is that the annotation can be stored in a standardized format and retrieved in the same way for each genome. GFF3 is the common output of many annotation tools, however, this doesn't work well for functional annotation as all of the "information" is stored in a single column. At any rate, :code:`funannotate compare` can take either folders containing "funannotated" genomes or GBK files --> the output is stats, graphs, CSV files, phylogeny, etc all summarized in HTML format.
7+
8+
.. code-block:: none
9+
10+
Usage: funannotate compare <arguments>
11+
version: 1.0.1
12+
13+
Description: Script does light-weight comparative genomics between funannotated genomes. Output
14+
is graphs, phylogeny, CSV files, etc --> visualized in web-browser.
15+
16+
Required: -i, --input List of funannotate genome folders or GBK files
17+
18+
Optional: -o, --out Output folder name. Default: funannotate_compare
19+
-d, --database Path to funannotate database. Default: $FUNANNOTATE_DB
20+
--cpus Number of CPUs to use. Default: 2
21+
--run_dnds Calculate dN/dS ratio on all orthologs. [estimate,full]
22+
--go_fdr P-value for FDR GO-enrichment. Default: 0.05
23+
--heatmap_stdev Cut-off for heatmap. Default: 1.0
24+
--num_orthos Number of Single-copy orthologs to use for RAxML. Default: 500
25+
--bootstrap Number of boostrap replicates to run with RAxML. Default: 100
26+
--outgroup Name of species to use for RAxML outgroup. Default: no outgroup
27+
--proteinortho ProteinOrtho5 POFF results.
728

docs/homebrew.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ While it seems that homebrew-science is being at least partially deprecated (or
77

88
.. code-block:: none
99
10-
brew tap homebrew/science && brew tap nextgenusfs/tap && brew update
10+
brew tap brewsci/science && brew tap nextgenusfs/tap && brew update
1111
brew install funannotate
1212
1313
This will automatically install most of the dependencies as well as the most current release of funannotate. Follow the instructions from homebrew, which are:

docs/predict.rst

+30-3
Original file line numberDiff line numberDiff line change
@@ -128,17 +128,44 @@ Evidence Modeler builds consensus gene models and in addition to providing EVM w
128128
funannotate predict -i mygenome.fa -o output_folder -s "Aspergillus nidulans"
129129
--pasa_gff mypasamodels.gff3:8 --other_gff prediction.gff3:5
130130
131-
**Submitting to NCBI, what should I know?**
131+
Submitting to NCBI, what should I know?
132+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
132133

133134
Funannotate will produce NCBI/GeneBank-submission ready output, however, there are a few things you should do if planning on submitting to NCBI.
134135

135136
1. Get a locus_tag number for your genome.
136-
You do this by starting a WGS genome submission and either specifying a locus tag or one will be assigned to you. The default in funannotate is to use FUN_.
137+
You do this by starting a WGS genome submission and either specifying a locus tag or one will be assigned to you. The default in funannotate is to use "FUN".
137138

138139
2. Pre-submission inquiry of unannotated genome.
139140
If you are new to genome assembly/annotation submission, be aware that your assembly will have to undergo some quality checks before being accepted by NCBI. Sometimes this results in you have to update your assembly, i.e. remove contigs, split contigs where you have adapter contamination, etc. If you have already done your annotation and then have to make these changes it can be very difficult. Instead, you can start your WGS submission and request that the GenBank curators do a quality check on your assembly and fix any problems prior to generating annotation with funannotate.
140141

141142
3. Generated an SBT template file. https://submit.ncbi.nlm.nih.gov/genbank/template/submission/
142143

143-
144+
Explanation of the outputs:
145+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
146+
The output of :code:`funannotate predict` is written to the output/predict_results folder, which contains:
147+
148+
+---------------------------------+----------------------------------------------+
149+
| **File Name** | **Description** |
150+
+---------------------------------+----------------------------------------------+
151+
| Basename.gbk | Annotated Genome in GenBank Flat File format |
152+
+---------------------------------+----------------------------------------------+
153+
| Basename.tbl | NCBI tbl annotation file |
154+
+---------------------------------+----------------------------------------------+
155+
| Basename.gff3 | Genome annotation in GFF3 format |
156+
+---------------------------------+----------------------------------------------+
157+
| Basename.scaffolds.fa | Multi-fasta file of scaffolds |
158+
+---------------------------------+----------------------------------------------+
159+
| Basename.proteins.fa | Multi-fasta file of protein coding genes |
160+
+---------------------------------+----------------------------------------------+
161+
| Basename.transcripts.fa | Multi-fasta file of transcripts (mRNA) |
162+
+---------------------------------+----------------------------------------------+
163+
| Basename.discrepency.report.txt | tbl2asn summary report of annotated genome |
164+
+---------------------------------+----------------------------------------------+
165+
| Basename.error.summary.txt | tbl2asn error summary report |
166+
+---------------------------------+----------------------------------------------+
167+
| Basename.validation.txt | tbl2asn genome validation report |
168+
+---------------------------------+----------------------------------------------+
169+
170+
144171

sample_data/run_unit_tests.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ echo $cmd; eval $cmd
4545
#test RNAseq modules
4646
cmd='funannotate train -i genome6.fasta -l genome6_R1.fq.gz -r genome6_R2.fq.gz --stranded RF --species "Rubeus macgubis" --cpus 6 -o genome6'
4747
echo $cmd; eval $cmd
48-
cmd='funannotate predict -i genome6 --transcript_evidence genome6/training/funannotate_train.trinity-GG.fasta --rna_bam genome6/training/funannotate_train.coordSorted.bam --pasa_gff genome6/training/funannotate_train.pasa.gff3 -o genome6 -s "Rubeus macgubis" --cpus 6'
48+
cmd='funannotate predict -i genome6.fasta --transcript_evidence genome6/training/funannotate_train.trinity-GG.fasta --rna_bam genome6/training/funannotate_train.coordSorted.bam --pasa_gff genome6/training/funannotate_train.pasa.gff3 -o genome6 -s "Rubeus macgubis" --cpus 6'
4949
echo $cmd; eval $cmd
5050
cmd='funannotate update -i genome6 --cpus 6'
5151
echo $cmd; eval $cmd

0 commit comments

Comments
 (0)