updates to docs

Jon Palmer · Jon Palmer · commit a8ed7904068c · 2018-01-09T17:42:16.000-05:00
diff --git a/docs/annotate.rst b/docs/annotate.rst
@@ -4,4 +4,80 @@
 Functional annotation
 ================================
  
+After your genome has gone through the gene prediction module and you have gene models that pass NCBI specs the next step is to add functional annotate to the protein-coding genes. Funannotate accomplishes this using several curated databases and is run using the :code:`funannotate annotate` command. 
+
+Funannotate will parse the protein-coding models from the annotation and identify Pfam domains, CAZYmes, secreted proteins, proteases (MEROPS), and BUSCO groups.  If you provide the script with InterProScan5 data :code:`--iprscan`, funannotate will also generate additional annotation: InterPro terms, GO ontology, and fungal transcription factors. If Eggnog-mapper is installed locally or you pass eggnog results via :code:`--eggnog`, then Eggnog annotations and COGs will be added to the functional annotation.  The scripts will also parse UniProtKb/SwissProt searches with Eggnog-mapper searches (optional) to generate gene names and product descriptions. 
+
+InterProScan5 and Eggnog-Mapper are two functional annotation pipelines that can be parsed by funannotate, however due to the large database sizes they are not run directly.  If :code:`emapper.py` (Eggnog-mapper) is installed, then it will be run automatically during the functional annotation process. Because InterProScan5 is Linux only, it must be run outside funannotate and the results passed to the script. If you are on Mac, I've included a method to run InterProScan5 using Docker and the :code:`funannotate predict` output will let the user know how to run this script.  Alternatively, you can run the InterProScan5 search remotely using the :code:`funannotate remote` command.
+
+Phobius and SignalP will be run automatically if they are installed (i.e. in the PATH), however, Phobius will not run on Mac.  If you are on Mac you can run Phobius with the :code:`funannotate remote` script. 
+
+If you are annotating a fungal genome, you can run Secondary Metabolite Gene Cluster prediction using antiSMASH.  This can be done on the webserver, submit your GBK file from predict (predict_results/yourGenome.gbk) or alternatively you can submit from the command line using :code:`funannotate remote`.  Of course, if you are on Linux you can install the antiSMASH program locally and run that way as well.  The annotated GBK file is fed back to this script with the :code:`--antismash` option.
+
+Similarily to :code:`funannotate predict`, the output from :code:`funannotate annotate` will be populated in the output/annotate_results folder. The output files are:
+
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| **File Name**                      | **Description**                                                                                                                  |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Basename.gbk                       | Annotated Genome in GenBank Flat File format                                                                                     |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Basename.contigs.fsa               | Multi-fasta file of contigs, split at gaps (use for NCBI submission)                                                             |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Basename.agp                       | AGP file; showing linkage/location of contigs (use for NCBI submission)                                                          |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Basename.tbl                       | NCBI tbl annotation file (use for NCBI submission)                                                                               |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Basename.sqn                       | NCBI Sequin genome file (use for NCBI submission)                                                                                |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Basename.scaffolds.fa              | Multi-fasta file of scaffolds                                                                                                    |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Basename.proteins.fa               | Multi-fasta file of protein coding genes                                                                                         |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Basename.transcripts.fa            | Multi-fasta file of transcripts (mRNA)                                                                                           |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Basename.discrepency.report.txt    | tbl2asn summary report of annotated genome                                                                                       |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Basename.annotations.txt           | TSV file of all annotations added to genome. (i.e. import into excel)                                                            |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Gene2Products.must-fix.txt         | TSV file of Gene Name/Product deflines that failed to pass tbl2asn checks and must be fixed                                      |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Gene2Products.need-curating.txt    | TSV file of Gene Name/Product defines that need to be curated                                                                    |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+| Gene2Products.new-names-passed.txt | TSV file of Gene Name/Product deflines that passed tbl2asn but are not in Gene2Products database. Please submit a PR with these. |
++------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
+
+.. code-block:: none
+
+    Usage:       funannotate annotate <arguments>
+    version:     1.0.1
+
+    Description: Script functionally annotates the results from funannotate predict.  It pulls
+                 annotation from PFAM, InterPro, EggNog, UniProtKB, MEROPS, CAZyme, and GO ontology.
+    
+    Required:    -i, --input        Folder from funannotate predict
+              or
+                 --genbank          Genome in GenBank format
+                 -o, --out          Output folder for results
+              or   
+                 --gff              Genome GFF3 annotation file
+                 --fasta            Genome in multi-fasta format
+                 -s, --species      Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
+                 -o, --out          Output folder for results
+
+    Optional:    --sbt              NCBI submission template file. (Recommended)
+                 -a, --annotations	Custom annotations (3 column tsv file)
+                 --eggnog           Eggnog-mapper annotations file (if NOT installed)
+                 --antismash        antiSMASH secondary metabolism results (GBK file from output)
+                 --iprscan          InterProScan5 XML file
+                 --phobius          Phobius pre-computed results (if phobius NOT installed)
+                 --isolate          Isolate name
+                 --strain           Strain name
+                 --busco_db         BUSCO models. Default: dikarya
+                 -t, --tbl2asn      Additional parameters for tbl2asn. Example: "-l paired-ends"
+                 -d, --database     Path to funannotate database. Default: $FUNANNOTATE_DB
+                 --force            Force over-write of output folder
+                 --cpus             Number of CPUs to use. Default: 2
+
+    ENV Vars:  If not specified at runtime, will be loaded from your $PATH  
+                 --AUGUSTUS_CONFIG_PATH
 
diff --git a/docs/compare.rst b/docs/compare.rst
@@ -3,5 +3,26 @@
 
 Comparative genomics
 ================================
- 
+A typical workflow in a genomics project would be to compare your newly sequenced/assembled/annotated genome to other organisms. The impetus behind :code:`funannotate compare` was that there was previously no way to easily compare multiple genomes. Funannotate stores all annotation in GenBank flat file format, while some people don't like this format as it is difficult to parse with standard unix tools, the main advantage is that the annotation can be stored in a standardized format and retrieved in the same way for each genome. GFF3 is the common output of many annotation tools, however, this doesn't work well for functional annotation as all of the "information" is stored in a single column.  At any rate, :code:`funannotate compare` can take either folders containing "funannotated" genomes or GBK files --> the output is stats, graphs, CSV files, phylogeny, etc all summarized in HTML format.
+
+.. code-block:: none
+    
+    Usage:       funannotate compare <arguments>
+    version:     1.0.1
+
+    Description: Script does light-weight comparative genomics between funannotated genomes.  Output
+                 is graphs, phylogeny, CSV files, etc --> visualized in web-browser.  
+    
+    Required:    -i, --input         List of funannotate genome folders or GBK files
+
+    Optional:    -o, --out           Output folder name. Default: funannotate_compare
+                 -d, --database      Path to funannotate database. Default: $FUNANNOTATE_DB
+                 --cpus              Number of CPUs to use. Default: 2
+                 --run_dnds          Calculate dN/dS ratio on all orthologs. [estimate,full]
+                 --go_fdr            P-value for FDR GO-enrichment. Default: 0.05
+                 --heatmap_stdev     Cut-off for heatmap. Default: 1.0
+                 --num_orthos        Number of Single-copy orthologs to use for RAxML. Default: 500
+                 --bootstrap         Number of boostrap replicates to run with RAxML. Default: 100
+                 --outgroup          Name of species to use for RAxML outgroup. Default: no outgroup
+                 --proteinortho      ProteinOrtho5 POFF results.
 
diff --git a/docs/homebrew.rst b/docs/homebrew.rst
@@ -7,7 +7,7 @@ While it seems that homebrew-science is being at least partially deprecated (or
 
 .. code-block:: none
 
-    brew tap homebrew/science && brew tap nextgenusfs/tap && brew update
+    brew tap brewsci/science && brew tap nextgenusfs/tap && brew update
     brew install funannotate
 
 This will automatically install most of the dependencies as well as the most current release of funannotate. Follow the instructions from homebrew, which are:
diff --git a/docs/predict.rst b/docs/predict.rst
@@ -128,17 +128,44 @@ Evidence Modeler builds consensus gene models and in addition to providing EVM w
     funannotate predict -i mygenome.fa -o output_folder -s "Aspergillus nidulans"
         --pasa_gff mypasamodels.gff3:8 --other_gff prediction.gff3:5
         
-**Submitting to NCBI, what should I know?**
+Submitting to NCBI, what should I know?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Funannotate will produce NCBI/GeneBank-submission ready output, however, there are a few things you should do if planning on submitting to NCBI.
 
     1. Get a locus_tag number for your genome.
-        You do this by starting a WGS genome submission and either specifying a locus tag or one will be assigned to you. The default in funannotate is to use FUN_. 
+        You do this by starting a WGS genome submission and either specifying a locus tag or one will be assigned to you. The default in funannotate is to use "FUN". 
         
     2. Pre-submission inquiry of unannotated genome.
         If you are new to genome assembly/annotation submission, be aware that your assembly will have to undergo some quality checks before being accepted by NCBI. Sometimes this results in you have to update your assembly, i.e. remove contigs, split contigs where you have adapter contamination, etc. If you have already done your annotation and then have to make these changes it can be very difficult. Instead, you can start your WGS submission and request that the GenBank curators do a quality check on your assembly and fix any problems prior to generating annotation with funannotate. 
     
     3. Generated an SBT template file. https://submit.ncbi.nlm.nih.gov/genbank/template/submission/
     
-    
+Explanation of the outputs:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The output of :code:`funannotate predict` is written to the output/predict_results folder, which contains:
+
++---------------------------------+----------------------------------------------+
+| **File Name**                   | **Description**                              |
++---------------------------------+----------------------------------------------+
+| Basename.gbk                    | Annotated Genome in GenBank Flat File format |
++---------------------------------+----------------------------------------------+
+| Basename.tbl                    | NCBI tbl annotation file                     |
++---------------------------------+----------------------------------------------+
+| Basename.gff3                   | Genome annotation in GFF3 format             |
++---------------------------------+----------------------------------------------+
+| Basename.scaffolds.fa           | Multi-fasta file of scaffolds                |
++---------------------------------+----------------------------------------------+
+| Basename.proteins.fa            | Multi-fasta file of protein coding genes     |
++---------------------------------+----------------------------------------------+
+| Basename.transcripts.fa         | Multi-fasta file of transcripts (mRNA)       |
++---------------------------------+----------------------------------------------+
+| Basename.discrepency.report.txt | tbl2asn summary report of annotated genome   |
++---------------------------------+----------------------------------------------+
+| Basename.error.summary.txt      | tbl2asn error summary report                 |
++---------------------------------+----------------------------------------------+
+| Basename.validation.txt         | tbl2asn genome validation report             |
++---------------------------------+----------------------------------------------+
+
+
 
diff --git a/sample_data/run_unit_tests.sh b/sample_data/run_unit_tests.sh
@@ -45,7 +45,7 @@ echo $cmd; eval $cmd
 #test RNAseq modules
 cmd='funannotate train -i genome6.fasta -l genome6_R1.fq.gz -r genome6_R2.fq.gz --stranded RF --species "Rubeus macgubis" --cpus 6 -o genome6'
 echo $cmd; eval $cmd
-cmd='funannotate predict -i genome6 --transcript_evidence genome6/training/funannotate_train.trinity-GG.fasta --rna_bam genome6/training/funannotate_train.coordSorted.bam --pasa_gff genome6/training/funannotate_train.pasa.gff3 -o genome6 -s "Rubeus macgubis" --cpus 6'
+cmd='funannotate predict -i genome6.fasta --transcript_evidence genome6/training/funannotate_train.trinity-GG.fasta --rna_bam genome6/training/funannotate_train.coordSorted.bam --pasa_gff genome6/training/funannotate_train.pasa.gff3 -o genome6 -s "Rubeus macgubis" --cpus 6'
 echo $cmd; eval $cmd
 cmd='funannotate update -i genome6 --cpus 6'
 echo $cmd; eval $cmd