-
Notifications
You must be signed in to change notification settings - Fork 6
Command line options
Commands are issued as the parameter on the command line and set the task to be run by the program.
The help options list can be printed on the console via:
# help for general options
phabox --help
# help for specific options
phabox2 --task [task] -h
#Example:
phabox2 --task phamer -h
phabox2 --task phagcn -hWe also listed the options below for your reference:
The following parameters are common when running phabox2:
--task
Select a program to run:
end_to_end || Run phamer, phagcn, phatyp, phavip, and cherry once (default)
phamer || Virus identification
phagcn || Taxonomy classification
phatyp || Lifestyle prediction
cherry || Host prediction
phavip || Protein annotation
contamination || Contamination/proviurs detection
votu || vOTU grouping (ANI-based or AAI-based)
tree || Build phylogenetic trees based on marker genes
--dbdir
Path of downloaded phabox2 database directory (required)
--outpth
Rootpth for the output folder (required)
All the results, including intermediate files and final predictions, are stored in this folder.
--contigs
Path of the input FASTA file (required)
--proteins
FASTA file of predicted proteins. (optional)
--midfolder
Midfolder for intermediate files. (optional)
This folder will be created within the --outpth to store intermediate files.
--len
Filter the length of contigs || default: 3000
Contigs with length smaller than this value will not proceed
--threads
Number of threads to use || default: all available threads
Please note that end_to_end task will run phamer, phagcn, cherry, phatyp, and phavip together. Thus, each task's options can also be used for the end_to_end task.
In addition, prediction with non-virus and low-confidence will not be used in the following taxonomy, host, and lifestyle prediction tasks.
The following parameters will be used in specific tasks:
usage: phabox2 --task phamer [options]
In-task options:
--reject
Reject sequences in which the percent proteins aligned to known phages is smaller than the value.
Default: 10
Range from 0 to 20
If the proportion is too low, the prediction for downstream analysis will be unreliable.
Usage: phabox2 --task phagcn [options]
In-task options:
The options below are used to generate a network for virus-virus connections. The current parameters are optimized for the ICTV 2024 and are highly accurate for grouping genus-level vOTUs. When making changes, make sure you understand 100% what they are.
--aai
Average amino acids identity || default: 75 || range from 0 to 100
--share
Minimum shared number of proteins || default: 15 || range from 0 to 100
--pcov
Protein-based coverage || default: 80 || range from 0 to 100
--draw
Draw network examples for the query virus relationship. || default: N || Y or N
--draw is used to plot sub-networks containing the query virus. We use it to generate visualization for our web server.
However, it will only print the top 10 largest sub-networks, so we do not recommend that users use it.
We have provided the complete network for visualization (network_edges.tsv and network_nodes.tsv file)
please check it out via: here
Usage: phabox2 --task cherry [options]
In-task options:
The options below are used to generate a network for virus-virus connections. The current parameters are optimized for the ICTV 2024 and are highly accurate for grouping genus-level vOTUs. When making changes, make sure you understand 100% what they are.
--aai
Average amino acids identity || default: 75 || range from 0 to 100
--share
Minimum shared number of proteins || default: 15 || range from 0 to 100
--pcov
Protein-based coverage || default: 80 || range from 0 to 100
--draw
Draw network examples for the query virus relationship. || default: N || Y or N
--draw is used to plot sub-networks containing the query virus. We use it to generate visualization for our web server.
However, it will only print the top 10 largest sub-networks, so we do not recommend that users use it.
We have provided the complete network for visualization (network_edges.tsv and network_nodes.tsv file)
please check it out via: here
The options below are used to predict Host based on MAGs.
--bfolder
Path to the folder that contains MAGs || default: None
--magonly
Only predicting host based on the provided MAGs: Y or N || default: N
Y will only predict the host based on the provided MAGs
N will predict the host based on the MAGs and the reference database
The options below are used to align contigs to CRISPRs.
--cpident
Alignment identity for CRISPRs || default: 90 || range from 90 to 100
--ccov
Alignment coverage for CRISPRs || default: 90 || range from 0 to 100
--blast
BLAST program for CRISPRs || default: blastn || blastn or blastn-short
blastn-short will lead to more sensitive results but require more time to execute the program
The options below are used to align contigs to MAGs to find prophages.
--prolen
Minimum alignment length for prophage || default: 1000 || range from 0 to 100000
The options below are used as GTDB-based taxonomy annotation filter for kmer-based prediction.
--bgtdb
Path to the GTDB tsv file of the MAGs || default: None
The default parameters are optimized for predicting prokaryotic hosts for the virus with 98% accuracy (data from the NCBI RefSeq database). When making changes, make sure you understand 100% what they are.
usage: phabox2 --task phatyp [options]
In-task options:
There are no additional options for lifestyle prediction. Only need to follow the general options.
Please note that running task end_to_end, phamer, phagcn, phatyp, and cherry, will automatically run phavip. The output files are the same.
usage: phabox2 --task phavip [options]
usage: phabox2 --task end_to_end [options]
In-task options:
The end-to-end task allow to skip the PhaMer(virus identification).
If users already have the viral contigs as their inputs, they can run end-to-end task using --skip Y to skip the virus identification
--skip
Whether you want to skip the viruses identification (PhaMer) || default: N || Y or N
However, please noted that the default parameters is --skip N. We also added a log output that tells the user that PhaMer detected no viruses and stopped the following pipelines in the end-to-end task in --skip N condition.
Usage: phabox2 --task contamination [options]
In-task options:
--sensitive
Sensitive when search for the prokaryotic genes || default: N || Y or N
Y will lead to more sensitive results but require more time to execute the program
Usage: phabox2 --task votu [options]
In-task options:
--mode
Mode for clustering ANI based or AAI based || default: ANI || ANI or AAI
AAI-based options:
--aai
Average amino acids identity for AAI based genus grouping || default: 75 || range from 0 to 100
--pcov
Protein-level coverage for AAI based genus grouping || default: 80 || range from 0 to 100
--share
Minimum shared number of proteins for AAI based genus grouping || default: 15 || range from 0 to 100
ANI-based options:
--ani
Alignment identity for ANI-based clustering || default: 95 || range from 0 to 100
--tcov
Alignment coverage for ANI-based clustering || default: 85 || range from 0 to 100
Usage: phabox2 --task tree [options]
In-task options:
--marker
A list of markers used to generate tree || default: terl portal
You can choose more than one marker to generate the tree from below:
The marker genes were obtained from the RefSeq 2024:
endolysin || 91% prokaryotic virus have endolysin
holin || 75% prokaryotic virus have holin
head || 77% prokaryotic virus have marjor head
portal || 84% prokaryotic viruses have portal
terl || 92% prokaryotic viruses have terminase large subunit
Using combinations of these markers can improve the accuracy of the tree
But will decrease the number of sequences in the tree.
--mcov
Alignment coverage for matching marker genes || default: 50 || range from 0 to 100
--mpident
Alignment identity for matching marker genes || default: 25 || range from 0 to 100
--msa
Whether run msa || default: N || Y or N
Y will run msa for the marker genes using mafft
But this will require more time to execute the program
--msadb
Whether run msa with database || default: Y || Y or N
Y will run msa on the detected marker genes with the database
N will run msa on the detected marker genes without the database
--tree
Whether build a tree || default: N || Y or N
Y will generate the tree based on the marker genes using FastTree
But this will require more time to execute the program