TAPER is a tool that looks for errors in small species-specific stretches of the multiple sequence alignments. TAPER uses a two-dimensional outlier detection algorithm that seeks to find what parts of a sequence are outliers given the overall level of divergence of the sequence to others and the level of the divergence of the region in the alignment. Importantly, TAPER adjusts its null expectations per site and species, and in doing so, it attempts to distinguish the real heterogeneity (signal) from errors (noise).
A preprint of the method is available here: https://www.biorxiv.org/content/10.1101/2020.11.30.405589v1
- Zhang, C., Zhao, Y., Braun, E. L., & Mirarab, S. (2020). TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution. bioRxiv.
Please download Julia here and unzip the file using the following command.
tar -xvzf julia*.tar.gz
Luckily, Julia doesn't require installation on most platforms. You just download it and you have the tool.
In the following, we use JULIA_FOLDER for the place you have copied the julia package. Change it to the actual place you used.
JULIA_FOLDER/bin/julia correction_multi.jl -h-
For AA input, you can use the default version for all flags:
JULIA_FOLDER/bin/julia correction_multi.jl INPUTNAME > OUTPUTNAME-
This will detect erronous sequences and will replace them with
Xin the output. -
Note that an
Xcharacter in the input is treated the same way as dash (see-m) -
For example,
cdto the directory containingcorrection_multi.jlJULIA_FOLDER/bin/julia correction_multi.jl sample_inputs/3.fasta > sample_inputs/3.out.fasta
-
-
For DNA input, you (probably) want to mask erroneous regions with
Ninstead ofXand also treatN(instead ofX) as a dash. To do this, you can use-m(to set the output masking letter) and-a(to set the ANY character). Thus, you would run:JULIA_FOLDER/bin/julia correction_multi.jl -m N -a N INPUTNAME > OUTPUTNAME-
For example,
cdto the directory containingcorrection_multi.jlJULIA_FOLDER/bin/julia correction_multi.jl -m N -a N sample_inputs/dna.fasta > sample_inputs/dna.out.fasta
-
You need a LIST file that includes the name of input and output files. Run:
JULIA_FOLDER/bin/julia correction_multi.jl -l LISTLIST should look like this:
path_to_input_file_1
path_to_output_of_input_file_1
path_to_input_file_2
path_to_output_of_input_file_2
...
For every two lines in LIST:
- The first line should be the path to an input file
- The second line should be the path to the output of the previous input file
For example, cd to the directory containing correction_multi.jl and run
JULIA_FOLDER/bin/julia correction_multi.jl -l sample_inputs/files2run.txtThe following are all the available commands of correction_multi.
JULIA_FOLDER/bin/julia correction_multi.jl [-l] [-m MASK] [-a ANY] [-c CUTOFF] input-l tells the program to run multiple files together.
- By default, without
-lcommand, yourinputshould be a file in FASTA format and the program prints the output onto the standard output. - You can try
JULIA_FOLDER/bin/julia correction_multi.jl sample_inputs/3.fastaAlso, you can redirect your output to a file. Try:
JULIA_FOLDER/bin/julia correction_multi.jl sample_inputs/3.fasta > sample_inputs/3.out.fasta- To run multiple files at a time, you need a
LISTfile that includes the name of input and output files.- The
LISTfile should look like this:For every two lines inpath_to_input_file_1 path_to_output_of_input_file_1 path_to_input_file_2 path_to_output_of_input_file_2 ...LIST:- The first line should be the path to a input file
- The second line should be the path to the output of the previous input file
- You can refer to
sample_inputs/files2run.txtfor formating. - To run multiple FASTA files, your
inputshould be theLISTand you should add-lcommand. For example, to run the three FASTA files insample_inputs/files2run.txt, try:JULIA_FOLDER/bin/julia correction_multi.jl -l sample_inputs/files2run.txt
- The
TAPER will mask each position that it believed to be erroneous with a specific MASK character. By default, it uses X, which makes sense for amino acid sequences. You can change the MASK character to most characters (including - and *). For example, if your input file contains DNA reads, it makes sense to mask with N. For example, try:
JULIA_FOLDER/bin/julia correction_multi.jl -m N sample_inputs/dna.fastaYou may want TAPER to ignore a specific ANY character in the input file and treat it as if it is a -. By default, the ANY character is X which makes sense for amino acid sequences. You can change the ANY character to most characters (including - and *). For example, if your input file contains DNA reads, it makes sense to ignore N. For example, try:
JULIA_FOLDER/bin/julia correction_multi.jl -a N sample_inputs/dna.fastaA cutoff (make sense to > 1) to control aggressiveness of error detection. The lower the cutoff is, the more aggressive the TAPER becomes. Values below 1 are allowed but discouraged. The default cutoff is 3. For example, if you want TAPER be more precautious, you can try:
JULIA_FOLDER/bin/julia correction_multi.jl -c 10 sample_inputs/3.fasta > sample_inputs/3.out.fastaLoad the list of k, p, q, and L from the input parameter file. Please refer to the paper for usage for the meaning of these parameters and multi-k setting. Very roughly speaking,
kis the size of the kmer.pis the maximum proportion of a site that could be removed.qis the maximum proportion of a sequence that could be removed.Lis a hard upper bound on the length of sequences that will be removed. The input parameter file should contain a list of dictonaries with key words k, p, q, and L (in Julia format, see below). TAPER removes the union of all the parameters described in the list.
By default, the input parameter is set to be:
[Dict("k"=>5, "p"=>0.25, "q"=>0.1, "L"=>30),
Dict("k"=>9, "p"=>0.25, "q"=>0.25, "L"=>54),
Dict("k"=>17, "p"=>0.1, "q"=>0.5, "L"=>Inf)]
This means that TAPER is run with k equals to 5, 9, and 17, each with a different p and q. Smaller k values are only used to catch short errors (<=30 and <=54, respectively for k=5 and k=9). You can change these settings.
- To get more aggresive filtering, you can increase
p(to allow removal of more than 1/4 of a column) orq(to allow removal of more than half of a sequence). - To make the filtering less aggressive, you can lower
porqor lowerLvalues.
However, it does seem a bit hard to optimize all these parameters. Thus, we suggest that most users simply use the defaults or use -c to adjust aggressiveness. If you adjust these parameters, some experimentation is needed.
While we recommend the default multi-k version, you can have a single-k version by having only one k value in the -p parameter.
For example, you can run:
JULIA_FOLDER/bin/julia correction_multi.jl -p sample_inputs/parameter.txt sample_inputs/3.fasta