Apologies and Warnings: results generated by versions prior to 0.2.0 had incorrect sequence to ID correspondence. Sincere apologies.
OrthoSLC is a pipline that performs Reciprocal Best Blast Hit (RBBH) Single Linkage Clustering (V1.0.0 onwards allows mcl) to obtain Orthologous Genes, and generate core and accessory genes as final output.
- Readme OrthoSLC (1.0.0)
- Easy Run
- You can run each step independently (check
commandline_template.sh)- Step 1 Annotate Genome information preparation
- Step 2 FASTA dereplication
- Step 3 Pre-clustering of using all dereplicated FASTAs and non-redundant genome generation
- Step 4 Reciprocal Blast
- Step 5 query binning
- Step 6 Filtering and binning
- Step 7 Reciprocal Best find
- Step 8 Clustering
- Step 9 Write clusters into FASTA
- OrthoSLC ToolKit (OthoSLC_TK)
OrthoSLC ToolKit (OthoSLC_TK, at bottom of this page): From V1.0.0 onwards, we provide OthoSLC_TK to assist some downstream related tasks like MSA, construction of single copy core genome, calculating all pairwise SNP count.
OrthoSLC is:
- lightweight,
- fast:
- 90 E.coli annotated genomes, 10 threads to final cluster -> <150 s;
- 976 E.coli annotated genomes (unique genes -> ~500M), 30 threads -> ~22 min;
- 1200 E.coli annotated genomes (highly diverse, from six different lineages, unique genes -> ~900M), 30 threads -> ~70 min;
- 3050 E.coli annotated genomes (from different lineages, unique genes -> 1.2G), 30 threads -> ~220 min, peak memory < 15G with speed trade off;
- convenient to install;
- and independent of relational database management systems (e.g., MySQL, Redis, SQLite)
Download:
$ git clone https://github.com/JJChenCharly/OrthoSLC
$ cd OrthoSLCCaveat:
- The pipeline is currently available for linux-like system only.
- From V1.0.0 onwards, the "all-in-one" Jupyter notebook interface
OrthoSLC_Python_jupyter_interface.ipynbno longer gurantee same output as excutable binaries. Users should employ excutable inbins/
Requirement and Run:
- Python3 (suggest newest stable release or > V3.7)
- C++17 ("must" or higher for compiling) users may also directly use pre-compiled binary files by allowing excute access:
- NCBI Blast+ (suggest 2.12 or higher)
mcl(optional but Highly suggested,apt-get install mcl)
You may use pre-compiled:
$ cd OrthoSLC
$ chmod a+x bins/*Or use install.sh to compile like following:
$ cd OrthoSLC
$ chmod a+x install.sh
$ ./install.sh src/ bins/The pipeline start with annotated genomes, and can produce clusters of gene id, and FASTA files of each cluster.
Therefore, with:
blastnandmclin$PATH;- one fasta file for each strain, many DNA sequences in each fasta;
you can easily try:
- 90 E.coli genomes and 10 threads (time: <150s).
bash ./commandline_template.sh ./bins \
./test_output \ # your output directory
./test_inputs \ # your input directory
10 \ # threads
60 \ # larger -> lower mem but a bit slower
fasta # input file format- ~3000 E.coli genomes (from six lineages, unique genes -> 1.2G) on 30 threads:
(Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz, time: ~215 min, peak mem <15G). Peak mem can be reduced to <5G with ~10 min more total time usage, by increasing500to more.
bash ./commandline_template.sh ./bins \
./test_output \
./test_inputs \
30 \
500 \
ffnNote that, this pipeline is recommended for sub-species level single copy core genome construction since RBBH may not work well for missions like finding Human-Microbe orthologs.
The programs uses A simple C++ Thread Pool implementation, and sincere thanks to its contributors.
Bug report:
The pipeline starts with annotated genomes in fasta format (accept DNA as default input).
Step 1 needs the path to directory of annotated FASTA files as input, your input folder should have Prokka/PGAP output structures:
$ tree ./test_inputs
./test_inputs
├── strain_A
│  ├── strain_A.err
│  ├── strain_A.faa
│  ├── strain_A.ffn
│  ├── strain_A.fna
...
├── strain_B
│  ├── strain_B.err
│  ├── strain_B.faa
│  ├── strain_B.ffn
...Or simple layout:
$ tree ./test_inputs
./test_inputs
├── K12_MG1655.fasta
├── UTI89.fasta
├── strain_A.fasta
├── strain_B.fastaBy -f, program looks for all files with given extension. Program generates a header-less, tab separated table, in which the
- first column is a short ID (
strainID-GeneID); - second column is the strain name (file name or parent folder name),
- third column as the absolute path.
$ python3 ./bins/Step1_preWalk.py -h
Usage: python Step1_preWalk.py -i input/ -o output/ [options...]
-i or --input_path ---> <dir> path/to/input/directory
-o or --output_path --> <txt> path/to/output_file
-f or --format -------> <str> file extension to look for (e.g. 'sqn', 'ffn')
-h or --help ---------> display this informationStep 2 is to remove sequence duplication within each genome (e.g., copies of tRNA, some CDS). This dereplication is equivalent to 100% clustering, to obtain single copy.
Step 2 requires the tab separated table output by Step 1 as input, and specifying a directory (-o) for dereplicated files.
Since, V1.0.0, by specifying -c or --copy_info_path, program will organize ID of copies (identical sequences), and output file is a tsv where one row is one set of identical copies separated by \t.
Note! Dereplication of OrthoSLC in Step 2 here is to simply record indentical sequences, not by region coverage of reads. Hence, annotated results of a complete circular or near complete genome assembly is suggested. What I usually do is leting TellSeq handle my genome extract. I always obtain N50>4.5Mbp E.coli assembly (not complete) with 300 to 350 RMB/sample (~<50$).
$ python3 ./bins/Step2_simple_derep -h
Usage: Step2_simple_derep -i input_file -o output/ [options...]
-i or --input_path ------> <txt> path/to/file/output/by/Step1
-o or --output_path -----> <dir> path/to/output/directory
-c or --copy_info_path --> <txt> path/to/output/copy_info.txt
-u or --thread_number ---> <int> thread number, default: 1
-h or --help ------------> display this informationAfter dereplication, users should give a careful check of size of dereplicated FASTA files. It is worth noting that if a FASTA file with a very low amount of sequences, getting processed together with all the rest, the final "core" clusters will be heavily affected and may bias your analysis.
grep -c ">" $wd/S2_op_dereped/* | sort -t: -k2rnSince the core genome construction is similar with intersection construction. It is recommend to remove some very small dereplicated fasta files BEFORE NEXT STEP, e.g., remove all dereplicated E.coli genomes with file size lower than 2.5MB as most should be larger than 4 MB.
This step performs 100% clustering on all dereplicated FASTAs generated in Step 2 to further dereplicate on all sequences.
The program of Step 3 will take the dereplicated fasta files made in step 2 as input, and produce:
- Fasta dereplicate on all sequences for reciprocal blast query
- length of each non-redundant sequence
- pre-clustered results (one row one cluster tab separated format)
- each genome that is redundancy-removed for
makeblastdb - gene id information: a tab delimited table, first column as short id generated, second colums as description line of original fasta
You may run Step3_pre_cluster like following:
Usage: Step3_pre_cluster -i concatenated.fasta -d dereped/ -n nr_genome/ -l seq_len.txt -p pre_cluster.txt [options...]
-i or --input_path -----> <dir> path/to/directory of dereplicated FASTA from Step2
-d or --derep_fasta ----> <fasta> path/to/output/dereplicated concatenated FASTA
-n or --nr_genomes -----> <dir> path/to/directory/of/output/non-reundant/genomes
-l or --seq_len_info ---> <txt> path/to/output/sequence_length_table
-m or --id_info --------> <txt> path/to/output/id_info_table
-p or --pre_cluster ----> <txt> path/to/output/pre_clustered_file
-u or --thread_number --> <int> thread number, default: 1
-h or --help -----------> display this informationStep 4 will carry out the Reciprocal Blast using NCBI Blast.
The pipeline will assist you to:
- Create blast db with non-redundant genomes using
makeblastdb. - Using
blastnorblastpto align thedereplicated concatenated FASTAagainst each of the db just made and get tabular output.
In case you have installed your blast but not exported to $PATH, you need to provide path to your blast binary file. You may use whereis blastn or whereis makeblastdb in order to get the full path to your blast binary file.
To create database to BLAST, users should provide path to directory where all non-redundant FASTA made in Step 4 is, and a path to output directory where BLAST database is to store.
You may run Step4_makeblastdb.py like following:
Usage: python Step4_makeblastdb.py -i input/ -o output/ [options...]
options:
-i or --input_path -----------> <dir> path/to/input/directory of nr_genomes from Step 2
-o or --output_path ----------> <dir> path/to/output/directory
-c or --path_to_makeblastdb --> <cmd_path> path/to/makeblastdb, default: makeblastdb
-u or --thread_number --------> <int> thread number, default: 1
-t or --dbtype ---------------> <str> -dbtype <String, 'nucl', 'prot'>, default: nucl
-h or --help -----------------> display this informationTo perform reciprocal BLAST, users should provide path to dereplicated concatenated FASTA producd by step 3, path to directory where databases made by makeblastdb, and a path to output directory where BLAST tabular output is to store.
This step assist you to achieve all-vs-all Blast with task distribution and automatic thread allocation, which extenuate memory usage and increase the overall speed.
You can run Step4_reciprocal_blast.py like following:
Usage: python Step4_reciprocal_blast.py -i query.fasta -o output/ -d directory_of_dbs/ [options...]
options:
-i or --query -------------> <fasta> path/to/dereped_cated.fasta from Step 3
-d or --dir_to_dbs --------> <dir> path/to/directory/of/dbs by makeblastdb
-o or --output_path -------> <dir> path/to/output/directory
-c or --path_to_blast -----> <cmd_path> path/to/blastn or blastp, default: 'blastn'
-e or --e_value -----------> <float> blast E value, default: 1e-5
-s or --strand ------------> <str> select from <'both', 'minus', 'plus'>, default: plus
-u or --blast_thread_num --> <int> blast thread number, default: 1
-f or --outfmt ------------> <str> specify blast output format if needed, unspecified means `'6 qseqid sseqid pident score evalue'` as default
-t or --blastp_task ------> <str> select from <'blastp' 'blastp-fast' 'blastp-short'>, unspecified means `'blastp'` as default
-T or --blastn_task ------> <str> select from <'blastn' 'blastn-short' 'dc-megablast' 'megablast' 'rmblastn' >, unspecified means `'megablast'` as default
-h or --help --------------> display this informationThis step is to apply hash binning to bin all presence of a query into same file to facilitate next step filtering.
You can run Step5_query_binning like following:
Usage: Step5_query_binning -i input/ -o output/ [options...]
-i or --input_path -----> <dir> path/to/input/directory of blast outputs
-o or --output_path ----> <dir> path/to/output/directory
-u or --thread_number --> <int> thread number, default: 1
-L or --bin_level ------> <int> binning level, an intger 0 < L <= 9999, default: 10
-k or --no_lock_mode ---> <on/off> select to turn no lock mode <on> or <off>, default: off
-h or --help -----------> display this informationSet bin level:
According to the amount of genomes to analysze, user should provide binning level, which is to set how many bins should be used. Level
Suggestion is that do not set the bin level too high, especially when less than 200 genomes participated. If such amount of genomes participated analysis, bin level from 10 to 100 should work as most efficient way.
As tested, an analysis of 30 genomes, has 30 BLAST output after step 4.
- A bin level of 10, takes 7 seconds to finish,
- a bin level of 100, takes 10 seconds to finish,
- a bin level of 1000, takes 24 seconds to finish,
When to set a high bin level:
Simply speaking, when you have really larger amount of genomes and not enough memory (e.g., more than 1000 genomes and less than 10 GB memory), increase -Lto 200 ~ 300.
For example, if the output of BLAST for 1000 genomes reach 15 GB in size, and if the bin level is set to 10, there will be 10 bins to evenly distribute the data. On average, each bin will contain 1.5 GB of data, which may be too memory-intensive to process in step 6 (where requires approximately 1.5 GB of memory per bin). However, if the number of bins is increased to 1000, the size of each bin will be reduced to between 100-200 MB, it will then facilitate step 6 parallelization.
No lock mode:
we provide no lock mode in all steps that we applied hash binning to speed up the process. We allow users to turn off mutex lock which is to safely write into files when multi-threading. In ours tests, program can generate files without data corruption when multi-threading with no lock (data corruption were rarely observed, the possiblity of data corruption may vary between computation platform).
This step is to filter the blast output and to apply hash binning, to prepare for reciprocal best finding.
Step 6 requires path to directory of query binning output (Step 5), sequence length information, pre-cluster information output by Step 3 as input.
You can run Step6_filter_n_bin like following:
Usage: Step6_filter_n_bin -i input/ -o output/ -s seq_len_info.txt [options...]
-i or --input_path --------> <dir> path/to/input/directory from Step 5
-o or --output_path -------> <dir> path/to/output/directory
-s or --seq_len_path ------> <txt> path/to/seq_len_info.txt from Step 3
-p or --pre_cluster_path --> <txt> path/to/pre_cluster.txt from Step 3
-L or --bin_level ---------> <int> binning level, an intger 0 < L <= 9999 , default: 10
-r or --length_limit ------> <float> length difference limit, 0 < r <= 1, default: 0.8
-m or --similarity_limit --> <float> similarity limit, 0 < r <= 100, default: 80
-w or --weight ------------> <str> select from <'none', 'pident', 'bitscore', 'evalue'> for clustering,
-k or --no_lock_mode ------> <on/off> select to turn no lock mode <on> or <off>, default: off
-u or --thread_number -----> <int> thread number, default: 1
-h or --help --------------> display this informationThe pipeline will carry out following treatment to BLAST output:
- Paralog removal:
If query and subject is from same strain, the hit will be skipped, as to remove paralog. - Length ratio filtering:
Within a hit, query length$Q$ and subject length$S$ , the ratio$v$ of this 2 length
should be within a range, according to L. Salichos et al,
If above condition not met, the hit will be removed from analysis.
- Similarity filtering: Keep only pairs with
pident>-m. - Non-best-hit removal:
- Identical sequences are always regarded as best hit.
- If a query has more than 1 subject hits, only the query-subject pair with highest score will then be kept.
- if pairs are of same score, the pair whose query and subject are of more similar length will be kept.
- Sorting and binning:
For every kept hit, its query and subject will be sorted using Python or C++ built in sort algorithm. This is because in a sequential blast output file, only "single direction best hit" can be obtained, its "reciprocal best hit" only exist in other files, which poses difficulty doing "repriprocal finding".
However, if a query$a$ and its best suject hit$b$ , passed filter above, and form$(a, b)$ , and in the mean time we sort its rericprocal hit$(b, a)$ from another file into$(a, b)$ , then both$(a, b)$ will generate same hash value. This hashed value with last several digits will allow us to bin them into same new file. Therefore, after this binning, "reciprocal finding" will be turned into "duplication finding" within one same file.
- In
Step4_reciprocal_blast.py, program by defaul generatesqseqid sseqid pident score evalueas output columns. By-w, you may choose 'none' or one of 'pident', 'bitscore', 'evalue' to keep formclclustering weight. If you would like to employStep8_SLCfor clustering, you have to select 'none'.
Set bin level:
According to the amount of genomes to analysze, user should provide binning level, which is to set how many bins should be used. Level
This Step is to find reciprocal best hits. In Step 6, query-subject pairs had been binned into different files according to their hash value, therefore, pair
In addition, Step 7 also does hash binning after a reciprocal best hit is confirmed. Query-subject pairs will be binned by the hash value of query ID, which then put pairs with common elements into same bin to assist faster clustering in next step.
Step 7 requires path to directory of bins output by Step 6, and path to output directory.
You can run Step7_RBF like following:
Usage: Step7_RBF -i input/ -o output/ [options...]
-i or --input_path -----> <dir> path/to/input/directory from Step 6
-o or --output_path ----> <dir> path/to/output/directory
-u or --thread_number --> <int> thread number, default: 1
-k or --no_lock_mode ---> <on/off> select to turn no lock mode <on> or <off>, default: off
-L or --bin_level ------> <int> binning level, an intger 0 < L <= 9999 , default: 10
-h or --help -----------> display this informationSet bin level:
According to the amount of genomes to analysze, user should provide binning level, which is to set how many bins should be used. Level
This step will carry out single linkage clustering on output from step 7. Users may perform "multi-step-to-final" or "one-step-to-final" clustering by adjusting the compression_size parameter. In the output files, each row is a cluster (stopped by "\n") and each gene ID is separated by "\t".
In case that large amount genomes participated analysis, it could be memory intensive to reach final cluster in a single step. The pipeline provide alternative to extenuate such pressure by reaching final cluster with multiple steps. For example, if compression_size = 5 is provided, program will perform clustering using 5 files at a time and shrink the output file number by a factor of 5.
Note Before Start
User must specify the path to pre-cluster file produced in Step 3, when running the LAST step of multi step to final, or when running direct one step to final.
You can run Step8_SLC like following:
Usage: Step8_SLC -i input/ -o output/ [options...]
options:
-i or --input_path --------> <dir> path/to/input/directory from Step 7
-o or --output_path -------> <dir> path/to/output/directory
-u or --thread_number -----> <int> thread number, default: 1
-p or --pre_cluster_path --> <txt> path/to/output/pre_clustered file from Step 3
-S or --compression_size --> <int> compression size, default: 10, 'all' means one-step-to-final
-h or --help --------------> display this informationExample running with multi-step-to-final approach:
E.g., there are 1,000 files generated by Step 8
Set -S or --compression_size to 1
bin_dir="bin_dirctory"
wd="working_dir"
time $bin_dir/Step8_SLC \
-i $wd"/S8_op" \
-o $wd"/SLC_1" \
-u 16 \
-S 1above commands will perform clustering on every 1 file, and generate 1,000 files in $cp"/SLC_1".
Set -S or --compression_size to 5
time $bin_dir/Step8_SLC \
-i $wd"/SLC_1" \
-o $wd"/SLC_2" \
-u 16 \
-S 5it will perform clustering on every 5 files, and generate 200 files in $cp"/SLC_2".
Set -S or --compression_size to 10
time $bin_dir/Step8_SLC \
-i $cp"/SLC_2" \
-o $cp"/SLC_3" \
-u 16 \
-S 10it will perform clustering on every 10 files, and generate 20 files in $cp"/SLC_3.
Finally, Set -S or --compression_size as all, and provide pre_cluster.txt made in Step 4 using -p or --pre_cluster_path
time $bin_dir/Step8_SLC \
-i $cp"/SLC_3" \
-o $cp"/Final_cluster" \
-p $cp"/S3_op_pre_cluster.txt" \
-u 16 \
-S allit will perform clustering on on all, and give the final 1 cluster file in $cp"/Final_cluster.
You can run Step8_SLC like following:
optional arguments:
-h, --help show this help message and exit
-c , --path_to_mcl <cmd_path> path/to/mcl
-i , --input_path <dir> path/to/input/directory from Step 7
-o , --output_path <dir> path/to/output/directory
-p , --pre_cluster_path
<txt> path/to/pre_cluster.txt from Step 3
-u , --mcl_thread_num
<int> number of threads for mclThis step concatenate all files of Step 7 output and feed to mcl by stdin, and fuse the pre-cluster result with mcl output.
you may also provide any other argument that mcl has like
time python3 ./bins/Step8_MCL.py \
-c mcl \
-i $wd"/S7_op" \
-p $wd"/S3_op_pre_cluster.txt" \
-o $wd"/Final_cluster" \
-u 16 \
-I 1.5 --abc # mcl argsIn Step 9, program will help user to generate FASTA file for each cluster. By providing the final one cluster file generated by Step 8 as input, program produces 3 types of clusters into 3 directories separately.
Noteably, those genomes
- depreplicated in Step 2,
- not removed because of too low genome size
- participated processes up to this step,
find ./test_op/S2_op_dereped/ -type f | wc -lare used to separate 3 types of clusters.
- In drectory
accessory_cluster(a cluster not shared by all genomes), FASTA files of clusters which do not have genes from all genomes participated analysis, will be output in this drectory. For example, there are 100 genomes in analysis, a cluster with less than 100 genes will have its FASTA output here. Also, if a cluster has >= 100 genes, but all these genes are from less than 100 genomes, its FASTA will be in this directory. - In drectory
strict_core, a give cluster with exactly 1 gene from every genome analyzed. Such clusters will have their FASTA files here. - In drectory
surplus_core, a give cluster with at least 1 gene from every genome analyzed, and some genomes has more than 1 genes in this cluster. Such clusters will have their FASTA files here.
This step also requires the concatenated FASTA made in Step 3 as input.
You may run Step9_write_clusters like this:
Usage: Step9_write_clusters -i input_path -o output/ -f concatenated.fasta [options...]
options:
-i or --input_path --------> <dir> path/to/input/final_cluster_file from Step 8
-o or --output_path -------> <dir> path/to/output/directory
-f or --fasta_path --------> <fasta> path/to/dereped_cated_fasta from Step 3
-m or --id_info_path ------> <txt> path/to/output/id_info_table from Step 3
-p or --pre_cluster_path --> <txt> path/to/pre_clustered_file from Step 3
-c or --total_count -------> <int> amonut of genomes to analyze
-a or --pct_threshold -----> <float> only write accessory clusters shared by >=n% (0<=n<100) of genomes, default: 0
-t or --cluster_type ------> <txt> select from < accessory / strict / surplus >, separate by ',', all types if not specified
-u or --thread_number -----> <int> thread number, default: 1
-h or --help --------------> display this informationTo run OthoSLC_TK, you may need:
OthoSLC_TK assists you to:
- Do MSA within each cluster;
- Concatenate aligned clusters into single copy core genome (as
phyor other format) so you can do RAxML, or IQ-TREE or whatever; - Generate SNP count of core genome between all pairwise of strains;
kalign_bin='Path/to/kalign'
python3 ./bins/TK_kalign.py \
-c $kalign_bin \
-i ./test_output/S9_write_fasta/strict_core/ \
-o path/to/kalign_op \
-u 10 --type dnaor:
mafft_bin='Path/to/mafft'
python3 ./bins/TK_mafft.py \
-c $mafft_bin \
-i ./test_output/S9_write_fasta/strict_core/ \
-o path/to/mafft_op \
-u 10 --maxiterate 1000The output has each piece of fasta with original strain name.
python3 ./bins/TK_AlnConcat.py \
-i path/to/kalign \
-o path/to/BUG_core.phy \
-T ./test_output/Step1_op.txt \
-f phylip-relaxedNow path/to/BUG_core.phy is ready for RAxML, or IQ-TREE!
python3 ./bins/TK_SNPmat.py \
-i path/to/kalign \
-o path/to/snp_count.csv \
-T ./test_output/Step1_op.txt \
-u 10The SNP matrix workflow now converts aligned sequences into NumPy-based fractional one-hot vectors and reports L2 distances for every strain pair, allowing ambiguity-aware SNP summaries without introducing extra heavy dependencies.
