Skip to content

Commit 95607f7

Browse files
RabbitTClust v.2.2.0, support incremental clustering by --append option
1 parent a3c324c commit 95607f7

File tree

13 files changed

+508
-88
lines changed

13 files changed

+508
-88
lines changed

README.md

Lines changed: 34 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
![RabbitTClust](rabbittclust.png)
22

3-
# `RabbitTClust v.2.1.0`
3+
# `RabbitTClust v.2.2.0`
44
RabbitTClust is a fast and memory-efficient genome clustering tool based on sketch-based distance estimations.
55
It enables processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms.
66
RabbitTClust supports classical single-linkage hierarchical (clust-mst) and greedy incremental clustering (clust-greedy) algorithms for different scenarios.
77

88
## Installation
9-
RabbitTClust version 2.1.0 can only support 64-bit Linux Systems.
9+
`RabbitTClust v.2.2.0` can only support 64-bit Linux Systems.
1010

1111
The detailed update information for this version, as well as the version history, can be found in the [`version_history`](version_history/history.md) document.
1212

@@ -15,45 +15,12 @@ The detailed update information for this version, as well as the version history
1515
* c++14
1616
* [zlib](https://zlib.net/)
1717

18-
### Compile and install automatically
18+
### Compile and install
1919
```bash
2020
git clone --recursive https://github.com/RabbitBio/RabbitTClust.git
2121
cd RabbitTClust
2222
./install.sh
2323
```
24-
25-
### Compile and install manually
26-
```bash
27-
git clone --recursive https://github.com/RabbitBio/RabbitTClust.git
28-
cd RabbitTClust
29-
30-
#make rabbitSketch library
31-
cd RabbitSketch &&
32-
mkdir -p build && cd build &&
33-
cmake -DCXXAPI=ON -DCMAKE_INSTALL_PREFIX=. .. &&
34-
make -j8 && make install &&
35-
cd ../../ &&
36-
37-
#make rabbitFX library
38-
cd RabbitFX &&
39-
mkdir -p build && cd build &&
40-
cmake -DCMAKE_INSTALL_PREFIX=. .. &&
41-
make -j8 && make install &&
42-
cd ../../ &&
43-
44-
#compile the clust-greedy
45-
mkdir -p build && cd build &&
46-
cmake -DUSE_RABBITFX=ON -DUSE_GREEDY=ON .. &&
47-
make -j8 && make install &&
48-
cd ../ &&
49-
50-
#compile the clust-mst
51-
cd build &&
52-
cmake -DUSE_RABBITFX=ON -DUSE_GREEDY=OFF .. &&
53-
make -j8 && make install &&
54-
cd ../
55-
```
56-
5724
## Usage
5825
```bash
5926
# clust-mst, minimum-spanning-tree-based module for RabbitTClust
@@ -62,65 +29,75 @@ Options:
6229
-h,--help Print this help message and exit
6330
-t,--threads INT set the thread number, default all CPUs of the platform
6431
-m,--min-length UINT set the filter minimum length (minLen), genome length less than minLen will be ignore, default 10,000
65-
-c,--containment INT use AAF distance with containment coefficient, set the containCompress, the sketch size is in proportion with 1/containCompress
66-
-k,--kmer-size INT set the kmer size
32+
-c,--containment INT use AAF distance with containment coefficient, set the containCompress, the sketch size is in proportion with 1/containCompress -k,--kmer-size INT set the kmer size
6733
-s,--sketch-size INT set the sketch size for Jaccard Index and Mash distance, default 1000
68-
-l,--inputlist input is genome list, one genome per line
34+
-l,--list input is genome list, one genome per line
6935
-e,--no-save not save the intermediate files, such as sketches or MST
7036
-d,--threshold FLOAT set the distance threshold for clustering
7137
-F,--function TEXT set the sketch function, such as MinHash, KSSD, default MinHash
7238
-o,--output TEXT REQUIRED set the output name of cluster result
73-
-i,--input TEXT set the input file
39+
-i,--input TEXT Excludes: --append
40+
set the input file, single FASTA genome file (without -l option) or genome list file (with -l option)
7441
--presketched TEXT clustering by the pre-generated sketch files rather than genomes
7542
--premsted TEXT clustering by the pre-generated mst files rather than genomes for clust-mst
43+
--append TEXT Excludes: --input
44+
append genome file or file list with the pre-generated sketch or MST files
7645

7746
# clust-greedy, greedy incremental clustering module for RabbitTClust
7847
Usage: ./clust-greedy [OPTIONS]
7948
Options:
8049
-h,--help Print this help message and exit
8150
-t,--threads INT set the thread number, default all CPUs of the platform
8251
-m,--min-length UINT set the filter minimum length (minLen), genome length less than minLen will be ignore, default 10,000
83-
-c,--containment INT use AAF distance with containment coefficient, set the containCompress, the sketch size is in proportion with 1/containCompress
84-
-k,--kmer-size INT set the kmer size
52+
-c,--containment INT use AAF distance with containment coefficient, set the containCompress, the sketch size is in proportion with 1/containCompress -k,--kmer-size INT set the kmer size
8553
-s,--sketch-size INT set the sketch size for Jaccard Index and Mash distance, default 1000
86-
-l,--inputlist input is genome list, one genome per line
54+
-l,--list input is genome list, one genome per line
8755
-e,--no-save not save the intermediate files, such as sketches or MST
8856
-d,--threshold FLOAT set the distance threshold for clustering
8957
-F,--function TEXT set the sketch function, such as MinHash, KSSD, default MinHash
9058
-o,--output TEXT REQUIRED set the output name of cluster result
91-
-i,--input TEXT set the input file
59+
-i,--input TEXT Excludes: --append
60+
set the input file, single FASTA genome file (without -l option) or genome list file (with -l option)
9261
--presketched TEXT clustering by the pre-generated sketch files rather than genomes
62+
--append TEXT Excludes: --input
63+
append genome file or file list with the pre-generated sketch or MST files
9364
```
9465

9566
## Example:
9667
```bash
97-
#input is a file list, one genome path per line:
68+
# input is a file list, one genome path per line:
9869
./clust-mst -l -i bact_refseq.list -o bact_refseq.mst.clust
9970
./clust-greedy -l -i bact_genbank.list -o bact_genbank.greedy.clust
10071

101-
#input is a single genome file in FASTA format, one genome as a sequence:
72+
# input is a single genome file in FASTA format, one genome as a sequence:
10273
./clust-mst -i bacteria.fna -o bacteria.mst.clust
10374
./clust-greedy -i bacteria.fna -o bacteria.greedy.clust
10475

105-
#the sketch size (reciprocal of sampling proportion), kmer size, and distance threshold can be specified by -s (-c), -k, and -d options.
76+
# the sketch size (reciprocal of sampling proportion), kmer size, and distance threshold can be specified by -s (-c), -k, and -d options.
10677
./clust-mst -l -k 21 -s 1000 -d 0.05 -i bact_refseq.list -o bact_refseq.mst.clust
10778
./clust-greedy -l -k 21 -c 1000 -d 0.05 -i bact_genbank.list -o bact_genbank.greedy.clust
10879

10980

110-
#for redundancy detection with clust-greedy, input is a genome file list:
111-
#use -d to specify the distance threshold corresponding to various degrees of redundancy.
81+
# for redundancy detection with clust-greedy, input is a genome file list:
82+
# use -d to specify the distance threshold corresponding to various degrees of redundancy.
11283
./clust-greedy -d 0.001 -l -i bacteria.list -o bacteria.out
11384

114-
#for last running of clust-mst, it generated a folder name in year_month_day_hour-minute-second format, such as 2023_05_06_08-49-15.
115-
#this folder contains the sketch, mst files.
116-
#for generator cluster from exist MST with a distance threshold of 0.045:
117-
./clust-mst -d 0.045 --premsted 2023_05_06_08-49-15 -o bacteria.mst.d.045.clust
118-
#for generator cluster from exist sketches files of clust-mst with a distance threshold of 0.045:
119-
./clust-mst -d 0.045 --presketched 2023_05_06_08-49-15 -o bacteria.mst.d.045.clust
85+
# v.2.1.0 or later
86+
# for last running of clust-mst, it generated a folder name in year_month_day_hour-minute-second format, such as 2023_05_06_08-49-15.
87+
# this folder contains the sketch, mst files.
88+
# for generator cluster from exist MST with a distance threshold of 0.045:
89+
./clust-mst -d 0.045 --premsted 2023_05_06_08-49-15/ -o bact_refseq.mst.d.045.clust
90+
# for generator cluster from exist sketches files of clust-mst with a distance threshold of 0.045:
91+
./clust-mst -d 0.045 --presketched 2023_05_06_08-49-15/ -o bact_refseq.mst.d.045.clust
12092

121-
#for generator cluster from exist sketches of clust-greedy with a distance threshold of 0.001:
93+
# for generator cluster from exist sketches of clust-greedy with a distance threshold of 0.001:
12294
# folder 2023_05_06_08-49-15 contains the sketch files.
123-
./clust-greedy -d 0.001 --presketched 2023_05_06_08-49-15 -o bact_genbank.greedy.d.001.clust
95+
./clust-greedy -d 0.001 --presketched 2023_05_06_09-37-23/ -o bact_genbank.greedy.d.001.clust
96+
97+
# v.2.2.0 or later
98+
# for generator cluster from exist part sketches (presketch_A_dir) and append genome set (genome_B.list) to incrementally clustering
99+
./clust-mst --presketched 2023_05_06_08-49-15/ -l --append genome_B.list -o append_refseq.mst.clust
100+
./clust-mst --presketched 2023_05_06_09-37-23/ -l --append genome_B.list -o append_genbank.greedy.clust
124101
```
125102
## Output
126103
The output file is in a CD-HIT output format and is slightly different when running with varying input options (*-l* and *-i*).

src/MST.cpp

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -171,11 +171,12 @@ vector<int> getNoiseNode(vector<PairInt> densePairArr, int alpha){
171171
return noiseArr;
172172
}
173173

174-
vector<EdgeInfo> modifyMST(vector<SketchInfo>& sketches, int sketch_func_id, int threads, int** &denseArr, int denseSpan, uint64_t* &aniArr, string prefixName, double threshold){
174+
175+
176+
vector<EdgeInfo> modifyMST(vector<SketchInfo>& sketches, int start_index, int sketch_func_id, int threads, int** &denseArr, int denseSpan, uint64_t* &aniArr){
175177
//int denseSpan = 10;
176178
double step = 1.0 / denseSpan;
177179

178-
179180
//double step = threshold / denseSpan;
180181
//cerr << "the threshold is: " << threshold << endl;
181182
//cerr << "the step is: " << step << endl;
@@ -211,13 +212,14 @@ vector<EdgeInfo> modifyMST(vector<SketchInfo>& sketches, int sketch_func_id, int
211212
//int N = sketches.size();
212213
uint64_t totalCompNum = (uint64_t)N * (uint64_t)(N-1)/2;
213214
uint64_t percentNum = totalCompNum / 100;
214-
//cerr << "the percentNum is: " << percentNum << endl;
215+
cerr << "---the percentNum is: " << percentNum << endl;
216+
cerr << "---the start_index is: " << start_index << endl;
215217
uint64_t percentId = 0;
216218
#pragma omp parallel for num_threads(threads) schedule (dynamic)
217219
for(id = 0; id < sketches.size() - tailNum; id+=subSize){
218220
int thread_id = omp_get_thread_num();
219221
for(int i = id; i < id+subSize; i++){
220-
for(int j = i+1; j < sketches.size(); j++){
222+
for(int j = max(i+1, start_index); j < sketches.size(); j++){
221223
double tmpDist;
222224
if(sketch_func_id == 0)
223225
{

src/MST.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,9 @@ std::vector<EdgeInfo> kruskalAlgorithm(std::vector<EdgeInfo>graph, int vertices)
4444

4545
vector<EdgeInfo> generateMST(vector<SketchInfo>& sketches, string sketchFunc, int threads);
4646

47-
vector<EdgeInfo> modifyMST(vector<SketchInfo>& sketches, int sketch_func_id, int threads, int** &denseArr, int denseSpan, uint64_t* &aniArr, string prefixName, double threshold);
47+
vector<EdgeInfo> append_MST(vector<SketchInfo>& pre_sketches, vector<SketchInfo>& append_sketches, int sketch_func_id, int threads, int ** &denseArr, int denseSpan, uint64_t* &aniArr);
48+
49+
vector<EdgeInfo> modifyMST(vector<SketchInfo>& sketches, int start_index, int sketch_func_id, int threads, int** &denseArr, int denseSpan, uint64_t* &aniArr);
4850

4951
std::vector<EdgeInfo> generateForest(std::vector <EdgeInfo> mst, double threshhold);
5052

src/MST_IO.cpp

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,15 +26,30 @@ void loadDense(int** &denseArr, string folderPath, int& denseSpan, int& genome_n
2626
cerr << "-----read the dense file from: " << file_dense << endl;
2727
}
2828

29+
void loadANI(string folderPath, uint64_t* &aniArr, int sketch_func_id){
30+
if(sketch_func_id != 0 && sketch_func_id != 1){
31+
cerr << "ERROR: saveANI(), save ANI can only support MinHash and KSSD functions" << endl;
32+
return;
33+
}
34+
string file_ani = folderPath + '/' + "mst.ani";
35+
FILE* fp_ani = fopen(file_ani.c_str(), "r");
36+
if(!fp_ani){
37+
cerr << "ERROR: saveANI(), cannot open file: " << file_ani << endl;
38+
exit(1);
39+
}
40+
aniArr = new uint64_t[101];
41+
fread(aniArr, sizeof(uint64_t), 101, fp_ani);
42+
fclose(fp_ani);
43+
cerr << "-----read the ani file from: " << file_ani << endl;
44+
}
2945

30-
bool loadMST(string folderPath, vector<SketchInfo>& sketches, vector<EdgeInfo>& mst)
46+
void loadMST(string folderPath, vector<EdgeInfo>& mst)
3147
{
32-
bool sketch_by_file = load_genome_info(folderPath, "mst", sketches);
3348
//load the mst edge
3449
string file_mst = folderPath + '/' + "edge.mst";
3550
FILE* fp_mst = fopen(file_mst.c_str(), "r");
3651
if(!fp_mst){
37-
cerr << "ERROR: saveMST(), cannot open the file: " << file_mst << endl;
52+
cerr << "ERROR: loadMST(), cannot open the file: " << file_mst << endl;
3853
exit(1);
3954
}
4055
size_t mst_size;
@@ -51,7 +66,6 @@ bool loadMST(string folderPath, vector<SketchInfo>& sketches, vector<EdgeInfo>&
5166
}
5267
fclose(fp_mst);
5368
cerr << "-----read the mst file from " << file_mst << endl;
54-
return sketch_by_file;
5569
}
5670

5771
void printResult(vector<vector<int>>& cluster, vector<SketchInfo>& sketches, bool sketchByFile, string outputFile)
@@ -88,7 +102,7 @@ void printResult(vector<vector<int>>& cluster, vector<SketchInfo>& sketches, boo
88102

89103
}
90104

91-
void saveMST(vector<SketchInfo> sketches, vector<EdgeInfo> mst, string folderPath, bool sketchByFile){
105+
void saveMST(vector<SketchInfo>& sketches, vector<EdgeInfo>& mst, string folderPath, bool sketchByFile){
92106
save_genome_info(sketches, folderPath, "mst", sketchByFile);
93107
string file_mst = folderPath + '/' + "edge.mst";
94108
FILE* fp_mst = fopen(file_mst.c_str(), "w+");
@@ -137,7 +151,7 @@ void saveANI(string folderPath, uint64_t* aniArr, int sketch_func_id){
137151
}
138152
fwrite(aniArr, sizeof(uint64_t), 101, fp_ani);
139153
fclose(fp_ani);
140-
cerr << "-----save the ani file into: " << folderPath << endl;
154+
cerr << "-----save the ani file into: " << file_ani << endl;
141155
}
142156

143157

src/MST_IO.h

Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -11,22 +11,14 @@ struct ClusterInfo{
1111
int id;
1212
uint64_t length;
1313
};
14-
//void loadDense(int** &denseArr, string inputFile, int denseSpan, vector<SketchInfo> sketches);
15-
16-
//bool loadMSTs(string inputFile, string inputFile1, vector<SketchInfo>& sketches, vector<EdgeInfo>& mst);
1714

1815
void printResult(std::vector<std::vector<int>>& clusterOrigin, std::vector<SketchInfo>& sketches, bool sketchByFile, string outputFile);
1916

20-
bool loadMST(string folderPath, vector<SketchInfo>& sketches, vector<EdgeInfo>& mst);
17+
void loadMST(string folderPath, vector<EdgeInfo>& mst);
2118
void loadDense(int** &denseArr, string folderPath, int& denseSpan, int& genome_number);
22-
//void saveMST(string folderPath, string inputFile, string sketchFunc, bool isContainment, int containCompress, vector<SketchInfo> sketches, vector<EdgeInfo> mst, bool sketchByFile, int sketchSize, int kmerSize);
23-
//
24-
//void saveDense(string folderPath, string prefixName, int** denseArr, int denseSpan, vector<SketchInfo> sketches);
25-
//
26-
//void saveANI(string folderPath, string prefixName, uint64_t* aniArr, string sketchFunc);
27-
19+
void loadANI(string folderPath, uint64_t* &aniArr, int sketch_func_id);
2820

29-
void saveMST(vector<SketchInfo> sketches, vector<EdgeInfo> mst, string folderPath, bool sketchByFile);
21+
void saveMST(vector<SketchInfo>& sketches, vector<EdgeInfo>& mst, string folderPath, bool sketchByFile);
3022
void saveDense(string folderPath, int** denseArr, int denseSpan, int genome_number);
3123
void saveANI(string folderPath, uint64_t* aniArr, int sketch_func_id);
3224

src/SketchInfo.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ struct SketchInfo{
3535
};
3636

3737

38+
bool cmpGenomeSize(SketchInfo s1, SketchInfo s2);
39+
bool cmpSeqSize(SketchInfo s1, SketchInfo s2);
3840

3941
void calSize(bool sketchByFile, string inputFile, int threads, uint64_t minLen, uint64_t &maxSize, uint64_t& minSize, uint64_t& averageSize);
4042
bool sketchSequences(string inputFile, int kmerSize, int sketchSize, int minLen, string sketchFunc, bool isContainment, int containCompress, vector<SketchInfo>& sketches, int threads);

src/Sketch_IO.cpp

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,36 @@ bool cmpIndex(SketchInfo s1, SketchInfo s2){
99
return s1.id < s2.id;
1010
}
1111

12+
void read_sketch_parameters(string folder_path, int& sketch_func_id, int& kmer_size, bool& is_containment, int& contain_compress, int& sketch_size, int& half_k, int& half_subk, int& drlevel){
13+
string hash_file = folder_path + '/' + "hash.sketch";
14+
FILE * fp_hash = fopen(hash_file.c_str(), "r");
15+
if(!fp_hash){
16+
cerr << "ERROR: read_sketch_parameters(), cannot open the file: " << hash_file << endl;
17+
exit(1);
18+
}
19+
fread(&sketch_func_id, sizeof(int), 1, fp_hash);
20+
if(sketch_func_id == 0){
21+
fread(&kmer_size, sizeof(int), 1, fp_hash);
22+
fread(&is_containment, sizeof(bool), 1, fp_hash);
23+
if(is_containment)
24+
fread(&contain_compress, sizeof(int), 1, fp_hash);
25+
else
26+
fread(&sketch_size, sizeof(int), 1, fp_hash);
27+
}
28+
else if(sketch_func_id == 1){
29+
fread(&half_k, sizeof(int), 1, fp_hash);
30+
fread(&half_subk, sizeof(int), 1, fp_hash);
31+
fread(&drlevel, sizeof(int), 1, fp_hash);
32+
}
33+
fclose(fp_hash);
34+
}
35+
1236
void save_genome_info(vector<SketchInfo>& sketches, string folderPath, string type, bool sketchByFile){
1337
assert(type == "sketch" || type == "mst");
1438
string info_file = folderPath + '/' + "info." + type;
1539
FILE * fp_info = fopen(info_file.c_str(), "w+");
1640
if(!fp_info){
17-
cerr << "ERROR: saveSketches(), cannot open the file: " << info_file << endl;
41+
cerr << "ERROR: save_genome_info(), cannot open the file: " << info_file << endl;
1842
exit(1);
1943
}
2044
fwrite(&sketchByFile, sizeof(bool), 1, fp_info);

src/Sketch_IO.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
#include "SketchInfo.h"
44
#include "common.hpp"
55

6+
void read_sketch_parameters(string folder_path, int& sketch_func_id, int& kmer_size, bool& is_containment, int& contain_compress, int& sketch_size, int& half_k, int& half_subk, int& drlevel);
67
void save_genome_info(vector<SketchInfo>& sketches, string folderPath, string type, bool sketchByFile);
78
void saveSketches(vector<SketchInfo>& sketches, string folderPath, bool sketchByFile, string sketchFunc, bool isContainment, int containCompress, int sketchSize, int kmerSize);
89

src/main.cpp

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -95,8 +95,10 @@ int main(int argc, char * argv[]){
9595
#ifndef GREEDY_CLUST
9696
auto option_premsted = app.add_option("--premsted", folder_path, "clustering by the pre-generated mst files rather than genomes for clust-mst");
9797
#endif
98+
auto option_append = app.add_option("--append", inputFile, "append genome file or file list with the pre-generated sketch or MST files");
9899

99100
option_output->required();
101+
option_append->excludes(option_input);
100102

101103
CLI11_PARSE(app, argc, argv);
102104

@@ -132,22 +134,41 @@ int main(int argc, char * argv[]){
132134

133135
#ifndef GREEDY_CLUST
134136
//======clust-mst=========================================================================
135-
if(*option_premsted){
137+
if(*option_premsted && !*option_append){
136138
clust_from_mst(folder_path, outputFile, threshold, threads);
137139
return 0;
138140
}
141+
if(*option_append && !*option_presketched && !*option_premsted){
142+
cerr << "ERROR option --append, option --presketched or --premsted needed" << endl;
143+
return 1;
144+
}
145+
if(*option_append && (*option_premsted || *option_presketched)){
146+
append_clust_mst(folder_path, inputFile, outputFile, sketchByFile, minLen, noSave, threshold, threads);
147+
return 0;
148+
}
139149
//======clust-mst=========================================================================
150+
#else
151+
//======clust-greedy======================================================================
152+
if(*option_append && !*option_presketched){
153+
cerr << "ERROR option --append, option --presketched needed" << endl;
154+
return 1;
155+
}
156+
if(*option_append && *option_presketched){
157+
append_clust_greedy(folder_path, inputFile, outputFile, sketchByFile, minLen, noSave, threshold, threads);
158+
return 0;
159+
}
160+
//======clust-greedy======================================================================
140161
#endif
141162

142-
if(*option_presketched){
163+
if(*option_presketched && !*option_append){
143164
clust_from_sketches(folder_path, outputFile, threshold, threads);
144165
return 0;
145166
}
146167

147168
if(!tune_parameters(sketchByFile, isSetKmer, inputFile, threads, minLen, isContainment, isJaccard, kmerSize, threshold, containCompress, sketchSize)){
148169
return 1;
149170
}
150-
171+
151172
clust_from_genomes(inputFile, outputFile, sketchByFile, kmerSize, sketchSize, threshold,sketchFunc, isContainment, containCompress, minLen, folder_path, noSave, threads);
152173

153174
return 0;

0 commit comments

Comments
 (0)