Skip to content

Commit e6baf7d

Browse files
agudysaziele
andauthored
Add the deduplicate command; improve verbosity; add more tests
Co-authored-by: aziele <[email protected]>
1 parent 8807e88 commit e6baf7d

File tree

23 files changed

+15004
-4537
lines changed

23 files changed

+15004
-4537
lines changed

.github/workflows/large.yml

Lines changed: 115 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,118 @@ on:
44
workflow_dispatch:
55

66
jobs:
7-
dummy:
8-
name: dummy
9-
runs-on: echo
10-
7+
8+
########################################################################################
9+
checkout:
10+
name: Checkout
11+
runs-on: [self-hosted, vclust, x64_linux, large]
12+
13+
steps:
14+
- name: clean
15+
run: rm -rf ${{ github.workspace }}/*
16+
- uses: actions/checkout@v4
17+
with:
18+
submodules: recursive
19+
- name: Get tags
20+
run: |
21+
cd ./3rd_party/clusty/libs/igraph
22+
git fetch --prune --unshallow
23+
echo exit code $?
24+
git tag --list
25+
continue-on-error: true
26+
27+
########################################################################################
28+
download-release:
29+
name: Download release
30+
needs: checkout
31+
strategy:
32+
matrix:
33+
compiler: [14]
34+
runs-on: [self-hosted, vclust, x64_linux, large]
35+
36+
steps:
37+
# - name: clean
38+
# run: rm -rf ${{ github.workspace }}/*
39+
# - uses: robinraju/[email protected]
40+
# with:
41+
# latest: true
42+
# tarBall: true
43+
# extract: true
44+
# token: ${{ secrets.MY_TOKEN }}
45+
# - name: download
46+
# run: ./.github/workflows/github-release-downloader.sh refresh-bio vclust-dev "x64_linux.tar.gz"
47+
- name: make
48+
run: gmake -j32 CXX=g++-${{matrix.compiler}} CC=gcc-${{matrix.compiler}} PLATFORM=avx2 LEIDEN=true STATIC_LINK=true
49+
- name: print info
50+
run: python3 vclust.py info
51+
52+
########################################################################################
53+
ani:
54+
name: ANI calculation
55+
needs: download-release
56+
strategy:
57+
fail-fast: false
58+
matrix:
59+
dataset: [ICTV, IMGVR]
60+
include:
61+
- dataset: ICTV
62+
variant_name: full
63+
prefilter_args: '-k 25 --min-ident 0.7 --min-kmers 20'
64+
align_args: '--out-tani 0.70'
65+
- dataset: IMGVR_HQ
66+
variant_name: full
67+
prefilter_args: '-k 25 --min-ident 0.95 --min-kmers 20 --batch-size 1000000'
68+
align_args: '--out-ani 0.95 --out-qcov 0.85'
69+
- dataset: IMGVR
70+
variant_name: fraction_02
71+
prefilter_args: '-k 25 --min-ident 0.95 --min-kmers 4 --kmers-fraction 0.2 --batch-size 2000000'
72+
align_args: '--out-ani 0.95 --out-qcov 0.85'
73+
env:
74+
INPUT_DIR: ../../../../vclust/input
75+
TEMP_DIR: ../../../../vclust/temp
76+
77+
runs-on: [self-hosted, vclust, x64_linux, large]
78+
79+
steps:
80+
- name: prefilter
81+
run: /usr/bin/time -v ./vclust.py prefilter -t 32 -i ${INPUT_DIR}/${{ matrix.dataset }}.fna.gz -o ${TEMP_DIR}/${{ matrix.dataset }}.${{ matrix.variant_name }}.filter ${{ matrix.prefilter_args }}
82+
- name: prefilter md5
83+
run: md5sum ${TEMP_DIR}/${{ matrix.dataset }}.${{ matrix.variant_name }}.filter
84+
- name: align
85+
run: /usr/bin/time -v ./vclust.py align -t 32 -i ${INPUT_DIR}/${{ matrix.dataset }}.fna.gz -o ${TEMP_DIR}/${{ matrix.dataset }}.${{ matrix.variant_name }}.ani.tsv --filter ${TEMP_DIR}/${{ matrix.dataset }}.${{ matrix.variant_name }}.filter ${{ matrix.align_args }}
86+
- name: align md5
87+
run: md5sum ${TEMP_DIR}/${{ matrix.dataset }}.${{ matrix.variant_name }}.ani.tsv ${TEMP_DIR}/${{ matrix.dataset }}.${{ matrix.variant_name }}.ani.ids.tsv
88+
89+
########################################################################################
90+
clustering:
91+
name: clustering
92+
needs: ani
93+
strategy:
94+
fail-fast: false
95+
matrix:
96+
dataset: [ICTV, IMGVR, IMGVR_HQ]
97+
algo_name: [single, complete, set-cover, uclust, cd-hit, leiden_07, leiden_10]
98+
include:
99+
- {dataset: ICTV, variant_name: full, args: '--metric tani --tani 0.95'}
100+
- {dataset: IMGVR, variant_name: fraction_02, args: '--metric ani --ani 0.95 --qcov 0.85'}
101+
- {dataset: IMGVR_HQ, variant_name: full, args: '--metric ani --ani 0.95 --qcov 0.85'}
102+
- {algo_name: single, algo_cmd: single}
103+
- {algo_name: complete, algo_cmd: complete}
104+
- {algo_name: set-cover, algo_cmd: set-cover}
105+
- {algo_name: uclust, algo_cmd: uclust}
106+
- {algo_name: cd-hit, algo_cmd: cd-hit}
107+
- {algo_name: leiden_07, algo_cmd: 'leiden --leiden-resolution 0.7'}
108+
- {algo_name: leiden_10, algo_cmd: 'leiden --leiden-resolution 1.0'}
109+
110+
env:
111+
INPUT_DIR: ../../../../vclust/input
112+
TEMP_DIR: ../../../../vclust/temp
113+
114+
runs-on: [self-hosted, vclust, x64_linux, large]
115+
116+
steps:
117+
- name: cluster
118+
run: /usr/bin/time -v ./vclust.py cluster -i ${TEMP_DIR}/${{ matrix.dataset }}.${{ matrix.variant_name }}.ani.tsv --ids ${TEMP_DIR}/${{ matrix.dataset }}.${{ matrix.variant_name }}.ani.ids.tsv -o ${TEMP_DIR}/${{ matrix.dataset }}.${{ matrix.variant_name }}.${{ matrix.algo_name }}.clusty --algorithm ${{ matrix.algo_cmd }} ${{ matrix.args }}
119+
- name: md5
120+
run: md5sum ${TEMP_DIR}/${{ matrix.dataset }}.${{ matrix.variant_name }}.${{ matrix.algo_name }}.clusty
121+

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,15 +80,17 @@ The Vclust documentation is available on the [GitHub Wiki](https://github.com/re
8080
2. Prefilter
8181
3. Align
8282
4. Cluster
83+
5. Deduplicate
8384
5. [Optimizing sensitivity and resource usage](https://github.com/refresh-bio/vclust/wiki/5-Optimizing-sensitivity-and-resource-usage)
8485
6. [Use cases](https://github.com/refresh-bio/vclust/wiki/6-Use-cases)
8586
1. Classify viruses into species and genera following ICTV standards
8687
2. Assign viral contigs into vOTUs following MIUViG standards
8788
3. Dereplicate viral contigs into representative genomes
88-
4. Calculate pairwise similarities between all-versus-all genomes
89-
5. Process large dataset of diverse virus genomes (IMG/VR)
89+
4. Process large dataset of diverse virus genomes (IMG/VR)
90+
5. Deduplicate (remove duplicate sequences) between and within multiple datasets
9091
6. Process large dataset of highly redundant virus genomes
9192
7. Cluster plasmid genomes into pOTUs
93+
8. Calculate pairwise similarities between all-versus-all genomes
9294
7. [FAQ: Frequently Asked Questions](https://github.com/refresh-bio/vclust/wiki/7-FAQ:-Frequently-Asked-Questions)
9395

9496

example/README.txt

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
This dataset comprises bacteriophage genome sequences with simulated mutations relative to the reference sequence. Mutations include substitutions (sn), deletions (del), insertions (ins), duplications (dup), inversions (inv), and translocations (tl). These modified sequences (.alt*) have known true total ANI (tANI) values compared to the reference.
2+
3+
ref_id alt_id ref_len alt_len tani alt_summary
4+
NC_010807 NC_010807.alt1 38815 38815 0.99753 sn;inv;tl
5+
NC_010807 NC_010807.alt2 38815 40555 0.98985 sn;dup
6+
NC_010807 NC_010807.alt3 38815 39891 0.98414 sn;ins;tl
7+
NC_005091 NC_005091.alt1 57455 57455 0.97161 sn;inv;tl
8+
NC_005091 NC_005091.alt2 57455 63696 0.96707 sn;dup;tl
9+
NC_025457 NC_025457.alt1 42654 41066 0.80607 sn;del;ins;dup;inv
10+
NC_025457 NC_025457.alt2 42654 64164 0.75921 sn;del;ins;dup;inv;tl
11+
NC_002486 NC_002486.alt 45636 45636 1.00000 tl

example/datasets/README.txt

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
Duplicate sequences (identical sequences)
2+
3+
refseq.fna genbank.fna other
4+
NC_002486.1 = AB044554.1
5+
NC_005091.2 = AY357582.2 = AY357582.2_duplicate
6+
NC_010807.1 = EU547803.1 = NC_010807.1_duplicate
7+
NC_025457.1 = KJ473423.1
8+
MN428048.1 = MN428048.1_revcomp
9+
MK937595.1
10+
Mushuvirus = Mushuvirus_copy
11+
12+
13+
7 unique sequences

0 commit comments

Comments
 (0)