Skip to content

makepaddedseqdb issue #530

@schmigle

Description

@schmigle

I'm trying to make a pipeline using foldseek easy-search, and I'd like it to be GPU-compatible, both to generate structures with prostt5 and to run rapid comparisons. However, the padding function seems not to work. I created a database first:
foldseek createdb db_concat.faa foldseek_db --prostt5-model ~/prostt5 --gpu 1 --threads 64

Then I ran the following command, for which I've also provided output:

foldseek easy-search benchmark_all_00/positive_query.faa benchmark_all_00/db/foldseek_db foldseek_test.tsv tmp --threads 64 --gpu 1 --prostt5-model ~/prostt5
easy-search benchmark_all_00/positive_query.faa benchmark_all_00/db/foldseek_db foldseek_test.tsv tmp --threads 64 --gpu 1 --prostt5-model /people/stey877/prostt5 

MMseqs Version:                    	10.941cd33
Seq. id. threshold                 	0
Coverage threshold                 	0
Coverage mode                      	0
Max reject                         	2147483647
Max accept                         	2147483647
Add backtrace                      	false
TMscore threshold                  	0
TMscore threshold mode             	0
TMalign hit order                  	0
TMalign fast                       	1
Preload mode                       	0
Threads                            	64
Verbosity                          	3
LDDT threshold                     	0
Sort by structure bit score        	1
Alignment type                     	2
Exact TMscore                      	0
Substitution matrix                	aa:3di.out,nucl:3di.out
Alignment mode                     	3
Alignment mode                     	0
E-value threshold                  	10
Min alignment length               	0
Seq. id. mode                      	0
Alternative alignments             	0
Max sequence length                	65535
Compositional bias                 	1
Compositional bias                 	1
Gap open cost                      	aa:10,nucl:10
Gap extension cost                 	aa:1,nucl:1
Compressed                         	0
Seed substitution matrix           	aa:3di.out,nucl:3di.out
Sensitivity                        	9.5
k-mer length                       	6
Target search mode                 	0
k-score                            	seq:2147483647,prof:2147483647
Max results per query              	1000
Split database                     	0
Split mode                         	2
Split memory limit                 	0
Diagonal scoring                   	true
Exact k-mer matching               	0
Mask residues                      	0
Mask residues probability          	0.999995
Mask lower case residues           	1
Mask lower letter repeating N times	6
Minimum diagonal score             	30
Selected taxa                      	
Spaced k-mers                      	1
Spaced k-mer pattern               	
Local temporary path               	
Use GPU                            	1
Use GPU server                     	0
Wait for GPU server                	600
Prefilter mode                     	0
Exhaustive search mode             	false
Search iterations                  	1
Remove temporary files             	true
MPI runner                         	
Force restart with latest tmp      	false
Cluster search                     	0
Path to ProstT5                    	/people/stey877/prostt5
Chain name mode                    	0
Createdb extraction mode           	0
Interface distance threshold       	8
Write mapping file                 	0
Mask b-factor threshold            	0
Coord store mode                   	2
Write lookup file                  	1
Input format                       	0
File Inclusion Regex               	.*
File Exclusion Regex               	^$
Alignment format                   	0
Format alignment output            	query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output                    	false
Report mode                        	2
Greedy best hits                   	false

createdb benchmark_all_00/positive_query.faa tmp/12571839706662637167/query --gpu 1 --prostt5-model /people/stey877/prostt5 --chain-name-mode 0 --db-extraction-mode 0 --distance-threshold 8 --write-mapping 0 --mask-bfactor-threshold 0 --coord-store-mode 2 --write-lookup 1 --input-format 0 --file-include '.*' --file-exclude '^$' --threads 64 -v 3 

Converting sequences

Time for merging to query_h: 0h 0m 0s 198ms
Time for merging to query: 0h 0m 0s 234ms
Database type: Aminoacid
CUDA0
CPU
[=================================================================] 100.00% 96 3s 239ms      
Time for merging to query_ss: 0h 0m 0s 448ms
Time for merging to query_ss_tmp: 0h 0m 0s 405ms
Time for processing: 0h 0m 6s 900ms
Create directory tmp/12571839706662637167/search_tmp
search tmp/12571839706662637167/query benchmark_all_00/db/foldseek_db tmp/12571839706662637167/result tmp/12571839706662637167/search_tmp --threads 64 --alignment-mode 3 -s 9.5 -k 6 --gpu 1 --remove-tmp-files 1 

ungappedprefilter tmp/12571839706662637167/query_ss benchmark_all_00/db/foldseek_db_ss tmp/12571839706662637167/search_tmp/10720767736467567852/pref --sub-mat 'aa:3di.out,nucl:3di.out' -c 0 -e 1.79769e+308 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --min-ungapped-score 30 --max-seqs 1000 --db-load-mode 0 --gpu 1 --gpu-server 0 --gpu-server-wait-timeout 600 --prefilter-mode 0 --threads 64 --compressed 0 -v 3 

Database foldseek_db_ss is not a valid GPU database
Please call: makepaddedseqdb foldseek_db_ss foldseek_db_ss_pad
Error: Ungapped prefilter matching step died
Error: Search died

I get the following error when attempting to run foldseek makepaddedseqdb foldseek_db_ss foldseek_db_ss_pad manually:

foldseek makepaddedseqdb foldseek_db_ss foldseek_db_ss_padded
makepaddedseqdb foldseek_db_ss foldseek_db_ss_padded 

MMseqs Version:          	10.941cd33
Substitution matrix      	aa:3di.out,nucl:3di.out
Mask residues            	0
Mask residues probability	0.999995
Write lookup file        	1
Threads                  	32
Verbosity                	3
Cluster search           	0

Database foldseek_db_ss needs header information

But in addition, I thought easy-search handled padding on its own, cf #399

If I run the padding on the entire database, rather than just *_ss files, it does run successfully, but doesn't change the easy-search error. It also yields some empty files, and one unusual file that appears as a large empty space; wc -l reports it as having 22 lines, so I'm assuming it's 22 line-breaks or something.

Any help would be much appreciated! I'm running on Mamba, which was last updated in January, but I saw makepaddeddb.sh was updated in August; is that related to this problem?

Edit: I downloaded the AVX2 GPU version, since AVX2 is supported on my HPC:

grep -m1 -o 'avx2' /proc/cpuinfo
avx2

I re-ran createdb and makepaddseqdb:

foldseek createdb db_concat_01.faa foldseek_db --prostt5-model ~/prostt5 --gpu 1
createdb db_concat_01.faa foldseek_db --prostt5-model /people/stey877/prostt5 --gpu 1 

MMseqs Version:             	d6204679ceef8a559be2e7a92e89760e31fbc21a
Use GPU                     	1
Path to ProstT5             	/people/stey877/prostt5
Chain name mode             	0
Model name mode             	0
Createdb extraction mode    	0
Interface distance threshold	10
Write mapping file          	0
Write Foldcomp              	0
Mask b-factor threshold     	0
Coord store mode            	2
Write lookup file           	1
Input format                	0
File Inclusion Regex        	.*
File Exclusion Regex        	^$
Threads                     	256
Verbosity                   	3

Converting sequences
[304] 0s 324ms
Sort single files in 0h 0m 1s 142ms
Merge all files 0h 0m 0s 386ms
Database type: Aminoacid
CUDA0
CPU
[=================================================================] 100.00% 348 11s 137ms    
Time for merging to foldseek_db_ss: 0h 0m 2s 75ms
Time for merging to foldseek_db_ss_tmp: 0h 0m 2s 57ms
Time for processing: 0h 0m 21s 233ms
foldseek makepaddedseqdb foldseek_db foldseek_db_pad
makepaddedseqdb foldseek_db foldseek_db_pad 

MMseqs Version:          	d6204679ceef8a559be2e7a92e89760e31fbc21a
Substitution matrix      	aa:3di.out,nucl:3di.out
Mask residues            	0
Mask residues probability	0.999995
Write lookup file        	1
Threads                  	256
Verbosity                	3
Cluster search           	0

lndb foldseek_db_h foldseek_db_pad_tmp_ss_h 

Time for processing: 0h 0m 0s 18ms
lndb foldseek_db_ss foldseek_db_pad_tmp_ss 

Time for processing: 0h 0m 0s 14ms
makepaddedseqdb foldseek_db_pad_tmp_ss foldseek_db_pad_ss --sub-mat 'aa:3di.out,nucl:3di.out' --score-bias 0 --mask 0 --mask-prob 0.999995 --mask-lower-case 1 --mask-n-repeat 6 --write-lookup 1 --threads 256 -v 3 

[=================================================================] 100.00% 348 0s 43ms     
Time for merging to foldseek_db_pad_ss: 0h 0m 1s 552ms
Time for merging to foldseek_db_pad_ss_h: 0h 0m 1s 845ms
Time for processing: 0h 0m 8s 481ms
rmdb foldseek_db_pad_tmp_ss 

Time for processing: 0h 0m 0s 8ms
rmdb foldseek_db_pad_tmp_ss_h 

Time for processing: 0h 0m 0s 7ms
renamedbkeys foldseek_db_pad_ss.gpu_mapping1 foldseek_db foldseek_db_pad --subdb-mode 1 --threads 256 -v 3 

Time for merging to foldseek_db_pad: 0h 0m 0s 10ms
Time for merging to foldseek_db_pad_h: 0h 0m 0s 92ms
Time for processing: 0h 0m 0s 183ms
foldseek_db_pad_h exists and will be overwritten
renamedbkeys foldseek_db_pad_ss.gpu_mapping1 foldseek_db_h foldseek_db_pad_h --subdb-mode 1 --threads 256 -v 3 

Time for merging to foldseek_db_pad_h: 0h 0m 0s 11ms
Time for processing: 0h 0m 0s 26ms

As you can see, makepaddedseqdb runs on the database and seems to be doing something with some of the _ss files, though makepaddedseqdb foldseek_db_ss still fails for the same reason.

When I now attempt to run foldseek easy-search, I get a new error:

foldseek easy-search db_concat_01.faa foldseek_db test.m8 tmp --prostt5-model ~/prostt5 --gpu 1 
Create directory tmp
easy-search db_concat_01.faa foldseek_db test.m8 tmp --prostt5-model /people/stey877/prostt5 --gpu 1 

MMseqs Version:                    	d6204679ceef8a559be2e7a92e89760e31fbc21a
TMscore threshold                  	0
TMscore threshold mode             	0
LDDT threshold                     	0
Sort by structure bit score        	1
Alignment type                     	2
Exact TMscore                      	0
Substitution matrix                	aa:3di.out,nucl:3di.out
Add backtrace                      	false
Alignment mode                     	3
Alignment mode                     	0
E-value threshold                  	10
Seq. id. threshold                 	0
Min alignment length               	0
Seq. id. mode                      	0
Alternative alignments             	0
Coverage threshold                 	0
Coverage mode                      	0
Max sequence length                	65535
Compositional bias                 	1
Compositional bias scale           	1
Max reject                         	2147483647
Max accept                         	2147483647
Preload mode                       	0
Gap open cost                      	aa:10,nucl:10
Gap extension cost                 	aa:1,nucl:1
Threads                            	256
Compressed                         	0
Verbosity                          	3
Seed substitution matrix           	aa:3di.out,nucl:3di.out
Sensitivity                        	9.5
k-mer length                       	6
Target search mode                 	0
k-score                            	seq:2147483647,prof:2147483647
Max results per query              	1000
Split database                     	0
Split mode                         	2
Split memory limit                 	0
Diagonal scoring                   	true
Exact k-mer matching               	0
Mask residues                      	0
Mask residues probability          	0.999995
Mask lower case residues           	1
Mask lower letter repeating N times	6
Minimum diagonal score             	30
Selected taxa                      	
Spaced k-mers                      	1
Spaced k-mer pattern               	
Local temporary path               	
Use GPU                            	1
Use GPU server                     	0
Wait for GPU server                	600
Prefilter mode                     	0
TMalign hit order                  	0
TMalign fast                       	1
MultiDomain Mode                   	1
Mask profile                       	1
Profile E-value threshold          	0.1
Global sequence weighting          	false
Allow deletions                    	false
Filter MSA                         	1
Use filter only at N seqs          	0
Maximum seq. id. threshold         	0.9
Minimum seq. id.                   	0.0
Minimum score per column           	-20
Minimum coverage                   	0
Select N most diverse seqs         	1000
Pseudo count mode                  	0
Profile output mode                	0
Cluster search                     	0
Exhaustive search mode             	false
Search iterations                  	1
Remove temporary files             	true
Force restart with latest tmp      	false
MPI runner                         	
Path to ProstT5                    	/people/stey877/prostt5
Chain name mode                    	0
Model name mode                    	0
Createdb extraction mode           	0
Interface distance threshold       	10
Write mapping file                 	0
Write Foldcomp                     	0
Mask b-factor threshold            	0
Coord store mode                   	2
Write lookup file                  	1
Input format                       	0
File Inclusion Regex               	.*
File Exclusion Regex               	^$
Alignment format                   	0
Format alignment output            	query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output                    	false
Report mode                        	2
Greedy best hits                   	false

createdb db_concat_01.faa tmp/12495472389025653413/query --gpu 1 --prostt5-model /people/stey877/prostt5 --chain-name-mode 0 --model-name-mode 0 --db-extraction-mode 0 --distance-threshold 10 --write-mapping 0 --write-foldcomp 0 --mask-bfactor-threshold 0 --coord-store-mode 2 --write-lookup 1 --input-format 0 --file-include '.*' --file-exclude '^$' --threads 256 -v 3 

Converting sequences
[304] 0s 456ms
Sort single files in 0h 0m 0s 846ms
Merge all files 0h 0m 0s 482ms
Database type: Aminoacid
CUDA0
CPU
[=================================================================] 100.00% 348 10s 429ms    
Time for merging to query_ss: 0h 0m 2s 182ms
Time for merging to query_ss_tmp: 0h 0m 1s 888ms
Time for processing: 0h 0m 20s 923ms
Create directory tmp/12495472389025653413/search_tmp
search tmp/12495472389025653413/query foldseek_db tmp/12495472389025653413/result tmp/12495472389025653413/search_tmp --alignment-mode 3 -s 9.5 -k 6 --gpu 1 --remove-tmp-files 1 

ungappedprefilter tmp/12495472389025653413/query_ss foldseek_db_ss tmp/12495472389025653413/search_tmp/16285961332583313377/pref --sub-mat 'aa:3di.out,nucl:3di.out' -c 0 -e 1.79769e+308 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --min-ungapped-score 30 --max-seqs 1000 --db-load-mode 0 --gpu 1 --gpu-server 0 --gpu-server-wait-timeout 600 --prefilter-mode 0 --threads 256 --compressed 0 -v 3 

terminate called after throwing an instance of 'thrust::THRUST_200500_750_800_860_890_900_NS::system::system_error'
  what():  __copy:: D->D: failed: cudaErrorMisalignedAddress: misaligned address
tmp/12495472389025653413/search_tmp/16285961332583313377/structuresearch.sh: line 53: 573305 Aborted                 (core dumped) $RUNNER "$MMSEQS" ungappedprefilter "${QUERY_PREFILTER}" "${TARGET_PREFILTER}${INDEXEXT}" "${TMP_PATH}/pref" ${UNGAPPEDPREFILTER_PAR}
Error: Ungapped prefilter matching step died
Error: Search died

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions