-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
Hi,
I'm building a custom DB from a large set of genome files. I'm indicating the TaxonId of each sequence using the NCBI-style accession2taxid tab-separated files. When I build the DB, it appears some sequences are not being given a rank. Specifically, the output of metacache build indicates 262383 targets remain unranked. Does this mean that MetaCache identified and placed sequenced in the DB that do not have a taxonomic assignment (i.e. associated TaxonId)? Is there any way to determine which sequences remain unranked?
Thanks,
Donovan
>metacache build gtdb_r207_db ./gtdb_reps/genomes/ -taxonomy ./gtdb_taxonomy/R207 -reset-taxa -taxpostmap accn_to_taxid.tsv
Building new database 'gtdb_r207_db' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 401816 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version 2.2.3 (20220708)
database version 20200820
------------------------------------------------
sequence type mc::char_sequence
target id type unsigned int 32 bits
target limit 4294967295
------------------------------------------------
window id type unsigned int 32 bits
window limit 4294967295
window length 127
window stride 112
------------------------------------------------
sketcher type mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type unsigned int 32 bits
feature hash mc::same_size_hash<unsigned int>
kmer size 16
kmer limit 16
sketch size 16
------------------------------------------------
bucket size type unsigned char 8 bits
max. locations 254
location limit 254
------------------------------------------------
Processing reference sequences.
Added 8302685 reference sequences in 14348.7 s
targets 8302685
ranked targets 0
taxa in tree 401816
------------------------------------------------
buckets 964032481
bucket size max: 254 mean: 46.7784 +/- 61.2667 <> 2.0026
features 518555067
dead features 0
locations 24257165919
------------------------------------------------
8302685 targets are unranked.
Try to map sequences to taxa using 'accn_to_taxid.tsv' (381 MB)
262383 targets remain unranked.
Writing database to file ... Writing database metadata to file 'gtdb_r207_db.meta' ... done.
Writing database part to file 'gtdb_r207_db.cache0' ... done.
done.
Metadata
Metadata
Assignees
Labels
No labels