Skip to content

Unranked targets when building custom DB #35

@donovan-h-parks

Description

@donovan-h-parks

Hi,

I'm building a custom DB from a large set of genome files. I'm indicating the TaxonId of each sequence using the NCBI-style accession2taxid tab-separated files. When I build the DB, it appears some sequences are not being given a rank. Specifically, the output of metacache build indicates 262383 targets remain unranked. Does this mean that MetaCache identified and placed sequenced in the DB that do not have a taxonomic assignment (i.e. associated TaxonId)? Is there any way to determine which sequences remain unranked?

Thanks,
Donovan

>metacache build gtdb_r207_db ./gtdb_reps/genomes/ -taxonomy ./gtdb_taxonomy/R207 -reset-taxa -taxpostmap accn_to_taxid.tsv

Building new database 'gtdb_r207_db' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 401816 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version    2.2.3 (20220708)
database version     20200820
------------------------------------------------
sequence type        mc::char_sequence
target id type       unsigned int 32 bits
target limit         4294967295
------------------------------------------------
window id type       unsigned int 32 bits
window limit         4294967295
window length        127
window stride        112
------------------------------------------------
sketcher type        mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type         unsigned int 32 bits
feature hash         mc::same_size_hash<unsigned int>
kmer size            16
kmer limit           16
sketch size          16
------------------------------------------------
bucket size type     unsigned char 8 bits
max. locations       254
location limit       254
------------------------------------------------
Processing reference sequences.
Added 8302685 reference sequences in 14348.7 s
targets              8302685
ranked targets       0
taxa in tree         401816
------------------------------------------------
buckets              964032481
bucket size          max: 254 mean: 46.7784 +/- 61.2667 <> 2.0026
features             518555067
dead features        0
locations            24257165919
------------------------------------------------
8302685 targets are unranked.
Try to map sequences to taxa using 'accn_to_taxid.tsv' (381 MB)
262383 targets remain unranked.
Writing database to file ... Writing database metadata to file 'gtdb_r207_db.meta' ... done.
Writing database part to file 'gtdb_r207_db.cache0' ... done.
done.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions