Guidance for working with large reference data sets

Hi,

I am trying to build a database from RefSeq and GenBank genomes.
The total size of the ~1.9 million compressed genomes is ~8.5T. Since the data set contains many genomes, some of which with extremely long chromosomes, I built MetaCache with:
`make MACROS="-DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t"`

1. What will the peak memory consumption during build be, when partitioning in x M sized partitions?
 Atm, I am running with `${p2mc}/metacache-partition-genomes ${p2g} 1600000`, since I have at max ~ 1.9T of memory available.
Is the partition size reasonable?
Does it matter for the partition size calculation, if the genomes are compressed or not?

2. Would it be beneficial for build time and memory consumption to create more smaller partitions instead of fewer large ones? There was a similar question in https://github.com/muellan/metacache/issues/33#issuecomment-1174742729, with advice to build fewer partitions, to keep the merging time in check.
Should I then try to find the maximum partition size that will fit into memory during building?
Since I am partitioning anyway,  do I actually need to compile with `uint64_t`, or check the largest partition, sequence count-wise, and see if I can get away with `uint32_t`?

3. Would you expect performance differences between querying a single db, few partitions, and many partitions, using the merge functionality with the latter two?

4. I chose `-kmerlen 20 ` based on Figure SF1 from the publication. Would you advise against this in favor of the default value of 16, maybe to keep the computational resource demand and query speed etc. at a reasonable level?
Should other sketching parameters be adjusted as well for a value of 20 or are the defaults fine?

5. Since the reference data set is large, should some of the advanced options be set/adjusted e.g. `-remove-overpopulated-features`? If so, what values should be chosen, based on the reference data?

I would be very grateful for any guidance you have.

Best

Oskar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guidance for working with large reference data sets #37

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Guidance for working with large reference data sets #37

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions