-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Hi,
I am trying to build a database from RefSeq and GenBank genomes.
The total size of the ~1.9 million compressed genomes is ~8.5T. Since the data set contains many genomes, some of which with extremely long chromosomes, I built MetaCache with:
make MACROS="-DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t"
-
What will the peak memory consumption during build be, when partitioning in x M sized partitions?
Atm, I am running with${p2mc}/metacache-partition-genomes ${p2g} 1600000, since I have at max ~ 1.9T of memory available.
Is the partition size reasonable?
Does it matter for the partition size calculation, if the genomes are compressed or not? -
Would it be beneficial for build time and memory consumption to create more smaller partitions instead of fewer large ones? There was a similar question in Merging results from querying partitioned database #33 (comment), with advice to build fewer partitions, to keep the merging time in check.
Should I then try to find the maximum partition size that will fit into memory during building?
Since I am partitioning anyway, do I actually need to compile withuint64_t, or check the largest partition, sequence count-wise, and see if I can get away withuint32_t? -
Would you expect performance differences between querying a single db, few partitions, and many partitions, using the merge functionality with the latter two?
-
I chose
-kmerlen 20based on Figure SF1 from the publication. Would you advise against this in favor of the default value of 16, maybe to keep the computational resource demand and query speed etc. at a reasonable level?
Should other sketching parameters be adjusted as well for a value of 20 or are the defaults fine? -
Since the reference data set is large, should some of the advanced options be set/adjusted e.g.
-remove-overpopulated-features? If so, what values should be chosen, based on the reference data?
I would be very grateful for any guidance you have.
Best
Oskar