Skip to content

Guidance for working with large reference data sets #37

@ohickl

Description

@ohickl

Hi,

I am trying to build a database from RefSeq and GenBank genomes.
The total size of the ~1.9 million compressed genomes is ~8.5T. Since the data set contains many genomes, some of which with extremely long chromosomes, I built MetaCache with:
make MACROS="-DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t"

  1. What will the peak memory consumption during build be, when partitioning in x M sized partitions?
    Atm, I am running with ${p2mc}/metacache-partition-genomes ${p2g} 1600000, since I have at max ~ 1.9T of memory available.
    Is the partition size reasonable?
    Does it matter for the partition size calculation, if the genomes are compressed or not?

  2. Would it be beneficial for build time and memory consumption to create more smaller partitions instead of fewer large ones? There was a similar question in Merging results from querying partitioned database #33 (comment), with advice to build fewer partitions, to keep the merging time in check.
    Should I then try to find the maximum partition size that will fit into memory during building?
    Since I am partitioning anyway, do I actually need to compile with uint64_t, or check the largest partition, sequence count-wise, and see if I can get away with uint32_t?

  3. Would you expect performance differences between querying a single db, few partitions, and many partitions, using the merge functionality with the latter two?

  4. I chose -kmerlen 20 based on Figure SF1 from the publication. Would you advise against this in favor of the default value of 16, maybe to keep the computational resource demand and query speed etc. at a reasonable level?
    Should other sketching parameters be adjusted as well for a value of 20 or are the defaults fine?

  5. Since the reference data set is large, should some of the advanced options be set/adjusted e.g. -remove-overpopulated-features? If so, what values should be chosen, based on the reference data?

I would be very grateful for any guidance you have.

Best

Oskar

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions