Skip to content

parallelize reading from a PGEN file #253

Open
@aryarm

Description

@aryarm

we should be able to parallelize by dividing the file into chunks and reading each chunk in a separate thread or process

multi-threading will be harder to implement than multi-processing b/c of the python GIL and thus (might?) require a compiled extension to python but multi-processing will be slower and probably require copying of each chunk into the larger array

@d-laub and I have had some very productive chats about this and are working on a strategy 💪


progress: #273

still to do:

  • figure out why tests are failing
  • check that max memory usage of the new code doesn't exceed that of the old code across a range of chunk sizes and num CPUs (b/c the memory that we use will be equivalent to the sum of the memory used in each chunk/process and we could theoretically have too many chunks running at once). We might need to adjust the chunk_size down, in that case
    • to execute py-spy, for example:
      py-spy record --subprocesses -o py-spy-flamegraph.svg -- python -c 'from haptools.data import GenotypesPLINK; GenotypesPLINK("temp/data/variant/77.pgen", num_cpus=4, chunk_size=1).read()'
      
    • we should be using memray, though
    • consider whether to set chunk_size automatically based on the available memory.
  • use shared read-only memory instead of passing globals?
  • fix failing nogil tests in plink-ng
  • try multithreading instead of multiprocessing and benchmark with nogil from feat: disable GIL for pgenlib methods that read from PGENs aryarm/plink-ng#1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions