parallelize reading from a PGEN file

we should be able to parallelize by dividing the file into chunks and reading each chunk in a separate thread or process

multi-threading will be harder to implement than multi-processing b/c of the python GIL and thus (might?) require a compiled extension to python but multi-processing will be slower and probably require copying of each chunk into the larger array

@d-laub and I have had some very productive chats about this and are working on a strategy 💪 

-----
progress: https://github.com/CAST-genomics/haptools/pull/273

still to do:
- [x] figure out why tests are failing
- [ ] check that max memory usage of the new code doesn't exceed that of the old code across a range of chunk sizes and num CPUs (b/c the memory that we use will be equivalent to the sum of the memory used in each chunk/process and we could theoretically have too many chunks running at once). We might need to adjust the `chunk_size` down, in that case
    - to execute py-spy, for example:
        ```
        py-spy record --subprocesses -o py-spy-flamegraph.svg -- python -c 'from haptools.data import GenotypesPLINK; GenotypesPLINK("temp/data/variant/77.pgen", num_cpus=4, chunk_size=1).read()'
        ````
    - we should be using memray, though
    - [ ] consider whether to set `chunk_size` automatically based on the available memory.
- [ ] use shared read-only memory instead of passing globals?
- [ ] fix failing nogil tests in plink-ng
- [ ] try multithreading instead of multiprocessing and benchmark with nogil from https://github.com/aryarm/plink-ng/pull/1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

parallelize reading from a PGEN file #253

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

parallelize reading from a PGEN file #253

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions