Open
Description
we should be able to parallelize by dividing the file into chunks and reading each chunk in a separate thread or process
multi-threading will be harder to implement than multi-processing b/c of the python GIL and thus (might?) require a compiled extension to python but multi-processing will be slower and probably require copying of each chunk into the larger array
@d-laub and I have had some very productive chats about this and are working on a strategy 💪
progress: #273
still to do:
- figure out why tests are failing
- check that max memory usage of the new code doesn't exceed that of the old code across a range of chunk sizes and num CPUs (b/c the memory that we use will be equivalent to the sum of the memory used in each chunk/process and we could theoretically have too many chunks running at once). We might need to adjust the
chunk_size
down, in that case- to execute py-spy, for example:
py-spy record --subprocesses -o py-spy-flamegraph.svg -- python -c 'from haptools.data import GenotypesPLINK; GenotypesPLINK("temp/data/variant/77.pgen", num_cpus=4, chunk_size=1).read()'
- we should be using memray, though
- consider whether to set
chunk_size
automatically based on the available memory.
- to execute py-spy, for example:
- use shared read-only memory instead of passing globals?
- fix failing nogil tests in plink-ng
- try multithreading instead of multiprocessing and benchmark with nogil from feat: disable GIL for pgenlib methods that read from PGENs aryarm/plink-ng#1
Metadata
Metadata
Assignees
Labels
No labels