speeding up the `transform` command

# Description
[Transforming a set of haplotypes](https://haptools.readthedocs.io/en/latest/commands/transform.html) can take a while, especially if you have many haplotypes. A quick benchmark via [the `bench_transform.py` script](https://github.com/gymrek-lab/haptools/blob/639985ec639b0f45be36f7f3e6641152dc056267/tests/bench_transform.py) seems to indicate that it scales linearly with the number of samples, the max number of alleles in each haplotype, **and** the number of haplotypes.
![plot](https://user-images.githubusercontent.com/23412689/187967720-0667d69d-9761-43e2-a89b-751b5b41c075.png)

@s041629 wants to transform roughly 21,000 haplotypes for 500,000 samples where the max number of alleles in a haplotype is 115. Currently (as of f45742f466ac0eb1de6d0358494d6ace8fd5856f), the `Haplotypes.transform()` function takes about 10-25 real minutes to accomplish this task. (It varies based on the willingness of the kernel to prioritize our job on the node.)

For now, I think this will be fast enough. But we might want to brainstorm some ideas for the future. In any case, I don't really think I can justify spending a lot more time to improve this right now (since I've already spent quite a bit), so I'm gonna take a break and solicit suggestions until I feel energized enough to tackle this again.

# Details
The vast majority of the time seems to be spent in this for-loop.

https://github.com/CAST-genomics/haptools/blob/639985ec639b0f45be36f7f3e6641152dc056267/haptools/data/haplotypes.py#L1054-L1055

Unfortunately, I can't use broadcasting to speed up this loop because the size of `idxs[i]` might vary on each iteration. And I can't pad the array to make broadcasting work because I quickly run out of memory for arrays of that shape. Ideally, I would like to find a solution that doesn't require multi-threading or the installation of new dependencies besides those we have already.

# Steps to reproduce
## A basic test
This test uses just 5 samples, 4 variants, and 3 haplotypes. It's a sanity check to make sure the `Haplotypes.transform()` function works and that none of the code will fail.
```bash
pytest -sx tests/test_data.py::TestHaplotypes::test_haps_transform
```
## A benchmarking test
To benchmark the function and see how it scales with the number of samples, max number of alleles in each haplotype, and the number of haplotypes in each allele, you can use [the `bench_transform.py` script](https://github.com/cast-genomics/haptools/blob/639985ec639b0f45be36f7f3e6641152dc056267/tests/bench_transform.py) as so. It will create a plot like the one I've included above.

https://github.com/CAST-genomics/haptools/blob/639985ec639b0f45be36f7f3e6641152dc056267/tests/bench_transform.py#L19-L21

## A pure numpy-based test of the for-loop
The tests above require that you install [the dev setup for `haptools`](https://haptools.readthedocs.io/en/latest/project_info/contributing.html#dev-setup). If you'd like to reproduce just the problematic for-loop for @s041629's situation in pure numpy you can do something like this:
```python
import numpy as np

num_samps, num_alleles, num_haps, max_alleles = 500000, 6200, 21000, 115

idxs = [np.random.randint(0, num_alleles, np.random.randint(0, max_alleles)) for j in range(num_haps)]
arr = np.random.choice([True, False], size=num_samps*num_alleles*2).reshape((num_samps, num_alleles, 2))
output = np.empty((num_samps, num_haps, 2), dtype=np.bool_)

for i in range(num_haps):
    output[:, i] = np.all(arr[:, idxs[i]], axis=1)
```
## A full, long-running test
If requested, I can provide files that reproduce the entire setup so that you can execute the `transform` command with them. The files are a bit too large to attach here.

# Some things I've already tried
1. Adding a dimension to `arr` (aka `equality_arr`) of size equal to the number of haplotypes and then performing `np.all()` on the other dimension. This takes longer and requires too much more memory. See fac79b07c085490351b1ad2e3b61bc06732d731b.
2. Using numba to compile the for-loop to C. This took longer for some reason. Still not sure why. Also, I'm averse to installing new dependencies. See 5afcf81667b8a718ab28ff39f101f1de78614c43.
3. Looping over the allele dimension instead of the haplotype dimension (since `num_alleles < num_haps`, at least in this situation). This was a clever strategy from Ryan Eveloff. Strangely, this also took longer than looping over the haplotypes. I'm not sure why. See 335e5f7063f84f8f739dd62e789cd5a48a1468df.

# Potential ideas
1. Try taking advantage of the fact that some of the haplotypes share alleles (ie some of the haplotypes are subsets of other haplotypes). So I could try to construct some sort of tree and then iterate through that via a dynamic programming approach, instead. In theory, this should reduce the time complexity by a log factor somewhere. A back-of-the-envelope analysis of the `.hap` file Matteo gave me indicates that there would be at least 1,000 leaves in such a tree.
2. Talk to Tara! She might have had a similar issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

speeding up the `transform` command #92

Description

Details

Steps to reproduce

A basic test

A benchmarking test

A pure numpy-based test of the for-loop

A full, long-running test

Some things I've already tried

Potential ideas

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	for i in range(len(self.data)):
	hap_gts.data[:, i] = np.all(equality_arr[:, idxs[i]], axis=1)

	# tests/bench_transform.py --name 'hap-loop' --reps 5 --archive archive.pickle \
	# --default-variants 115 --default-samples 5000 --default-haplotypes 20 \
	# --intervals-variants 1 80 4 --intervals-samples 1 3200 160 --intervals-haplotypes 10 91 4 -o plot.png

speeding up the transform command #92

Description

Description

Details

Steps to reproduce

A basic test

A benchmarking test

A pure numpy-based test of the for-loop

A full, long-running test

Some things I've already tried

Potential ideas

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

speeding up the `transform` command #92