cattle example, LD calculator cleaned up #51

ningyuxin1999 · 2025-03-01T12:41:21Z

This is the example of cattle configuration. Each population size is dependent on the previous one instead of generated randomly.

nspope · 2025-03-03T17:40:19Z

Great, thanks Yuxin! Could you please rebase this onto main and fix the merge conflicts?

If you haven't done this before, then from the command line

git checkout main
git pull upstream main  # this is assuming that this repo is called "upstream" in `git remote -v`
git checkout processor_cleanup
git rebase main

and then you'll have to edit the files that git lists, to resolve the

<<<<< HEAD
old code
=======
new code
>>>>>> commit

inserts added by git. Once that is done,

git add path/to/modified/file
git rebase --continue

to move to the next commit (or finish the rebase).

Let me know if you have questions!

andrewkern · 2025-03-05T22:56:25Z

hmm this still has merge conflicts @ningyuxin1999

ningyuxin1999 · 2025-03-06T08:30:07Z

hmm this still has merge conflicts @ningyuxin1999

I was only updating some comments yesterday, and now the conflicts should be resolved :)

nspope

Thanks @ningyuxin1999 -- it looks like there's a few things to fix that I've flagged here.

Also, it'd be good if you could try running the training pipeline through on this example just to make sure it doesn't error out (the number of sims can be set to something low, like 100, just want to make sure it doesn't error out).

@andrewkern can you take a quick look as well?

nspope · 2025-03-06T17:37:01Z

workflow/config/cattle_21Gen.yaml

+  max_time: 130000
+  time_rate: 0.06
+  samples:
+    pop: 25


the population name should be "pop0" to match the msprime demography, I think?

edit: now I see, you're doing samples={"pop0": self.samples["pop"]} in the msprime invocation below. But, it'd be easier/nicer to just have the correct name here, so that you can do samples=self.samples

nspope · 2025-03-06T17:38:43Z

workflow/config/cattle_21Gen.yaml

+packed_sequence: False
+
+simulator:
+  class_name: "VariablePopulationSize"


The class name should be "Cattle21Gen", I think

yep, agreed

I changed this, but in the long run I think I would like to use the same class for the cattle example and the VariablePopulationSize example. In my opinion the dependent population sizes make much more sense as a prior and could be used in any case.

workflow/scripts/ts_simulators.py

nspope · 2025-03-06T17:43:28Z

workflow/scripts/ts_simulators.py

+            geno = ts.genotype_matrix().T
+            num_sample = geno.shape[0]
+            if (geno==2).any():
+                num_sample *= 2


you can just use num_sample = ts.num_samples here

nspope · 2025-03-06T17:45:58Z

workflow/scripts/ts_simulators.py

+                row_sum != 0,
+                row_sum != num_sample,
+                row_sum > num_sample * 0.05,
+                num_sample - row_sum > num_sample * 0.05


just to make this clearer to read (and settable), put "maf": 0.05 under the "FIXED PARAMETER" heading in the default_config dict above, and change these lines to:

row_sum > num_sample * self.maf

etc

nspope · 2025-03-06T17:46:32Z

workflow/scripts/ts_simulators.py

+            row_sum = np.sum(geno, axis=0)
+            keep = np.logical_and.reduce([
+                row_sum != 0,
+                row_sum != num_sample,


I think that row_sum != 0 , row_sum != num_sample isn't needed with the MAF filter directly below, so delete these two lines

Yes, I deleted them.

nspope · 2025-03-06T17:52:21Z

workflow/scripts/ts_simulators.py

+        for i in range(1, self.num_time_windows):
+            new_value = 10 ** self.pop_sizes[0] - 1
+            while new_value > self.pop_sizes[1] or new_value < self.pop_sizes[0]:
+                new_value = log_values[i - 1] * 10 ** np.random.uniform(-1, 1)


I'm having trouble getting my head around this -- we're multiplying log10 Ne by U[0.1, 10] -- is this what's intended? e.g. this isn't intended to be natural units Ne multiplied by some scalar?

once I understand what the intended process is here, maybe we could reparameterize in terms of iid uniforms (e.g. without rejection sampling), as this is most definitely not a box uniform any more

i'm a bit confused here too because the vector named log_values will initially be populated by the BoxUniform draws

I think log_values is confusing here. What this should actually do (and is currently not) is drawing the first population size between 10 and 100000 but uniformly in log10-scale. So if we draw alpha from a BoxUniform between 1 and 5 the first population size should then be 10^alpha. In each next time window the population size is then N_{i+1} = N_i * 10^beta where beta is uniform in [-1,1].

This is to prevent the posterior distribution from focusing too much on the larger population sizes and makes thus probably sense also for the variablepopulationsizes class. I think the performance for the more realistic population size scenarios should also improve using this prior with dependent population sizes as it prevents completely weird population size changes. Basically if we would draw the population sizes uniformly and independently between 10 and 100000 too many of the training data would contain large ancestral population sizes.

This is a bit of a tangent, but worth discussing ... I think we should try to keep the posterior and prior parameterization consistent. For example (simplifying things for exposition), say I have a model parameterized by the random walk Ne[i] = Ne[i - 1] * U[lower, upper]. Then the values of theta that the simulator returns -- the targets for the NN -- should be the U[lower, upper] values, not the Ne values, because we're assuming a uniform prior downstream.

Though this is a little cumbersome (in that the user has to then manually reparameterize posterior samples, to get what they actually want), it's crucial if we want to try to combine posteriors across chunks of sequence

Also I think using np.random.uniform here won't respect the random seed, as this has only been set for torch

Ok, we could certainly use the U[lower, upper] values instead of the actual population sizes. But we would have to change a small detail in the generation of the population sizes then. Currently, we redraw U[lower, upper] if Ne[i-1] * U[lower, upper] is larger than the maximal population size allowed (or smaller than the minimal). Instead of redrawing we could simply set the population size to the max or min value allowed in this case.

I have now implemented the new version, where the population sizes will be a bit more sticky at the population size range borders.

workflow/scripts/ts_simulators.py

nspope · 2025-03-06T17:57:38Z

workflow/scripts/ts_simulators.py

-    }
-
-    def __init__(self, config: dict):
-        super().__init__(config, self.default_config)
        self.parameters = ["recombination_rate"]


some stuff got accidentally deleted in this class, probably during fixing merge conflicts-- can you please set it back to how it was?

andrewkern · 2025-03-06T18:34:31Z

workflow/scripts/ts_processors.py

-        -------
-        DataFrame
-            Table with the distance_bins as index, and the mean value of
+        Compute LD for a subset of SNPs and return a DataFrame with only the mean r2.


i think it's useful to keep some of the details about the parameters that are needed in here

andrewkern · 2025-03-06T18:35:39Z

workflow/scripts/ts_processors.py

-            snp_pairs = np.unique([((snps[i] + i) % n_SNP, (snps[i + 1] + i) % n_SNP) for i in range(len(snps) - 1)],
-                                  axis=0)
+            snp_pairs = np.unique([((snps[i] + i) % n_SNP, (snps[i + 1] + i) % n_SNP)
+                                for i in range(len(snps) - 1)], axis=0)


for my eyes, it's way easier to have the list comprehension on a single line, splitting list comprehensions across lines makes things harder to read

andrewkern · 2025-03-06T18:36:59Z

workflow/scripts/ts_processors.py

            sd["dist_group"] = pd.cut(sd.snp_dist, bins=distance_bins)
            sr = [allel.rogers_huff_r(snps) ** 2 for snps in haplotype[snps_pos]]
            sd["r2"] = sr
            sd["gap_id"] = i
            ld = pd.concat([ld, sd])

-        ld2 = ld.dropna().groupby("dist_group", observed=False).agg(agg_bins)
+        ld2 = ld.dropna().groupby("dist_group",observed=True).agg(agg_bins)


i believe that setting observed=True here will throw an error, at least it did for me when first reimplementing this routine

andrewkern · 2025-03-06T18:39:52Z

added some additional comments

fbaumdicker · 2025-03-17T22:12:08Z

I started to fix the issues in this branch. Currently, I am no longer including the LD statistics and am just using the sfs. I will look at the LD stuff again later. The new prior is done and seems to work fine.
However, when using the cattle config I currently get stuck in the plot_diagnostics job when drawing the posterior samples.
snakemake --configfile workflow/config/cattle_21Gen.yaml --snakefile workflow/training_workflow.smk
On my PC I run out of GPU memory, on the workstation, drawing from the prior does not progress. Any idea what could cause this @nspope ? I want to fix this first before continuing with the LD calculation part.

nspope · 2025-03-17T22:22:42Z

Hm ... we're batching the sampling during plot diagnostics, so the per batch memory requirement should be pretty tightly constrained (as much as it would be during training). But maybe I need to be transferring results to the CPU at the end of each batch. Any chance you could drop some print statements into workflow/scripts/plot_diagnostics.py to try to figure out if the OOM is happening within Lightning's prediction loop (the predict_step in the LightningModule in that script), or afterwards (after trainer.predict is called)?

I'm not sure what could be happening with the prior -- do you mean that it's stuck in an endless while loop during the simulation? Regardless, this shouldn't be related to GPU usage (simulation is done on the CPU) but is maybe an issue with parallelization on your workstation? Could you try running snakemake with --jobs 1 to see if the issue persists?

andrewkern · 2025-03-18T04:29:17Z

i've seen this behavior once before with the posterior sampling being stuck in an infinite loop. it had to do with not properly defining the prior for use with SBI. looking at your code i don't see anything obvious, but i'd go back and compare it to the VariablePopnSize simulator here (which works) to see if there is something off

fbaumdicker · 2025-03-18T07:26:06Z

Could you try running snakemake with --jobs 1 to see if the issue persists?

I already tried that and it did not change the outcome. I will look into the other points you mentioned.

fbaumdicker · 2025-03-18T13:38:19Z

Thanks for the hint @andrewkern. Took me a while, but I found the bug. I was passing on the transformed values instead of the prior samples. However, I learned a lot while searching for this and found another unrelated bug.
I am also no longer convinced of the current parametrization of the dependent population sizes using the population size changes instead of the actual population sizes. Mainly due to the harder-to-calculate resulting posterior distribution for the actual population sizes. I will think a bit more about this.
Maybe we can define a custom prior that uses dependent sampling but provides the logprob as if it were uniformly sampled?

nspope · 2025-03-18T16:56:02Z

Maybe we can define a custom prior that uses dependent sampling but provides the logprob as if it were uniformly sampled?

Sorry I'm not following -- do you mean, have the stored prior be uniform but the actual prior be something more complex, with the same support?

fbaumdicker · 2025-03-18T22:47:20Z

Yes, that is what I was thinking of. As long as the support is the same this should in principle work, but I do not know how this would affect the predictions. So maybe this is not the route to go?
Actually, it wouldn't be too hard to compute the correct logprob for the scenario I have in mind. However, it would be a custom prior and this is currently not included in the snakemake workflows. So if this would not be too hard to include, this would be my preferred path. We have defined custom priors in the past, so this should not be a big issue as long as we can import the prior to all subsequent steps when necessary. For example the plot_diagnostics script currently assumes BoxUniform as a prior.

nspope · 2025-03-18T23:43:53Z

I don't think it's an issue with what's in there now; as we're using the prior samples (not the log density) for everything currently implemented, so the correct non-uniform prior will get implicitly used. Where it gets a little hairy, conceptually, is when we try to aggregate posteriors across empirical windows, which is what I'm working on now. (Because then we effectively duplicate the prior). But maybe we just accept that this sort of aggregation is not sensible to use with a non-uniform prior.

andrewkern · 2025-04-07T18:40:13Z

checking in on this PR-- @fbaumdicker what's the current status here?

fbaumdicker · 2025-04-10T15:23:51Z

We fixed the (hopefully) last bug today. So I am now cleaning up the code and will then rebase.
We now have a simulator example with a working prior that is not a BoxUniform and where the parameters sampled from the prior depend on each other. We are currently running a comparison with your previous version with a BoxUniform prior with independent parameters on some example datasets. This should give a nice example for the usage of more complex priors. But this PR should be independent of this analysis and be ready to merge soon.

ningyuxin1999 requested review from andrewkern and nspope March 6, 2025 08:30

nspope requested changes Mar 6, 2025

View reviewed changes

andrewkern reviewed Mar 6, 2025

View reviewed changes

Yuxin Ning and others added 2 commits March 12, 2025 18:23

cattle example, LD calculator cleaned up

c90d793

comment change

9c98d3d

ningyuxin1999 force-pushed the processor_cleanup branch from d7a1d96 to 9c98d3d Compare March 13, 2025 08:33

simplified dependent population sizes prior

941482b

fix simulator with dependent population sizes

8a9958b

fix LD calculation

2f8c063

reture the prior samples in log

fed1997

ningyuxin1999 added 2 commits April 10, 2025 12:04

set the prior range as 1 to 5

e302d8b

use identical param setting

08fcef2

parameters

2ee9d80

ningyuxin1999 and others added 4 commits April 14, 2025 20:29

cpu

8166139

fix

6a175c8

set identical param after failed tries

1b827ed

cleanup and renaming

c865333

cattle example, LD calculator cleaned up #51

Are you sure you want to change the base?

cattle example, LD calculator cleaned up #51

Uh oh!

Conversation

ningyuxin1999 commented Mar 1, 2025

Uh oh!

nspope commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewkern commented Mar 5, 2025

Uh oh!

ningyuxin1999 commented Mar 6, 2025

Uh oh!

nspope left a comment

Choose a reason for hiding this comment

Uh oh!

nspope Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nspope Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewkern commented Mar 6, 2025

Uh oh!

fbaumdicker commented Mar 17, 2025

Uh oh!

nspope commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewkern commented Mar 18, 2025

Uh oh!

fbaumdicker commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

nspope commented Mar 3, 2025 •

edited

Loading

nspope Mar 6, 2025 •

edited

Loading

nspope Mar 6, 2025 •

edited

Loading

nspope commented Mar 17, 2025 •

edited

Loading

fbaumdicker commented Mar 18, 2025 •

edited

Loading

nspope commented Mar 18, 2025 •

edited

Loading