Skip to content

cattle example, LD calculator cleaned up #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

ningyuxin1999
Copy link
Collaborator

This is the example of cattle configuration. Each population size is dependent on the previous one instead of generated randomly.

@nspope
Copy link
Contributor

nspope commented Mar 3, 2025

Great, thanks Yuxin! Could you please rebase this onto main and fix the merge conflicts?

If you haven't done this before, then from the command line

git checkout main
git pull upstream main  # this is assuming that this repo is called "upstream" in `git remote -v`
git checkout processor_cleanup
git rebase main

and then you'll have to edit the files that git lists, to resolve the

<<<<< HEAD
old code
=======
new code
>>>>>> commit

inserts added by git. Once that is done,

git add path/to/modified/file
git rebase --continue

to move to the next commit (or finish the rebase).

Let me know if you have questions!

@andrewkern
Copy link
Member

hmm this still has merge conflicts @ningyuxin1999

@ningyuxin1999
Copy link
Collaborator Author

hmm this still has merge conflicts @ningyuxin1999

I was only updating some comments yesterday, and now the conflicts should be resolved :)

Copy link
Contributor

@nspope nspope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ningyuxin1999 -- it looks like there's a few things to fix that I've flagged here.

Also, it'd be good if you could try running the training pipeline through on this example just to make sure it doesn't error out (the number of sims can be set to something low, like 100, just want to make sure it doesn't error out).

@andrewkern can you take a quick look as well?

max_time: 130000
time_rate: 0.06
samples:
pop: 25
Copy link
Contributor

@nspope nspope Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the population name should be "pop0" to match the msprime demography, I think?

edit: now I see, you're doing samples={"pop0": self.samples["pop"]} in the msprime invocation below. But, it'd be easier/nicer to just have the correct name here, so that you can do samples=self.samples

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

packed_sequence: False

simulator:
class_name: "VariablePopulationSize"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class name should be "Cattle21Gen", I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, agreed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this, but in the long run I think I would like to use the same class for the cattle example and the VariablePopulationSize example. In my opinion the dependent population sizes make much more sense as a prior and could be used in any case.

geno = ts.genotype_matrix().T
num_sample = geno.shape[0]
if (geno==2).any():
num_sample *= 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just use num_sample = ts.num_samples here

row_sum != 0,
row_sum != num_sample,
row_sum > num_sample * 0.05,
num_sample - row_sum > num_sample * 0.05
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make this clearer to read (and settable), put "maf": 0.05 under the "FIXED PARAMETER" heading in the default_config dict above, and change these lines to:

row_sum > num_sample * self.maf

etc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

row_sum = np.sum(geno, axis=0)
keep = np.logical_and.reduce([
row_sum != 0,
row_sum != num_sample,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that row_sum != 0 , row_sum != num_sample isn't needed with the MAF filter directly below, so delete these two lines

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I deleted them.

for i in range(1, self.num_time_windows):
new_value = 10 ** self.pop_sizes[0] - 1
while new_value > self.pop_sizes[1] or new_value < self.pop_sizes[0]:
new_value = log_values[i - 1] * 10 ** np.random.uniform(-1, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble getting my head around this -- we're multiplying log10 Ne by U[0.1, 10] -- is this what's intended? e.g. this isn't intended to be natural units Ne multiplied by some scalar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once I understand what the intended process is here, maybe we could reparameterize in terms of iid uniforms (e.g. without rejection sampling), as this is most definitely not a box uniform any more

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a bit confused here too because the vector named log_values will initially be populated by the BoxUniform draws

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think log_values is confusing here. What this should actually do (and is currently not) is drawing the first population size between 10 and 100000 but uniformly in log10-scale. So if we draw alpha from a BoxUniform between 1 and 5 the first population size should then be 10^alpha. In each next time window the population size is then N_{i+1} = N_i * 10^beta where beta is uniform in [-1,1].

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prevent the posterior distribution from focusing too much on the larger population sizes and makes thus probably sense also for the variablepopulationsizes class. I think the performance for the more realistic population size scenarios should also improve using this prior with dependent population sizes as it prevents completely weird population size changes. Basically if we would draw the population sizes uniformly and independently between 10 and 100000 too many of the training data would contain large ancestral population sizes.

Copy link
Contributor

@nspope nspope Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a tangent, but worth discussing ... I think we should try to keep the posterior and prior parameterization consistent. For example (simplifying things for exposition), say I have a model parameterized by the random walk Ne[i] = Ne[i - 1] * U[lower, upper]. Then the values of theta that the simulator returns -- the targets for the NN -- should be the U[lower, upper] values, not the Ne values, because we're assuming a uniform prior downstream.

Though this is a little cumbersome (in that the user has to then manually reparameterize posterior samples, to get what they actually want), it's crucial if we want to try to combine posteriors across chunks of sequence

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think using np.random.uniform here won't respect the random seed, as this has only been set for torch

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we could certainly use the U[lower, upper] values instead of the actual population sizes. But we would have to change a small detail in the generation of the population sizes then. Currently, we redraw U[lower, upper] if Ne[i-1] * U[lower, upper] is larger than the maximal population size allowed (or smaller than the minimal). Instead of redrawing we could simply set the population size to the max or min value allowed in this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now implemented the new version, where the population sizes will be a bit more sticky at the population size range borders.

}

def __init__(self, config: dict):
super().__init__(config, self.default_config)
self.parameters = ["recombination_rate"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some stuff got accidentally deleted in this class, probably during fixing merge conflicts-- can you please set it back to how it was?

-------
DataFrame
Table with the distance_bins as index, and the mean value of
Compute LD for a subset of SNPs and return a DataFrame with only the mean r2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's useful to keep some of the details about the parameters that are needed in here

snp_pairs = np.unique([((snps[i] + i) % n_SNP, (snps[i + 1] + i) % n_SNP) for i in range(len(snps) - 1)],
axis=0)
snp_pairs = np.unique([((snps[i] + i) % n_SNP, (snps[i + 1] + i) % n_SNP)
for i in range(len(snps) - 1)], axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my eyes, it's way easier to have the list comprehension on a single line, splitting list comprehensions across lines makes things harder to read

sd["dist_group"] = pd.cut(sd.snp_dist, bins=distance_bins)
sr = [allel.rogers_huff_r(snps) ** 2 for snps in haplotype[snps_pos]]
sd["r2"] = sr
sd["gap_id"] = i
ld = pd.concat([ld, sd])

ld2 = ld.dropna().groupby("dist_group", observed=False).agg(agg_bins)
ld2 = ld.dropna().groupby("dist_group",observed=True).agg(agg_bins)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe that setting observed=True here will throw an error, at least it did for me when first reimplementing this routine

@andrewkern
Copy link
Member

added some additional comments

@fbaumdicker
Copy link
Collaborator

I started to fix the issues in this branch. Currently, I am no longer including the LD statistics and am just using the sfs. I will look at the LD stuff again later. The new prior is done and seems to work fine.
However, when using the cattle config I currently get stuck in the plot_diagnostics job when drawing the posterior samples.
snakemake --configfile workflow/config/cattle_21Gen.yaml --snakefile workflow/training_workflow.smk
On my PC I run out of GPU memory, on the workstation, drawing from the prior does not progress. Any idea what could cause this @nspope ? I want to fix this first before continuing with the LD calculation part.

@nspope
Copy link
Contributor

nspope commented Mar 17, 2025

Hm ... we're batching the sampling during plot diagnostics, so the per batch memory requirement should be pretty tightly constrained (as much as it would be during training). But maybe I need to be transferring results to the CPU at the end of each batch. Any chance you could drop some print statements into workflow/scripts/plot_diagnostics.py to try to figure out if the OOM is happening within Lightning's prediction loop (the predict_step in the LightningModule in that script), or afterwards (after trainer.predict is called)?

I'm not sure what could be happening with the prior -- do you mean that it's stuck in an endless while loop during the simulation? Regardless, this shouldn't be related to GPU usage (simulation is done on the CPU) but is maybe an issue with parallelization on your workstation? Could you try running snakemake with --jobs 1 to see if the issue persists?

@andrewkern
Copy link
Member

i've seen this behavior once before with the posterior sampling being stuck in an infinite loop. it had to do with not properly defining the prior for use with SBI. looking at your code i don't see anything obvious, but i'd go back and compare it to the VariablePopnSize simulator here (which works) to see if there is something off

@fbaumdicker
Copy link
Collaborator

fbaumdicker commented Mar 18, 2025

Could you try running snakemake with --jobs 1 to see if the issue persists?

I already tried that and it did not change the outcome. I will look into the other points you mentioned.

@fbaumdicker
Copy link
Collaborator

Thanks for the hint @andrewkern. Took me a while, but I found the bug. I was passing on the transformed values instead of the prior samples. However, I learned a lot while searching for this and found another unrelated bug.
I am also no longer convinced of the current parametrization of the dependent population sizes using the population size changes instead of the actual population sizes. Mainly due to the harder-to-calculate resulting posterior distribution for the actual population sizes. I will think a bit more about this.
Maybe we can define a custom prior that uses dependent sampling but provides the logprob as if it were uniformly sampled?

@nspope
Copy link
Contributor

nspope commented Mar 18, 2025

Maybe we can define a custom prior that uses dependent sampling but provides the logprob as if it were uniformly sampled?

Sorry I'm not following -- do you mean, have the stored prior be uniform but the actual prior be something more complex, with the same support?

@fbaumdicker
Copy link
Collaborator

Yes, that is what I was thinking of. As long as the support is the same this should in principle work, but I do not know how this would affect the predictions. So maybe this is not the route to go?
Actually, it wouldn't be too hard to compute the correct logprob for the scenario I have in mind. However, it would be a custom prior and this is currently not included in the snakemake workflows. So if this would not be too hard to include, this would be my preferred path. We have defined custom priors in the past, so this should not be a big issue as long as we can import the prior to all subsequent steps when necessary. For example the plot_diagnostics script currently assumes BoxUniform as a prior.

@nspope
Copy link
Contributor

nspope commented Mar 18, 2025

I don't think it's an issue with what's in there now; as we're using the prior samples (not the log density) for everything currently implemented, so the correct non-uniform prior will get implicitly used. Where it gets a little hairy, conceptually, is when we try to aggregate posteriors across empirical windows, which is what I'm working on now. (Because then we effectively duplicate the prior). But maybe we just accept that this sort of aggregation is not sensible to use with a non-uniform prior.

@andrewkern
Copy link
Member

checking in on this PR-- @fbaumdicker what's the current status here?

@fbaumdicker
Copy link
Collaborator

We fixed the (hopefully) last bug today. So I am now cleaning up the code and will then rebase.
We now have a simulator example with a working prior that is not a BoxUniform and where the parameters sampled from the prior depend on each other. We are currently running a comparison with your previous version with a BoxUniform prior with independent parameters on some example datasets. This should give a nice example for the usage of more complex priors. But this PR should be independent of this analysis and be ready to merge soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants