-
Notifications
You must be signed in to change notification settings - Fork 1
cattle example, LD calculator cleaned up #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Great, thanks Yuxin! Could you please rebase this onto main and fix the merge conflicts? If you haven't done this before, then from the command line git checkout main
git pull upstream main # this is assuming that this repo is called "upstream" in `git remote -v`
git checkout processor_cleanup
git rebase main and then you'll have to edit the files that git lists, to resolve the <<<<< HEAD
old code
=======
new code
>>>>>> commit inserts added by git. Once that is done, git add path/to/modified/file
git rebase --continue to move to the next commit (or finish the rebase). Let me know if you have questions! |
hmm this still has merge conflicts @ningyuxin1999 |
I was only updating some comments yesterday, and now the conflicts should be resolved :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ningyuxin1999 -- it looks like there's a few things to fix that I've flagged here.
Also, it'd be good if you could try running the training pipeline through on this example just to make sure it doesn't error out (the number of sims can be set to something low, like 100, just want to make sure it doesn't error out).
@andrewkern can you take a quick look as well?
workflow/config/cattle_21Gen.yaml
Outdated
max_time: 130000 | ||
time_rate: 0.06 | ||
samples: | ||
pop: 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the population name should be "pop0" to match the msprime demography, I think?
edit: now I see, you're doing samples={"pop0": self.samples["pop"]}
in the msprime invocation below. But, it'd be easier/nicer to just have the correct name here, so that you can do samples=self.samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
workflow/config/cattle_21Gen.yaml
Outdated
packed_sequence: False | ||
|
||
simulator: | ||
class_name: "VariablePopulationSize" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class name should be "Cattle21Gen", I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this, but in the long run I think I would like to use the same class for the cattle example and the VariablePopulationSize example. In my opinion the dependent population sizes make much more sense as a prior and could be used in any case.
workflow/scripts/ts_simulators.py
Outdated
geno = ts.genotype_matrix().T | ||
num_sample = geno.shape[0] | ||
if (geno==2).any(): | ||
num_sample *= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just use num_sample = ts.num_samples
here
workflow/scripts/ts_simulators.py
Outdated
row_sum != 0, | ||
row_sum != num_sample, | ||
row_sum > num_sample * 0.05, | ||
num_sample - row_sum > num_sample * 0.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to make this clearer to read (and settable), put "maf": 0.05
under the "FIXED PARAMETER" heading in the default_config
dict above, and change these lines to:
row_sum > num_sample * self.maf
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
workflow/scripts/ts_simulators.py
Outdated
row_sum = np.sum(geno, axis=0) | ||
keep = np.logical_and.reduce([ | ||
row_sum != 0, | ||
row_sum != num_sample, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that row_sum != 0
, row_sum != num_sample
isn't needed with the MAF filter directly below, so delete these two lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I deleted them.
workflow/scripts/ts_simulators.py
Outdated
for i in range(1, self.num_time_windows): | ||
new_value = 10 ** self.pop_sizes[0] - 1 | ||
while new_value > self.pop_sizes[1] or new_value < self.pop_sizes[0]: | ||
new_value = log_values[i - 1] * 10 ** np.random.uniform(-1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble getting my head around this -- we're multiplying log10 Ne by U[0.1, 10] -- is this what's intended? e.g. this isn't intended to be natural units Ne multiplied by some scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once I understand what the intended process is here, maybe we could reparameterize in terms of iid uniforms (e.g. without rejection sampling), as this is most definitely not a box uniform any more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm a bit confused here too because the vector named log_values
will initially be populated by the BoxUniform draws
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think log_values
is confusing here. What this should actually do (and is currently not) is drawing the first population size between 10 and 100000 but uniformly in log10-scale. So if we draw alpha from a BoxUniform between 1 and 5 the first population size should then be 10^alpha. In each next time window the population size is then N_{i+1} = N_i * 10^beta where beta is uniform in [-1,1].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to prevent the posterior distribution from focusing too much on the larger population sizes and makes thus probably sense also for the variablepopulationsizes class. I think the performance for the more realistic population size scenarios should also improve using this prior with dependent population sizes as it prevents completely weird population size changes. Basically if we would draw the population sizes uniformly and independently between 10 and 100000 too many of the training data would contain large ancestral population sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a tangent, but worth discussing ... I think we should try to keep the posterior and prior parameterization consistent. For example (simplifying things for exposition), say I have a model parameterized by the random walk Ne[i] = Ne[i - 1] * U[lower, upper]
. Then the values of theta that the simulator returns -- the targets for the NN -- should be the U[lower, upper]
values, not the Ne values, because we're assuming a uniform prior downstream.
Though this is a little cumbersome (in that the user has to then manually reparameterize posterior samples, to get what they actually want), it's crucial if we want to try to combine posteriors across chunks of sequence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think using np.random.uniform here won't respect the random seed, as this has only been set for torch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we could certainly use the U[lower, upper] values instead of the actual population sizes. But we would have to change a small detail in the generation of the population sizes then. Currently, we redraw U[lower, upper] if Ne[i-1] * U[lower, upper] is larger than the maximal population size allowed (or smaller than the minimal). Instead of redrawing we could simply set the population size to the max or min value allowed in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now implemented the new version, where the population sizes will be a bit more sticky at the population size range borders.
workflow/scripts/ts_simulators.py
Outdated
} | ||
|
||
def __init__(self, config: dict): | ||
super().__init__(config, self.default_config) | ||
self.parameters = ["recombination_rate"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some stuff got accidentally deleted in this class, probably during fixing merge conflicts-- can you please set it back to how it was?
------- | ||
DataFrame | ||
Table with the distance_bins as index, and the mean value of | ||
Compute LD for a subset of SNPs and return a DataFrame with only the mean r2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's useful to keep some of the details about the parameters that are needed in here
workflow/scripts/ts_processors.py
Outdated
snp_pairs = np.unique([((snps[i] + i) % n_SNP, (snps[i + 1] + i) % n_SNP) for i in range(len(snps) - 1)], | ||
axis=0) | ||
snp_pairs = np.unique([((snps[i] + i) % n_SNP, (snps[i + 1] + i) % n_SNP) | ||
for i in range(len(snps) - 1)], axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my eyes, it's way easier to have the list comprehension on a single line, splitting list comprehensions across lines makes things harder to read
workflow/scripts/ts_processors.py
Outdated
sd["dist_group"] = pd.cut(sd.snp_dist, bins=distance_bins) | ||
sr = [allel.rogers_huff_r(snps) ** 2 for snps in haplotype[snps_pos]] | ||
sd["r2"] = sr | ||
sd["gap_id"] = i | ||
ld = pd.concat([ld, sd]) | ||
|
||
ld2 = ld.dropna().groupby("dist_group", observed=False).agg(agg_bins) | ||
ld2 = ld.dropna().groupby("dist_group",observed=True).agg(agg_bins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe that setting observed=True
here will throw an error, at least it did for me when first reimplementing this routine
added some additional comments |
d7a1d96
to
9c98d3d
Compare
I started to fix the issues in this branch. Currently, I am no longer including the LD statistics and am just using the sfs. I will look at the LD stuff again later. The new prior is done and seems to work fine. |
Hm ... we're batching the sampling during plot diagnostics, so the per batch memory requirement should be pretty tightly constrained (as much as it would be during training). But maybe I need to be transferring results to the CPU at the end of each batch. Any chance you could drop some print statements into I'm not sure what could be happening with the prior -- do you mean that it's stuck in an endless while loop during the simulation? Regardless, this shouldn't be related to GPU usage (simulation is done on the CPU) but is maybe an issue with parallelization on your workstation? Could you try running snakemake with --jobs 1 to see if the issue persists? |
i've seen this behavior once before with the posterior sampling being stuck in an infinite loop. it had to do with not properly defining the prior for use with SBI. looking at your code i don't see anything obvious, but i'd go back and compare it to the |
I already tried that and it did not change the outcome. I will look into the other points you mentioned. |
Thanks for the hint @andrewkern. Took me a while, but I found the bug. I was passing on the transformed values instead of the prior samples. However, I learned a lot while searching for this and found another unrelated bug. |
Sorry I'm not following -- do you mean, have the stored prior be uniform but the actual prior be something more complex, with the same support? |
Yes, that is what I was thinking of. As long as the support is the same this should in principle work, but I do not know how this would affect the predictions. So maybe this is not the route to go? |
I don't think it's an issue with what's in there now; as we're using the prior samples (not the log density) for everything currently implemented, so the correct non-uniform prior will get implicitly used. Where it gets a little hairy, conceptually, is when we try to aggregate posteriors across empirical windows, which is what I'm working on now. (Because then we effectively duplicate the prior). But maybe we just accept that this sort of aggregation is not sensible to use with a non-uniform prior. |
checking in on this PR-- @fbaumdicker what's the current status here? |
We fixed the (hopefully) last bug today. So I am now cleaning up the code and will then rebase. |
This is the example of cattle configuration. Each population size is dependent on the previous one instead of generated randomly.