Code to run logistic regression on v4 exomes and genomes with ancesty pcs#616
Code to run logistic regression on v4 exomes and genomes with ancesty pcs#616
Conversation
| ht = get_test_intervals(ht) | ||
| ht = ht.checkpoint(hl.utils.new_temp_file("test_intervals", "ht")) | ||
| exomes_vds = hl.vds.filter_intervals( | ||
| exomes_vds, ht, split_reference_blocks=True |
There was a problem hiding this comment.
Is this needed? Did the Hail team say this is faster than just filtering the variant matrix table and doing a densify like we do in this script? https://github.com/broadinstitute/gnomad_qc/blob/main/gnomad_qc/v4/sample_qc/generate_qc_mt.py
There was a problem hiding this comment.
I didn't talk to Hail team, I tried filter_variants, I found it took very long to finish that step, then I remembered that once the vds has this store_max_ref_length, filter_intervals is much faster.
There was a problem hiding this comment.
OK, sounds good, if it's faster then go for it
| logger.info("Densifying exomes...") | ||
| exomes_mt = hl.vds.to_dense_mt(exomes_vds) | ||
| exomes_mt = exomes_mt.annotate_cols(is_genome=False) | ||
| exomes_mt = exomes_mt.select_entries("GT").select_rows().select_cols("is_genome") |
There was a problem hiding this comment.
If this is the only entry you need then you should filter to it before the densify. I think for this you probably also want to filter to only adj genotypes though, so you probably need more
There was a problem hiding this comment.
I agree, adj seems to make more sense. I will get that.
There was a problem hiding this comment.
I found the entries are not the same as what we used in getting freq for exomes or genomes, could you check the new steps for me?
| return ht | ||
|
|
||
|
|
||
| def densify_union_exomes_genomes( |
There was a problem hiding this comment.
I would split this up. I would run the exomes and genomes filter and densify in parallel, checkpoint each, and then union after those are done and checkpoint
| :param joint_ht: Joint HT of v4 exomes and genomes. | ||
| :return: Test Table | ||
| """ | ||
| # Filter to chr22 |
There was a problem hiding this comment.
Before running the chr22 test, make an actual test that is only the first few partitions of chr22
There was a problem hiding this comment.
Only a few partitions were slower when I tested, when I get the set of intervals on chr22, they are more partitioned and they were densified faster.
| "firth", | ||
| y=mt.is_genome, | ||
| x=mt.GT.n_alt_alleles(), | ||
| covariates=[1] + [mt.pc[i] for i in range(10)], |
There was a problem hiding this comment.
I don't remember how many PCs we used for ancestry assignment off the top of my head, but I would use that number
There was a problem hiding this comment.
Mike told me it was 10, but I do remember you're exploring until 18,19, I will double check the code.
There was a problem hiding this comment.
Sorry if I wasnt clear, I wasnt certain it was 10, just thought it may be. Its 20: https://app.zenhub.com/workspaces/gnomad-5f4d127ea61afc001d6be50b/issues/gh/broadinstitute/gnomad_production/496
There was a problem hiding this comment.
No worries, I put that 10 temporarily because it was just 10 in Julia's code. I changed it in my test.
| return mt | ||
|
|
||
|
|
||
| def main(args): |
There was a problem hiding this comment.
add a try catch for copying the logger in case it fails with an error
| hl.utils.new_temp_file(f"temp_{data_type}_vds_filtered", "vds") | ||
| ) | ||
| mt = hl.vds.to_dense_mt(vds) | ||
| mt = annotate_adj(mt) |
There was a problem hiding this comment.
After looking at your code, I think I could use filter_to_adj directly.
No description provided.