add ancestry imputation #1040

jklugherz · 2025-02-13T19:45:09Z

adds qc_gen_anc and related fields to sample_qc json

…o sample-qc-qc_pop

bpblanken · 2025-02-19T17:32:59Z

v03_pipeline/lib/methods/sample_qc.py

+POP_PCA_LOADINGS_PATH = (
+    'gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.pca_loadings.ht'
+)
+ANCESTRY_RF_MODEL_PATH = 'v03_pipeline/var/ancestry_imputation_model.pickle'


one gross bit here, this isn't likely to work on dataproc as is. The local package import doesn't work in the same way because of how jobs are submitted and packaged for pyspark (look at pyfiles.zip in airflow if you're interested!).

This is how we handle liftover:

GRCH38_TO_GRCH37_LIFTOVER_REF_PATH = ( 'gs://hail-common/references/grch38_to_grch37.over.chain.gz' if os.environ.get('HAIL_DATAPROC') == '1' else 'v03_pipeline/var/liftover/grch38_to_grch37.over.chain.gz' )

but that's potentially even less possible with pickle. Is it possible for us to just treat the pickle as reference data (and put it in seqr-reference-data) and manage the pickle load by downloading the file to memory with a gcs client, then wrapping it in a BytesIO?

up until now we have not supported sample QC for local installations, so I think it is okay if this only works on dataproc, if that helps make this simpler at all

or maybe https://hail.is/docs/0.2/fs_api.html#hailtop.fs.open will help?

+1, that this can only work on dataproc for now. The filesystem compatibility issues are a broad issue across the pipeline and worth some deeper thought.

We can definitely put it in seqr-reference-data. The original script uses hl.hadoop_open, which should work and hfs.open probably will too.

matren395

looks just abt like the code i'd shared, with one silly thing missing, but i'm not sure if it's within the scope of this PR or not. yippee!

matren395 · 2025-02-20T16:35:17Z

v03_pipeline/lib/methods/sample_qc.py

+POP_PCA_LOADINGS_PATH = (
+    'gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.pca_loadings.ht'
+)


could be nice to move this into some seqr GCP bucket, but then we're double spending on cost:/

I think we prefer not to store duplicate data in seqr buckets if it exists publicly in another

matren395 · 2025-02-20T16:35:32Z

v03_pipeline/lib/methods/sample_qc.py

@@ -1,4 +1,7 @@
+import pickle


and we know this is pinned to 1.5.2 yea ?

v03_pipeline/lib/methods/sample_qc.py

jklugherz · 2025-02-20T21:06:08Z

requirements.in

@@ -1,5 +1,6 @@
 hail==0.2.133
 luigi==3.5.2
-gnomad==0.6.4


had to upgrade gnomad to a more recent release in order match the version that's used in gnomad_qc package: https://github.com/broadinstitute/gnomad_qc/releases/tag/v4.1

matren395 · 2025-02-21T15:42:41Z

requirements.in

@@ -1,5 +1,6 @@
 hail==0.2.133
 luigi==3.5.2
-gnomad==0.6.4
+gnomad==0.8.0


woo i love gnomAD i love more current gnomAD verisoning

matren395

adding apply_onnx stuff and some quick notes on the custom pop probs, but looking solid 👍

v03_pipeline/lib/methods/sample_qc.py

matren395 · 2025-02-21T15:46:14Z

v03_pipeline/lib/methods/sample_qc.py

+GNOMAD_POP_PROBABILITY_CUTOFFS = {
+    'afr': 0.93,
+    'ami': 0.98,
+    'amr': 0.89,
+    'asj': 0.94,
+    'eas': 0.95,
+    'fin': 0.92,
+    'mid': 0.55,
+    'nfe': 0.75,
+    'sas': 0.92,
+}


would it be preferable to have this in a .json file ? that's how it's currently handled in gnomAD, but this approach may be easier honestly. did your team have a convo abt this ?

I decided to put it in code instead of a json file, it's a small enough mapping that it makes sense to me to be a constant in the same script where it's used

…o sample-qc-qc_pop

matren395

no objections 🫡

jklugherz added 7 commits February 13, 2025 14:44

add ancestry imputation - qc_pop

eebce84

Merge remote-tracking branch 'origin/sample-qc-filtered-callrate' int…

61e197a

…o sample-qc-qc_pop

reduce duplication

0c5651a

Merge remote-tracking branch 'origin/sample-qc-filtered-callrate' int…

3295f01

…o sample-qc-qc_pop

merge

d8af4e4

merge

20aeb01

Merge remote-tracking branch 'origin/sample-qc-filtered-callrate' int…

a749041

…o sample-qc-qc_pop

jklugherz marked this pull request as ready for review February 14, 2025 22:13

jklugherz requested a review from a team as a code owner February 14, 2025 22:13

jklugherz requested a review from matren395 February 14, 2025 22:17

merge

f066854

bpblanken reviewed Feb 19, 2025

View reviewed changes

jklugherz added 6 commits February 19, 2025 13:49

merge

1d49030

merge

66ca46c

use gs

e060a5b

merge base branch

a0d2a66

mock mock whos there

4fb1b07

delete model file

f85a343

matren395 reviewed Feb 20, 2025

View reviewed changes

jklugherz added 3 commits February 20, 2025 12:24

try adding gnomad qc import

4e942b6

upgrade gnomad methods package

927ca85

use onnx model

139dab2

jklugherz changed the title ~~add ancestry imputation - qc_pop~~ add ancestry imputation Feb 20, 2025

jklugherz commented Feb 20, 2025

View reviewed changes

jklugherz requested a review from matren395 February 20, 2025 21:07

jklugherz added 3 commits February 20, 2025 16:41

test onnx

e290995

rename

31c6d92

rename 2

dfdd59b

matren395 reviewed Feb 21, 2025

View reviewed changes

jklugherz added 2 commits February 21, 2025 11:09

apply model func

c9f2c69

Merge remote-tracking branch 'origin/sample-qc-filtered-callrate' int…

322a95e

…o sample-qc-qc_pop

matren395 self-requested a review February 26, 2025 20:07

matren395 approved these changes Feb 26, 2025

View reviewed changes

jklugherz added 2 commits March 17, 2025 14:11

typo

ec44d61

typo 2

0909edd

jklugherz requested a review from bpblanken March 18, 2025 15:21

jklugherz merged commit 2642bc2 into sample-qc-filtered-callrate Mar 20, 2025
1 check passed

add ancestry imputation #1040

add ancestry imputation #1040

Uh oh!

Conversation

jklugherz commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matren395 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matren395 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matren395 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jklugherz commented Feb 13, 2025 •

edited

Loading