Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ancestry imputation #1040

Open
wants to merge 24 commits into
base: sample-qc-filtered-callrate
Choose a base branch
from

Conversation

jklugherz
Copy link
Contributor

@jklugherz jklugherz commented Feb 13, 2025

adds qc_gen_anc and related fields to sample_qc json

@jklugherz jklugherz marked this pull request as ready for review February 14, 2025 22:13
@jklugherz jklugherz requested a review from a team as a code owner February 14, 2025 22:13
@jklugherz jklugherz requested a review from matren395 February 14, 2025 22:17
POP_PCA_LOADINGS_PATH = (
'gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.pca_loadings.ht'
)
ANCESTRY_RF_MODEL_PATH = 'v03_pipeline/var/ancestry_imputation_model.pickle'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one gross bit here, this isn't likely to work on dataproc as is. The local package import doesn't work in the same way because of how jobs are submitted and packaged for pyspark (look at pyfiles.zip in airflow if you're interested!).

This is how we handle liftover:

GRCH38_TO_GRCH37_LIFTOVER_REF_PATH = (
    'gs://hail-common/references/grch38_to_grch37.over.chain.gz'
    if os.environ.get('HAIL_DATAPROC') == '1'
    else 'v03_pipeline/var/liftover/grch38_to_grch37.over.chain.gz'
)

but that's potentially even less possible with pickle. Is it possible for us to just treat the pickle as reference data (and put it in seqr-reference-data) and manage the pickle load by downloading the file to memory with a gcs client, then wrapping it in a BytesIO?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up until now we have not supported sample QC for local installations, so I think it is okay if this only works on dataproc, if that helps make this simpler at all

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, that this can only work on dataproc for now. The filesystem compatibility issues are a broad issue across the pipeline and worth some deeper thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can definitely put it in seqr-reference-data. The original script uses hl.hadoop_open, which should work and hfs.open probably will too.

Copy link
Contributor

@matren395 matren395 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks just abt like the code i'd shared, with one silly thing missing, but i'm not sure if it's within the scope of this PR or not. yippee!

Comment on lines +17 to +19
POP_PCA_LOADINGS_PATH = (
'gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.pca_loadings.ht'
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be nice to move this into some seqr GCP bucket, but then we're double spending on cost:/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we prefer not to store duplicate data in seqr buckets if it exists publicly in another

@@ -1,4 +1,7 @@
import pickle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we know this is pinned to 1.5.2 yea ?

@jklugherz jklugherz changed the title add ancestry imputation - qc_pop add ancestry imputation Feb 20, 2025
@@ -1,5 +1,6 @@
hail==0.2.133
luigi==3.5.2
gnomad==0.6.4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to upgrade gnomad to a more recent release in order match the version that's used in gnomad_qc package: https://github.com/broadinstitute/gnomad_qc/releases/tag/v4.1

@jklugherz jklugherz requested a review from matren395 February 20, 2025 21:07
@@ -1,5 +1,6 @@
hail==0.2.133
luigi==3.5.2
gnomad==0.6.4
gnomad==0.8.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woo i love gnomAD i love more current gnomAD verisoning

Copy link
Contributor

@matren395 matren395 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding apply_onnx stuff and some quick notes on the custom pop probs, but looking solid 👍

Comment on lines +24 to +34
GNOMAD_POP_PROBABILITY_CUTOFFS = {
'afr': 0.93,
'ami': 0.98,
'amr': 0.89,
'asj': 0.94,
'eas': 0.95,
'fin': 0.92,
'mid': 0.55,
'nfe': 0.75,
'sas': 0.92,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be preferable to have this in a .json file ? that's how it's currently handled in gnomAD, but this approach may be easier honestly. did your team have a convo abt this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to put it in code instead of a json file, it's a small enough mapping that it makes sense to me to be a constant in the same script where it's used

@matren395 matren395 self-requested a review February 26, 2025 20:07
Copy link
Contributor

@matren395 matren395 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no objections 🫡

@jklugherz jklugherz requested a review from bpblanken March 18, 2025 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants