-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ancestry imputation #1040
base: sample-qc-filtered-callrate
Are you sure you want to change the base?
add ancestry imputation #1040
Conversation
…o sample-qc-qc_pop
…o sample-qc-qc_pop
…o sample-qc-qc_pop
POP_PCA_LOADINGS_PATH = ( | ||
'gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.pca_loadings.ht' | ||
) | ||
ANCESTRY_RF_MODEL_PATH = 'v03_pipeline/var/ancestry_imputation_model.pickle' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one gross bit here, this isn't likely to work on dataproc as is. The local package import doesn't work in the same way because of how jobs are submitted and packaged for pyspark (look at pyfiles.zip in airflow if you're interested!).
This is how we handle liftover:
GRCH38_TO_GRCH37_LIFTOVER_REF_PATH = (
'gs://hail-common/references/grch38_to_grch37.over.chain.gz'
if os.environ.get('HAIL_DATAPROC') == '1'
else 'v03_pipeline/var/liftover/grch38_to_grch37.over.chain.gz'
)
but that's potentially even less possible with pickle
. Is it possible for us to just treat the pickle as reference data (and put it in seqr-reference-data
) and manage the pickle load by downloading the file to memory with a gcs client, then wrapping it in a BytesIO
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up until now we have not supported sample QC for local installations, so I think it is okay if this only works on dataproc, if that helps make this simpler at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe https://hail.is/docs/0.2/fs_api.html#hailtop.fs.open will help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, that this can only work on dataproc for now. The filesystem compatibility issues are a broad issue across the pipeline and worth some deeper thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can definitely put it in seqr-reference-data
. The original script uses hl.hadoop_open
, which should work and hfs.open
probably will too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks just abt like the code i'd shared, with one silly thing missing, but i'm not sure if it's within the scope of this PR or not. yippee!
POP_PCA_LOADINGS_PATH = ( | ||
'gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.pca_loadings.ht' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be nice to move this into some seqr GCP bucket, but then we're double spending on cost:/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we prefer not to store duplicate data in seqr buckets if it exists publicly in another
@@ -1,4 +1,7 @@ | |||
import pickle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and we know this is pinned to 1.5.2 yea ?
@@ -1,5 +1,6 @@ | |||
hail==0.2.133 | |||
luigi==3.5.2 | |||
gnomad==0.6.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had to upgrade gnomad to a more recent release in order match the version that's used in gnomad_qc
package: https://github.com/broadinstitute/gnomad_qc/releases/tag/v4.1
@@ -1,5 +1,6 @@ | |||
hail==0.2.133 | |||
luigi==3.5.2 | |||
gnomad==0.6.4 | |||
gnomad==0.8.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woo i love gnomAD i love more current gnomAD verisoning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding apply_onnx stuff and some quick notes on the custom pop probs, but looking solid 👍
GNOMAD_POP_PROBABILITY_CUTOFFS = { | ||
'afr': 0.93, | ||
'ami': 0.98, | ||
'amr': 0.89, | ||
'asj': 0.94, | ||
'eas': 0.95, | ||
'fin': 0.92, | ||
'mid': 0.55, | ||
'nfe': 0.75, | ||
'sas': 0.92, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be preferable to have this in a .json file ? that's how it's currently handled in gnomAD, but this approach may be easier honestly. did your team have a convo abt this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to put it in code instead of a json file, it's a small enough mapping that it makes sense to me to be a constant in the same script where it's used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no objections 🫡
adds
qc_gen_anc
and related fields to sample_qc json