-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move ploidy checks from callset integrity validation to family/sample validation #1057
base: dev
Are you sure you want to change the base?
Conversation
) | ||
failed_families = {} | ||
for family in families: | ||
discrepant_loadable_samples = set(family.samples.keys()) & discrepant_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only real change here is to use the passed families
to identify failures we care about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
) | ||
|
||
# Invalid X chromosome case, but only discrepant family samples are reported | ||
families = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relevant change to the tests is the addition of this case
@@ -180,6 +178,24 @@ def test_write_remapped_and_subsetted_callset_task_failed_sex_check_family( | |||
'samples': ['NA20885_1'], | |||
}, | |||
}, | |||
ploidy_check={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These failures are new, but I thought it safer to incorporate them into the test case rather than trying to adjust the underlying callset or sex check ht, because those are used in other tests.
@@ -337,6 +353,7 @@ def test_write_remapped_and_subsetted_callset_task_all_families_failed( | |||
], | |||
}, | |||
}, | |||
'ploidy_check': {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to change the code that reads this json to account for the ploidy failures?
@@ -134,12 +136,21 @@ def create_table(self) -> hl.MatrixTable: | |||
sex_check_ht, | |||
remap_lookup, | |||
) | |||
families_failed_imputed_sex_ploidy = get_families_failed_imputed_sex_ploidy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should run this before sex check, and also exclude the failures from the sex check!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change
'ploidy_check': { | ||
'cde_1': { | ||
'samples': ['NA20881_1'], | ||
'reasons': "Found samples with misaligned ploidy with their provided imputed sex (first 10, if applicable) : ['NA20881_1']", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the (first 10, if applicable)
looks weird to me if these are only going to be one sample. Do we need to change the string error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message used to be part of a SeqrValidationError
message, so maybe that's why it was kept short. We can take out the (first 10, if applicable)
and just include all of the failed samples in this message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to format the reasons
as a list like the sex failures. So one string reason for each sample, rather than a list of samples embedded in a single "reason".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the check sample job in seqr, we parse the structured json from the failed_family_samples
and then use that information to report a human readable error but also to update the status for family models in seqr, update airtable etc. We have explicit parsing in this part of the code that takes that error dictionary and makes the error list we report in slack clear
For validation failures, we just take the error_messages
and pass them through to the end user with no additional processing. So I think we need to think about formatting this message very differently than for sex checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block is under failed_family_samples
now, and won't be considered a validation
failure. We decided that this check should no longer be considered a callset-wide validation, but run as a check over the families to load. We have a new key ("ploidy_check") for these though, which maybe needs an update to the check new samples job (reading the code though, it seems fine as is!).
No description provided.