Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move ploidy checks from callset integrity validation to family/sample validation #1057

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from

Conversation

jklugherz
Copy link
Contributor

No description provided.

)
failed_families = {}
for family in families:
discrepant_loadable_samples = set(family.samples.keys()) & discrepant_samples
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only real change here is to use the passed families to identify failures we care about

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

)

# Invalid X chromosome case, but only discrepant family samples are reported
families = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relevant change to the tests is the addition of this case

@@ -180,6 +178,24 @@ def test_write_remapped_and_subsetted_callset_task_failed_sex_check_family(
'samples': ['NA20885_1'],
},
},
ploidy_check={
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These failures are new, but I thought it safer to incorporate them into the test case rather than trying to adjust the underlying callset or sex check ht, because those are used in other tests.

@jklugherz jklugherz changed the title hard fail imputed sex ploidy failures for pedigree samples only move ploidy checks from callset integrity validation to family/sample validation Mar 13, 2025
@jklugherz jklugherz marked this pull request as ready for review March 13, 2025 20:17
@jklugherz jklugherz requested a review from a team as a code owner March 13, 2025 20:17
@@ -337,6 +353,7 @@ def test_write_remapped_and_subsetted_callset_task_all_families_failed(
],
},
},
'ploidy_check': {},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change the code that reads this json to account for the ploidy failures?

@@ -134,12 +136,21 @@ def create_table(self) -> hl.MatrixTable:
sex_check_ht,
remap_lookup,
)
families_failed_imputed_sex_ploidy = get_families_failed_imputed_sex_ploidy(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should run this before sex check, and also exclude the failures from the sex check!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change

'ploidy_check': {
'cde_1': {
'samples': ['NA20881_1'],
'reasons': "Found samples with misaligned ploidy with their provided imputed sex (first 10, if applicable) : ['NA20881_1']",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the (first 10, if applicable) looks weird to me if these are only going to be one sample. Do we need to change the string error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message used to be part of a SeqrValidationError message, so maybe that's why it was kept short. We can take out the (first 10, if applicable) and just include all of the failed samples in this message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to format the reasons as a list like the sex failures. So one string reason for each sample, rather than a list of samples embedded in a single "reason".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the check sample job in seqr, we parse the structured json from the failed_family_samples and then use that information to report a human readable error but also to update the status for family models in seqr, update airtable etc. We have explicit parsing in this part of the code that takes that error dictionary and makes the error list we report in slack clear
For validation failures, we just take the error_messages and pass them through to the end user with no additional processing. So I think we need to think about formatting this message very differently than for sex checks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is under failed_family_samples now, and won't be considered a validation failure. We decided that this check should no longer be considered a callset-wide validation, but run as a check over the families to load. We have a new key ("ploidy_check") for these though, which maybe needs an update to the check new samples job (reading the code though, it seems fine as is!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants