Skip to content

Fix IdentifyDuplicates to resolve NoneType mismatches#788

Merged
kjaisingh merged 3 commits intomainfrom
kj_fix_identify_duplicates_sorting
Mar 13, 2025
Merged

Fix IdentifyDuplicates to resolve NoneType mismatches#788
kjaisingh merged 3 commits intomainfrom
kj_fix_identify_duplicates_sorting

Conversation

@kjaisingh
Copy link
Copy Markdown
Collaborator

@kjaisingh kjaisingh commented Mar 12, 2025

Description

The IdentifyDuplicates task looks for a match across several INFO fields in order to determine exact or insert matches. However, these INFO fields may not exist for certain records, and hence are represented as None in the tuple that stores them. At a certain point, we sort these tuples.

​As highlighted in Python documentation, when sorting a list of tuples, if there is a match on the search index within each tuple, the subsequent elements in the tuples are compared to define the sort order. This means that Python starts by comparing the first element of each tuple; if they are equal, it moves to the second element, and so on, until it finds elements that differ or reaches the end of the tuples. ​

Hence, if we reach an element in our sort process for which the INFO field exists in one record but not another, we get the following error: TypeError: '<' not supported between instances of 'str' and 'NoneType'. This PR is intended to avoid this by setting defaults for all INFO fields that may not exist in a record.

Testing

  • This Terra job includes a failed run that does not include this change.
  • This Terra job includes a successful run that does include this change.

@kjaisingh kjaisingh self-assigned this Mar 12, 2025
@kjaisingh kjaisingh added the bug Something isn't working label Mar 12, 2025
@kjaisingh kjaisingh marked this pull request as ready for review March 12, 2025 18:47
record.info.get('STRANDS'),
record.info.get('CPX_TYPE'),
record.info.get('CPX_INTERVALS')
record.info.get('CHR2') or "",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Can you give me access to the testing workspaces?

I would use the built-in .get(key, default) syntax for consistency across our scripts

Suggested change
record.info.get('CHR2') or "",
record.info.get('CHR2', ""),

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just shared the workspace with you, apologies for this oversight.

And thanks for the suggestion, just modified it to use this. Given the simplicity of the change, I did a quick sanity test locally, but let me know if I should re-build the docker / re-test with the .get(key, default) syntax as well.

Copy link
Copy Markdown
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix and for running the extra test!

@kjaisingh kjaisingh merged commit 560975e into main Mar 13, 2025
8 checks passed
@kjaisingh kjaisingh deleted the kj_fix_identify_duplicates_sorting branch March 13, 2025 16:28
MattWellie pushed a commit to populationgenomics/gatk-sv that referenced this pull request Apr 7, 2025
…#788)

* Initial commit

* Removed trailing whitespace

* Used .get(x, y) to standardize with repo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants