-
Notifications
You must be signed in to change notification settings - Fork 3
Added initial script for merging call and region files #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev-v1.0
Are you sure you want to change the base?
Added initial script for merging call and region files #45
Conversation
|
|
||
| logging.basicConfig(level=logging.INFO) | ||
|
|
||
| def find_variant_by_accession(xml_file, accession) -> dict|None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loading the XML into memory and then querying would be much faster.
The XMLs are in MB - so it should not take up much of memory.
scripts/merge_region_call.py
Outdated
| header.info.add("ALLELE_TYPE", 1, "String", "Aggregated type of supporting calls") | ||
| header.info.add("CN", ".", "String", "Comma-separated list of copy numbers of supporting calls") | ||
| if "SVLEN" not in header.info: | ||
| del header.info['SVLEN'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| del header.info['SVLEN'] |
redundant
scripts/merge_region_call.py
Outdated
| if calls: | ||
| new_rec.info["ALLELE_NAME"] = ",".join(call["ALLELE_NAME"] for call in calls) | ||
| new_rec.info["ALLELE_TYPE"] = aggregate_sv_type(call["ALLELE_TYPE"] for call in calls) | ||
| new_rec.alts = tuple([f"<{call['ALLELE_TYPE']}>" for call in calls]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we are calculating final alts from SVTYPE?
scripts/merge_region_call.py
Outdated
| new_rec.info["ALLELE_NAME"] = ",".join(call["ALLELE_NAME"] for call in calls) | ||
| new_rec.info["ALLELE_TYPE"] = aggregate_sv_type(call["ALLELE_TYPE"] for call in calls) | ||
| new_rec.alts = tuple([f"<{call['ALLELE_TYPE']}>" for call in calls]) | ||
| new_rec.info["SVLEN"] = ",".join(call["SVLEN"][0] for call in calls if call["SVLEN"] is not None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- if there is no SVLEN we need to put a
.. As like any allele specific field SVELN needs one-to-one relationship with allele, otherwise we do not know which SVLEN belongs to which allele. - Is there a reason we are taking
call["SVLEN"][0](I assume for call there is always one allele and the list will always contain single value?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nakib103 , call["SVLEN"] is giving a list of single value (as opposed to the value itself) because of the type of SVLEN in VCF header.
NW-70