Skip to content

Conversation

@mathiasbio
Copy link
Collaborator

@mathiasbio mathiasbio commented Nov 12, 2024

Description

MERGED THESE PRs INTO THIS ONE to test all new features together:

  1. disable normal hardfilter: feat: add option to disable normal hardfilter #1509
  2. create tnscope mvns: fix: create tnscope mnvs #1524

This method is replacing the current bcftools concat method.

The bcftools concat method has a couple of issues:

  1. The most important is probably that it doesn't merge INFO fields when the variants are shared in the two VCFs, with the most significant consequence being that we don't keep the info that the variant was called by 2 callers. [User Story] Keep both FOUND_IN variantcaller tags in merged variants #1518
  2. It doesn't require that the variants are matching perfectly in the ALT column, so for instance if a variant has been called as a MNV in VarDict and as separate SNVs in TNscope, it merges only the first variant. [Bug] Merging of different variants VarDict and TNscope #1519

Also added this to header of merged vcf:

##merge_snv_variantcallers=merge_snv_variantcallers.py SNV.somatic.setamoeba.tnscope.research.normalised.vcf.gz SNV.somatic.setamoeba.vardict.research.normalised.vcf.gz --output output_merged.vcf
##merge_snv_variantcallers_processing_time=2025-01-20T11:18:24
##INFO_MERGE_SNV_VARIANTCALLERS=Values in merged INFO fields are listed in the order of the input files: first from SNV.somatic.setamoeba.tnscope.research.normalised.vcf.gz, then from SNV.somatic.setamoeba.vardict.research.normalised.vcf.gz

Also removed the filepath in the FOUND_IN pre-processing by edit_vcf_info.py

Based on @khurrammaqbool suggestion I also maintained single values for the AF and DP fields from the 1st VCF in the INFO field and added a new list of AF and DP values which contains the values from both, instead of as previously transforming the AF and DP field directly into a list.

Changed

  • Replaced bcftools concat with custom python script for merging VCFs from VarDict and TNscope

Documentation

  • N/A
  • Updated Balsamic documentation to reflect the changes as needed for this PR.
    • [balsamic_filters.rst]
    • [balsamic_methods.rst]

Tests

Feature Tests

Verify that both TNscope and VarDict shows up as callers for merged variants in Scout

image

Verify that INFO field from VarDict and TNscope variants keep AF and DP as single-value fields for merged variants, and that separate AF and DP LIST fields are created with values from each caller. See sheet: https://docs.google.com/spreadsheets/d/1kB2vNaEBmol0tX3HUR3UY1LPQCtWZLutixGwnvbkmhY/edit?gid=1578036768#gid=1578036768

  • Successful

Verify that running the https://github.com/EBIvariation/vcf-validator does not show any new errors that weren't present in the original merged VCF from 16.0.0.

Errors in clinical.filtered.pass vcf from uphippo v16.0.0:

According to the VCF specification, the input file is not valid
Error: Error in meta-data section. This occurs 1 time(s), first time in line 265.
Error: Format is not a colon-separated list of alphanumeric strings. This occurs 113 time(s), first time in line 279.
Warning: Reference and alternate alleles do not share the first nucleotide. This occurs 2 time(s), first time in line 442.

Errors in clinical.filtered.pass vcf from uphippo this PR:

According to the VCF specification, the input file is not valid
Error: Error in meta-data section. This occurs 1 time(s), first time in line 280.
Error: Format is not a colon-separated list of alphanumeric strings. This occurs 115 time(s), first time in line 293.
Warning: Reference and alternate alleles do not share the first nucleotide. This occurs 2 time(s), first time in line 460.
  • Successful. The "Format is not a colon-separated list of alphanumeric strings" Error is from TNscope and represents the majority of TNscope variants. If this was an issue for uploading to Scout we would have seen it.

Pipeline Integrity Tests

  • Report deliver (generation of the .hk file)
    • N/A
    • Verified
  • TGA T/O Workflow
    • N/A
    • Verified
  • TGA T/N Workflow
    • N/A
    • Verified
  • UMI T/O Workflow
    • N/A
    • Verified
  • UMI T/N Workflow
    • N/A
    • Verified
  • WGS T/O Workflow
    • N/A
    • Verified
  • WGS T/N Workflow
    • N/A
    • Verified
  • QC Workflow
    • N/A
    • Verified
  • PON Workflow
    • N/A
    • Verified

Clinical Genomics Stockholm

Documentation

  • Atlas documentation
    • N/A
    • Updated: [Link]
  • Web portal for Clinical Genomics
    • N/A
    • Updated: [Link]

Panel of Normal specific criteria

User Changes

  • N/A
  • This PR affects the output files or results.
    • User feedback is considered unnecessary because [Justification].
    • Affected users have been included in the development process and given a chance to provide feedback.

Infrastructure Changes

  • Stored files in Housekeeper
    • N/A
    • Updated: [Link]
  • CG (CLI and delivered/uploaded files)
    • N/A
    • Updated: [Link]
  • Servers (configuration files on Hasta)
    • N/A
    • Updated: [Link]
  • Scout interface
    • N/A
    • Updated: [Link]

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

  • PR Description
    • Provided a comprehensive description of the PR.
    • Linked relevant user stories or issues to the PR.
  • Documentation
    • Verified and updated documentation if necessary.
  • Tests
    • Described and tested the functionality addressed in the PR.
    • Ensured integration of the new code with existing workflows.
    • Confirmed that meaningful unit tests were added for the changes introduced.
    • Checked that the PR has successfully passed all relevant code smells and coverage checks.
  • Review
    • Addressed and resolved all the feedback provided during the code review process.
    • Obtained final approval from designated reviewers.

For Reviewers

  • Code
    • Code implements the intended features or fixes the reported issue.
    • Code follows the project's coding standards and style guide.
  • Documentation
    • Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
  • Tests
    • The author provided a description of their manual testing, including consideration of edge cases and boundary
      conditions where applicable, with satisfactory results.
  • Review
    • Confirmed that the developer has addressed all the comments during the code review.

@codecov
Copy link

codecov bot commented Nov 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.45%. Comparing base (7d529e6) to head (fe0d227).
Report is 47 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1499      +/-   ##
===========================================
- Coverage    99.48%   99.45%   -0.03%     
===========================================
  Files           40       40              
  Lines         1932     2020      +88     
===========================================
+ Hits          1922     2009      +87     
- Misses          10       11       +1     
Flag Coverage Δ
unittests 99.45% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Base automatically changed from release_v16.0.0 to master November 19, 2024 14:45
@mathiasbio mathiasbio changed the base branch from master to develop November 21, 2024 11:20
@sonarqubecloud
Copy link

Base automatically changed from create_tnscope_mnvs to develop February 7, 2025 16:00
@mathiasbio mathiasbio requested review from a team and fevac February 13, 2025 12:20
Copy link
Contributor

@fevac fevac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@sonarqubecloud
Copy link

@mathiasbio mathiasbio merged commit e0459b8 into develop Feb 14, 2025
9 checks passed
@mathiasbio mathiasbio deleted the merge_snv_variants_script branch February 14, 2025 09:39
@mathiasbio mathiasbio mentioned this pull request Mar 3, 2025
15 tasks
mathiasbio added a commit that referenced this pull request Apr 8, 2025
### Added:

* Added option to disable hard filter of variants in matched normal #1509
* Added check to verify sample sex for all workflows #1516
* SOR filter to WGS TN SNV quality filter #1506
* GT field to IGH-DUX4 variant #1527
* ONC field annotations from Clinvar #1527
* Added memory option #1535
* Added max SOR 3 to TNscope TGA TN workflow #1526
* Added max RPA 12 to TNscope TGA workflow #1526

### Changed:

* Reworked bcftools filters #1509
* Renamed high_normal_tumor_af_frac to in_normal #1509
* check to verify sample sex for all workflows #1516
* Merging SNVs into MNVs in TNscope TGA #1524
* Change raw delivery SNV file for TGA to before any post-processing #1524
* Changed VarDict and TNscope VCF merged method to custom script #1499
* Changed QC thresholds for WGS normal and WES #1477
* Change VarDict memory usage to fix crashes in production #1537
* Updated cluster resources for tnscope WGS TN #1535
* Disable SV calling in TNscope #1541

### Removed:

* Remove WGS-level GC-bias metric from TGA workflow #1521
* Remove parallelization of VarDict per chromosome #1544

### Fixed:

* Merged VarDict and TNscope variants now correctly show both callers in FOUND_IN info field #1499
* Fixed somalier container #1538
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Completed

Development

Successfully merging this pull request may close these issues.

4 participants