Skip to content

Commit 6d72b4c

Browse files
authored
feat: move svs from snvs (#1584)
#### Changed - moved SVs from VarDict into SV VCF for TGA TO
1 parent fa3a16f commit 6d72b4c

File tree

12 files changed

+125
-54
lines changed

12 files changed

+125
-54
lines changed

BALSAMIC/constants/workflow_params.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,13 @@
4242
"sequencing_type": ["targeted"],
4343
"workflow_solution": ["BALSAMIC"],
4444
},
45+
"vardictsv": {
46+
"mutation": "somatic",
47+
"mutation_type": "SV",
48+
"analysis_type": ["single"],
49+
"sequencing_type": ["targeted"],
50+
"workflow_solution": ["BALSAMIC"],
51+
},
4552
"merged": {
4653
"mutation": "somatic",
4754
"mutation_type": "SNV",

BALSAMIC/models/config.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ class VarcallerAttribute(BaseModel):
7272
"""Holds variables for variant caller software
7373
Attributes:
7474
mutation: str of mutation class
75-
mutation_type: str of mutation type
75+
mutation_type: str for mutation type
7676
analysis_type: list of str for analysis types
7777
workflow_solution: list of str for workflows
7878
sequencing_type: list of str for workflows
@@ -101,6 +101,7 @@ class VCFModel(BaseModel):
101101
merged: VarcallerAttribute
102102
manta: VarcallerAttribute
103103
vardict: VarcallerAttribute
104+
vardictsv: VarcallerAttribute
104105
dellysv: VarcallerAttribute
105106
cnvkit: VarcallerAttribute
106107
ascat: VarcallerAttribute

BALSAMIC/snakemake_rules/variant_calling/snv_t_varcall_tga.rule

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ rule post_process_vardict:
113113
input:
114114
vcf = vcf_dir + "vardict/vardict.sorted.vcf"
115115
output:
116-
vcf_vardict = vcf_dir + "vardict/SNV.somatic." + config["analysis"]["case_id"] + ".vardict.vcf.gz",
116+
vcf_vardict = vcf_dir + "vardict/all.somatic." + config["analysis"]["case_id"] + ".vardict.vcf.gz",
117117
namemap=vcf_dir + "SNV.somatic." + config["analysis"]["case_id"] + ".vardict.sample_name_map"
118118
params:
119119
tmpdir=tempfile.mkdtemp(prefix=tmp_dir),
@@ -149,3 +149,55 @@ echo -e \"tTUMOR\" > {output.namemap}.tumor ;
149149
150150
rm -rf {params.tmpdir} ;
151151
"""
152+
153+
rule gatk_update_vcf_sequence_dictionary:
154+
input:
155+
ref = config["reference"]["reference_genome"],
156+
vcf = vcf_dir + "vardict/all.somatic." + config["analysis"]["case_id"] + ".vardict.vcf.gz",
157+
output:
158+
vcf_vardict = vcf_dir + "all.somatic." + config["analysis"]["case_id"] + ".vardict.vcf.gz",
159+
params:
160+
tmpdir=tempfile.mkdtemp(prefix=tmp_dir),
161+
benchmark:
162+
Path(benchmark_dir,"gatk_update_vcf_sequence_dictionary" + config["analysis"]["case_id"] + ".tsv").as_posix()
163+
singularity:
164+
Path(singularity_image,config["bioinfo_tools"].get("gatk") + ".sif").as_posix()
165+
message:
166+
"Running GATK UpdateVCFSequenceDictionary on VarDict VCF."
167+
shell:
168+
"""
169+
export TMPDIR={params.tmpdir};
170+
171+
ref=$(echo {input.ref} | sed 's/.fasta/.dict/g') ;
172+
173+
gatk UpdateVCFSequenceDictionary -V {input.vcf} --source-dictionary $ref --replace --output {output.vcf_vardict} ;
174+
175+
rm -rf {params.tmpdir}
176+
"""
177+
178+
rule vardict_move_svs:
179+
input:
180+
vcf_vardict=vcf_dir + "all.somatic." + config["analysis"]["case_id"] + ".vardict.vcf.gz"
181+
output:
182+
vcf_vardict=vcf_dir + "SNV.somatic." + config["analysis"]["case_id"] + ".vardict.vcf.gz",
183+
vcf_vardict_sv=vcf_dir + "SV.somatic." + config["analysis"]["case_id"] + ".vardictsv.vcf.gz",
184+
params:
185+
housekeeper_id={"id": config["analysis"]["case_id"], "tags": "research"},
186+
tmpdir=tempfile.mkdtemp(prefix=tmp_dir),
187+
benchmark:
188+
Path(benchmark_dir,"vardict_move_svs" + config["analysis"]["case_id"] + ".tsv").as_posix()
189+
singularity:
190+
Path(singularity_image,config["bioinfo_tools"].get("bcftools") + ".sif").as_posix()
191+
message:
192+
"Moving SVs from VarDict VCF to separate VCF."
193+
shell:
194+
"""
195+
export TMPDIR={params.tmpdir};
196+
197+
bcftools view -i 'INFO/SVTYPE!=""' -Oz -o {output.vcf_vardict_sv} {input.vcf_vardict} ;
198+
tabix -p vcf -f {output.vcf_vardict_sv} ;
199+
bcftools view -e 'INFO/SVTYPE!=""' -Oz -o {output.vcf_vardict} {input.vcf_vardict} ;
200+
tabix -p vcf -f {output.vcf_vardict} ;
201+
202+
rm -rf {params.tmpdir}
203+
"""

BALSAMIC/snakemake_rules/variant_calling/snv_tn_varcall_tga.rule

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,3 +149,29 @@ echo -e \"nNORMAL\" > {output.namemap}.normal ;
149149
150150
rm -rf {params.tmpdir} ;
151151
"""
152+
153+
rule gatk_update_vcf_sequence_dictionary:
154+
input:
155+
ref = config["reference"]["reference_genome"],
156+
vcf = vcf_dir + "vardict/SNV.somatic." + config["analysis"]["case_id"] + ".vardict.vcf.gz",
157+
output:
158+
vcf_vardict = vcf_dir + "SNV.somatic." + config["analysis"]["case_id"] + ".vardict.vcf.gz",
159+
params:
160+
housekeeper_id = {"id": config["analysis"]["case_id"], "tags": "research"},
161+
tmpdir=tempfile.mkdtemp(prefix=tmp_dir),
162+
benchmark:
163+
Path(benchmark_dir,"gatk_update_vcf_sequence_dictionary" + config["analysis"]["case_id"] + ".tsv").as_posix()
164+
singularity:
165+
Path(singularity_image,config["bioinfo_tools"].get("gatk") + ".sif").as_posix()
166+
message:
167+
"Running GATK UpdateVCFSequenceDictionary on VarDict VCF."
168+
shell:
169+
"""
170+
export TMPDIR={params.tmpdir};
171+
172+
ref=$(echo {input.ref} | sed 's/.fasta/.dict/g') ;
173+
174+
gatk UpdateVCFSequenceDictionary -V {input.vcf} --source-dictionary $ref --replace --output {output.vcf_vardict} ;
175+
176+
rm -rf {params.tmpdir}
177+
"""

BALSAMIC/snakemake_rules/variant_calling/somatic_sv_postprocess_and_filter_tumor_only.rule

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,8 @@ rule bcftools_process_SV_CNV:
3131

3232
rule svdb_merge_tumor_only:
3333
input:
34-
vcf = expand(
35-
vcf_dir + "SV.somatic." + config["analysis"]["case_id"] + ".{caller}.vcf.gz",
36-
caller=somatic_caller_sv) +
37-
expand(
38-
vcf_dir + "CNV.somatic." + config["analysis"]["case_id"] + ".{caller}.vcf.gz",
39-
caller=somatic_caller_cnv)
34+
vcf = expand(vcf_dir + "SV.somatic." + config["analysis"]["case_id"] + ".{caller}.vcf.gz", caller=somatic_caller_sv) +
35+
expand(vcf_dir + "CNV.somatic." + config["analysis"]["case_id"] + ".{caller}.vcf.gz", caller=somatic_caller_cnv)
4036
output:
4137
vcf_svdb = vcf_dir + "SV.somatic." + config["analysis"]["case_id"] + ".svdb.vcf.gz",
4238
namemap = vcf_dir + "SV.somatic." + config["analysis"]["case_id"] + ".svdb.sample_name_map",

BALSAMIC/snakemake_rules/variant_calling/vardict_pre_and_postprocessing.rule

Lines changed: 1 addition & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -47,33 +47,7 @@ rule vardict_sort:
4747
mkdir -p {params.tmpdir};
4848
export TMPDIR={params.tmpdir};
4949

50-
awk -f {params.sort_vcf} {input.vcf} > {output.vcf_sorted}
50+
awk -f {params.sort_vcf} {input.vcf} > {output.vcf_sorted} ;
5151

5252
rm -rf {params.tmpdir} ;
5353
"""
54-
55-
rule gatk_update_vcf_sequence_dictionary:
56-
input:
57-
ref = config["reference"]["reference_genome"],
58-
vcf = vcf_dir + "vardict/SNV.somatic." + config["analysis"]["case_id"] + ".vardict.vcf.gz",
59-
output:
60-
vcf_vardict = vcf_dir + "SNV.somatic." + config["analysis"]["case_id"] + ".vardict.vcf.gz",
61-
params:
62-
housekeeper_id = {"id": config["analysis"]["case_id"], "tags": "research"},
63-
tmpdir=tempfile.mkdtemp(prefix=tmp_dir),
64-
benchmark:
65-
Path(benchmark_dir,"gatk_update_vcf_sequence_dictionary" + config["analysis"]["case_id"] + ".tsv").as_posix()
66-
singularity:
67-
Path(singularity_image,config["bioinfo_tools"].get("gatk") + ".sif").as_posix()
68-
message:
69-
"Running GATK UpdateVCFSequenceDictionary on VarDict VCF."
70-
shell:
71-
"""
72-
export TMPDIR={params.tmpdir};
73-
74-
ref=$(echo {input.ref} | sed 's/.fasta/.dict/g') ;
75-
76-
gatk UpdateVCFSequenceDictionary -V {input.vcf} --source-dictionary $ref --replace --output {output.vcf_vardict} ;
77-
78-
rm -rf {params.tmpdir}
79-
"""

BALSAMIC/utils/rule.py

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -27,19 +27,12 @@ def get_vcf(config, var_caller, sample):
2727
input: BALSAMIC config file
2828
output: retrieve list of vcf files
2929
"""
30-
3130
vcf = []
3231
for v in var_caller:
3332
for s in sample:
34-
vcf.append(
35-
config["vcf"][v]["mutation_type"]
36-
+ "."
37-
+ config["vcf"][v]["mutation"]
38-
+ "."
39-
+ s
40-
+ "."
41-
+ v
42-
)
33+
mutation_type = config["vcf"][v]["mutation_type"]
34+
mutation = config["vcf"][v]["mutation"]
35+
vcf.append(f"{mutation_type}.{mutation}.{s}.{v}")
4336
return vcf
4437

4538

CHANGELOG.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ Fixed:
3737

3838
Added:
3939
^^^^^^
40+
* sbatch script for snakemake sequential job submission https://github.com/Clinical-Genomics/BALSAMIC/pull/1558
41+
* logfile for balsamic wrapper https://github.com/Clinical-Genomics/BALSAMIC/pull/1558
42+
* analysis status text file for easier prodbioinfo handling https://github.com/Clinical-Genomics/BALSAMIC/pull/1558
43+
* requested memory to each rule https://github.com/Clinical-Genomics/BALSAMIC/pull/1558
4044

4145

4246
Changed:
@@ -48,6 +52,10 @@ Changed:
4852

4953
Removed:
5054
^^^^^^^^
55+
* removed immediate submit functionality https://github.com/Clinical-Genomics/BALSAMIC/pull/1558
56+
* removed unused code for benchmark plotting https://github.com/Clinical-Genomics/BALSAMIC/pull/1558
57+
* removed functionality to disable variant callers https://github.com/Clinical-Genomics/BALSAMIC/pull/1558
58+
5159

5260
Fixed:
5361
^^^^^^

docs/balsamic_methods.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,11 @@ The resulting BAM is quality controlled using samtools v1.15.1 :superscript:`26`
1313
Duplicate reads are collapsed using the UMI function in Dedup from sentieon-tools :superscript:`15`, and invalid mates are corrected with tools collate and fixmate from samtools v1.15.1 :superscript:`26`.
1414
Coverage metrics are collected from the final BAM using Sambamba v0.8.2 :superscript:`27`, Mosdepth v0.3.3 :superscript:`28` and CollectHsMetrics from Picard tools v2.27.1 :superscript:`6`.
1515
Results of the quality controlled steps were summarized by MultiQC v1.22.3 :superscript:`7`.
16-
Small somatic mutations (SNVs and INDELs) were called for each sample using VarDict 2019.06.04 :superscript:`8` and Sentieon TNscope :superscript:`16` and normalised using bcftools norm before being filtered using the criteria (*MQ >= 30, DP >= 50 (20 for exome-samples), VD >= 5, Minimum AF >= 0.005, Maximum AF < 1, GNOMADAF_popmax <= 0.005, swegen AF < 0.01*) and then merged using a custom python script.
16+
Small somatic mutations (SNVs and INDELs) and SVs were called using VarDict 2019.06.04 :superscript:`8` and Sentieon TNscope :superscript:`16` and normalised using bcftools norm before being filtered using the criteria (*MQ >= 30, DP >= 50 (20 for exome-samples), VD >= 5, Minimum AF >= 0.005, Maximum AF < 1, GNOMADAF_popmax <= 0.005, swegen AF < 0.01*) and then merged using a custom python script.
1717
Only those variants that fulfilled the filtering criteria and scored as `PASS` in the VCF file were reported.
1818
Structural variants (SV) were called using Manta v1.6.0 :superscript:`9` on a post-processed version of the BAM which was base-quality capped to 70, and Delly v1.0.3 :superscript:`10`.
1919
Copy number variations (CNV) were called using CNVkit v0.9.10 :superscript:`11`.
20-
The variant calls from CNVkit, Manta and Delly were merged using SVDB v2.8.1 :superscript:`12`.
20+
The structural variant calls from CNVkit, Manta, Delly and VarDict were merged using SVDB v2.8.1 :superscript:`12`.
2121
The clinical set of SNV and SV is also annotated and filtered against loqusDB curated frequency of observed variants (frequency < 0.01) from non-cancer cases and only annotated using frequency of observed variants from cancer cases (somatic and germline).
2222
All variants were annotated using Ensembl VEP v113.4 :superscript:`13`. We used vcfanno v0.3.3 :superscript:`14`
2323
to annotate somatic variants for their population allele frequency from gnomAD v2.1.1 :superscript:`18`, CADD v1.7 :superscript:`24`, SweGen :superscript:`22` and frequency of observed variants in normal samples. The MSI (MicroSatellite Instability) score was computed using MSIsensor-pro v1.2.0 :superscript:`25`.

docs/balsamic_sv_cnv.rst

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,11 @@ Depending on the sequencing type, BALSAMIC is currently running the following st
3434
- tumor-normal, tumor-only
3535
- somatic, germline
3636
- SV
37+
* - VarDict
38+
- TGA, WES
39+
- tumor-only
40+
- somatic
41+
- SV
3742
* - TIDDIT
3843
- WGS
3944
- tumor-normal, tumor-only
@@ -136,13 +141,15 @@ Further information regarding the TIDDIT tumor normal filtration: As translocati
136141
- WGS
137142
tumor-only
138143
* - | 1. manta
139-
| 2. dellysv
140-
| 3. cnvkit
141-
| 4. dellycnv
144+
| 2. vardict
145+
| 3. dellysv
146+
| 4. cnvkit
147+
| 5. dellycnv
142148
- | 1. manta
143-
| 2. dellysv
144-
| 3. cnvkit
145-
| 4. dellycnv
149+
| 2. vardict
150+
| 3. dellysv
151+
| 4. cnvkit
152+
| 5. dellycnv
146153
- | 1. manta
147154
| 2. dellysv
148155
| 3. ascat

0 commit comments

Comments
 (0)