Skip to content

Commit b5e954c

Browse files
Subdivide FilterBatch and add SV count plots to enable IQR cutoff selection (#220)
1 parent 20684c7 commit b5e954c

30 files changed

+1096
-1007
lines changed

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -329,7 +329,10 @@ Generates variant metrics for filtering.
329329
## <a name="generate-batch-metrics">FilterBatch</a>
330330
*Formerly Module03*
331331

332-
Filters poor quality variants and filters outlier samples.
332+
Filters poor quality variants and filters outlier samples. This workflow can be run all at once with the WDL at `wdl/FilterBatch.wdl`, or it can be run in three steps to enable tuning of outlier filtration cutoffs. The three subworkflows are:
333+
1. FilterBatchSites: Per-batch variant filtration
334+
2. PlotSVCountsPerSample: Visualize SV counts per sample per type to help choose an IQR cutoff for outlier filtering, and preview outlier samples for a given cutoff
335+
3. FilterBatchSamples: Per-batch outlier sample filtration; provide an appropriate `outlier_cutoff_nIQR` based on the SV count plots and outlier previews from step 2.
333336

334337
#### Prerequisites:
335338
* [GenerateBatchMetrics](#generate-batch-metrics)
@@ -441,7 +444,7 @@ gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2
441444
```
442445

443446
* BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches
444-
* FilterOutlierSamples - remove outlier samples with unusually high or low number of SVs
447+
* FilterOutlierSamplesPostMinGQ - remove outlier samples with unusually high or low number of SVs
445448
* FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation
446449

447450
## <a name="annotate-vcf">AnnotateVcf</a> (in development)

input_templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -55,14 +55,19 @@ The following workflows are included in this workspace, to be executed in this o
5555
4. `04-GatherBatchEvidence`: Per-batch copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation
5656
5. `05-ClusterBatch`: Per-batch variant clustering
5757
6. `06-GenerateBatchMetrics`: Per-batch variant filtering, metric generation
58-
7. `07-FilterBatch`: Per-batch variant filtering; outlier exclusion
59-
8. (Skip for a single batch) `08-MergeBatchSites`: Site merging of SVs discovered across batches, run on a cohort-level `sample_set_set`
60-
9. `09-GenotypeBatch`: Per-batch genotyping of all sites in the cohort. Use `09-GenotypeBatch_SingleBatch` if you only have one batch.
61-
10. `10-RegenotypeCNVs`: Cohort-level genotype refinement of some depth calls. Use `10-RegenotypeCNVs_SingleBatch` if you only have one batch.
62-
11. `11-MakeCohortVcf`: Cohort-level cross-batch integration; complex variant resolution and re-genotyping; VCF cleanup. Use `11-MakeCohortVcf_SingleBatch` if you only have one batch.
63-
12. `12-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and AF annotation with external population callsets. Use `12-AnnotateVcf_SingleBatch` if you only have one batch.
58+
7. `07a-FilterBatchSites`: Per-batch variant filtering
59+
8. `07b-PlotSVCountsPerSample`: Plot SV counts per sample per SV type to enable choice of IQR cutoff for outlier filtration in `07c-FilterBatchSamples`
60+
9. `07c-FilterBatchSamples`: Per-batch outlier sample filtration
61+
10. (Skip for a single batch) `08-MergeBatchSites`: Site merging of SVs discovered across batches, run on a cohort-level `sample_set_set`
62+
11. `09-GenotypeBatch`: Per-batch genotyping of all sites in the cohort. Use `09-GenotypeBatch_SingleBatch` if you only have one batch.
63+
12. `10-RegenotypeCNVs`: Cohort-level genotype refinement of some depth calls. Use `10-RegenotypeCNVs_SingleBatch` if you only have one batch.
64+
13. `11-MakeCohortVcf`: Cohort-level cross-batch integration; complex variant resolution and re-genotyping; VCF cleanup. Use `11-MakeCohortVcf_SingleBatch` if you only have one batch.
65+
14. `12-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and AF annotation with external population callsets. Use `12-AnnotateVcf_SingleBatch` if you only have one batch.
6466

65-
Additional modules, such as those for filtering and visualization, are under development. They are not included in this workspace at this time, but the source code can be found in the [GATK-SV GitHub repository](https://github.com/broadinstitute/gatk-sv).
67+
Additional downstream modules, such as those for filtering and visualization, are under development. They are not included in this workspace at this time, but the source code can be found in the [GATK-SV GitHub repository](https://github.com/broadinstitute/gatk-sv). See **Downstream steps** towards the bottom of this page for more information.
68+
69+
Extra workflows (Not part of canonical pipeline, but included for your convenience. May require manual configuration):
70+
* `FilterOutlierSamples`: Filter outlier samples (in terms of SV counts) from a single VCF. Recommended to run `07b-PlotSVCountsPerSample` beforehand (reconfigured with the single VCF you want to filter) to enable IQR cutoff choice.
6671

6772
For detailed instructions on running the pipeline in Terra, see **Step-by-step instructions** below.
6873

@@ -178,24 +183,26 @@ Read the full documentation for these modules [here](https://github.com/broadins
178183
* Use the same `sample_set` definitions you used for `03-TrainGCNV` and `04-GatherBatchEvidence`.
179184

180185

181-
#### 07-FilterBatch
186+
#### 07a-FilterBatchSites, 07b-PlotSVCountsPerSample, 07c-FilterBatchSamples
182187

183-
Read the full FilterBatch documentation [here](https://github.com/broadinstitute/gatk-sv#filter-batch).
188+
These three workflows make up FilterBatch; they are subdivided in this workspace to enable tuning of outlier filtration cutoffs. Read the full FilterBatch documentation [here](https://github.com/broadinstitute/gatk-sv#filter-batch).
184189
* Use the same `sample_set` definitions you used for `03-TrainGCNV` through `06-GenerateBatchMetrics`.
185-
* The default value for `outlier_cutoff_nIQR`, which is used to filter samples that have an abnormal number of SV calls, is 10000. This essentially means that no samples are filtered. You should adjust this value depending on your scientific needs.
190+
* `07a-FilterBatchSites` does not require user intervention
191+
* `07b-PlotSVCountsPerSample` produces SV count plots and files, as well as a preview of the outlier samples to be filtered, but it does not perform any filtering of the VCFs. The input `N_IQR_cutoff` is used to visualize filtration thresholds on the SV count plots and preview the samples to be filtered; the default value is set to 6. You can adjust this value depending on your needs, and you can re-run the workflow with new `N_IQR_cutoff` values until the plots and outlier sample lists suit the purposes of your study. Once you have chosen an IQR cutoff, provide it to the `N_IQR_cutoff` input in `07c-FilterBatchSamples` to filter the VCFs using the chosen cutoff.
192+
* `07c-FilterBatchSamples` performs outlier sample filtration, removing samples with an abnormal number of SV calls of at least one SV type. To tune the filtering threshold to your needs, edit the `N_IQR_cutoff` input value based on the plots and outlier sample preview lists from `07b-PlotSVCountsPerSample`. The default value for `N_IQR_cutoff` in this step is 10000, which essentially means that no samples are filtered.
186193

187194
#### 08-MergeBatchSites
188195

189196
Read the full MergeBatchSites documentation [here](https://github.com/broadinstitute/gatk-sv#merge-batch-sites).
190197
* If you only have one batch, skip this workflow.
191-
* For a multi-batch cohort, `08-MergeBatchSites` is a cohort-level workflow, so it is run on a `sample_set_set` containing all of the batches in the cohort. You can create this `sample_set_set` while you are launching the `08-MergeBatchSites` workflow: click "Select Data", choose "Create new sample_set_set [...]", check all the batches to include (all of the ones used in `03-TrainGCNV` through `07-FilterBatch`), and give it a name that follows the **Sample ID requirements**.
198+
* For a multi-batch cohort, `08-MergeBatchSites` is a cohort-level workflow, so it is run on a `sample_set_set` containing all of the batches in the cohort. You can create this `sample_set_set` while you are launching the `08-MergeBatchSites` workflow: click "Select Data", choose "Create new sample_set_set [...]", check all the batches to include (all of the ones used in `03-TrainGCNV` through `07c-FilterBatchSamples`), and give it a name that follows the **Sample ID requirements**.
192199

193200
<img alt="creating a cohort sample_set_set" title="How to create a cohort sample_set_set" src="https://i.imgur.com/zKEtSbe.png" width="500">
194201

195202
#### 09-GenotypeBatch
196203

197204
Read the full GenotypeBatch documentation [here](https://github.com/broadinstitute/gatk-sv#genotype-batch).
198-
* Use the same `sample_set` definitions you used for `03-TrainGCNV` through `07-FilterBatch`.
205+
* Use the same `sample_set` definitions you used for `03-TrainGCNV` through `07c-FilterBatchSamples`.
199206
* If you only have one batch, use the `09-GenotypeBatch_SingleBatch` version of the workflow.
200207

201208
#### 10-RegenotypeCNVs, 11-MakeCohortVcf, and 12-AnnotateVcf

input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterBatch.json.tmpl

Lines changed: 0 additions & 19 deletions
This file was deleted.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"FilterBatchSamples.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
3+
"FilterBatchSamples.sv_base_mini_docker": "${workspace.sv_base_mini_docker}",
4+
"FilterBatchSamples.linux_docker" : "${workspace.linux_docker}",
5+
6+
"FilterBatchSamples.N_IQR_cutoff": "10000",
7+
8+
"FilterBatchSamples.batch": "${this.sample_set_id}",
9+
"FilterBatchSamples.vcfs" : "${this.sites_filtered_vcfs}",
10+
"FilterBatchSamples.sv_counts": "${this.sv_counts}"
11+
}
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"FilterBatchSites.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
3+
4+
"FilterBatchSites.batch": "${this.sample_set_id}",
5+
"FilterBatchSites.depth_vcf" : "${this.clustered_depth_vcf}",
6+
"FilterBatchSites.manta_vcf" : "${this.clustered_manta_vcf}",
7+
"FilterBatchSites.wham_vcf" : "${this.clustered_wham_vcf}",
8+
"FilterBatchSites.melt_vcf" : "${this.clustered_melt_vcf}",
9+
"FilterBatchSites.evidence_metrics": "${this.metrics}",
10+
"FilterBatchSites.evidence_metrics_common": "${this.metrics_common}"
11+
}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"FilterOutlierSamples.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
3+
"FilterOutlierSamples.sv_base_mini_docker": "${workspace.sv_base_mini_docker}",
4+
"FilterOutlierSamples.linux_docker" : "${workspace.linux_docker}",
5+
6+
"FilterOutlierSamples.N_IQR_cutoff": "6",
7+
8+
"FilterOutlierSamples.name": "${this.sample_set_set_id}",
9+
"FilterOutlierSamples.vcf" : "${this.output_vcf}"
10+
}

input_templates/terra_workspaces/cohort_mode/workflow_configurations/GenotypeBatch.SingleBatch.json.tmpl

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,14 @@
1616

1717
"GenotypeBatch.batch": "${this.sample_set_id}",
1818
"GenotypeBatch.rf_cutoffs": "${this.cutoffs}",
19-
"GenotypeBatch.batch_depth_vcf": "${this.filtered_depth_vcf}",
20-
"GenotypeBatch.batch_pesr_vcf": "${this.filtered_pesr_vcf}",
19+
"GenotypeBatch.batch_depth_vcf": "${this.outlier_filtered_depth_vcf}",
20+
"GenotypeBatch.batch_pesr_vcf": "${this.outlier_filtered_pesr_vcf}",
2121
"GenotypeBatch.ped_file": "${workspace.cohort_ped_file}",
2222
"GenotypeBatch.bin_exclude": "${workspace.bin_exclude}",
2323
"GenotypeBatch.discfile": "${this.merged_PE}",
2424
"GenotypeBatch.coveragefile": "${this.merged_bincov}",
2525
"GenotypeBatch.splitfile": "${this.merged_SR}",
2626
"GenotypeBatch.medianfile": "${this.median_cov}",
27-
"GenotypeBatch.cohort_depth_vcf": "${this.filtered_depth_vcf}",
28-
"GenotypeBatch.cohort_pesr_vcf": "${this.filtered_pesr_vcf}"
27+
"GenotypeBatch.cohort_depth_vcf": "${this.outlier_filtered_depth_vcf}",
28+
"GenotypeBatch.cohort_pesr_vcf": "${this.outlier_filtered_pesr_vcf}"
2929
}

input_templates/terra_workspaces/cohort_mode/workflow_configurations/GenotypeBatch.json.tmpl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@
1616

1717
"GenotypeBatch.batch": "${this.sample_set_id}",
1818
"GenotypeBatch.rf_cutoffs": "${this.cutoffs}",
19-
"GenotypeBatch.batch_depth_vcf": "${this.filtered_depth_vcf}",
20-
"GenotypeBatch.batch_pesr_vcf": "${this.filtered_pesr_vcf}",
19+
"GenotypeBatch.batch_depth_vcf": "${this.outlier_filtered_depth_vcf}",
20+
"GenotypeBatch.batch_pesr_vcf": "${this.outlier_filtered_pesr_vcf}",
2121
"GenotypeBatch.ped_file": "${workspace.cohort_ped_file}",
2222
"GenotypeBatch.bin_exclude": "${workspace.bin_exclude}",
2323
"GenotypeBatch.discfile": "${this.merged_PE}",
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"MergeBatchSites.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
33
"MergeBatchSites.cohort": "${this.sample_set_set_id}",
4-
"MergeBatchSites.pesr_vcfs": "${this.sample_sets.filtered_pesr_vcf}",
5-
"MergeBatchSites.depth_vcfs": "${this.sample_sets.filtered_depth_vcf}"
4+
"MergeBatchSites.pesr_vcfs": "${this.sample_sets.outlier_filtered_pesr_vcf}",
5+
"MergeBatchSites.depth_vcfs": "${this.sample_sets.outlier_filtered_depth_vcf}"
66
}
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"PlotSVCountsPerSample.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
3+
4+
"PlotSVCountsPerSample.N_IQR_cutoff": "6",
5+
6+
"PlotSVCountsPerSample.prefix": "${this.sample_set_id}",
7+
"PlotSVCountsPerSample.vcfs" : "${this.sites_filtered_vcfs}",
8+
"PlotSVCountsPerSample.vcf_identifiers" : "${this.algorithms_filtersites}"
9+
}

0 commit comments

Comments
 (0)