Skip to content

Commit d0e48cf

Browse files
committed
Add support for variant scoring by MIVMIR, GICAM models
Signed-off-by: Tor Björgen <tor.bjorgen@scilifelab.se>
1 parent 84da5c0 commit d0e48cf

19 files changed

Lines changed: 568 additions & 15 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3030
- Env variable NXF_SINGULARITY_NEW_PID_NAMESPACE = false to accommodate hisat2 running with latest Nextflow and Singularity [#775](https://github.com/nf-core/raredisease/pull/775)
3131
- Parameter `exclude_alt` to filter alignments to alt/unplaced contigs after alignment using samtools view, retaining only primary chromosomes (GRCh37: 1-22,X,Y,MT / GRCh38: chr1-chr22,chrX,chrY,chrM). Note that enabling this will restrict variant calling to these chromosomes [#803](https://github.com/nf-core/raredisease/pull/803)]
3232
- Parameters `save_all_mapped_as_cram` and `save_noalt_mapped_as_cram` to replace `save_mapped_as_cram`, allowing independent control over publishing unfiltered and alt-filtered alignment files as CRAM [#807](https://github.com/nf-core/raredisease/pull/807)
33+
- Add SNV scoring by MIVMIR, GICAM models [#816](https://github.com/nf-core/raredisease/pull/812)
3334

3435
### `Changed`
3536

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,7 @@ For more details about the output files and reports, please refer to the
175175

176176
nf-core/raredisease was written in a collaboration between the Clinical Genomics nodes in Sweden, with major contributions from [Ramprasad Neethiraj](https://github.com/ramprasadn), [Anders Jemt](https://github.com/jemten), [Lucia Pena Perez](https://github.com/Lucpen), and [Mei Wu](https://github.com/projectoriented) at Clinical Genomics Stockholm.
177177

178-
Additional contributors were [Sima Rahimi](https://github.com/sima-r), [Gwenna Breton](https://github.com/Gwennid) and [Emma Västerviga](https://github.com/EmmaCAndersson) (Clinical Genomics Gothenburg); [Halfdan Rydbeck](https://github.com/hrydbeck) and [Lauri Mesilaakso](https://github.com/ljmesi) (Clinical Genomics Linköping); [Subazini Thankaswamy Kosalai](https://github.com/sysbiocoder) (Clinical Genomics Örebro); [Annick Renevey](https://github.com/rannick), [Peter Pruisscher](https://github.com/peterpru) and [Eva Caceres](https://github.com/fevac) (Clinical Genomics Stockholm); [Ryan Kennedy](https://github.com/ryanjameskennedy) (Clinical Genomics Lund); [Anders Sune Pedersen](https://github.com/asp8200) (Danish National Genome Center) and [Lucas Taniguti](https://github.com/lmtani).
178+
Additional contributors were [Sima Rahimi](https://github.com/sima-r), [Gwenna Breton](https://github.com/Gwennid) and [Emma Västerviga](https://github.com/EmmaCAndersson) (Clinical Genomics Gothenburg); [Halfdan Rydbeck](https://github.com/hrydbeck) and [Lauri Mesilaakso](https://github.com/ljmesi) (Clinical Genomics Linköping); [Subazini Thankaswamy Kosalai](https://github.com/sysbiocoder) (Clinical Genomics Örebro); [Annick Renevey](https://github.com/rannick), [Peter Pruisscher](https://github.com/peterpru), [Eva Caceres](https://github.com/fevac) and [Tor Björgen](https://github.com/torbjorgen) (Clinical Genomics Stockholm); [Ryan Kennedy](https://github.com/ryanjameskennedy) (Clinical Genomics Lund); [Anders Sune Pedersen](https://github.com/asp8200) (Danish National Genome Center) and [Lucas Taniguti](https://github.com/lmtani).
179179

180180
We thank the nf-core community for their extensive assistance in the development of this pipeline.
181181

conf/modules/rank_variants.config

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,31 @@ process {
8383
ext.prefix = { "${meta.id}_snv_ranked_${meta.set}" }
8484
}
8585

86+
withName: '.*RANK_VARIANTS_SNV:TABIX_BGZIPTABIX_GICAM' {
87+
ext.prefix = { "${meta.id}_snv_ranked_gicam_${meta.set}" }
88+
}
89+
90+
withName: '.*RANK_VARIANTS_SNV:TABIX_BGZIP_GENMOD_GICAM' {
91+
ext.prefix = { "${meta.id}_snv_ranked_${meta.set}" }
92+
publishDir = [
93+
path: { "${params.outdir}/rank_and_filter" },
94+
mode: params.publish_dir_mode,
95+
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
96+
]
97+
}
98+
99+
withName: '.*RANK_VARIANTS_SNV:BCFTOOLS_MERGE_GENMOD_GICAM' {
100+
ext.args = { "--columns MivmirScore,MivmirExplanation,GicamScore" }
101+
ext.prefix = { "${meta.id}_snv_ranked_${meta.set}" }
102+
}
103+
104+
withName: '.*RANK_VARIANTS_SNV:TABIX_TABIX' {
105+
publishDir = [
106+
path: { "${params.outdir}/rank_and_filter" },
107+
mode: params.publish_dir_mode,
108+
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
109+
]
110+
}
86111
}
87112

88113
//

docs/output.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
6969
- [Filtering and ranking](#filtering-and-ranking)
7070
- [Filter_vep](#filter_vep)
7171
- [GENMOD](#genmod)
72+
- [MIVMIR, GICAM](#mivmir-gicam)
7273
- [Mobile element analysis](#mobile-element-analysis)
7374
- [Calling mobile elements](#calling-mobile-elements)
7475
- [Annotating mobile elements](#annotating-mobile-elements)
@@ -527,6 +528,14 @@ We recommend using vcfanno to annotate SNVs with precomputed CADD scores (files
527528

528529
</details>
529530

531+
#### MIVMIR, GICAM
532+
[MIVMIR](../modules/local/mivmir/meta.yml) and [GICAM](../modules/local/gicam/meta.yml) are two machine learning models used to
533+
infer a pathogenicity score for SNVs. In essence, MIVMIR infer SNV pathogenicity and GICAM improves precision for
534+
duo, trio, ... analysis. MIVMIR, GICAM can be enabled by setting the `--rank_with_mivmir_gicam` feature flag and
535+
adds annotations `INFO/MivmirScore`, `INFO/MivmirExplanation`, `INFO/GicamScore`.
536+
Only `<case_id>_snv_ranked_<research|clinical>.vcf.gz` contains the above annotations.
537+
Refer to the module documentation `.yml` for more information on required inputs and output formats.
538+
530539
### Mobile element analysis
531540

532541
#### Calling mobile elements

docs/usage.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -287,7 +287,7 @@ The mandatory and optional parameters for each category are tabulated below.
287287
| vcfanno_resources<sup>2</sup> | vcfanno_lua |
288288
| vcfanno_toml<sup>3</sup> | vep_filters/vep_filters_scout_fmt<sup>10</sup> |
289289
| vep_cache_version | cadd_resources<sup>11</sup> |
290-
| vep_cache<sup>4</sup> | |
290+
| vep_cache<sup>4</sup> | rank_with_mivmir_gicam<sup>12</sup> |
291291
| gnomad_af<sup>5</sup> | |
292292
| score_config_snv<sup>6</sup> | |
293293
| variant_consequences_snv<sup>7</sup> | |
@@ -307,6 +307,7 @@ no header and the following columns: `CHROM POS REF_ALLELE,ALT_ALLELE AF`. Sampl
307307
<sup>9</sup>Used by GENMOD while modeling the variants. Contains a list of loci that show [reduced penetrance](https://medlineplus.gov/genetics/understanding/inheritance/penetranceexpressivity/) in people. Sample file [here](https://github.com/nf-core/test-datasets/blob/raredisease/reference/reduced_penetrance.tsv).<br />
308308
<sup>10</sup> This file contains a list of candidate genes (with [HGNC](https://www.genenames.org/) IDs) that is used to split the variants into candidate variants and research variants. Research variants contain all the variants, while candidate variants are a subset of research variants and are associated with candidate genes. Sample file [here](https://github.com/nf-core/test-datasets/blob/raredisease/reference/hgnc.txt). Not required if `--skip_subworkflows generate_clinical_set` is set. To skip this splitting entirely, add `generate_clinical_set` to `--skip_subworkflows`.<br />
309309
<sup>11</sup>Path to a folder containing cadd annotations. Equivalent of the data/annotations/ folder described [here](https://github.com/kircherlab/CADD-scripts/#manual-installation), and it is used to calculate CADD scores for small indels. <br />
310+
<sup>12</sup> Enable variant scoring using MIVMIR, GICAM machine learning models.
310311

311312
:::note
312313
We use CADD only to annotate small indels. To annotate SNVs with precomputed CADD scores, pass the file containing CADD scores as a resource to vcfanno instead. Files containing the precomputed CADD scores for SNVs can be downloaded from [here](https://cadd.gs.washington.edu/download) (download files listed under the description: "All possible SNVs of GRCh3<7/8>/hg3<7/8>")

main.nf

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,7 @@ workflow NFCORE_RAREDISEASE {
105105
val_readcount_intervals
106106
val_reduced_penetrance
107107
val_rtg_truthvcfs
108+
val_rank_with_mivmir_gicam
108109
val_run_mt_for_wes
109110
val_run_rtgvcfeval
110111
val_sambamba_regions
@@ -508,6 +509,7 @@ workflow NFCORE_RAREDISEASE {
508509
val_mt_subsample_rd,
509510
val_mt_subsample_seed,
510511
val_platform,
512+
val_rank_with_mivmir_gicam,
511513
val_run_mt_for_wes,
512514
val_run_rtgvcfeval,
513515
val_sample_id_map,
@@ -619,6 +621,7 @@ workflow {
619621
params.readcount_intervals,
620622
params.reduced_penetrance,
621623
params.rtg_truthvcfs,
624+
params.rank_with_mivmir_gicam,
622625
params.run_mt_for_wes,
623626
params.run_rtgvcfeval,
624627
params.sambamba_regions,

modules/local/gicam/main.nf

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
process GICAM_INFER {
2+
// https://github.com/Clinical-Genomics/rdds/tree/master/src/rdds/gicam
3+
4+
tag "${meta.id}"
5+
label 'process_high'
6+
7+
container "docker.io/clinicalgenomics/rdds_mivmir:v1.12.0-rc6"
8+
9+
beforeScript "mkdir ${task.workDir}/rdds-tmp"
10+
afterScript "rm -r ${task.workDir}/rdds-tmp"
11+
containerOptions {[
12+
workflow.containerEngine.equals("singularity") ? "--bind ${task.workDir}/rdds-tmp:/rdds/tmp" : "",
13+
workflow.containerEngine.equals("docker") ? "--tmpfs /rdds/tmp": "",
14+
""
15+
].minus("").join(" ")}
16+
17+
input:
18+
tuple val(meta), path(input_vcf)
19+
20+
output:
21+
tuple val(meta), path('*-predictions.vcf'), emit: vcf
22+
path "versions.yml", emit: versions
23+
24+
when:
25+
task.ext.when == null || task.ext.when
26+
27+
script:
28+
def VERSION = 'v1.12.0-rc6'
29+
"""
30+
. /opt/pyenv/bin/activate
31+
export PYTHONPATH=/rdds/src
32+
python3 -m rdds.gicam infer-vcf --cpu_cores ${task.cpus} ${input_vcf}
33+
34+
cat <<-END_VERSIONS > versions.yml
35+
"${task.process}":
36+
gicam: ${VERSION}
37+
python: \$(python --version | sed 's/Python //g')
38+
END_VERSIONS
39+
"""
40+
}

modules/local/gicam/meta.yml

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
name: gicam
2+
description: Machine learning tool to improve precision for duo, trio, ... analysis
3+
keywords:
4+
- score
5+
- ranking
6+
- gicam
7+
tools:
8+
- mivmir:
9+
description: Model for SNV ranking in conjunction with MIVMIR.
10+
This model improves precision for duo, trio, ... analysis situations by reducing MIVMIR scores for variants that's
11+
not following the appropriate GENMOD genetic inheritance model.
12+
Applied to MIVMIR scores as a post-processing step.
13+
14+
VCF key inputs to the model are
15+
- MivmirScore, (0, 1)
16+
- RankScoreNormalized, (0, 1) as produced by GENMOD using rank_model_genmod_gicam.ini custom scoring config
17+
18+
The tool adds one key to the VCF
19+
- GicamScore (0, 1) where 1.0 inferred pathogenic.
20+
homepage: https://github.com/clinicalgenomics/rdds
21+
documentation: https://github.com/Clinical-Genomics/rdds/tree/master/src/rdds/gicam
22+
doi: ""
23+
licence: ["MIT"]
24+
identifier: ""
25+
input:
26+
- - meta:
27+
type: map
28+
description: |
29+
Groovy Map containing sample information
30+
e.g. [ id:'test', single_end:false ]
31+
- input_vcf:
32+
type: file
33+
description: vcf file
34+
pattern: "*.{vcf}"
35+
ontologies: []
36+
output:
37+
vcf:
38+
- - meta:
39+
type: map
40+
description: |
41+
Groovy Map containing sample information
42+
e.g. [ id:'test', single_end:false ]
43+
- "*-predictions.vcf":
44+
type: file
45+
description: Scored output VCF file
46+
pattern: "*.{vcf}"
47+
ontologies: [ ]
48+
authors:
49+
- "@torbjorgen"
50+
maintainers:
51+
- "@torbjorgen"
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
[Version]
2+
version = 1
3+
name = rank_model_for_gicam
4+
5+
[Categories]
6+
7+
[[inheritance_models]]
8+
category_aggregation = min
9+
10+
[[variant_call_quality_filter]]
11+
category_aggregation = sum
12+
13+
[model_score]
14+
category = variant_call_quality_filter
15+
data_type = integer
16+
description = Inheritance model score
17+
field = INFO
18+
info_key = ModelScore
19+
record_rule = min
20+
separators = ',',':',
21+
22+
[[not_reported]]
23+
score = 0
24+
25+
[[low_qual]]
26+
score = -5
27+
lower = 0
28+
upper = 10
29+
30+
[[medium_qual]]
31+
score = -2
32+
lower = 10
33+
upper = 20
34+
35+
[[high_qual]]
36+
score = 0
37+
lower = 20
38+
upper = 300
39+
40+
[genetic_models]
41+
category = inheritance_models
42+
data_type = string
43+
description = Inheritance models followed for the variant
44+
field = INFO
45+
info_key = GeneticModels
46+
record_rule = max
47+
separators = ',', ':', '|',
48+
49+
[[ad]]
50+
priority = 1
51+
score = 1
52+
string = 'AD'
53+
54+
[[ad_dn]]
55+
score = 1
56+
priority = 1
57+
string = 'AD_dn'
58+
59+
[[ar]]
60+
score = 1
61+
priority = 1
62+
string = 'AR_hom'
63+
64+
[[ar_dn]]
65+
score = 1
66+
priority = 1
67+
string = 'AR_hom_dn'
68+
69+
[[ar_comp]]
70+
score = 1
71+
priority = 1
72+
string = 'AR_comp'
73+
74+
[[ar_comp_dn]]
75+
score = 1
76+
priority = 1
77+
string = 'AR_comp_dn'
78+
79+
[[xr]]
80+
score = 1
81+
priority = 1
82+
string = 'XR'
83+
84+
[[xr_dn]]
85+
score = 1
86+
priority = 1
87+
string = 'XR_dn'
88+
89+
[[xd]]
90+
score = 1
91+
priority = 1
92+
string = 'XD'
93+
94+
[[xd_dn]]
95+
score = 1
96+
priority = 1
97+
string = 'XD_dn'
98+
99+
[[not_reported]]
100+
score = -12
101+
102+
[filter]
103+
category = variant_call_quality_filter
104+
data_type = string
105+
description = The filters for the variant
106+
field = FILTER
107+
record_rule = min
108+
separators = ';',
109+
110+
[[not_reported]]
111+
score = 0
112+
113+
[[pass]]
114+
score = 3
115+
priority = 1
116+
string = 'PASS'
117+
118+
[[dot]]
119+
score = 3
120+
priority = 2
121+
string = '.'
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
nextflow_process {
2+
3+
name "Test Process GICAM_INFER"
4+
script "modules/local/gicam/main.nf"
5+
process "GICAM_INFER"
6+
7+
test("Test GICAM inference on annotated VCF") {
8+
9+
when {
10+
params {
11+
}
12+
process {
13+
"""
14+
input[0] = [
15+
[ id:'test', single_end:false ], // meta map
16+
// TODO@torbjorgen: Update data path to main repo
17+
// VCF as annotated by rank_variants subworkflow (including mivmir and genmod custom inheritance config), prior to gicam inference call
18+
file('https://raw.githubusercontent.com/torbjorgen/nfcore-test-datasets/mivmir-gicam-test-data/testdata/justhusky_snv_gicam.vcf', checkIfExists: true)
19+
]
20+
"""
21+
}
22+
}
23+
24+
then {
25+
assert process.success
26+
assert snapshot(process.out).match()
27+
}
28+
29+
}
30+
31+
}

0 commit comments

Comments
 (0)