Skip to content

Commit cdc427e

Browse files
committed
Add support for variant scoring by MIVMIR, GICAM models
Signed-off-by: Tor Björgen <tor.bjorgen@scilifelab.se>
1 parent a41090e commit cdc427e

22 files changed

Lines changed: 980 additions & 28 deletions

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
### `Added`
99

10+
- Add SNV scoring by MIVMIR, GICAM models [#812](https://github.com/nf-core/raredisease/pull/812)
11+
1012
### `Changed`
1113

1214
### `Fixed`

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,7 @@ For more details about the output files and reports, please refer to the
175175

176176
nf-core/raredisease was written in a collaboration between the Clinical Genomics nodes in Sweden, with major contributions from [Ramprasad Neethiraj](https://github.com/ramprasadn), [Anders Jemt](https://github.com/jemten), [Lucia Pena Perez](https://github.com/Lucpen), and [Mei Wu](https://github.com/projectoriented) at Clinical Genomics Stockholm.
177177

178-
Additional contributors were [Sima Rahimi](https://github.com/sima-r), [Gwenna Breton](https://github.com/Gwennid) and [Emma Västerviga](https://github.com/EmmaCAndersson) (Clinical Genomics Gothenburg); [Halfdan Rydbeck](https://github.com/hrydbeck) and [Lauri Mesilaakso](https://github.com/ljmesi) (Clinical Genomics Linköping); [Subazini Thankaswamy Kosalai](https://github.com/sysbiocoder) (Clinical Genomics Örebro); [Annick Renevey](https://github.com/rannick), [Peter Pruisscher](https://github.com/peterpru) and [Eva Caceres](https://github.com/fevac) (Clinical Genomics Stockholm); [Ryan Kennedy](https://github.com/ryanjameskennedy) (Clinical Genomics Lund); [Anders Sune Pedersen](https://github.com/asp8200) (Danish National Genome Center) and [Lucas Taniguti](https://github.com/lmtani).
178+
Additional contributors were [Sima Rahimi](https://github.com/sima-r), [Gwenna Breton](https://github.com/Gwennid) and [Emma Västerviga](https://github.com/EmmaCAndersson) (Clinical Genomics Gothenburg); [Halfdan Rydbeck](https://github.com/hrydbeck) and [Lauri Mesilaakso](https://github.com/ljmesi) (Clinical Genomics Linköping); [Subazini Thankaswamy Kosalai](https://github.com/sysbiocoder) (Clinical Genomics Örebro); [Annick Renevey](https://github.com/rannick), [Peter Pruisscher](https://github.com/peterpru), [Eva Caceres](https://github.com/fevac) and [Tor Björgen](https://github.com/torbjorgen) (Clinical Genomics Stockholm); [Ryan Kennedy](https://github.com/ryanjameskennedy) (Clinical Genomics Lund); [Anders Sune Pedersen](https://github.com/asp8200) (Danish National Genome Center) and [Lucas Taniguti](https://github.com/lmtani).
179179

180180
We thank the nf-core community for their extensive assistance in the development of this pipeline.
181181

assets/rank_model_genmod_gicam.ini

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
[Version]
2+
version = 1
3+
name = rank_model_for_gicam
4+
5+
[Categories]
6+
7+
[[inheritance_models]]
8+
category_aggregation = min
9+
10+
[[variant_call_quality_filter]]
11+
category_aggregation = sum
12+
13+
[model_score]
14+
category = variant_call_quality_filter
15+
data_type = integer
16+
description = Inheritance model score
17+
field = INFO
18+
info_key = ModelScore
19+
record_rule = min
20+
separators = ',',':',
21+
22+
[[not_reported]]
23+
score = 0
24+
25+
[[low_qual]]
26+
score = -5
27+
lower = 0
28+
upper = 10
29+
30+
[[medium_qual]]
31+
score = -2
32+
lower = 10
33+
upper = 20
34+
35+
[[high_qual]]
36+
score = 0
37+
lower = 20
38+
upper = 300
39+
40+
[genetic_models]
41+
category = inheritance_models
42+
data_type = string
43+
description = Inheritance models followed for the variant
44+
field = INFO
45+
info_key = GeneticModels
46+
record_rule = max
47+
separators = ',', ':', '|',
48+
49+
[[ad]]
50+
priority = 1
51+
score = 1
52+
string = 'AD'
53+
54+
[[ad_dn]]
55+
score = 1
56+
priority = 1
57+
string = 'AD_dn'
58+
59+
[[ar]]
60+
score = 1
61+
priority = 1
62+
string = 'AR_hom'
63+
64+
[[ar_dn]]
65+
score = 1
66+
priority = 1
67+
string = 'AR_hom_dn'
68+
69+
[[ar_comp]]
70+
score = 1
71+
priority = 1
72+
string = 'AR_comp'
73+
74+
[[ar_comp_dn]]
75+
score = 1
76+
priority = 1
77+
string = 'AR_comp_dn'
78+
79+
[[xr]]
80+
score = 1
81+
priority = 1
82+
string = 'XR'
83+
84+
[[xr_dn]]
85+
score = 1
86+
priority = 1
87+
string = 'XR_dn'
88+
89+
[[xd]]
90+
score = 1
91+
priority = 1
92+
string = 'XD'
93+
94+
[[xd_dn]]
95+
score = 1
96+
priority = 1
97+
string = 'XD_dn'
98+
99+
[[not_reported]]
100+
score = -12
101+
102+
[filter]
103+
category = variant_call_quality_filter
104+
data_type = string
105+
description = The filters for the variant
106+
field = FILTER
107+
record_rule = min
108+
separators = ';',
109+
110+
[[not_reported]]
111+
score = 0
112+
113+
[[pass]]
114+
score = 3
115+
priority = 1
116+
string = 'PASS'
117+
118+
[[dot]]
119+
score = 3
120+
priority = 2
121+
string = '.'

conf/modules/rank_variants.config

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,9 +80,28 @@ process {
8080

8181
withName: '.*RANK_VARIANTS_SNV:TABIX_BGZIPTABIX' {
8282
ext.args2 = '-p vcf'
83-
ext.prefix = { "${meta.id}_snv_ranked_${meta.set}" }
83+
// Avoid BCFTOOLS i/o nameclash by appending genmod_ when ranking with mivmir, gicam
84+
// VCF name is overridden to fit naming convention BCFTOOLS_APPEND_MIVMIR_GICAM
85+
ext.prefix = { "${meta.id}_snv_ranked_${params.rank_with_mivmir_gicam ? 'genmod_': ''}${meta.set}" }
86+
}
87+
88+
withName: '.*RANK_VARIANTS_SNV:MIVMIR_INFER' {
89+
// VCF key name change tweak for GRCh38
90+
ext.args = '--override_vcf_input_keys GNOMADAF_popmax:GNOMADAF_grpmax'
91+
}
92+
93+
withName: '.*RANK_VARIANTS_SNV:TABIX_BGZIPTABIX_GICAM' {
94+
ext.prefix = { "${meta.id}_snv_ranked_gicam_${meta.set}" }
8495
}
8596

97+
withName: '.*RANK_VARIANTS_SNV:BCFTOOLS_APPEND_MIVMIR_GICAM' {
98+
ext.args = { [
99+
'--columns MivmirScore,MivmirExplanation,GicamScore',
100+
'--write-index=tbi',
101+
'--output-type z'
102+
].join(' ') }
103+
ext.prefix = { "${meta.id}_snv_ranked_${meta.set}" }
104+
}
86105
}
87106

88107
//

docs/output.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
6969
- [Filtering and ranking](#filtering-and-ranking)
7070
- [Filter_vep](#filter_vep)
7171
- [GENMOD](#genmod)
72+
- [MIVMIR, GICAM](#mivmir-gicam)
7273
- [Mobile element analysis](#mobile-element-analysis)
7374
- [Calling mobile elements](#calling-mobile-elements)
7475
- [Annotating mobile elements](#annotating-mobile-elements)
@@ -527,6 +528,101 @@ We recommend using vcfanno to annotate SNVs with precomputed CADD scores (files
527528

528529
</details>
529530

531+
#### MIVMIR, GICAM
532+
533+
[MIVMIR](../modules/local/mivmir/meta.yml) and [GICAM](../modules/local/gicam/meta.yml) are two machine learning models used to
534+
infer a pathogenicity score for SNVs, INDELs. In essence, MIVMIR infer SNV pathogenicity and GICAM improves precision
535+
for duo, trio, ... analysis.
536+
537+
The models are enabled as a single entity, to achieve the best performance.
538+
GICAM has a dependency on MIVMIR, acting as an input feature to GICAM.
539+
540+
MIVMIR, GICAM can be enabled by setting the `--rank_with_mivmir_gicam` feature flag.
541+
542+
Only `<outdir>/rank_and_filter/<case_id>_snv_ranked_<research|clinical>.vcf.gz` contains the output annotations.
543+
544+
MIVMIR and GICAM [source code and performance metrics available here](https://github.com/Clinical-Genomics/rdds/tree/master/src/rdds/).
545+
546+
##### MIVMIR
547+
548+
MIVMIR is a deep neural network regression type model trained on ClinVar pathogenic
549+
variants and an Ashkenazim trio negative background. The purpose of the model
550+
is to infer pathogenicity/ rank scores from a set of input annotations.
551+
The model is scoring every variant in isolation.
552+
553+
Model is optimized to generalize well on Scilifelab Clinical Genomics Stockholm cohort,
554+
and does have some custom input features (`SWEGENAF`, `Frq`) as by our
555+
local pipeline configuration.
556+
557+
> It is possible to run model inference on variants without SWEGENAF, Frq
558+
> annotations. These annotations are dropped out during training to simulate
559+
> lack of previous clinical evidence in order to regularize the model.
560+
> However, such configuration is highly experimental, unsupported
561+
> and at your own discretion.
562+
563+
###### Requirements
564+
565+
MIVMIR relies on the following VCF annotations/keys as variant feature inputs:
566+
567+
- `CSQ/PolyPhen`
568+
- `CSQ/SIFT`
569+
- `CSQ/CLINVAR_CLNREVSTAT`
570+
- `CSQ/CLINVAR_CLNSIG`
571+
- `CSQ/MaxEntScan_alt`
572+
- `CSQ/MaxEntScan_diff`
573+
- `CSQ/MES-SWA_acceptor_alt`
574+
- `CSQ/MES-SWA_donor_alt`
575+
- `CSQ/MES-SWA_donor_diff`
576+
- `CSQ/SpliceAI_pred_DS_AL`
577+
- `CSQ/SpliceAI_pred_DS_DG`
578+
- `CSQ/SpliceAI_pred_DS_DL`
579+
- `CSQ/REVEL_score`
580+
- `CSQ/LoFtool`
581+
- `CSQ/GERP++_RS`
582+
- `CSQ/phastCons100way_vertebrate`
583+
- `CSQ/phyloP100way_vertebrate`
584+
- `CSQ/SpliceAI_pred_DS_AG`
585+
- `most_severe_consequence`
586+
- `CADD`
587+
- `SWEGENAF` ([SweGen Variant Frequency Dataset](https://swefreq.nbis.se/dataset/SweGen))
588+
- `GNOMADAF_popmax`
589+
- `Frq` (variant frequency from a local database)
590+
591+
> If any annotation is missing, it will be **_*silently*_** interpreted as `''` or `0.0` depending on data type.
592+
593+
> For GRCh38, GnomAD has renamed the `GNOMADAF_popmax` to `GNOMADAF_grpmax`.
594+
> Adjust this via the `--override_vcf_input_keys [OLD_KEY:NEW_KEY],[...]` flag.
595+
596+
The tool adds two keys to the VCF
597+
598+
- `MivmirScore`, a score in range (0, 1) where 1.0 inferred pathogenic.
599+
- `MivmirExplanation`, a key value array type str-float explaining the additive contribution of each annotation
600+
to the final variant score. The explanations are sorted in decreasing order, the largest input feature
601+
contributor is listed first. This key is only available for high scoring variants for compute performance
602+
reasons.
603+
604+
##### GICAM
605+
606+
Model for SNV ranking in conjunction with MIVMIR.
607+
608+
This model improves precision for duo, trio, ... analysis situations by reducing MIVMIR scores for variants that's
609+
not following the appropriate GENMOD genetic inheritance model. Applied to MIVMIR scores as a post-processing step.
610+
611+
###### Requirements
612+
613+
GICAM relies on the following VCF annotations/keys as variant feature inputs:
614+
615+
- `MivmirScore`, (0, 1)
616+
- `RankScoreNormalized`, (0, 1) as produced by GENMOD using rank_model_genmod_gicam.ini custom scoring config
617+
618+
The tool adds one key to the VCF
619+
620+
- `GicamScore` (0, 1) where 1.0 inferred pathogenic.
621+
622+
> GICAM is optimized for the GENMOD scoring config present in the gicam module directory, that
623+
> generates RankScoreNormalized.
624+
> Modifications to the GENMOD scoring config will break inference (unless GICAM is first retrained on the new config).
625+
530626
### Mobile element analysis
531627

532628
#### Calling mobile elements

docs/usage.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,7 @@ The mandatory and optional parameters for each category are tabulated below.
290290
| vcfanno_toml<sup>3</sup> | vep_filters/vep_filters_scout_fmt<sup>10</sup> |
291291
| vep_cache_version | cadd_resources<sup>11</sup> |
292292
| vep_cache<sup>4</sup> | run_vcfanno_db_sanity_check<sup>12</sup> |
293-
| gnomad_af<sup>5</sup> | |
293+
| gnomad_af<sup>5</sup> | rank_with_mivmir_gicam<sup>13</sup> |
294294
| score_config_snv<sup>6</sup> | |
295295
| variant_consequences_snv<sup>7</sup> | |
296296
| vep_plugin_files<sup>8</sup> | |
@@ -310,6 +310,7 @@ no header and the following columns: `CHROM POS REF_ALLELE,ALT_ALLELE AF`. Sampl
310310
<sup>10</sup> This file contains a list of candidate genes (with [HGNC](https://www.genenames.org/) IDs) that is used to split the variants into candidate variants and research variants. Research variants contain all the variants, while candidate variants are a subset of research variants and are associated with candidate genes. Sample file [here](https://github.com/nf-core/test-datasets/blob/raredisease/reference/hgnc.txt). Not required if `--skip_subworkflows generate_clinical_set` is set.<br />
311311
<sup>11</sup>Path to a folder containing cadd annotations. Equivalent of the data/annotations/ folder described [here](https://github.com/kircherlab/CADD-scripts/#manual-installation), and it is used to calculate CADD scores for small indels. <br />
312312
<sup>12</sup>When set to `true`, each vcfanno database file listed in `vcfanno_resources` is checked for records (non-header lines). Any database with zero records is removed from the vcfanno TOML config before annotation runs to prevent vcfanno from crashing on default resource files. Default: `false`.<br />
313+
<sup>13</sup> Enable variant SNV-INDEL scoring using MIVMIR, GICAM machine learning models.
313314

314315
:::note
315316
We use CADD only to annotate small indels. To annotate SNVs with precomputed CADD scores, pass the file containing CADD scores as a resource to vcfanno instead. Files containing the precomputed CADD scores for SNVs can be downloaded from [here](https://cadd.gs.washington.edu/download) (download files listed under the description: "All possible SNVs of GRCh3<7/8>/hg3<7/8>")

main.nf

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ workflow NFCORE_RAREDISEASE {
110110
val_readcount_intervals
111111
val_reduced_penetrance
112112
val_rtg_truthvcfs
113+
val_rank_with_mivmir_gicam
113114
val_run_mt_for_wes
114115
val_run_rtgvcfeval
115116
val_run_vcfanno_db_sanity_check
@@ -220,6 +221,8 @@ workflow NFCORE_RAREDISEASE {
220221
ch_score_config_mt = channelFromPath(val_score_config_mt, true)
221222
ch_score_config_snv = channelFromPath(val_score_config_snv, true)
222223
ch_score_config_sv = channelFromPath(val_score_config_sv, true)
224+
// ch_genmod_gicam_score_config is integral to GICAM inference; it cannot be changed without retraining gicam
225+
ch_score_config_genmod_gicam = channel.fromPath("$projectDir/assets/rank_model_genmod_gicam.ini", checkIfExists: true).collect()
223226
ch_vcf2cytosure_blacklist = channelFromPath(val_vcf2cytosure_blacklist, true)
224227
ch_vcfanno_lua = channelFromPath(val_vcfanno_lua, true)
225228
ch_vcfanno_toml = channelFromPath(val_vcfanno_toml, true)
@@ -436,6 +439,7 @@ workflow NFCORE_RAREDISEASE {
436439
ch_score_config_mt,
437440
ch_score_config_snv,
438441
ch_score_config_sv,
442+
ch_score_config_genmod_gicam,
439443
ch_sdf,
440444
ch_sentieon_pcr_indel_model,
441445
ch_subdepth,
@@ -518,6 +522,7 @@ workflow NFCORE_RAREDISEASE {
518522
val_multiqc_samples,
519523
val_outdir,
520524
val_platform,
525+
val_rank_with_mivmir_gicam,
521526
val_run_mt_for_wes,
522527
val_run_rtgvcfeval,
523528
val_run_vcfanno_db_sanity_check,
@@ -636,6 +641,7 @@ workflow {
636641
params.readcount_intervals,
637642
params.reduced_penetrance,
638643
params.rtg_truthvcfs,
644+
params.rank_with_mivmir_gicam,
639645
params.run_mt_for_wes,
640646
params.run_rtgvcfeval,
641647
params.run_vcfanno_db_sanity_check,

modules/local/gicam/main.nf

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
process GICAM_INFER {
2+
// https://github.com/Clinical-Genomics/rdds/tree/master/src/rdds/gicam
3+
4+
tag "${meta.id}"
5+
label 'process_high'
6+
7+
container "docker.io/clinicalgenomics/rdds_mivmir:v1.13.0"
8+
9+
beforeScript "mkdir ${task.workDir}/rdds-tmp"
10+
afterScript "rm -r ${task.workDir}/rdds-tmp"
11+
containerOptions {[
12+
workflow.containerEngine.equals("singularity") ? "--bind ${task.workDir}/rdds-tmp:/rdds/tmp" : "",
13+
workflow.containerEngine.equals("docker") ? "--tmpfs /rdds/tmp": "",
14+
""
15+
].minus("").join(" ")}
16+
17+
input:
18+
tuple val(meta), path(input_vcf)
19+
20+
output:
21+
tuple val(meta), path('*-predictions.vcf'), emit: vcf
22+
tuple val("${task.process}"), val('gicam'), val('v1.13.0'), topic: versions, emit: versions_gicam
23+
24+
when:
25+
task.ext.when == null || task.ext.when
26+
27+
script:
28+
"""
29+
. /opt/pyenv/bin/activate
30+
export PYTHONPATH=/rdds/src
31+
python3 -m rdds.gicam infer-vcf --cpu_cores ${task.cpus} ${input_vcf}
32+
"""
33+
}

0 commit comments

Comments
 (0)