Skip to content

Commit e9d3d76

Browse files
committed
Introduce a new option --no-indelQ-tweaks for more sensitive indel calling
The option switches off an existing heuristics which reduces indelQ and, in the extreme case, has the power to discard an indel entirely when the reference alignment has low score. This is a problem for long reads, so newly `--no-indelQ-tweaks` is added to 'ont' and 'pacbio-ccs' presets. Tentative fix to #1459
1 parent fe33c9f commit e9d3d76

File tree

5 files changed

+52
-35
lines changed

5 files changed

+52
-35
lines changed

NEWS

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,9 @@ Changes affecting specific commands:
5555
--rf, --incl-flags STR|INT Required flags: skip reads with mask bits unset
5656
--ff, --excl-flags STR|INT Filter flags: skip reads with mask bits set
5757

58+
- New option --no-indelQ-tweaks to increase sensitivity for indels, especially
59+
in long reads.
60+
5861
* bcftools query
5962

6063
- Make the `--samples` and `--samples-file` options work also in the `--list-samples`
@@ -78,7 +81,7 @@ Changes affecting specific commands:
7881
Changes affecting the whole of bcftools, or multiple commands:
7982

8083
* New `--regions-overlap` and `--targets-overlap` options which address
81-
a long-standing design problem with subsetting VCF files by region.
84+
a long-standing design problem with subsetting VCF files by region.
8285
BCFtools recognize two sets of options, one for streaming (`-t/-T`) and
8386
one for index-gumping (`-r/-R`). They behave differently, the first
8487
includes only records with POS coordinate within the regions, the other
@@ -106,11 +109,11 @@ Changes affecting specific commands:
106109
by using `-c INFO/END`.
107110

108111
- add a new '.' modifier to control wheter missing values should be carried
109-
over from a tab-delimited file or not. For example:
112+
over from a tab-delimited file or not. For example:
110113

111114
-c TAG .. adds TAG if the source value is not missing. If TAG
112115
exists in the target file, it will be overwritten
113-
116+
114117
-c .TAG .. adds TAG even if the source value is missing. This
115118
can overwrite non-missing values with a missing value
116119
and can create empty VCF fields (`TAG=.`)
@@ -239,7 +242,7 @@ Changes affecting specific commands:
239242
* bcftools +fill-tags:
240243

241244
- Generalization and better support for custom functions that allow
242-
adding new INFO tags based on arbitrary `-i, --include` type of
245+
adding new INFO tags based on arbitrary `-i, --include` type of
243246
expressions. For example, to calculate a missing INFO/DP annotation
244247
from FORMAT/AD, it is possible to use:
245248

@@ -303,7 +306,7 @@ Changes affecting specific commands:
303306

304307
- Atomization of AD and QS tags now correctly updates occurrences of duplicate
305308
alleles within different haplotypes
306-
309+
307310
- Fix a bug in atomization of Number=A,R tags
308311

309312
* bcftools reheader:
@@ -315,7 +318,7 @@ Changes affecting specific commands:
315318
- A wider range of genotypes can be set by the plugin by allowing
316319
specifying custom genotypes. For example, to force a heterozygous
317320
genotype it is now possible to use expressions like:
318-
321+
319322
c:'m|M'
320323
c:0/1
321324
c:0
@@ -327,7 +330,7 @@ Changes affecting specific commands:
327330
- Better handling of ambiguous keys such as INFO/AF and CSQ/AD. The
328331
`-p, --annot-prefix` option is now applied before doing anything else
329332
which allows its use with `-f, --format` and `-c, --columns` options.
330-
333+
331334
- Some consequence field names may not constitute a valid tag name, such
332335
as "pos(1-based)". Newly field names are trimmed to exclude brackets.
333336

@@ -457,7 +460,7 @@ Changes affecting specific commands:
457460

458461
* bcftools csq:
459462

460-
- Fix a bug wich caused incorrect FORMAT/BCSQ formatting at sites with too
463+
- Fix a bug wich caused incorrect FORMAT/BCSQ formatting at sites with too
461464
many per-sample consequences
462465

463466
- Fix a bug which incorrectly handled the --ncsq parameter and could clash

bam2bcf.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ typedef struct __bcf_callaux_t {
101101
errmod_t *e;
102102
void *rghash;
103103
float indel_bias_inverted; // adjusts indel score threshold, 1/--indel-bias, so lower => call more.
104+
int no_indelQ_tweaks;
104105
} bcf_callaux_t;
105106

106107
// per-sample values

bam2bcf_indel.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -629,7 +629,7 @@ static int bcf_cgp_compute_indelQ(int n, int *n_plp, bam_pileup1_t **plp,
629629
// events.
630630
//
631631
// reduce indelQ
632-
if ( bca->indel_bias_inverted >= 1 )
632+
if ( !bca->no_indelQ_tweaks )
633633
indelQ = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ + .499);
634634

635635
// Doesn't really help accuracy, but permits -h to take

doc/bcftools.txt

Lines changed: 29 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,7 @@ specific commands to see if they apply.
213213
*--regions-overlap* '0'|'1'|'2'::
214214
This option controls how overlapping records are determined:
215215
set to *0* if the VCF record has to have POS inside a region
216-
(this corresponds to the default behavior of *-t/-T*);
216+
(this corresponds to the default behavior of *-t/-T*);
217217
set to *1* if also overlapping records with POS outside a region
218218
should be included (this is the default behavior of *-r/-R*); or set
219219
to *2* to include only true overlapping variation (compare
@@ -278,7 +278,7 @@ The program ignores the first column and the last indicates sex (1=male, 2=femal
278278

279279
*-T, --targets-file* \[^]'FILE'::
280280
Same *-t, --targets*, but reads regions from a file. Note that *-T*
281-
cannot be used in combination with *-t*.
281+
cannot be used in combination with *-t*.
282282
+
283283
With the *call -C* 'alleles' command, third column of the targets file must
284284
be comma-separated list of alleles, starting with the reference allele.
@@ -478,7 +478,7 @@ Add or remove annotations.
478478
*--single-overlaps*::
479479
use this option to keep memory requirements low with very large annotation
480480
files. Note, however, that this comes at a cost, only single overlapping intervals
481-
are considered in this mode. This was the default mode until the commit
481+
are considered in this mode. This was the default mode until the commit
482482
af6f0c9 (Feb 24 2019).
483483

484484
*--threads* 'INT'::
@@ -633,7 +633,7 @@ demand. The original calling model can be invoked with the *-c* option.
633633
text file with sample names in the first column and group names in the second column. If '-' is
634634
given instead, no HWE assumption is made at all and single-sample calling is performed. (Note that
635635
in low coverage data this inflates the rate of false positives.) The *-G* option requires the presence of
636-
per-sample FORMAT/QS or FORMAT/AD tag generated with *bcftools mpileup -a QS* (or *-a AD*).
636+
per-sample FORMAT/QS or FORMAT/AD tag generated with *bcftools mpileup -a QS* (or *-a AD*).
637637

638638
*-g, --gvcf* 'INT'::
639639
output also gVCF blocks of homozygous REF calls. The parameter 'INT' is the
@@ -892,7 +892,7 @@ depth information, such as INFO/AD or FORMAT/AD. For that, consider using the
892892

893893
*-H, --haplotype* '1'|'2'|'R'|'A'|'I'|'LR'|'LA'|'SR'|'SA'|'1pIu'|'2pIu'::
894894
choose which allele from the FORMAT/GT field to use (the codes are case-insensitive):
895-
895+
896896
'1';;
897897
the first allele, regardless of phasing
898898

@@ -1018,8 +1018,8 @@ depth information, such as INFO/AD or FORMAT/AD. For that, consider using the
10181018
==== GEN/SAMPLE conversion:
10191019
*-G, --gensample2vcf* 'prefix' or 'gen-file','sample-file'::
10201020
convert IMPUTE2 output to VCF. One of the ID columns ("SNP ID" or "rsID" in
1021-
https://www.cog-genomics.org/plink/2.0/formats#gen) must be of the form
1022-
"CHROM:POS_REF_ALT" to detect possible strand swaps.
1021+
https://www.cog-genomics.org/plink/2.0/formats#gen) must be of the form
1022+
"CHROM:POS_REF_ALT" to detect possible strand swaps.
10231023
{nbsp} +
10241024
When the *--vcf-ids* option is given, the other column (autodetected) is used
10251025
to fill the ID column of the VCF.
@@ -1279,7 +1279,7 @@ output VCF and are ignored for the prediction analysis.
12791279
#
12801280
# Attributes required for
12811281
# gene lines:
1282-
# - ID=gene:<gene_id>
1282+
# - ID=gene:<gene_id>
12831283
# - biotype=<biotype>
12841284
# - Name=<gene_name> [optional]
12851285
#
@@ -1553,7 +1553,7 @@ Without the *-g* option, multi-sample cross-check of samples in 'query.vcf.gz' i
15531553
that average score is used to determine the top matches, not absolute values.
15541554

15551555
*--no-HWE-prob*::
1556-
Disable calculation of HWE probability to reduce memory requirements with
1556+
Disable calculation of HWE probability to reduce memory requirements with
15571557
comparisons between very large number of sample pairs.
15581558

15591559
*-p, --pairs* 'LIST'::
@@ -1622,11 +1622,11 @@ Without the *-g* option, multi-sample cross-check of samples in 'query.vcf.gz' i
16221622
// present, a constant value '99' is used for the unseen genotypes. With
16231623
// *-G*, the value '1' can be used instead; the discordance value then
16241624
// gives exactly the number of differing genotypes.
1625-
//
1625+
//
16261626
// ERR, error rate;;
16271627
// Pairwise error rate calculated as number of differences divided
16281628
// by the total number of comparisons.
1629-
//
1629+
//
16301630
// CLUSTER, TH, DOT;;
16311631
// In presence of multiple samples, related samples and outliers can be
16321632
// identified by clustering samples by error rate. A simple hierarchical
@@ -1861,7 +1861,7 @@ For "vertical" merge take a look at *<<concat,bcftools concat>>* or *<<norm,bcft
18611861
alternate alleles relevant (local) for the current sample. The number 'INT' gives the
18621862
maximum number of alternate alleles that can be included in the PL tag. The default value
18631863
is 0 which disables the feature and outputs values for all alternate alleles.
1864-
1864+
18651865
*-m, --merge* 'snps'|'indels'|'both'|'all'|'none'|'id'::
18661866
The option controls what types of multiallelic records can be created:
18671867
----
@@ -2150,8 +2150,8 @@ INFO/DPR .. Deprecated in favor of INFO/AD; Number of high-quality bases for
21502150

21512151
1.12 -Q13 -h100 -m1
21522152
illumina [ default values ]
2153-
ont -B -Q5 --max-BQ 30 --indel-bias 1.01 -I
2154-
pacbio-ccs -D -Q5 --max-BQ 50 --indel-bias 1.01 -F0.1 -o25 -e1 -M99999
2153+
ont -B -Q5 --max-BQ 30 --no-indelQ-tweaks -I
2154+
pacbio-ccs -D -Q5 --max-BQ 50 --no-indelQ-tweaks -F0.1 -o25 -e1 -M99999
21552155

21562156
*--ar, --ambig-reads* 'drop'|'incAD'|'incAD0'::
21572157
What to do with ambiguous indel reads that do not span an entire
@@ -2195,6 +2195,13 @@ INFO/DPR .. Deprecated in favor of INFO/AD; Number of high-quality bases for
21952195
Note that although the window size approximately corresponds to the maximum
21962196
indel size considered, it is not an exact threshold [110]
21972197

2198+
*--no-indelQ-tweaks*::
2199+
Increase sensitivity of indel calling, especially from long reads.
2200+
The indel calling algorithm was designed for short reads and uses heuristics
2201+
to estimate the maximum tolerable deviation of the query sequence
2202+
from the reference. However, for long reads this sometimes leads to incorrect
2203+
rejection of valid indels.
2204+
21982205
*-I, --skip-indels*::
21992206
Do not perform INDEL calling
22002207

@@ -2256,7 +2263,7 @@ the *<<fasta_ref,--fasta-ref>>* option is supplied.
22562263
See also *--atom-overlaps* and *--old-rec-tag*.
22572264

22582265
*--atom-overlaps* '.'|'*'::
2259-
Alleles missing because of an overlapping variant can be set either
2266+
Alleles missing because of an overlapping variant can be set either
22602267
to missing (.) or to the star alele (*), as recommended by
22612268
the VCF specification. IMPORTANT: Note that asterisk is expaneded
22622269
by shell and must be put in quotes or escaped by a backslash:
@@ -2286,7 +2293,7 @@ the *<<fasta_ref,--fasta-ref>>* option is supplied.
22862293
can swap alleles and will update genotypes (GT) and AC counts,
22872294
but will not attempt to fix PL or other fields. Also note, and this
22882295
cannot be stressed enough, that 's' will NOT fix strand issues in
2289-
your VCF, do NOT use it for that purpose!!! (Instead see
2296+
your VCF, do NOT use it for that purpose!!! (Instead see
22902297
<http://samtools.github.io/bcftools/howtos/plugin.af-dist.html> and
22912298
<http://samtools.github.io/bcftools/howtos/plugin.fixref.html>.)
22922299

@@ -2330,7 +2337,7 @@ the *<<fasta_ref,--fasta-ref>>* option is supplied.
23302337

23312338
*--old-rec-tag* 'STR'::
23322339
Add INFO/STR annotation with the original record. The format of the
2333-
annotation is CHROM|POS|REF|ALT|USED_ALT_IDX.
2340+
annotation is CHROM|POS|REF|ALT|USED_ALT_IDX.
23342341

23352342
*-o, --output* 'FILE'::
23362343
see *<<common_options,Common Options>>*
@@ -2949,11 +2956,11 @@ Transition probabilities:
29492956

29502957
*-M, --rec-rate* 'FLOAT'::
29512958
constant recombination rate per bp. In combination with *--genetic-map*,
2952-
the *--rec-rate* parameter is interpreted differently, as 'FLOAT'-fold increase of
2959+
the *--rec-rate* parameter is interpreted differently, as 'FLOAT'-fold increase of
29532960
transition probabilities, which allows the model to become more sensitive
29542961
yet still account for recombination hotspots. Note that also the range
29552962
of the values is therefore different in both cases: normally the
2956-
parameter will be in the range (1e-3,1e-9) but with *--genetic-map*
2963+
parameter will be in the range (1e-3,1e-9) but with *--genetic-map*
29572964
it will be in the range (10,1000).
29582965

29592966
*-o, --output* 'FILE'::
@@ -3192,7 +3199,7 @@ Convert between VCF and BCF. Former *bcftools subset*.
31923199
Note that filter options below dealing with counting the number of alleles
31933200
will, for speed, first check for the values of AC and AN in the INFO column to
31943201
avoid parsing all the genotype (FORMAT/GT) fields in the VCF. This means
3195-
that a filter like '--min-af 0.1' will be calculated from INFO/AC and INFO/AN
3202+
that a filter like '--min-af 0.1' will be calculated from INFO/AC and INFO/AN
31963203
when available or FORMAT/GT otherwise. However, it will not attempt to use any other existing
31973204
field, like INFO/AF for example. For that, use '--exclude AF<0.1' instead.
31983205

@@ -3411,7 +3418,7 @@ to require that all alleles are of the given type. Compare
34113418
* array subscripts (0-based), "*" for any element, "-" to indicate a range. Note that
34123419
for querying FORMAT vectors, the colon ":" can be used to select a sample and an
34133420
element of the vector, as shown in the examples below
3414-
3421+
34153422
INFO/AF[0] > 0.3 .. first AF value bigger than 0.3
34163423
FORMAT/AD[0:0] > 30 .. first AD value of the first sample bigger than 30
34173424
FORMAT/AD[0:1] .. first sample, second AD value
@@ -3524,7 +3531,7 @@ used on the result. For example, when querying "TAG=1,2,3,4", it will be evaluat
35243531

35253532
TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2)
35263533

3527-
COUNT(GT="hom")=0 .. no homozygous genotypes at the site
3534+
COUNT(GT="hom")=0 .. no homozygous genotypes at the site
35283535

35293536
AVG(GQ)>50 .. average (arithmetic mean) of genotype qualities bigger than 50
35303537

mpileup.c

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ typedef struct {
7373
int openQ, extQ, tandemQ, min_support, indel_win_size; // for indels
7474
double min_frac; // for indels
7575
double indel_bias;
76+
int no_indelQ_tweaks;
7677
char *reg_fname, *pl_list, *fai_fname, *output_fname;
7778
int reg_is_file, record_cmd_line, n_threads, clevel;
7879
faidx_t *fai;
@@ -843,6 +844,7 @@ static int mpileup(mplp_conf_t *conf)
843844
conf->bcr = (bcf_callret1_t*) calloc(nsmpl, sizeof(bcf_callret1_t));
844845
conf->bca->openQ = conf->openQ, conf->bca->extQ = conf->extQ, conf->bca->tandemQ = conf->tandemQ;
845846
conf->bca->indel_bias_inverted = 1 / conf->indel_bias;
847+
conf->bca->no_indelQ_tweaks = conf->no_indelQ_tweaks;
846848
conf->bca->min_frac = conf->min_frac;
847849
conf->bca->min_support = conf->min_support;
848850
conf->bca->per_sample_flt = conf->flag & MPLP_PER_SAMPLE;
@@ -1175,13 +1177,15 @@ static void print_usage(FILE *fp, const mplp_conf_t *mplp)
11751177
" --indel-bias FLOAT Raise to favour recall over precision [%.2f]\n", mplp->indel_bias);
11761178
fprintf(fp,
11771179
" --indel-size INT Approximate maximum indel size considered [%d]\n", mplp->indel_win_size);
1180+
fprintf(fp,
1181+
" --no-indelQ-tweaks Increase sensitivity by disabling indel quality adjustments\n");
11781182
fprintf(fp,"\n");
11791183
fprintf(fp,
11801184
"Configuration profiles activated with -X, --config:\n"
11811185
" 1.12: -Q13 -h100 -m1 -F0.002\n"
11821186
" illumina: [ default values ]\n"
1183-
" ont: -B -Q5 --max-BQ 30 --indel-bias 1.01 -I [also try eg |bcftools call -P0.01]\n"
1184-
" pacbio-ccs: -D -Q5 --max-BQ 50 --indel-bias 1.01 -F0.1 -o25 -e1 --delta-BQ 10 -M99999\n"
1187+
" ont: -B -Q5 --max-BQ 30 --no-indelQ-tweaks -I [also try eg |bcftools call -P0.01]\n"
1188+
" pacbio-ccs: -D -Q5 --max-BQ 50 --no-indelQ-tweaks -F0.1 -o25 -e1 --delta-BQ 10 -M99999\n"
11851189
"\n"
11861190
"Notes: Assuming diploid individuals.\n"
11871191
"\n"
@@ -1283,6 +1287,7 @@ int main_mpileup(int argc, char *argv[])
12831287
{"gap-frac", required_argument, NULL, 'F'},
12841288
{"indel-bias", required_argument, NULL, 10},
12851289
{"indel-size", required_argument, NULL, 15},
1290+
{"no-indelQ-tweaks", no_argument, NULL, 20},
12861291
{"tandem-qual", required_argument, NULL, 'h'},
12871292
{"skip-indels", no_argument, NULL, 'I'},
12881293
{"max-idepth", required_argument, NULL, 'L'},
@@ -1415,6 +1420,7 @@ int main_mpileup(int argc, char *argv[])
14151420
}
14161421
}
14171422
break;
1423+
case 20: mplp.no_indelQ_tweaks = 1; break;
14181424
case 'A': use_orphan = 1; break;
14191425
case 'F': mplp.min_frac = atof(optarg); break;
14201426
case 'm': mplp.min_support = atoi(optarg); break;
@@ -1439,15 +1445,15 @@ int main_mpileup(int argc, char *argv[])
14391445
mplp.extQ = 1;
14401446
mplp.flag |= MPLP_REALN_PARTIAL;
14411447
mplp.max_read_len = 99999;
1442-
mplp.indel_bias = 1.01;
1448+
mplp.no_indelQ_tweaks = 1;
14431449
} else if (strcasecmp(optarg, "ont") == 0) {
14441450
fprintf(stderr, "For ONT it may be beneficial to also run bcftools call with "
14451451
"a higher -P, eg -P0.01 or -P 0.1\n");
14461452
mplp.min_baseQ = 5;
14471453
mplp.max_baseQ = 30;
14481454
mplp.flag &= ~MPLP_REALN;
14491455
mplp.flag |= MPLP_NO_INDEL;
1450-
mplp.indel_bias = 1.01;
1456+
mplp.no_indelQ_tweaks = 1;
14511457
} else if (strcasecmp(optarg, "1.12") == 0) {
14521458
// 1.12 and earlier
14531459
mplp.min_frac = 0.002;

0 commit comments

Comments
 (0)