Skip to content

revision of 1679 plus switch to edlib over BAQ #2045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 9, 2024

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Nov 23, 2023

Please give me confirmation that this is interesting and will be considered for inclusion

Despite demonstrating a very significant leap forward in calling accuracy, #1679 never received any comments or consideration. Instead it triggered the creation of the --indels-2.0 option instead, with many of the ideas being rewritten for that. Unfortunately --indels-2.0 just doesn't perform that well, as it wasn't completed and misses many of the vital heuristics that remove the false positive calls.

In personal communications, one reason for 1679 being passed over appears to be because it changes the behaviour of existing bcftools and doesn't offer a way back to the old calling algorithm for compatibility. (IMO if you want it to never get better, don't upgrade!)

I'm not a believe in only locking new functionality behind new options, as that means we never progress (no one is going to go through all the nextflow pipelines and amend them), but I'll concede the lack of compatibility may be an issue so this PR adds versioned profiles (eg -X illumina-1.18 that behave as the previous release, while updating the default profiles to be more accurate.

The main summary of changes so far are:

  • Use of a consensus instead of patching the reference for constructing the candidate indel alleles to align against. This is the most significant accuracy gain from Improvements to indel calling. #1679.

  • Drop the use of BAQ as an alignment algorithm and replace with edlib. BAQ was never really designed as an aligner, but a quality recalibration too. Using it to align sequences is both very heavy weight and also error prone as it won't always give a good alignment score on things that have STRs or other low-complexity. Edlib is vastly faster, and we compensate for the lack of quality assessment by a combination of heuristics such as STR scanning (this was previously added, but it's been beefed up a bit more) and some crude quality score evaluations such as minimum quality within an STR.

  • More profiles and revised profiles. ONT-super and Ultima have been added.

  • New parameters: --del-bias to control biases in insertion vs deletion error models in instruments (many single cell instruments are more likely to miss bases than insert bases); --poly-mqual to enable/disable the use of minimum quality values within homopolmers (harms clocked instruments such as Illumina, but benefits many others).

SNP calling is unchanged.

As requested, I moved the code from out of bam2bcf_indels.c and into its own file. I'm not happy with the --edlib parameter as I feel edlib vs BAQ isn't the really important thing here. It's a big speed up, but the real work is the better algorithm. That's (for now) in bam2bcf_edlib.c, following the same strategy @pd3 took with duplicating functions into bam2bcf_iaux.c. I'm open to better names on the command line. I'm tempted by --indels-3.0, but I fear that may cause friction. I'm not intending to say it's deprecating your indels-2.0 code (it's for the users to decide which algorithms they prefer), and indeed it's not a derivation of it as it preceeds it and the indels-2.0 code was your earlier rewrite of my own 1679 PR so really that was "indels-1.5" anyway. I don't really like the naming scheme for either of them, but am open to suggestions.

The recall vs precision is generally significantly ahead of both develop and the experimental --indels-2.0 mode (which also crashes on a lot of the long-read data sets I tried, such as ~jkb/lustre/GIAB/ont/PAO83395.pass.cram on our local systems). Speed wise, it's pretty much unchanged with Illumina, but on long read technologies with higher indel errors the increased number of candidate indels to asses means the new profiles and edlib use can be 2-3x faster.

This PR however isn't always quite matching the old 1679 one. It's close, but marginally behind. That's probably due to using heuristics instead of an HMM for incorporation of quality scores to the alignment process. It's close enough though and it is much faster on long read data.

I'm still running experiments and assessing results, but this is a proof of principle.


Illumina GIAB HG002 chr1 results. These all had BAQ semi-enabled (default mode) for SNPs.
Total INDEL on chr1 = 45604
2% FN = 912
5% FN = 2280

Illumina:		2% FN		   5% FN
 	FP,GT,FN	Q/FP,GT,FN
devel 	294,1078,762	23/252,1068,910	   100/175,966,2268	42m38s
#1679	 341,188,488	75/236,170,910	   135/189,151,2254	77m4s
--id2.0	 517,864,700	62/411/833/911	   154/334,651,2290	63m48s
edlib	 307,223,649	53/234,210,913	   113/181,177,2286     60m25s
ed10	 384,225,592	79/252,205,908	   134/213,182,2263     62m44s

The comma-separated values above are False Positives, GT assignment errors, and False Negatives. The val in val/FP,GT,FN is the QUAL threshold applied as a filter, dropping all variants with lower quality. I've been manually adjusting that until I get to a similar FN rate so I can do a more reasonable side by side assessment of "given the same recall, how does the accuracy compare". I'll produce some plots later on once I've completed my trials.

"ed10" is --edlib --indel-bias 10 to push sensitivity up higher, especially at QUAL 0 (no filtering). It's still not as sensitive as 1679 though. 1679 was better here for genotype assignment accuracy, but FP vs FN was broadly similar to the new mode.


PacBio HiFi

                           ~1500FN              5% FN                                   
dev     25310,6014,1332     6/20532,6064,1491   23/ 9815,5986,2288      111m50s         
#1679   28891,5622, 778    26/10704,5608,1504   38/ 6466,5567,2329      47m             
edlib    9392,5406,1417     4/ 8624,5397,1496   13/ 5462,5342,2255      38m55s          
gatk   118211,5559, 997    90/29501,5541,1503  130/16044,5512,2294      ~944min          
--id2.0 crash

The new edlib mode is ahead of the 1679 PR, which itself was signifcantly ahead of develop (and gatk4 it turns out, but I didn't run all the myriad of post-processing tools, just QUAL filtering). The new code is MUCH faster too. --indels2.0 died:

bam2bcf_iaux.c:468: iaux_align_read: Assertion `qry_off1<=qry_off2' failed.        

ONT "super" base call mode. (PAO83395.pass.cram from ONT)

dev     2306,2712,12507  251m22s                                                        
#1679   2403,2322,11918  733m42s                                                        
bias    2677,2132,12650  111m50s                                                        
--id2.0 crash

I still have more analysis to do here, along with the Ultima Genomics calling. Both the ONT and UG data set however are marginal on small indel detection due to the error models on the instruments. (If anyone knows of a published HG002 data set from ONT using "sup-dup" mode data then please feel free to point me at it. Similarly for any other data types out there.)

@jkbonfield
Copy link
Contributor Author

Also, what do people want to see for evidence? I'm thinking of similar graphs to the ones I produced in the last plots, but I may adjust the counting.

My view is for diploid organisms we're making 2 calls at every site - one for each allele. For simplicity I want to switch with precision vs recall, so we can count REF/REF, REF/ALT and ALT/ALT as 2 calls each. If we call ALT/ALT and it's really ALT/ALT then that's two true variants. If we call it REF/REF then it's 2 false negatives, and if we call it REF/ALT then it's 1 true positive and 1 false negative. Similarly a true REF/REF called as ALT/ALT is 2 false positives, and as REF/ALT is 1 false positive.

I think that should be enough to give due importance to genotype assignment (which is one key improvement, especially on Illumina where it's a 6-fold reduction), while still keeping the graph simple. Comments?

@pd3
Copy link
Member

pd3 commented Dec 11, 2023

@jkbonfield As requested offline, here some information to the problem with the original draft we were discussing a while ago. I don't know if it is still an issue, but the observation back then was:

Each read was aligned to two consensus sequences generated from indels visible in the sample. In the case of one hom-ALT sample with an insertion and one hom-REF sample, the code triggered four realignments for the second sample, all to the same identical reference consensus, and the alternate consensus was not considered for the hom-REF sample at all. This lead to a low quality estimate.

The program should avoid recomputing the same read for the same consensus twice, just to save time. More importantly, if there aren't enough indel types within the sample, the most popular consensus from other samples should be used.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Dec 11, 2023 via email

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Dec 12, 2023

I'm adding some analysis results of various GIAB HG002 samples. These show false positive rate (FPr) vs false negative rate (FNr), and also genotype assignment error rate (GTr) vs FNr.

FPr is defined as the fraction of called variants that are false.
GTr is the fraction of true variants that were called with the incorrect genotype.
FNr is the fraction of true variants that were not called.

"Truth" is a little bit flexible as there is an inherent error in measuring the same consequence via different calls. Eg 1 indel + 1 SNP vs 2 indel, etc.

Note: I'm reporting indels only here as this PR doesn't change the SNP calling.

Low-depth Illumina

image
image

High-depth Illumina
image
image

It's clear that the genotype assignment accuracy is the huge win here, especially when it comes to deeper data. This was systematically a flawed calculation in the old code due to the way it scores multi-allelic sites, but also aligning to a consensus instead of patched reference is clearly a significant win for genotype analysis too.

Shallow depth, there isn't a marked gain over --indels-2.0. It's a bit less sensitive at the non-filter level, and a bit better FP/FN ratio at the higher QUAL thresholds. For 60x data I also showed the old #1679 results. We're similar on genotype call accuracy, but FPs have been further reduced since then. This PR is substantially head of both devel and indels-2.0 on the deeper data. Sadly indels-2.0 is a bad option there, as it's even worse than develop.

Timing wise, develop, this PR ("new") and indels-2.0 were all in the ~66m for chr1 60x and ~23m for 15x. Not much difference.

I'll do other technologies in their own posts.

(And yes, I'm being lazy! Much less work to just cut and paste a full window than to save the image to disk, find it in firefox, and then reupload it again. "control-V" is less work :-)).

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Dec 12, 2023

Next up is BGISeq, with the new -X BGI profile and equivalent options explicitly stated in develop branch. I also included the published GATK calls uploaded to the GIAB site.

https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_BGIseq_2x150bp_100x/
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_BGIseq_2x150bp_100x_07192023/

Deep 100x data:

image
image

On the full depth data, this PR is more accurate than the GATK results uploaded with the published dataset. Unfortunately the previous updates to --indels-2.0 are the worst performing due to a high false positive rate.

Shallow 10x data:

image
image

It's less clear with shallow 10x BGI data.
Develop, indels-2.0, and this PR all have positions in the FN vs FP tradeoff where they're best, but arguably develop is best where it matters most (high recall) although frankly none are doing well. This PR however does still have significantly better genotype accuracy than the others.

Speed wise, the full data set was broadly the same speed for all bcftools runs, at approx 2hours.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Dec 12, 2023

Next up, a modern 53x PacBio HiFi alignment from 20kb and 15kb libraries.

https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/GRCh38/

It comes with DeepVariant analysis too
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/PacBio_CCS_15kb_20kb_chemistry2_10312019/GRCh38/

53x data:
image
image

What's staggering here is clearly how far ahead DeepVariant is. It's obviously got far better understanding of the errors involved in this data. I did note there are clusters of errors that we make around supplementary reads, and that filtering those out (or alternatively filtering out anything with long soft-clips) removes almost all of those errors, but overall it's not a big impact as the number of affected regions is small. We still have a very significant number of false positives to remove though.

It may be worth investigating some of the tricks used in samtools consensus here too. I found these to be significantly beneficial for HiFi:

  • Recalibration of quality values, so the QUAL score actually ties in with the error likelihood. Note it needs recalibrating for SNP, INS and DEL separately too, or rather recalibrate once and have additional error type priors.
  • The notion of a localised mapping quality. MAPQ is almost useless on long-read data. What's more helpful is the number of differences within a local window, as it can help spot flights of fantasy and potential diversion caused by duplications.

Unfortunately I don't have --indels-2.0 results for PacBio as it hits an assertion failure very rapidly.

Speed wise, develop took 140 min and this PR 60 min, so is 2.3x faster.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Dec 12, 2023

An older 2019 GIAB HG002 PacBio HiFi data set. This appeared to have different characteristics, and got worse with higher --indel-bias rather than better as the more modern one did. Both are assessed at the same value of 1.2 (default for CCS profile), as a middling setting.

image
image

More for comedy, I also including processing of samtools consensus -f pileup output. This doesn't officially support VCF as it's not a variant caller, but it's possible to turn the consensus into variants with a bit of scripting. It's not very sensitive as it has no realignment step whatsoever, focusing primarily of reporting the consensus of the data as-given. However high FN aside, it's at a reasonable place in the FN/FP tradeoff and it's actually the most accurate for genotype assignments! This possibly comes from the improved quality recalibration and local mapping quality assessments mentioned above, so there is room to borrow more ideas.

We're comfortably ahead of GATK on this data set, although that was my own run of GATK and I can't be sure I used it correctly as it's a complex beast. This PR is also still an improvement on the already-improving PR1679.

Speed wise, devel took 120 min, this PR 41 min (so almost 3x faster), and sadly --indels-2.0 crashed again. I don't have recordings of how long GATK took, but the date put in the header and the file time stamp implies about 250 min, which seems reasonable as I recall it being slower than current bcftools.

I think this places us in a good spot for calling performance. We're likely to never be matching DeepVariant, but we'd be a fraction of the CPU cost. (I haven't ran it myself, but being AI driven I can't believe it's any faster than GATK)

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Dec 12, 2023

Ultima Genomics.

This was a control-sample spiked in with other data, and is ~20x HG002. I have deeper data available for others (eg HG004), but not yet analysed. Ultimata Genomics main error model is indels, particularly homopolymers IIRC, so it performs quite poorly. However this PR is still ahead of develop.

For the develop run I copied the parameters from the new -X ultima profile (bar the new params added in this PR itself).

image
image

Not much more to say on this. Probably a work in progress still, so we can learn more on handling the error models for this data.

Speed was develop 34min, this PR 31min. So broadly comparable.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Dec 12, 2023

ONT "sup" base caller, HG002 sample again

image
image

Disappointingly we're behind the old PR now, although again all of them are doing a really bad job. Even with the super mode basecaller the error rate is just too high for us to realiably call indels. Likely a better algorithm that's more in tune with the error profile could tease out a better result. I'll play more with the parameters to see if it's simply a matter of tuning things a bit. I'd like it to at least match the old PR here.

As with other long-read technologies this PR is significantly ahead on speed. 104min for this vs 317min for develop. So 3x faster again. As with other long-read technologies, --indels-2.0 crashes again:

bcftools: bam2bcf_iaux.c:468: iaux_align_read: Assertion `qry_off1<=qry_off2' failed.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Dec 13, 2023

@jkbonfield As requested offline, here some information to the problem with the original draft we were discussing a while ago. I don't know if it is still an issue, but the observation back then was:

Each read was aligned to two consensus sequences generated from indels visible in the sample. In the case of one hom-ALT sample with an insertion and one hom-REF sample, the code triggered four realignments for the second sample, all to the same identical reference consensus, and the alternate consensus was not considered for the hom-REF sample at all. This lead to a low quality estimate.

The program should avoid recomputing the same read for the same consensus twice, just to save time. More importantly, if there aren't enough indel types within the sample, the most popular consensus from other samples should be used.

Can you please elaborate. I haven't looked at what it's doing under the hood yet, but it does get the right answer. All the versions I've tried have. My test:

@ seq4d[samtools.../bcftools]; cat _dat.pl
#!/usr/bin/perl

# GRCh38
#>chr1:100000-100019
#CACTAAGCACACAGAGAATAA

print "\@SQ\tSN:chr1\tLN:248956422\n";
print "\@RG\tID:s1\tSM:s1\n";
print "\@RG\tID:s2\tSM:s2\n";

my $DP=6/2;

# Sample 1, HOM INS +TT
for (my $i=0;$i<$DP;$i++) {
    print "s1a.$i\t0\tchr1\t100000\t60\t10M2I10M\t*\t0\t0\tCACTAAGCACACACAGAGAATA\tIIIIIIIIIIIIIIIIIIIIII\tRG:Z:s1\n";
    print "s1b.$i\t0\tchr1\t100000\t60\t10M2I10M\t*\t0\t0\tCACTAAGCACACACAGAGAATA\tIIIIIIIIIIIIIIIIIIIIII\tRG:Z:s1\n";
}

# Sample 2, HET INS +TT
for (my $i=0;$i<$DP;$i++) {
    print "s2a.$i\t0\tchr1\t100000\t60\t10M2I10M\t*\t0\t0\tCACTAAGCACACACAGAGAATA\tIIIIIIIIIIIIIIIIIIIIII\tRG:Z:s2\n";
    print "s2b.$i\t0\tchr1\t100000\t60\t20M\t*\t0\t0\tCACTAAGCACACAGAGAATA\tIIIIIIIIIIIIIIIIIIII\tRG:Z:s2\n";
}

@ seq4d[samtools.../bcftools]; ./_dat.pl |./bcftools mpileup --indels-cns -a AD --fasta-ref $HREF38 - 2>/dev/null | bcftools call -vm - 2>/dev/null | egrep -v '^#'
chr1	100009	.	CACA	CACACA	16.4975	.	INDEL;IMF=0.75;IDV=9;DP=12;VDB=5.52906e-06;SGB=-1.03866;RPBZ=-3.31662;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=0;MQ0F=0;AC=3;AN=4;DP4=3,0,9,0;MQ=60	GT:PL:AD	1/1:39,18,0:0,6	0/1:12,0,127:3,3

So it's produced two samples, one with +TT on both alleles and one with +TT on one allele only. Tview shows me this:

 100001               
CACTAAGCAC**ACAGAGAATA
..........  ..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........**..........
..........AC..........
..........**..........
..........AC..........
..........**..........

They key thing here is the FORMAT fields:

GT:PL:AD 1/1:39,18,0:0,6 0/1:12,0,127:3,3

GT of 1/1 and 0/1, and AD of 0,6 and 3,3. That's spot on.

What am I missing about your scenario?

Also, is this still how people run multi-sample variant calling? I thought the world had moved on to gVCF. Do people do multi-sample analysis for small things like trios or tumor/normal analysis, and gVCF only for population studies? (I'm out of my depth here as I've only looked at simple-sample calling, partly due to the lack of decent high quality truth sets.)

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Dec 13, 2023

Ah, there is still a problem if sample 1 is HOM +CA and sample 2 is HET +TT. We get CA called as both HOM and HET and TT doesn't turn up anywhere, unless I split into two single-sample files, where it works fine.

 100001               
CACTAAGCAC**ACAGAGAATA
..........  ..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........TT..........
..........**..........
..........TT..........
..........**..........
..........TT..........
..........**..........
..........TT..........
..........**..........
..........TT..........
..........**..........

This is true for develop, indels-2.0, indels-cns (this PR), and #1679. I think it's a fundamental flaw of the shared code that does indel type selection.

Develop calls the second sample as 0/0 with AD 5,0. (Correct AD given the alleles, but wrong call).
indels-2.0 calls it 0/0 with AD 10,0. (Incorrect AD and call).
indels-cns calls is 0/1 with AD 5,5. (Correct call if it was a single-sample file, but TT is missing from ALT; should be 1/2 and 5,5)

If I had +CA and +TTT then it works, as we have type 2 and type 3, but +CA and +TT leads to one type only, and so everything gets called as 2bp insertion. Really it needs to check the insertion consensus for each sample matches as that's what it actually produced when aligning. Ie CA,TT.

A useful spot, but again this isn't specific to this PR. It's just a weakness of the existing algorithm.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Dec 13, 2023

Ah sorry I misread it. HOM ALT and HOM REF... That does give an issue with this new code which I need to investigate, as the QUAL field is very low for some unknown reason. However the calls are still correct. The original PR in #1679 works fine however.

Edit: blasted BAQ again! Disabled it and sure enough it works again. I've disabled it in most of the profiles, but I wasn't using one in my command-line tests and we can't change the default behaviour :(

So I still don't understand the failure case you observed. Data to demonstrate it would be useful please.

@jkbonfield
Copy link
Contributor Author

Revised ONT parameters. Switched from -h80 to -h110 to remove false negatives (but at the expense of FP), and then --indel-bias 0.8 to try and remove those FPs (at the expense of FN). The combination appears to generally be a better FP/FN tradeoff, and is more sensitive to boot so at QUAL 0 it's significantly better recall.

My graph here is going all the way up to 50% FN, which is very high, so the lower part is the more important bit for general usage.

These changes do unfortunately take a hit on genotype accuracy, but it's probably more important to identify the indels first and genotype accuracy while important is the next thing to assess. Overall GT accuracy is still quite high for all methods.

image
image

@jkbonfield
Copy link
Contributor Author

Attempts to fix some failing tests lead me to improve the way AD was calculated, although it still doesn't fix some of the failure cases. Overall it's a win though. I no longer reject alignments by setting the base type to 5 (unknown), instead preferring to adjust the quality of alignment. This gives higher, more realistic, AD figures now. I think this addresses some of the concerns you had over the old PR? However it does so at a cost of slightly higher misdiagnosis of the real genotype.

Current WIP figures. "10h" was my internal name for the what was listed as "new" above, and "12a" is my latest revision code name.


Illumina:

image

image

Better sensitivity at QUAL 0 and better FN vs FP tradeoff at most filter levels, but slightly less accurate on genotype assignment.

image
image

Similar deal for deep sequencing. GT is poorer, but both are miles ahead of dev or indels-2.0.
FP vs FN wise, it varies on the filtering level. Low QUAL this previous PR was a bit better (albeit less sensitive), while higher QUAL this latest commit is more discriminating. Broadly comparable.


BGI

image
image

Considerably more sensitive than this earlier PR, although "dev" still beats it on lower QUAL filtering. As with Illumina, we're no longer so great on the GT assignment accuracy, with another 1-2% additional false assignment.

image
image

At deeper data, the new commits really help on BGI, both for FN vs FP but also vs GT error. Both are lightyears ahead of the old dev branch.

See earlier comments for comparison vs the GATK results that were published along side the GIAB BGI data set, and --indels-2.0 results.

More data sets incoming...

@jkbonfield
Copy link
Contributor Author

PacBio HiFi (new HG002, full data at 53x).

image
image

Big win for both FP/FN and GT/FN plots.

image
image

Still a big win for shallow data, although sensitivity is a little down on before. It's still better than develop though.

@jkbonfield
Copy link
Contributor Author

Ultima Genomics (as above, it's not that deep as it was just a control sample).

image
image

New commit is slightly detrimental, but still ahead of develop.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jan 16, 2024

ONT "super"; approx 50x coverage

image
image

There is a cross-over where both the earlier version of this PR and dev give a better FN/FP tradeoff, but this is at nearly 35% lost calls. The new commits have significantly improved recall rates which probably offer a better tradeoff.

However as many cases before, we see this comes at a cost of less accurate genotype assignments.

For sub-sampled 15x coverage:

image
image

Similar to 50x, but closer in FN/FP ratio. The new commits show higher sensitivity still, and both are considerably ahead of develop.

@jkbonfield
Copy link
Contributor Author

Following requests outside of PR review here (but please do comment here rather than use out-of-band communication, for sake of transparency and to keep any other interested parties informed), I did some analysis of the various heuristics which had accrued over time, and culled some again as while they originally helped they're no longer so relevant following changes elsewhere. Hence the code is a little simpler now.

Of the ones left, I have some examples, both pro and con, of the sort of indel calls they alter, along with summaries of the total numbers of changes. See the TEST- comments in the source code for the various code blocks ifdefed in or out. (TEST 1, 3 and 6 are still in place.)

Tests below were on a region of HG002 chr1. My main tuning and training set is done on chr1:10M-20M.


TEST 1

30x Illumina

    With           :    Without
    FP    GT    FN :    FP    GT    FN
    22    31    51 :    23    47    61
    18    31   102 :    16    47   100
    12    30   153 :    14    46   150
    12    29   201 :    10    46   201
    11    29   251 :    10    43   254
    11    28   304 :     8    43   304
    11    28   353 :     8    43   359
     9    28   407 :     7    43   400

It significantly benefits genotype accuracy and improves sensitivity.

At low QUAL filtering it's a good FN/FP tradeoff. At higher
filtering, the test adds to FP rate, indicating we're causing high
quality FPs due to a small amount of ref bias.

False negatives:
Example of missing (without H) high qual variant where one allele is
lost (0/1 or 1/1 vs 1/2):

HG002 chr1:12606136 chr1:14048395 chr1:14465398 chr1:18282635

Example where no variant is called (without H)
HG002 chr1:13602026 (Q40)
HG002 chr1:11987311 13456803 (low qual)

15x Illumina

    With           :    Without
    FP    GT    FN :    FP    GT    FN
    43    67   148 :
    39    67   152 :    36   103   157
    16    63   203 :    16    99   204
    14    59   252 :    12    93   259
    12    57   307 :    10    91   302
    10    56   350 :     8    89   358
     8    51   402 :     7    86   404

Again GT error is reduced with this heuristic and a slight improvement
to sensitivity. Again we see increased FP for fixed FN especially
once filtering to higher QUAL levels.

53x PacBio HiFi

With            :    Without
 FP    GT    FN :    FP    GT    FN
216    74    48 :
175    74    52 :   225    80    50
 48    73   102 :    51    79   106
 39    70   160 :    43    76   162
 34    70   202 :    37    76   203
 31    65   256 :    32    71   258
 31    62   305 :    32    68   306
 30    58   360 :    29    64   362
 29    50   416 :    28    57   418

Not a huge difference, but a few more false positives called for low
QUAL filtering and, again, a bit less sensitive at Q0.

False positives CAUSED by the heuristic:

    chr1    13602043        AT      A       81.0313 0/1
    chr1    15522545        C       CA      3.05998 0/1
    chr1    16368451        G       GT      42.2321 1/1

False postives CURED by the heuristic:

    chr1    10171892        A       ATAAG   8.51064 0/1
    chr1    10595815        C       CA      36.774  1/1
    chr1    11310767        A       AC      49.9678 1/1
    chr1    11937922        A       AT      52.4146 1/1
    chr1    14928448        C       CA      4.68159 0/1
    chr1    15203429        A       AT      11.5798 0/1
    chr1    15704701        G       GC      58.2334 1/1
    chr1    15738295        A       AG      61.2629 1/1
    chr1    17557345        A       AC      4.14406 0/1
    chr1    18999113        C       CA      7.74464 0/1
    chr1    18999124        T       TG      8.27893 0/1
    chr1    19664903        G       GC      52.2149 1/1

Looking at high quality ones:
CURED chr1:15738295 1/1 Baffling this is labelled as homozygous
It's a poly-G with 6 reads having +1G and 46 not. The +1G is false.
(There are phased variants before and after this position and totally
unphased with the poly-G changes)

CAUSED chr1:13602043 0/1 (called AT A by us and dev, ATGC A by indels-2.0)
Tricky large insertion at 13602026 which can be confused as MNPs or
multiple indels instead. Change is real and phased both up and down
stream. (NB: chr1:13599373 is a neighbouring challenging phased poly-T
difference. Real, but tricky to guage exact run-lengths)

We call the main insertion, but this is false. May need to track
regions and call multiple indels in one go?

15x PacBio

With            :    Without
 FP    GT    FN :    FP    GT    FN
554   140   110 :   557   141   112
306   136   159 :   306   139   160
212   135   200 :   209   138   202
128   132   250 :   124   136   252
 99   127   308 :    96   130   310
 84   126   351 :    82   129   353
 75   121   400 :    72   125   401

Marginal difference on low coverage PB data too

Conclusion: not huge, but adds a little to sensitivity and it's a
considerable reduction in genotype assignment errors, especially on
low depth data.

Consider tuning how much reference spike-in we really need to
optimise, or add a limit. Eg "if (rfract > 0.2) rfract = 0.2" will
gain the bulk of the benefit without so many edge cases?

above: double rfract = (r - t2).75 / (r+1);
try50: double rfract = (r - t2).50 / (r+1);

Change going FROM 75 TO 50 (chr1 10-20M).

  • bgi 10x: slight increase in GT error. Similar otherwise
  • bgi 30x: slight reduction in FP, a couple fewer FN
  • illumina 15x: very slight increase in GT
  • illumina 30x: almost identical

Not enough evidence to change yet.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jan 26, 2024

TEST 3

TEST 3: Compute indelQ in two ways (vs ref and vs next-best) and blend.

30x Illumina

 With           :    Without
 FP    GT    FN :    FP    GT    FN
 24    31    49 :
 24    31    50 :    22    31    51
 19    31   101 :    18    31   102
 13    30   153 :    12    30   153
 13    30   209 :    12    29   201
 12    30   254 :    11    29   251
 12    30   309 :    11    28   304
 12    30   355 :    11    28   353
 10    30   403 :     9    28   407

Helps sensitivity marginally, but also marginal gain in FP.

10x Illumina

 With           :    Without
 FP    GT    FN :    FP    GT    FN
 43    67   148 :    44    68   147
 39    67   152 :    40    68   151
 16    63   203 :    18    62   200
 14    59   252 :    15    60   253
 12    57   307 :    13    59   300
 10    56   350 :    11    57   352
  8    51   402 :     9    51   401
  7    47   455 :     8    48   452
  6    45   504 :     7    46   502

Small reduction to FP on shallow data, maybe marginal GT benefit too.

10x BGI

 With           :    Without
 FP    GT    FN :    FP    GT    FN
247   131   303 :   248   133   303
114   127   350 :   117   130   350
 81   121   404 :    83   123   405
 60   120   450 :    63   122   451
 36   116   500 :    40   118   505

Slight beneficial reduction in FP and GT.

53x PacBio

 With           :    Without
 FP    GT    FN :    FP    GT    FN
216    74    48 :   214    75    49
175    74    52 :   199    75    50
 48    73   102 :    48    74   106
 39    70   160 :    39    72   152
 34    70   202 :    34    71   207
 31    65   256 :    31    67   258
 31    62   305 :    31    63   309

Significant reduction in low-qual FP, but almost nil impact elsewhere
bar the most minuscule GT change.

15x PacBio

 With           :    Without
 FP    GT    FN :    FP    GT    FN
554   140   110 :   545   142   114
306   136   159 :   321   139   154
212   135   200 :   220   137   200
128   132   250 :   137   135   250
 99   127   308 :   102   130   305
 84   126   351 :    84   128   353
 75   121   400 :    74   123   410

A slight reduction in FP rates, and slighter still to GT.
Also a slight improvement in Q0 sensitivity.

Examples of low-coverage changes:

False postives CAUSED by the heuristic:

    chr1    10018582        CT      C       4.37617 0/1
    chr1    11799633        GT      G       5.56825 0/1
    chr1    12201471        TA      T       4.75713 0/1
    chr1    12580996        TA      T       12.4837 0/1
    chr1    14615625        CA      C       4.86644 0/1
    chr1    17445094        CA      C       4.40545 0/1
    chr1    17479481        CA      C       17.9461 0/1
    chr1    19200268        TA      T       6.58196 0/1
    chr1    19944115        AT      A       11.603  0/1

False postives CURED by the heuristic:

    None

It appears that heuristic is detrimental, and indeed it makes
mistakes. We see this in QUAL 0 FP being 554 vs 545. However they
are all low quality.

chr1:17479481 is highest qual here. It's a poly-A with counts
13A=2, 12A=9, 11A=5. Indels-2.0 and dev both call it too (qual 37 and
5 respectively).

Checking the COMMON FPs, with the two last columns being QUAL (with, without):

1       chr1    14146180        G       GT      0/1     6       5
2       chr1    15556125        CT      C       0/1     22      19
2       chr1    18270779        C       CA      0/1     22      20
3       chr1    15805897        CT      C       0/1     7       4
4       chr1    11464319        ATT     A       0/1     113     108
4       chr1    19710270        ATT     A       0/1     84      79
6       chr1    10967155        GAAAA   G       0/1     148     142
6       chr1    13666839        CT      C       0/1     11      4
11      chr1    10500241        TA      T       0/1     21      10
11      chr1    11196849        TA      T       0/1     15      4
11      chr1    11331008        TA      T       0/1     15      4
14      chr1    11115059        TA      T       0/1     33      19
21      chr1    16084815        ACC     A       0/1     148     126
22      chr1    12483996        CA      C       0/1     37      15
23      chr1    15847116        TA      T       0/1     30      7
26      chr1    12068876        CT      C       0/1     66      40

Sorted by delta of QUAL-with and QUAL-without.
There are none where the delta is negative, so 16 of 545 are higher
QUAL with heuristic than without. Ie FP is stronger.

Again, this looks detrimental. Particularly those with QUAL delta of
20 or more. chr1:12068876 is long poly T. 14T=2, 13T=8, 12T=8.

Looking at true variants, I see 1974 common calls. 105 have higher
QUAL with the heuristic (max delta +40) and 8 with QUAL lower (maximum
delta is -4).

QUAL delta on COMMON TP, >= 20.

20      chr1    17748405        CT      C       0/1     72      51
21      chr1    10566493        CT      C       0/1     94      72
21      chr1    16084815        AC      A       1/0     148     126
21      chr1    19138672        TA      T       0/1     84      62
23      chr1    11713198        CA      C       0/1     55      31
23      chr1    18424845        CA      C       0/1     53      30
24      chr1    14595143        CA      C       1/1     158     134
30      chr1    16096322        CA      C       1/1     115     84
30      chr1    16333436        AC      A       1/1     138     108
30      chr1    18611271        GAGGAAGGAAGGA   G       1/0     96      65
40      chr1    15850400        GT      G       1/1     135     95

So the heuristic benefit is primarily one of discrimination which is
hard to demonstrate with single examples. It's just a numbers game.

This isn't surprising giving the heuristic isn't really changing calls
(mainly affected by seqQ), but is instead changing their qualities.

Conclusion, a small benefit overall

@jkbonfield
Copy link
Contributor Author

TEST 6

Depth based seqQ cap in bam2bcf.c

15x PacBio

 With           |    Without
 FP    GT    FN |    FP    GT    FN
554   140   110 |   791   148   103
306   136   159 |   483   144   150
212   135   200 |   330   142   203
128   132   250 |   231   140   252
 99   127   308 |   178   135   306
 84   126   351 |   154   129   359
 75   121   400 |   129   124   406

53x PacBio

 With           |    Without
 FP    GT    FN |    FP    GT    FN
216    74    48 |   686   138    37
175    74    52 |   243   137    50
 48    73   102 |   146   128   100
 39    70   160 |   120   120   153
 34    70   202 |   110   110   201
 31    65   256 |   102    99   255
 31    62   305 |    98    92   302

Huge reductions in FP rates, and for deeper data also for GT
assignment errors. This is instrumental in improving calling
accuracy, at least of diploid data.

30x Illumina

 With           |    Without
 FP    GT    FN |    FP    GT    FN
                |    52    37    42
 22    31    51 |    38    37    50
 18    31   102 |    27    35   103
 12    30   153 |    22    33   150
 12    29   201 |    21    33   201
 11    29   251 |    20    31   251
 11    28   304 |    17    26   302
 11    28   353 |    17    22   350
  9    28   407 |    13    20   405

Significant FP reduction

100x BGI

 With           |    Without
 FP    GT    FN |    FP    GT    FN
 17    11   121 |   110    70   120
  9     9   150 |    38    64   151
  8     8   200 |    30    56   239
  6     8   252 |    14    40   269
  6     7   306 |  
  6     6   355 |    11    37   378
  5     5   400 |    11    33   404

Similar sensitivity, but huge FP and GT error reduction.

Without this rule it's also highly overvonfident in QUAL. It's not
calibrated, but Q170 is 1 in 10^17 chance of an error, yet we have 38
out of 1921 => Q17.

BGI 100x analysis:

False postives CAUSED by the heuristic, sorted by qual:

none

False postives CURED by the heuristic, sorted by qual:

11263494        220.12  TAC     T       0/1 54,12   no variant
10090793        220.041 TTG     T       0/1 86,17   no variant
11192748        195.274 TAA     T       0/1 2,19    truth TA   T
11379137        195.08  CAA     C       0/1 1,16    truth CA   C
11782918        194.834 CAA     C       0/1 11,24   truth CA   C
11265500        194.728 ATTT    A       0/1 4,17    truth ATT  A
11140983        194.719 G       GAA     0/1 6,17    truth G    GA
14193661        194.626 GTT     G       0/1 1,16    truth GT   G
19464020        194.577 CAA     C       0/1 3,12    truth CA   C
11618673        194.436 CAA     C       0/1 8,11    truth CA   C
11398727        194.416 A       ATT     0/1 14,11   truth A    AT
10523628        194.257 TAA     T       0/1 3,16    truth TA   T
16183133        194.201 CAA     C       0/1 10,15   truth CA   C
17486147        193.968 C       CAA     0/1 3,8     truth C    CA
18864334        193.947 CTT     C       0/1 10,12   truth CT   C
11603885        193.819 C       CAAA    0/1 6,17    truth C    CAA
15284478        193.489 C       CAAA    0/1 5,13    truth C    CAA
19820846        193.333 TAA     T       0/1 6,15    truth TA   T
14933849        193.113 CTT     C       0/1 4,9     truth CT   C
10514156        191.236 C       CTG     0/1 0,11    truth C    CTGTG
10918938        191.088 ATTTT   A       0/1 0,17    truth ATTT A
10454856        190.293 CTTTT   C       0/1 0,12    truth CTTT C
16187322        189.738 TAAAA   T       0/1 0,13    truth TAAA T
10967155        189.57  GAA     G       0/1 0,7     truth GAAA G
15833706        189.247 TA      T       0/1 0,12    truth TAA  T
12192276        186.096 CTG     C       0/1 70,11   no variant
10984598        117.13  C       CT      0/1 67,15   no variant
15867712        107.302 CT      C       0/1 82,19   no variant
19699486        103.187 CAT     C       0/1 69,10   no variant
...
(93 entries)

chr1:11263494 (no var) is 12 -AC out of 91, so could be justified as a
variant, but with such deep data we expect a more reasonable allele
rate so the likelihood of it being true is decreased (which is
achieved by capping seqQ per read). Similarly chr1:10090793.

The others are called as GT 1/2 when in reality it's 0/1 or 1/1.
(We 0/1 above due to separating them for purposes of allele counting.)

Eg chr1:11398727 is identified as 14 REF, 36 +1T, 11 +2T.
We call 0/1 with heuristic or 1/2 without. Truth is 1/1 and the few
REF matches as erroneous.

Comparison of those with indels-cns (with heuristic), indels-2.0 and
develop shows the new heuristic works well.

          Truth                 Indels-cns      indels-2.0      dev
10090793                                        FP 1/2          FP 0/1
10454856  CTTT  C       1/1     Y               1/2             1/2
10514156  C     CTGTG   1/1     Y               1/2             1/2
10523628  TA    T       1/1     Y               1/2             1/2
10918938  ATTT  A       1/1     Y               1/2             1/2
10967155  GAAA  G       1/1     Y               1/2             1/2
11140983  G     GA      1/1     Y               1/2             1/2
11192748  TA    T       1/1     Y               1/2             1/2
11263494                                        FP 0/1          1/2
11265500  ATT   A       1/1     Y               1/2             1/2
11379137  CA    C       1/1     Y               1/2             1/2
11398727  A     AT      1/1     0/1             1/2             1/2
11603885  C     CAA     1/1     Y               1/2             1/2
11618673  CA    C       1/1     Y               1/2             1/2
11782918  CA    C       1/1     Y               1/2             1/2
12192276                                        FP 0/1          1/2
14193661  GT    G       1/1     Y               1/2             1/2
14933849  CT    C       1/1     Y               1/2             Y
15284478  C     CAA     1/1     Y               1/2             1/2
15833706  TAA   T       1/1     Y               1/2             1/2
15867712                                        FP 1/2          1/2
16183133  CA    C       1/1     Y               1/2             1/2
16187322  TAAA  T       1/1     Y               1/2             1/2
17486147  C     CA      1/1     Y               1/2             1/2
18864334  CT    C       1/1     Y               1/2             1/2
19464020  CA    C       1/1     Y               1/2             1/2
19699486                                        FP 0/1
19820846  TA    T       1/1     Y               1/2             Y

False positives SHARED by both with/without heuristic (with QUAL scores)

10523825    162.015 TAAA    TA,T     1/2 0,26      truth TAA T 1/1
17775360    108.287 TGGA    T        0/1 44,63     diff representation
12536917    103.504 C       CTT,CTTT 0/1 6,12      truth C CTT 1/1
17967242    94.4016 G       GATTC    0/1 37,37     diff representation
...

chr1:10523825 vs ref, I see -4A=9 -3A=19 -2A=66. Seems unlikely we'd
not call 1/2 (and also called by indels-2.0 and devel). Just hard
judgement in the data. Can get a correct call with "-o 10 -e 1", but
that'd leads to false negatives elsewhere (tested as 540 at Q0).

False negatives CAUSED by the heuristic, sorted by qual:

16060510    50      C       CT      0|1 23,29

False negatives CAUSED by the heuristic, sorted by qual:

none

However each platform needs differing seqQ cap as the biases vary.
It's reliant on the amount of systematic bias, so how much we can
trust depth. There may be some over-fitting to specific runs here, as
I've only tested one run per platform (sometimes all I have available
for a known truth set), but I tuned per platform and hope the bulk of
the variation is platform specific rather than run specific. We may
need to retune in future.

It would probably also harm calling of cancer data where we're not
expecting hardy-weinberg stats, but they need their own dedicated
caller.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jan 26, 2024

All of the data files here are public GIAB files. The HG002 ones are downloaded locally in Sanger at ~jkb/lustre/GIAB/chr1/

Files:
HG002.GRCh38.15x.RG.bam
HG002.GRCh38.30x.RG.bam
HG002.GRCh38.60x.RG.bam
HG002.SequelII.merged_15kb_20kb.pbmm2.GRCh38.haplotag.10x.bam
pb_new.15x.bam (subsample of above)
BGI.100x.bam
BGI.30x.bam
BGI.10x.bam

Plus in the ../ont and ../ultima directories too.
The number of examples potential for being created to add to your evaluation set is considerable, but I'll let you have an explore as to which you think are best. I just gave an example summary above based on bcftools isec to find new-errors created or old-errors fixed.

@jkbonfield
Copy link
Contributor Author

To answer the suggestion that I've been overfitting the data by tuning each set of parameters so much, I took the GIAB HG005 (Chinese Trio) sample instead. This isn't ideal as it's less well studied so the truth set may have more errors, and there aren't so many machine runs uploaded, but it's sufficient to get an idea of how it performs.

First up BGI:

15x

image

It's not ideal, as we're behind devel on the FP vs FN curve, but the gap closes once we start filtering out low-QUAL calls. Never quite passes it though. Genotype accuracy (not plotting as it's almost a straight line as all genotype assignment errors are obviously from high quality calls that don't get filtered) is better though - ballpark down from 420 errors to 360 errors.

130x

image
We're around about 3x fewer false positives for the same recall rate, plus our unfiltered recall is considerably higher.
Indels-2.0 is performing really badly on this data set.

As before, GT accuracy is just vertical lines in the 0-3% FN bracket. At Q0 it's dev=301, indels-2.0=333, indels-cns=13. A colossal improvement in genotype assignment accuracy.

@jkbonfield
Copy link
Contributor Author

Next up, Illumina HiSeq. It's not the most modern data, but I haven't yet found any for HG005. I'll have a look next week maybe.

15x
image

Indels-cns has a considerable boost to recall rate, while broadly speaking following the FP-vs-FN trend of indels-2.0 on the shallow data once we start to filter by QUAL. As before, we're a bit behind on genotype assignment accuracy for shallow data, with QUAL0 being dev=146 err, indels-2.0=284, and indels-cns=186. So not hugely behind, but a bit.

50x
image

At 50x we've got enough data to start making reasonable calls for both devel and indels-cns, although once again indels-2.0 is showing itself to be a poor performer with regards to false-positives. Indels-cns is the most sensitive method. With the deeper data however it's now also the most accurate at genotype assignment, with Q0 error counts being dev=120, indels-2.0=204, indels-cns=23. So 4-5x fewer errors there.

300x
image

The full data set is overly deep. I have --indels-2.0 still running, but it's likely to end up in the same position. Both dev and indels-cns understandably do a good job, but indels-cns is still the clear winner. It's now also only making 4 genotype assignment errors, compared to 57 for develop.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jan 26, 2024

Now for PacBio HiFi data. Note on the long read data, I find --indels-2.0 always crashes, so the results are missing.

15x
image
Both algorithms are really error prone and call lots of false positives. Genotype assignment error is very similar between the two, at about 6% error rate.

Nonetheless, indels-cns has a much better sensitivity for any given FP rate (or vice versa).

50x
(Sorry for typo in setting plot title - it is 50x, not 10x)
image

This is now looking like chalk vs cheese! The FN rate of devel is really poor, and misses 3x as many indels as --indels-cns. The false positive rate is more respectable, but again indels-cns is considerably ahead. To get to ~160 false negatives (where dev starts at Q0), indels-cns is making around 10x fewer false positives, mainly because it started the curve so much lower down.

As we saw before once we get deeper data, genotype assignment is now considerably improved over develop, dropping from approx 640 errors down to 240.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jan 26, 2024

Conclusion: I hope this addresses your concerns about over-fitting. I haven't demonstrated it on ONT (I'm not sure if modern "super" basecaller data is available for HG002), but it looked promising with HG002.

This algorithm is considerably better at assigning genotype, and the recall vs precision graph is strongly in favour of the new algorithm. Sadly it's strongly demonstrating that indels-2.0 is suffering, so I would also recommend (in a separate PR) marking it as deprecated once this PR is merged.

Files for the HG005 analysis are locally in ~jkb/lustre/GIAB/HG005. That includes BAM files, VCFs, and plots. You should be able to do bcftools isec on any of the dev vs "edlib" (this PR) VCFs to generate example new-FP, new-FN, old-FP and old-FN corpuses for your test system from there.

@jkbonfield
Copy link
Contributor Author

ONT data too. This is terrible at indel calling with bcftools as it was old data with an old base-caller, so quite poor on error rates, especially indels. Modern ONT base-callers are far superior, but I don't have any data for HG005.
image
image

It's pretty ropey with a huge FP rate (vs about 5% on the "Super" mode for HG002 above). However it's clearly more sensitive and at all filtering levels offering a better FN vs FP tradeoff than develop. Likewise the genotype error rate is significantly reduced at all qual levels (for a given FNr).

@jkbonfield
Copy link
Contributor Author

Some newer data - HG005 Ultima Genomics. Unlike the HG002 data this was quite deep.

59x Ultima
image
image

This data came to me with DeepVariant and GATK HaplotypeCaller joint results. Surprisingly bcftools is doing really well here compared to others on recall, although our genotype assignment accuracy is lagging. This PR extends the lead further.

*15x Ultima
image
image

As seen on other instruments, we're a bit behind develop on accuracy of genotype assignments for shallow data, but it's more than made up by the improved recall overall.

@jkbonfield jkbonfield marked this pull request as ready for review January 29, 2024 15:41
@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jan 29, 2024

I've moved this out of draft as I'm now happy with it to be reviewed.

Hopefully this demonstrates the gains to be had. I've validated it now makes zero difference to the default usage of bcftools.

Note however it does change the non-versioned profiles, eg -X illumina now uses -X illumina-1.20 which is the newer variant caller. This could be changed, but the parameters in the profile have been tuned with the new caller in mind, so using -X illumina-1.20 --no-indels-cns won't give the same results, and it may not be better than straight off -X illumina-1.18 either.

In my mind, we should always make sure the non-version numbered profiles use the best options available as that's what most people are going to do. It's only people who want continuity between releases, upgrading purely for major bug-fixes but not accepting anything that changes output, that should consider using a release-specific parameter set. (To this end, the -1.20 also contain a bug fix to IDV and IMF INFO fields which could go into 1.18, but I'm taking the request to not change old profiles literally here.)

@pd3
Copy link
Member

pd3 commented Jan 30, 2024

In my mind, we should always make sure the non-version numbered profiles use the best options available as that's what most people are going to do.

A major revamp like this needs to be tested by more people and on more data sets, before it becomes the new default.

To this end, the -1.20 also contain a bug fix to IDV and IMF INFO fields which could go into 1.18, but I'm taking the request to not change old profiles literally here

Bug fixes are a different thing. Not sure why we wouldn't want to include it.

@jkbonfield
Copy link
Contributor Author

In my mind, we should always make sure the non-version numbered profiles use the best options available as that's what most people are going to do.

A major revamp like this needs to be tested by more people and on more data sets, before it becomes the new default.

Understood, but what's the best naming? We discussed previously changes and you made it clear that irrespective of testing this could never become the default as people don't expect changing results.

However I took the approach of bcftools without command line arguments = default. Bcftools with specified profiles could be different. For consistency, I felt that people who are using a profile and have a hard requirement that the output never changes, should have a profile named after a bcftools release (eg illumina-1.18) while people who do not required hard never-changing results could use a non-release tagged profile that upgrades automatically to what we personally believe represents the best caller. That feels like a good pragmatic solution and what I've done here.

Do you agree?

If so I guess the final question is what do we do about new profiles that are undergoing testing? Should I just rename them to 1.21, so there is a clear statement that it's experimental for this release and (if no problems found) will become the new default next time around? That feels like good signalling.

To this end, the -1.20 also contain a bug fix to IDV and IMF INFO fields which could go into 1.18, but I'm taking the request to not change old profiles literally here

Bug fixes are a different thing. Not sure why we wouldn't want to include it.

Well, because of the above statement basically. Also when is a bug a bug?

My fix I'm thinking of is IDV and IMF are computed based on the number of records containing the indel called originally. Not the number called after we validated them. We've always had this as bcftools has always realigned reads (previously to patched ref, now to consensus), but it used the wrong data when filling out INFO. We could argue over bug or feature, and I can see both sides, which is why I chose to not fix it.

However it comes down to the same issue. If we take the route that people should never be getting changed results, which was your firm belief before and why you rejected the previous PR, then we also shouldn't be making such changes without producing a new version-numbered profile.

@pd3
Copy link
Member

pd3 commented Jan 31, 2024

My apologies, I completely misunderstood your words. I agree with what we agreed before:

  • profiles like -X illumina are what we think is best at the moment
  • profiles like -X illumina-1.18 remain stable forever, possibly with glaring bugs fixed
  • major changes are added as a new option at first. After a few releases they can become the new default.

It was not my brightest moment, sorry again.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jan 31, 2024

No problem. It's a new direction and not something we've done before. I'm not sure it's perfect, but it feels like a reasonable strategy.

The question still remains though as to when the non-versioned profile switches. If this is present in bcftools 1.20 but we don't want "bcftools mpileup -X illumina" then we need a different profile name than "illumina-1.20". I'm open to suggestions, but maybe illumina-1.21 or illumina-experimental? The problem with the latter is we'd want to delete it after it becomes non-experimental.

New options is a tricky one too. The profile configuration and the option are inherently entangled. We could obviously do -X illumina-1.18 --indels-cns, but it wouldn't be as good as -X illumina-1.20 as the former is tuned around BAQ scores while the latter is tuned around edlib + consensus scores. (Although -X illumina-1.18 --indels-cns should still be an improvement).

Edit: actually that's a really bad example as there were no changes made to Illumina! However PacBio for example gained a lot of new parameters, such as --poly-mqual and --del-bias, as well as other parameters (-h) being preferable with the different reference sequence construction and alignment scoring algorithm.

I don't have --indels-cns benchmarked using the old profile, but I do have the old variant called (bam2bcf_indel.c) with new and old parameters, minus the --indels-cns bit:

pacbio 50x                                                                      
 | dev-1.18          |  dev 1.20          |  indels-cns                         
 |    FP    GT    FN |    FP    GT    FN  |    FP    GT    FN                   
-+-------------------+--------------------+------------------                   
 |                   |                    |  1721   240    51                   
 |                   |                    |   576   239   106                   
 |  3569   647   159 |                    |   381   238   157                   
 |  1804   641   200 |                    |   315   238   201                   
 |  1175   635   257 |  3186   619   276  |   254   238   270                   
 |   898   625   300 |  2125   617   300  |   231   238   312                   
 |   751   617   350 |  1177   613   352  |   210   238   353                   
 |   661   607   412 |   858   606   401  |   195   238   413                   
 |   615   592   464 |   710   601   456  |   178   238   470                   
 |   566   585   519 |   601   584   515  |   170   238   509                   

So we see using -X pacbio-ccs-1.20 --no-indels-cns over -X pacbio-ccs-1.18 makes the indel caller less sensitive and with a higher FP rate for any given FN rate. In short, considerably worse. However combined with --indels-cns it becomes considerably better. If you're interested I could try -X pacbio-ccs-1.18 --indels-cns combined to show how that worked too, to get the 4th combination.

Edit: I ran HG002 chr20 with -X pacbio-ccs-1.18 --indels-cns now and see this:

    FP    GT    FN
  2535   285    42 
  1210   285    50 
   553   284   102 
   431   283   153 
   345   283   202 
   308   283   253 
   280   283   302 
   248   283   368 
   232   283   424 
   218   283   456 

So at ~50FN it's fewer FP, but if we do QUAL filtering to get approx comparable FP and FN (maximising discrimination and F1 score) we're around 20% more errors. But that's still a huge improvement over the ~double error of develop, so just adding --indels-cns to the old profile is still viable in this single test, just not optimal.

@jkbonfield jkbonfield changed the title DRAFT: revision of 1679 plus switch to edlib over BAQ revision of 1679 plus switch to edlib over BAQ Feb 1, 2024
This is a combination of PR samtools#1679 from 2022 coupled with more recent
changes to replace BAQ with edlib for evaluating the alignment of
reads against candidate alleles.  Due to the complexity of rebasing
many commits with many conflicts, these have been squashed together
for simplicity.  Please consult the original commit log in the PR for
more detailed history.

PR 1679 is a major restructuring of the indel caller that produces
proper consensus sequences.  TODO: paste collected commit messages
here.

The newer edlib changes are designed to dramtically increase the speed
of indel calling on long read data by use of a modern algorithm.  Note
edlib's alignments are very crude, with just +1 for all differences
including substitutions and each indel base.  However if we've
computed our candidate allele consensus sequences correctly then the
aligning against the real allele should be a minimal score regardless
of scoring regime, so the cost of an HMM or even affine weights do not
help much.

The lack of quality values is compensated for by detection of minimum
quality within STRs.
(TODO: rename edlib to something better)
This enables the homopolymer scan-left/right for minimum quality
for adjusting seqQ and indelQ.  This is good for machines with mainly
indel errors, and detrimental for clocked instruments such as
Illumina.
When we have eg REF C and ALT CT,CTT then we're saying it's +1 or +2
Ts.  We align all reads against the 3 candidate alleles and score
them, sorting these scores to find the best so we can allocate reads
to the alleles.

The score for those allele assignments was REF minus best-indel for
REF assignments, and best-indel minus REF for indel assignments.  This
change turns the latter into best-indel vs next-best-allele,
regardless of whether that next best is REF or not.

Consider the case of +1 score 30, +2 score 30, +0 (REF) score 60.
Previously we'd have recorded the relative quality of 60-30, but now
we record 30-30.  The consequence of this is reads that align equally
well against +1 and +2 get zero confidence in their correct allele
assignment.

This considerably reduces the chance of recording GT 1/2 for variable
homopolymers caused purely by the vagarities of sorting a bunch of
equal numbers.  (The equal scores often arrive due to reads that don't
span the homopolymer or STR.)  We may miss some true variants as the
multi-allelic possibilities all cancel each other out, but typically
that means there are very few reads spanning the site and this data
may well be false-positive caused by sequencing artifacts.  So overall
it's a considerable benefit to accuracy.

A consequence of this is also that AD values now more accurately
reflect evidence, rather than incorrectly distributing the
uninformative reads to alleles.

However, AD will now also be lower, and similarly PL.  So a global
boost to indelQ helps recover some of this lost sensitivity.  This
commit also boosts indelQ by the --indel-bias param to make this more
impactful. (Previously it only boosted seqQ)
These use VT's "decompose_blocksub" command, which since writing may
not be possible without bcftools natively.  If so I haven't evaluated
how it differs.

(The origin of these scripts come from Crumble evaluation many years
ago so I could see what the impact was on reducing quality in the
SynDip data set.)
It's faster and more accurate than the old mode.  The speed is due to
BAQ vs Edlib, but the accuracy comes from the use of allele consensus
generation and better heuristics.

- Cull lots of ifdefed out code, plus reset indel-bias back.

Earlier experiments had increased this score by doing indel_bias/10
when it's used (it's 1/score for effect).  The difference isn't great,
and it's marginally better FN/FP ratio with it in the original usage.
We can still manually change this score to get better sensitivity if
desired, but it's not dialed up so high by default.

- Remove an indel caller hang.

Our consensus can sometimes fail and give us a short alignment that
doesn't span pos.  We were checking the wrong end coord.
The current IDV and IMF come from the initial "types" assignment,
before realigning reads back to the consensus alleles.  If we're
outputting AD then we have more accurate re-aligned counts of each
type, so we report those instead.
The main benefit of the new indel caller is the use of alignment
against diploid consensus generation instead of simply patching the
reference with candidate indel types.

This greatly reduces false positives and incorrect allele alignment
(leading to wrong genotype calls).  This was the earlier PR samtools#1679, but
has since acquired edlib as an alternative to BAQ for the indel
alignment.  However this is primarily a speed benefit (with some
minor removal of false-negatives due to quality smashing), and
isn't the main thing users should think about when choosing an indel
caller.

Also tidied up the usage statement and added an explicit "-X list" to
print the profile parameters.

- Add extra debugging defines.

GLF_DEBUG reports GLF calculation in bam2bcf.c.

ALIGN_DEBUG uses edlib EDLIB_TASK_PATH to report sequence alignment.
NB to use this you need to link against edlib itself rather than the
cutdown version in this repository.

Also fix the edlib heuristics used in bcf_call_glfgen.  We don't want
to change the call (b=5) as this affects AD.  Instead we change the
quality so poor calls get filtered by QUAL rather than simply being
removed.

- Tweak edlib tuning for SeqQ/qual.
Add quality value assessment into soft-clip recovery.
Use /500 instead of /111 in indelQ assignment, and skew indel-bias
accordingly.  This gives better separation of FP/GT/FN generally.

- Added --seqq-offset parameter so we can use it in tunables per
profile.  This is used as a limit on the seqQ reduction in the
"VAL-5*MIN(20,depth)" formula, used for favouring data over seqQ
scores when depth is sufficient.  Experimentation showed no single
value that worked for all platforms, but the default is in the
middle.

- Tidy up to cull ifdefed and commented out code.

- Add test for indels-cns.

It's minimal, but the whole indel calling has minimal testing.  I
think we have under 10 indels in total with develop (all short read
and mostly duplications of each other), and no testing of indels-2.0.

This tests 4 indels with indels-cns.

- Added documentation for the new --indels-2.0 options

- Cull more unused parts of edlib.c.

This avoids clang warnings (which become errors with -Werror).
We're only including the bits we need here for mpileup.  If you want
the whole thing, link against the upstream source library instead.
    vcfbuf.c:249:32: error: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Werror,-Wsingle-bit-bitfield-constant-conversion]
                buf->vcf[i].af_set = 1;
                                   ^ ~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants