-
Notifications
You must be signed in to change notification settings - Fork 247
revision of 1679 plus switch to edlib over BAQ #2045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also, what do people want to see for evidence? I'm thinking of similar graphs to the ones I produced in the last plots, but I may adjust the counting. My view is for diploid organisms we're making 2 calls at every site - one for each allele. For simplicity I want to switch with precision vs recall, so we can count REF/REF, REF/ALT and ALT/ALT as 2 calls each. If we call ALT/ALT and it's really ALT/ALT then that's two true variants. If we call it REF/REF then it's 2 false negatives, and if we call it REF/ALT then it's 1 true positive and 1 false negative. Similarly a true REF/REF called as ALT/ALT is 2 false positives, and as REF/ALT is 1 false positive. I think that should be enough to give due importance to genotype assignment (which is one key improvement, especially on Illumina where it's a 6-fold reduction), while still keeping the graph simple. Comments? |
@jkbonfield As requested offline, here some information to the problem with the original draft we were discussing a while ago. I don't know if it is still an issue, but the observation back then was: Each read was aligned to two consensus sequences generated from indels visible in the sample. In the case of one hom-ALT sample with an insertion and one hom-REF sample, the code triggered four realignments for the second sample, all to the same identical reference consensus, and the alternate consensus was not considered for the hom-REF sample at all. This lead to a low quality estimate. The program should avoid recomputing the same read for the same consensus twice, just to save time. More importantly, if there aren't enough indel types within the sample, the most popular consensus from other samples should be used. |
On Mon, Dec 11, 2023 at 05:52:08AM -0800, Petr Danecek wrote:
@jkbonfield As requested offline, here some information to the problem with the original draft we were discussing a while ago. I don't know if it is still an issue, but the observation back then was:
Each read was aligned to two consensus sequences generated from indels visible in the sample. In the case of one hom-ALT sample with an insertion and one hom-REF sample, the code triggered four realignments for the second sample, all to the same identical reference consensus, and the alternate consensus was not considered for the hom-REF sample at all. This lead to a low quality estimate.
The program should avoid recomputing the same read for the same consensus twice, just to save time. More importantly, if there aren't enough indel types within the sample, the most popular consensus from other samples should be used.
Do you have any example data demonstrating this?
James
…--
James Bonfield ***@***.***)
The Sanger Institute, Hinxton, Cambs, CB10 1SA
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is Wellcome Sanger Institute, Wellcome Genome Campus,
Hinxton, CB10 1SA.
|
I'm adding some analysis results of various GIAB HG002 samples. These show false positive rate (FPr) vs false negative rate (FNr), and also genotype assignment error rate (GTr) vs FNr. FPr is defined as the fraction of called variants that are false. "Truth" is a little bit flexible as there is an inherent error in measuring the same consequence via different calls. Eg 1 indel + 1 SNP vs 2 indel, etc. Note: I'm reporting indels only here as this PR doesn't change the SNP calling. Low-depth Illumina It's clear that the genotype assignment accuracy is the huge win here, especially when it comes to deeper data. This was systematically a flawed calculation in the old code due to the way it scores multi-allelic sites, but also aligning to a consensus instead of patched reference is clearly a significant win for genotype analysis too. Shallow depth, there isn't a marked gain over Timing wise, develop, this PR ("new") and indels-2.0 were all in the ~66m for chr1 60x and ~23m for 15x. Not much difference. I'll do other technologies in their own posts. (And yes, I'm being lazy! Much less work to just cut and paste a full window than to save the image to disk, find it in firefox, and then reupload it again. "control-V" is less work :-)). |
Next up is BGISeq, with the new https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_BGIseq_2x150bp_100x/ Deep 100x data: On the full depth data, this PR is more accurate than the GATK results uploaded with the published dataset. Unfortunately the previous updates to Shallow 10x data: It's less clear with shallow 10x BGI data. Speed wise, the full data set was broadly the same speed for all bcftools runs, at approx 2hours. |
Next up, a modern 53x PacBio HiFi alignment from 20kb and 15kb libraries. It comes with DeepVariant analysis too What's staggering here is clearly how far ahead DeepVariant is. It's obviously got far better understanding of the errors involved in this data. I did note there are clusters of errors that we make around supplementary reads, and that filtering those out (or alternatively filtering out anything with long soft-clips) removes almost all of those errors, but overall it's not a big impact as the number of affected regions is small. We still have a very significant number of false positives to remove though. It may be worth investigating some of the tricks used in
Unfortunately I don't have Speed wise, develop took 140 min and this PR 60 min, so is 2.3x faster. |
An older 2019 GIAB HG002 PacBio HiFi data set. This appeared to have different characteristics, and got worse with higher More for comedy, I also including processing of We're comfortably ahead of GATK on this data set, although that was my own run of GATK and I can't be sure I used it correctly as it's a complex beast. This PR is also still an improvement on the already-improving PR1679. Speed wise, devel took 120 min, this PR 41 min (so almost 3x faster), and sadly I think this places us in a good spot for calling performance. We're likely to never be matching DeepVariant, but we'd be a fraction of the CPU cost. (I haven't ran it myself, but being AI driven I can't believe it's any faster than GATK) |
Ultima Genomics. This was a control-sample spiked in with other data, and is ~20x HG002. I have deeper data available for others (eg HG004), but not yet analysed. Ultimata Genomics main error model is indels, particularly homopolymers IIRC, so it performs quite poorly. However this PR is still ahead of develop. For the develop run I copied the parameters from the new Not much more to say on this. Probably a work in progress still, so we can learn more on handling the error models for this data. Speed was develop 34min, this PR 31min. So broadly comparable. |
ONT "sup" base caller, HG002 sample again Disappointingly we're behind the old PR now, although again all of them are doing a really bad job. Even with the super mode basecaller the error rate is just too high for us to realiably call indels. Likely a better algorithm that's more in tune with the error profile could tease out a better result. I'll play more with the parameters to see if it's simply a matter of tuning things a bit. I'd like it to at least match the old PR here. As with other long-read technologies this PR is significantly ahead on speed. 104min for this vs 317min for develop. So 3x faster again. As with other long-read technologies,
|
Can you please elaborate. I haven't looked at what it's doing under the hood yet, but it does get the right answer. All the versions I've tried have. My test:
So it's produced two samples, one with +TT on both alleles and one with +TT on one allele only. Tview shows me this:
They key thing here is the FORMAT fields:
GT of 1/1 and 0/1, and AD of 0,6 and 3,3. That's spot on. What am I missing about your scenario? Also, is this still how people run multi-sample variant calling? I thought the world had moved on to gVCF. Do people do multi-sample analysis for small things like trios or tumor/normal analysis, and gVCF only for population studies? (I'm out of my depth here as I've only looked at simple-sample calling, partly due to the lack of decent high quality truth sets.) |
Ah, there is still a problem if sample 1 is HOM +CA and sample 2 is HET +TT. We get CA called as both HOM and HET and TT doesn't turn up anywhere, unless I split into two single-sample files, where it works fine.
This is true for develop, indels-2.0, indels-cns (this PR), and #1679. I think it's a fundamental flaw of the shared code that does indel type selection. Develop calls the second sample as 0/0 with AD 5,0. (Correct AD given the alleles, but wrong call). If I had +CA and +TTT then it works, as we have type 2 and type 3, but +CA and +TT leads to one type only, and so everything gets called as 2bp insertion. Really it needs to check the insertion consensus for each sample matches as that's what it actually produced when aligning. Ie CA,TT. A useful spot, but again this isn't specific to this PR. It's just a weakness of the existing algorithm. |
Ah sorry I misread it. HOM ALT and HOM REF... That does give an issue with this new code which I need to investigate, as the QUAL field is very low for some unknown reason. However the calls are still correct. The original PR in #1679 works fine however. Edit: blasted BAQ again! Disabled it and sure enough it works again. I've disabled it in most of the profiles, but I wasn't using one in my command-line tests and we can't change the default behaviour :( So I still don't understand the failure case you observed. Data to demonstrate it would be useful please. |
Revised ONT parameters. Switched from -h80 to -h110 to remove false negatives (but at the expense of FP), and then --indel-bias 0.8 to try and remove those FPs (at the expense of FN). The combination appears to generally be a better FP/FN tradeoff, and is more sensitive to boot so at QUAL 0 it's significantly better recall. My graph here is going all the way up to 50% FN, which is very high, so the lower part is the more important bit for general usage. These changes do unfortunately take a hit on genotype accuracy, but it's probably more important to identify the indels first and genotype accuracy while important is the next thing to assess. Overall GT accuracy is still quite high for all methods. |
46cee26
to
b84137e
Compare
Attempts to fix some failing tests lead me to improve the way AD was calculated, although it still doesn't fix some of the failure cases. Overall it's a win though. I no longer reject alignments by setting the base type to 5 (unknown), instead preferring to adjust the quality of alignment. This gives higher, more realistic, AD figures now. I think this addresses some of the concerns you had over the old PR? However it does so at a cost of slightly higher misdiagnosis of the real genotype. Current WIP figures. "10h" was my internal name for the what was listed as "new" above, and "12a" is my latest revision code name. Illumina: Better sensitivity at QUAL 0 and better FN vs FP tradeoff at most filter levels, but slightly less accurate on genotype assignment. Similar deal for deep sequencing. GT is poorer, but both are miles ahead of dev or indels-2.0. BGI Considerably more sensitive than this earlier PR, although "dev" still beats it on lower QUAL filtering. As with Illumina, we're no longer so great on the GT assignment accuracy, with another 1-2% additional false assignment. At deeper data, the new commits really help on BGI, both for FN vs FP but also vs GT error. Both are lightyears ahead of the old dev branch. See earlier comments for comparison vs the GATK results that were published along side the GIAB BGI data set, and --indels-2.0 results. More data sets incoming... |
ONT "super"; approx 50x coverage There is a cross-over where both the earlier version of this PR and dev give a better FN/FP tradeoff, but this is at nearly 35% lost calls. The new commits have significantly improved recall rates which probably offer a better tradeoff. However as many cases before, we see this comes at a cost of less accurate genotype assignments. For sub-sampled 15x coverage: Similar to 50x, but closer in FN/FP ratio. The new commits show higher sensitivity still, and both are considerably ahead of develop. |
Following requests outside of PR review here (but please do comment here rather than use out-of-band communication, for sake of transparency and to keep any other interested parties informed), I did some analysis of the various heuristics which had accrued over time, and culled some again as while they originally helped they're no longer so relevant following changes elsewhere. Hence the code is a little simpler now. Of the ones left, I have some examples, both pro and con, of the sort of indel calls they alter, along with summaries of the total numbers of changes. See the TEST- comments in the source code for the various code blocks ifdefed in or out. (TEST 1, 3 and 6 are still in place.) Tests below were on a region of HG002 chr1. My main tuning and training set is done on chr1:10M-20M. TEST 130x Illumina
It significantly benefits genotype accuracy and improves sensitivity. At low QUAL filtering it's a good FN/FP tradeoff. At higher False negatives:
15x Illumina
Again GT error is reduced with this heuristic and a slight improvement 53x PacBio HiFi
Not a huge difference, but a few more false positives called for low False positives CAUSED by the heuristic:
False postives CURED by the heuristic:
Looking at high quality ones: CAUSED chr1:13602043 0/1 (called AT A by us and dev, ATGC A by indels-2.0) We call the main insertion, but this is false. May need to track 15x PacBio
Marginal difference on low coverage PB data too Conclusion: not huge, but adds a little to sensitivity and it's a Consider tuning how much reference spike-in we really need to above: double rfract = (r - t2).75 / (r+1); Change going FROM 75 TO 50 (chr1 10-20M).
Not enough evidence to change yet. |
TEST 3TEST 3: Compute indelQ in two ways (vs ref and vs next-best) and blend. 30x Illumina
Helps sensitivity marginally, but also marginal gain in FP. 10x Illumina
Small reduction to FP on shallow data, maybe marginal GT benefit too. 10x BGI
Slight beneficial reduction in FP and GT. 53x PacBio
Significant reduction in low-qual FP, but almost nil impact elsewhere 15x PacBio
A slight reduction in FP rates, and slighter still to GT. Examples of low-coverage changes: False postives CAUSED by the heuristic:
False postives CURED by the heuristic:
It appears that heuristic is detrimental, and indeed it makes chr1:17479481 is highest qual here. It's a poly-A with counts Checking the COMMON FPs, with the two last columns being QUAL (with, without):
Sorted by delta of QUAL-with and QUAL-without. Again, this looks detrimental. Particularly those with QUAL delta of Looking at true variants, I see 1974 common calls. 105 have higher QUAL delta on COMMON TP, >= 20.
So the heuristic benefit is primarily one of discrimination which is This isn't surprising giving the heuristic isn't really changing calls Conclusion, a small benefit overall |
TEST 6Depth based seqQ cap in bam2bcf.c 15x PacBio
53x PacBio
Huge reductions in FP rates, and for deeper data also for GT 30x Illumina
Significant FP reduction 100x BGI
Similar sensitivity, but huge FP and GT error reduction. Without this rule it's also highly overvonfident in QUAL. It's not BGI 100x analysis: False postives CAUSED by the heuristic, sorted by qual:
False postives CURED by the heuristic, sorted by qual:
chr1:11263494 (no var) is 12 -AC out of 91, so could be justified as a The others are called as GT 1/2 when in reality it's 0/1 or 1/1. Eg chr1:11398727 is identified as 14 REF, 36 +1T, 11 +2T. Comparison of those with indels-cns (with heuristic), indels-2.0 and
False positives SHARED by both with/without heuristic (with QUAL scores)
chr1:10523825 vs ref, I see -4A=9 -3A=19 -2A=66. Seems unlikely we'd False negatives CAUSED by the heuristic, sorted by qual:
False negatives CAUSED by the heuristic, sorted by qual:
However each platform needs differing seqQ cap as the biases vary. It would probably also harm calling of cancer data where we're not |
All of the data files here are public GIAB files. The HG002 ones are downloaded locally in Sanger at ~jkb/lustre/GIAB/chr1/ Files: Plus in the ../ont and ../ultima directories too. |
Next up, Illumina HiSeq. It's not the most modern data, but I haven't yet found any for HG005. I'll have a look next week maybe. Indels-cns has a considerable boost to recall rate, while broadly speaking following the FP-vs-FN trend of indels-2.0 on the shallow data once we start to filter by QUAL. As before, we're a bit behind on genotype assignment accuracy for shallow data, with QUAL0 being dev=146 err, indels-2.0=284, and indels-cns=186. So not hugely behind, but a bit. At 50x we've got enough data to start making reasonable calls for both devel and indels-cns, although once again indels-2.0 is showing itself to be a poor performer with regards to false-positives. Indels-cns is the most sensitive method. With the deeper data however it's now also the most accurate at genotype assignment, with Q0 error counts being dev=120, indels-2.0=204, indels-cns=23. So 4-5x fewer errors there. The full data set is overly deep. I have --indels-2.0 still running, but it's likely to end up in the same position. Both dev and indels-cns understandably do a good job, but indels-cns is still the clear winner. It's now also only making 4 genotype assignment errors, compared to 57 for develop. |
Conclusion: I hope this addresses your concerns about over-fitting. I haven't demonstrated it on ONT (I'm not sure if modern "super" basecaller data is available for HG002), but it looked promising with HG002. This algorithm is considerably better at assigning genotype, and the recall vs precision graph is strongly in favour of the new algorithm. Sadly it's strongly demonstrating that indels-2.0 is suffering, so I would also recommend (in a separate PR) marking it as deprecated once this PR is merged. Files for the HG005 analysis are locally in ~jkb/lustre/GIAB/HG005. That includes BAM files, VCFs, and plots. You should be able to do |
Some newer data - HG005 Ultima Genomics. Unlike the HG002 data this was quite deep. This data came to me with DeepVariant and GATK HaplotypeCaller joint results. Surprisingly bcftools is doing really well here compared to others on recall, although our genotype assignment accuracy is lagging. This PR extends the lead further. As seen on other instruments, we're a bit behind develop on accuracy of genotype assignments for shallow data, but it's more than made up by the improved recall overall. |
I've moved this out of draft as I'm now happy with it to be reviewed. Hopefully this demonstrates the gains to be had. I've validated it now makes zero difference to the default usage of bcftools. Note however it does change the non-versioned profiles, eg In my mind, we should always make sure the non-version numbered profiles use the best options available as that's what most people are going to do. It's only people who want continuity between releases, upgrading purely for major bug-fixes but not accepting anything that changes output, that should consider using a release-specific parameter set. (To this end, the -1.20 also contain a bug fix to IDV and IMF INFO fields which could go into 1.18, but I'm taking the request to not change old profiles literally here.) |
A major revamp like this needs to be tested by more people and on more data sets, before it becomes the new default.
Bug fixes are a different thing. Not sure why we wouldn't want to include it. |
Understood, but what's the best naming? We discussed previously changes and you made it clear that irrespective of testing this could never become the default as people don't expect changing results. However I took the approach of bcftools without command line arguments = default. Bcftools with specified profiles could be different. For consistency, I felt that people who are using a profile and have a hard requirement that the output never changes, should have a profile named after a bcftools release (eg illumina-1.18) while people who do not required hard never-changing results could use a non-release tagged profile that upgrades automatically to what we personally believe represents the best caller. That feels like a good pragmatic solution and what I've done here. Do you agree? If so I guess the final question is what do we do about new profiles that are undergoing testing? Should I just rename them to 1.21, so there is a clear statement that it's experimental for this release and (if no problems found) will become the new default next time around? That feels like good signalling.
Well, because of the above statement basically. Also when is a bug a bug? My fix I'm thinking of is IDV and IMF are computed based on the number of records containing the indel called originally. Not the number called after we validated them. We've always had this as bcftools has always realigned reads (previously to patched ref, now to consensus), but it used the wrong data when filling out INFO. We could argue over bug or feature, and I can see both sides, which is why I chose to not fix it. However it comes down to the same issue. If we take the route that people should never be getting changed results, which was your firm belief before and why you rejected the previous PR, then we also shouldn't be making such changes without producing a new version-numbered profile. |
My apologies, I completely misunderstood your words. I agree with what we agreed before:
It was not my brightest moment, sorry again. |
No problem. It's a new direction and not something we've done before. I'm not sure it's perfect, but it feels like a reasonable strategy. The question still remains though as to when the non-versioned profile switches. If this is present in bcftools 1.20 but we don't want "bcftools mpileup -X illumina" then we need a different profile name than "illumina-1.20". I'm open to suggestions, but maybe illumina-1.21 or illumina-experimental? The problem with the latter is we'd want to delete it after it becomes non-experimental. New options is a tricky one too. The profile configuration and the option are inherently entangled. We could obviously do Edit: actually that's a really bad example as there were no changes made to Illumina! However PacBio for example gained a lot of new parameters, such as I don't have
So we see using Edit: I ran HG002 chr20 with
So at ~50FN it's fewer FP, but if we do QUAL filtering to get approx comparable FP and FN (maximising discrimination and F1 score) we're around 20% more errors. But that's still a huge improvement over the ~double error of develop, so just adding |
This is a combination of PR samtools#1679 from 2022 coupled with more recent changes to replace BAQ with edlib for evaluating the alignment of reads against candidate alleles. Due to the complexity of rebasing many commits with many conflicts, these have been squashed together for simplicity. Please consult the original commit log in the PR for more detailed history. PR 1679 is a major restructuring of the indel caller that produces proper consensus sequences. TODO: paste collected commit messages here. The newer edlib changes are designed to dramtically increase the speed of indel calling on long read data by use of a modern algorithm. Note edlib's alignments are very crude, with just +1 for all differences including substitutions and each indel base. However if we've computed our candidate allele consensus sequences correctly then the aligning against the real allele should be a minimal score regardless of scoring regime, so the cost of an HMM or even affine weights do not help much. The lack of quality values is compensated for by detection of minimum quality within STRs.
(TODO: rename edlib to something better) This enables the homopolymer scan-left/right for minimum quality for adjusting seqQ and indelQ. This is good for machines with mainly indel errors, and detrimental for clocked instruments such as Illumina.
When we have eg REF C and ALT CT,CTT then we're saying it's +1 or +2 Ts. We align all reads against the 3 candidate alleles and score them, sorting these scores to find the best so we can allocate reads to the alleles. The score for those allele assignments was REF minus best-indel for REF assignments, and best-indel minus REF for indel assignments. This change turns the latter into best-indel vs next-best-allele, regardless of whether that next best is REF or not. Consider the case of +1 score 30, +2 score 30, +0 (REF) score 60. Previously we'd have recorded the relative quality of 60-30, but now we record 30-30. The consequence of this is reads that align equally well against +1 and +2 get zero confidence in their correct allele assignment. This considerably reduces the chance of recording GT 1/2 for variable homopolymers caused purely by the vagarities of sorting a bunch of equal numbers. (The equal scores often arrive due to reads that don't span the homopolymer or STR.) We may miss some true variants as the multi-allelic possibilities all cancel each other out, but typically that means there are very few reads spanning the site and this data may well be false-positive caused by sequencing artifacts. So overall it's a considerable benefit to accuracy. A consequence of this is also that AD values now more accurately reflect evidence, rather than incorrectly distributing the uninformative reads to alleles. However, AD will now also be lower, and similarly PL. So a global boost to indelQ helps recover some of this lost sensitivity. This commit also boosts indelQ by the --indel-bias param to make this more impactful. (Previously it only boosted seqQ)
These use VT's "decompose_blocksub" command, which since writing may not be possible without bcftools natively. If so I haven't evaluated how it differs. (The origin of these scripts come from Crumble evaluation many years ago so I could see what the impact was on reducing quality in the SynDip data set.)
It's faster and more accurate than the old mode. The speed is due to BAQ vs Edlib, but the accuracy comes from the use of allele consensus generation and better heuristics. - Cull lots of ifdefed out code, plus reset indel-bias back. Earlier experiments had increased this score by doing indel_bias/10 when it's used (it's 1/score for effect). The difference isn't great, and it's marginally better FN/FP ratio with it in the original usage. We can still manually change this score to get better sensitivity if desired, but it's not dialed up so high by default. - Remove an indel caller hang. Our consensus can sometimes fail and give us a short alignment that doesn't span pos. We were checking the wrong end coord.
The current IDV and IMF come from the initial "types" assignment, before realigning reads back to the consensus alleles. If we're outputting AD then we have more accurate re-aligned counts of each type, so we report those instead.
The main benefit of the new indel caller is the use of alignment against diploid consensus generation instead of simply patching the reference with candidate indel types. This greatly reduces false positives and incorrect allele alignment (leading to wrong genotype calls). This was the earlier PR samtools#1679, but has since acquired edlib as an alternative to BAQ for the indel alignment. However this is primarily a speed benefit (with some minor removal of false-negatives due to quality smashing), and isn't the main thing users should think about when choosing an indel caller. Also tidied up the usage statement and added an explicit "-X list" to print the profile parameters. - Add extra debugging defines. GLF_DEBUG reports GLF calculation in bam2bcf.c. ALIGN_DEBUG uses edlib EDLIB_TASK_PATH to report sequence alignment. NB to use this you need to link against edlib itself rather than the cutdown version in this repository. Also fix the edlib heuristics used in bcf_call_glfgen. We don't want to change the call (b=5) as this affects AD. Instead we change the quality so poor calls get filtered by QUAL rather than simply being removed. - Tweak edlib tuning for SeqQ/qual. Add quality value assessment into soft-clip recovery. Use /500 instead of /111 in indelQ assignment, and skew indel-bias accordingly. This gives better separation of FP/GT/FN generally. - Added --seqq-offset parameter so we can use it in tunables per profile. This is used as a limit on the seqQ reduction in the "VAL-5*MIN(20,depth)" formula, used for favouring data over seqQ scores when depth is sufficient. Experimentation showed no single value that worked for all platforms, but the default is in the middle. - Tidy up to cull ifdefed and commented out code. - Add test for indels-cns. It's minimal, but the whole indel calling has minimal testing. I think we have under 10 indels in total with develop (all short read and mostly duplications of each other), and no testing of indels-2.0. This tests 4 indels with indels-cns. - Added documentation for the new --indels-2.0 options - Cull more unused parts of edlib.c. This avoids clang warnings (which become errors with -Werror). We're only including the bits we need here for mpileup. If you want the whole thing, link against the upstream source library instead.
vcfbuf.c:249:32: error: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Werror,-Wsingle-bit-bitfield-constant-conversion] buf->vcf[i].af_set = 1; ^ ~
Please give me confirmation that this is interesting and will be considered for inclusion
Despite demonstrating a very significant leap forward in calling accuracy, #1679 never received any comments or consideration. Instead it triggered the creation of the
--indels-2.0
option instead, with many of the ideas being rewritten for that. Unfortunately--indels-2.0
just doesn't perform that well, as it wasn't completed and misses many of the vital heuristics that remove the false positive calls.In personal communications, one reason for 1679 being passed over appears to be because it changes the behaviour of existing bcftools and doesn't offer a way back to the old calling algorithm for compatibility. (IMO if you want it to never get better, don't upgrade!)
I'm not a believe in only locking new functionality behind new options, as that means we never progress (no one is going to go through all the nextflow pipelines and amend them), but I'll concede the lack of compatibility may be an issue so this PR adds versioned profiles (eg
-X illumina-1.18
that behave as the previous release, while updating the default profiles to be more accurate.The main summary of changes so far are:
Use of a consensus instead of patching the reference for constructing the candidate indel alleles to align against. This is the most significant accuracy gain from Improvements to indel calling. #1679.
Drop the use of BAQ as an alignment algorithm and replace with edlib. BAQ was never really designed as an aligner, but a quality recalibration too. Using it to align sequences is both very heavy weight and also error prone as it won't always give a good alignment score on things that have STRs or other low-complexity. Edlib is vastly faster, and we compensate for the lack of quality assessment by a combination of heuristics such as STR scanning (this was previously added, but it's been beefed up a bit more) and some crude quality score evaluations such as minimum quality within an STR.
More profiles and revised profiles. ONT-super and Ultima have been added.
New parameters: --del-bias to control biases in insertion vs deletion error models in instruments (many single cell instruments are more likely to miss bases than insert bases); --poly-mqual to enable/disable the use of minimum quality values within homopolmers (harms clocked instruments such as Illumina, but benefits many others).
SNP calling is unchanged.
As requested, I moved the code from out of bam2bcf_indels.c and into its own file. I'm not happy with the --edlib parameter as I feel edlib vs BAQ isn't the really important thing here. It's a big speed up, but the real work is the better algorithm. That's (for now) in bam2bcf_edlib.c, following the same strategy @pd3 took with duplicating functions into bam2bcf_iaux.c. I'm open to better names on the command line. I'm tempted by
--indels-3.0
, but I fear that may cause friction. I'm not intending to say it's deprecating your indels-2.0 code (it's for the users to decide which algorithms they prefer), and indeed it's not a derivation of it as it preceeds it and the indels-2.0 code was your earlier rewrite of my own 1679 PR so really that was "indels-1.5" anyway. I don't really like the naming scheme for either of them, but am open to suggestions.The recall vs precision is generally significantly ahead of both develop and the experimental
--indels-2.0
mode (which also crashes on a lot of the long-read data sets I tried, such as ~jkb/lustre/GIAB/ont/PAO83395.pass.cram on our local systems). Speed wise, it's pretty much unchanged with Illumina, but on long read technologies with higher indel errors the increased number of candidate indels to asses means the new profiles and edlib use can be 2-3x faster.This PR however isn't always quite matching the old 1679 one. It's close, but marginally behind. That's probably due to using heuristics instead of an HMM for incorporation of quality scores to the alignment process. It's close enough though and it is much faster on long read data.
I'm still running experiments and assessing results, but this is a proof of principle.
Illumina GIAB HG002 chr1 results. These all had BAQ semi-enabled (default mode) for SNPs.
The comma-separated values above are False Positives, GT assignment errors, and False Negatives. The val in val/FP,GT,FN is the QUAL threshold applied as a filter, dropping all variants with lower quality. I've been manually adjusting that until I get to a similar FN rate so I can do a more reasonable side by side assessment of "given the same recall, how does the accuracy compare". I'll produce some plots later on once I've completed my trials.
"ed10" is
--edlib --indel-bias 10
to push sensitivity up higher, especially at QUAL 0 (no filtering). It's still not as sensitive as 1679 though. 1679 was better here for genotype assignment accuracy, but FP vs FN was broadly similar to the new mode.PacBio HiFi
The new edlib mode is ahead of the 1679 PR, which itself was signifcantly ahead of develop (and gatk4 it turns out, but I didn't run all the myriad of post-processing tools, just QUAL filtering). The new code is MUCH faster too. --indels2.0 died:
ONT "super" base call mode. (PAO83395.pass.cram from ONT)
I still have more analysis to do here, along with the Ultima Genomics calling. Both the ONT and UG data set however are marginal on small indel detection due to the error models on the instruments. (If anyone knows of a published HG002 data set from ONT using "sup-dup" mode data then please feel free to point me at it. Similarly for any other data types out there.)