Skip to content

Commit 7a4f801

Browse files
committed
Fix bcftools norm -m +indels
The code was handling the merging in a simplistic way, SNPs vs everything else. As pointed by #2084, it also would not merge indels with `-m +indels`, only with `-m +both` or `-m +any`. Also splitting by type was not functioning properly due to an error where two incompatible bitmasks were used together (e.g. COLLAPSE_SNPS vs VCF_SNP) This is now fixed and improved, the new behavior is as follows: - multiallelic sites with containing SNPs but not indels are split with `-m -snps` but not with `-m -indels`, and analogously for indels. - multiallelic sites containing both SNPs and indels are split when any of the following is given: `-m -snps`, `-m -indels`, `-m both`, `-m any` - merging with `-m +snps` and `-m +indels` should work as expected in case of pure SNP or indel sites. When the input sites contain a mixture of types (e.g. SNP + indel), such sites will not be merged. - merging with `-m +both` will merge together not just SNPs with SNPs and indels with indels, but also "other types" with "other types". Note: this could be improved by providing the user with a way to fine-tune the desired behaviour, for example something like -m +snps+mnps,indels to merge SNPs with MNPs together and indels together. This would not be too difficult to add, but would complicate the user interface. Another improvement would be to make it possible to split multiallelic sites containing both SNPs and indels so that a) two mutliallelic sites are emitted, one with SNPs only and one with indels only b) as above, but one is transformed into multiple biallelic sites and one multiallelic site This could be further improved (and complicated) by considering other variant types. Resolves #2084
1 parent f33fd1d commit 7a4f801

14 files changed

+207
-71
lines changed

NEWS

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,10 @@ Changes affecting specific commands:
1313
- Fix Type=String multiallelic splitting for Number=A,R,G tags with incorrect number
1414
of values.
1515

16+
- Merging into multiallelic sites with `bcftools norm -m +indels` did not work. This is
17+
now fixed and the merging is now more strict about variant types, for example complex
18+
events, such as AC>TGA, are not considered as indels anymore (#2084)
19+
1620
* bcftools +setGT
1721

1822
- Support for custom genotypes based on the allele with higher depth, such

doc/bcftools.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2503,6 +2503,10 @@ the *<<fasta_ref,--fasta-ref>>* option is supplied.
25032503
both SNPs and indels should be merged separately into two records, specify
25042504
'both'; if SNPs and indels should be merged into a single record, specify
25052505
'any'.
2506+
{nbsp} +
2507+
{nbsp} +
2508+
Note that multiallelic sites with both SNPs and indels will be split into
2509+
biallelic sites with both *-m -snps* and *-m -indels*.
25062510

25072511
*--multi-overlaps* '0'|'.'::
25082512
use the reference ('0') or missing ('.') allele for overlapping alleles after

test/norm.merge.4.1.out

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
##fileformat=VCFv4.2
2+
##FILTER=<ID=PASS,Description="All filters passed">
3+
##contig=<ID=1,length=248387328>
4+
##reference=file:ref.fa
5+
#CHROM POS ID REF ALT QUAL FILTER INFO
6+
1 1 . C T,CTT,A,CAA . . .
7+
1 2 . C <DEL>,<DUP> . . .

test/norm.merge.4.2.out

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
##fileformat=VCFv4.2
2+
##FILTER=<ID=PASS,Description="All filters passed">
3+
##contig=<ID=1,length=248387328>
4+
##reference=file:ref.fa
5+
#CHROM POS ID REF ALT QUAL FILTER INFO
6+
1 1 . C T,CTT . . .
7+
1 1 . C A,CAA . . .
8+
1 2 . C <DEL>,<DUP> . . .

test/norm.merge.4.vcf

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
##fileformat=VCFv4.2
2+
##contig=<ID=1,length=248387328>
3+
##reference=file:ref.fa
4+
#CHROM POS ID REF ALT QUAL FILTER INFO
5+
1 1 . C T,CTT . . .
6+
1 1 . C A,CAA . . .
7+
1 2 . C <DEL> . . .
8+
1 2 . C <DUP> . . .

test/norm.merge.out

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,10 @@
3838
2 101 . ATTTTTTTTTTTTT ATTTTTTTTTTTTTTT 999 PASS INDEL;AN=4;AC=4 GT:DP 1/1:1 1/1:1
3939
2 114 . TC TTCC,TTC 999 FAIL1 INDEL;AN=4;AC=2,2 GT:DP:FGS 1/2:1:A,BB,CCC,EEEE,.,FFFFF 1/2:1:AA,BB,CCC,EEEE,.,FFFFF
4040
2 115 . C T 999 PASS INDEL;AN=4;AC=4 GT:DP 1/1:1 1/1:1
41-
20 3 . GATG CTATG,GACT 999 PASS INDEL;AN=4;AC=2,2 GT 2/1 2/1
42-
20 5 id0001;id0002 TGGG TAC,TG,TGGGG,AC . PASS INDEL;AN=4;AC=2,2,0,0 GT:PL:DP 1/2:1,2,3,4,.,6,7,.,.,10,11,.,.,.,15:1 1/2:1,2,3,4,.,6,7,.,.,10,11,.,.,.,15:1
41+
20 3 . GATG GACT 999 PASS INDEL;AN=4;AC=2 GT 1/0 1/0
42+
20 3 . G CT 999 PASS INDEL;AN=4;AC=2 GT 0/1 0/1
43+
20 5 id0001;id0002 TGGG TG,TGGGG . PASS INDEL;AN=4;AC=2,0 GT:PL:DP 0/1:1,4,6,7,.,10:1 0/1:1,4,6,7,.,10:1
44+
20 5 . TGGG TAC,AC . PASS INDEL;AN=4;AC=2,0 GT:PL:DP 1/0:1,2,3,11,.,15:1 1/0:1,2,3,11,.,15:1
4345
20 59 id0003 AG . 999 PASS AN=4 GT:PL:DP 0/0:0:4 0/0:0:4
4446
20 80 . CACAG CACAT 999 PASS AN=4;AC=2 GT:PL:DP 0/1:255,0,255:13 0/1:255,0,255:13
4547
20 81 . A C 999 PASS AN=4;AC=2 GT:PL:DP 0/1:255,0,255:13 0/1:255,0,255:13

test/norm.merge.strict.out

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,10 @@
3838
2 101 . ATTTTTTTTTTTTT ATTTTTTTTTTTTTTT 999 PASS INDEL;AN=4;AC=4 GT:DP 1/1:1 1/1:1
3939
2 114 . TC TTCC,TTC 999 PASS INDEL;AN=4;AC=2,2 GT:DP:FGS 1/2:1:A,BB,CCC,EEEE,.,FFFFF 1/2:1:AA,BB,CCC,EEEE,.,FFFFF
4040
2 115 . C T 999 PASS INDEL;AN=4;AC=4 GT:DP 1/1:1 1/1:1
41-
20 3 . GATG CTATG,GACT 999 PASS INDEL;AN=4;AC=2,2 GT 2/1 2/1
42-
20 5 id0001;id0002 TGGG TAC,TG,TGGGG,AC . PASS INDEL;AN=4;AC=2,2,0,0 GT:PL:DP 1/2:1,2,3,4,.,6,7,.,.,10,11,.,.,.,15:1 1/2:1,2,3,4,.,6,7,.,.,10,11,.,.,.,15:1
41+
20 3 . GATG GACT 999 PASS INDEL;AN=4;AC=2 GT 1/0 1/0
42+
20 3 . G CT 999 PASS INDEL;AN=4;AC=2 GT 0/1 0/1
43+
20 5 id0001;id0002 TGGG TG,TGGGG . PASS INDEL;AN=4;AC=2,0 GT:PL:DP 0/1:1,4,6,7,.,10:1 0/1:1,4,6,7,.,10:1
44+
20 5 . TGGG TAC,AC . PASS INDEL;AN=4;AC=2,0 GT:PL:DP 1/0:1,2,3,11,.,15:1 1/0:1,2,3,11,.,15:1
4345
20 59 id0003 AG . 999 PASS AN=4 GT:PL:DP 0/0:0:4 0/0:0:4
4446
20 80 . CACAG CACAT 999 PASS AN=4;AC=2 GT:PL:DP 0/1:255,0,255:13 0/1:255,0,255:13
4547
20 81 . A C 999 PASS AN=4;AC=2 GT:PL:DP 0/1:255,0,255:13 0/1:255,0,255:13

test/norm.split.merge.1.out

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
##fileformat=VCFv4.2
2+
##FILTER=<ID=PASS,Description="All filters passed">
3+
##contig=<ID=chr1,length=248387328>
4+
##contig=<ID=chr2,length=242696752>
5+
##contig=<ID=chr3,length=201105948>
6+
##reference=file:ref.fa
7+
#CHROM POS ID REF ALT QUAL FILTER INFO
8+
chr1 29291 . C T,G,A . . .
9+
chr2 29292 . T C . . .
10+
chr2 29292 . T TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
11+
chr3 29292 . T TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
12+
chr3 29292 . T CGTA . . .

test/norm.split.merge.2.out

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
##fileformat=VCFv4.2
2+
##FILTER=<ID=PASS,Description="All filters passed">
3+
##contig=<ID=chr1,length=248387328>
4+
##contig=<ID=chr2,length=242696752>
5+
##contig=<ID=chr3,length=201105948>
6+
##reference=file:ref.fa
7+
#CHROM POS ID REF ALT QUAL FILTER INFO
8+
chr1 29291 . C T . . .
9+
chr1 29291 . C G . . .
10+
chr1 29291 . C A . . .
11+
chr2 29292 . T C . . .
12+
chr2 29292 . T TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
13+
chr3 29292 . T TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
14+
chr3 29292 . T CGTA . . .

test/norm.split.merge.3.out

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
##fileformat=VCFv4.2
2+
##FILTER=<ID=PASS,Description="All filters passed">
3+
##contig=<ID=chr1,length=248387328>
4+
##contig=<ID=chr2,length=242696752>
5+
##contig=<ID=chr3,length=201105948>
6+
##reference=file:ref.fa
7+
#CHROM POS ID REF ALT QUAL FILTER INFO
8+
chr1 29291 . C T,G,A . . .
9+
chr2 29292 . T C . . .
10+
chr2 29292 . T TCCCTCTCCTTTCTCCTCTCTAGCC . . .
11+
chr2 29292 . T TCTCTTTCTCACTGTCTCTCTAGCC . . .
12+
chr2 29292 . T TCCCTCTCCTTTCTCCTCTCTAGC . . .
13+
chr2 29292 . T TCCATCTGTATCCTCTCTAAGC . . .
14+
chr2 29292 . T TCCCTCTCCTTTCTCCTCAGCC . . .
15+
chr2 29292 . T TCCCTCTCCCTTTCTCCTCTCTAGCC . . .
16+
chr2 29292 . T TCCTCTCCTTTCTCCTCTACCGC . . .
17+
chr2 29292 . T TCCCTCTCCTTTCTCTCTCTAGCC . . .
18+
chr2 29292 . T TCCCTCTCCTTTCTCCTCTAGCC . . .
19+
chr2 29292 . T TCCCTCTCCTTTTCCTCCCCAGCC . . .
20+
chr2 29292 . T TCCCTCTCCTTCTCCTCTCTAGCC . . .
21+
chr2 29292 . T TCCCTCTCCCTTCTCCTCTCTCAC . . .
22+
chr3 29292 . T TCCCTCTCCTTTCTCCTCTCTAGCC . . .
23+
chr3 29292 . T TCTCTTTCTCACTGTCTCTCTAGCC . . .
24+
chr3 29292 . T TCCCTCTCCTTTCTCCTCTCTAGC . . .
25+
chr3 29292 . T TCCATCTGTATCCTCTCTAAGC . . .
26+
chr3 29292 . T TCCCTCTCCTTTCTCCTCAGCC . . .
27+
chr3 29292 . T TCCCTCTCCCTTTCTCCTCTCTAGCC . . .
28+
chr3 29292 . T TCCTCTCCTTTCTCCTCTACCGC . . .
29+
chr3 29292 . T TCCCTCTCCTTTCTCTCTCTAGCC . . .
30+
chr3 29292 . T TCCCTCTCCTTTCTCCTCTAGCC . . .
31+
chr3 29292 . T TCCCTCTCCTTTTCCTCCCCAGCC . . .
32+
chr3 29292 . T TCCCTCTCCTTCTCCTCTCTAGCC . . .
33+
chr3 29292 . T TCCCTCTCCCTTCTCCTCTCTCAC . . .
34+
chr3 29292 . T CGTA . . .

test/norm.split.merge.4.out

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
##fileformat=VCFv4.2
2+
##FILTER=<ID=PASS,Description="All filters passed">
3+
##contig=<ID=chr1,length=248387328>
4+
##contig=<ID=chr2,length=242696752>
5+
##contig=<ID=chr3,length=201105948>
6+
##reference=file:ref.fa
7+
#CHROM POS ID REF ALT QUAL FILTER INFO
8+
chr1 29291 . C T,G,A . . .
9+
chr2 29292 . T C,TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
10+
chr3 29292 . T CGTA,TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .

test/norm.split.merge.vcf

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
##fileformat=VCFv4.2
2+
##contig=<ID=chr1,length=248387328>
3+
##contig=<ID=chr2,length=242696752>
4+
##contig=<ID=chr3,length=201105948>
5+
##reference=file:ref.fa
6+
#CHROM POS ID REF ALT QUAL FILTER INFO
7+
chr1 29291 . C T,G,A . . .
8+
chr2 29292 . T C,TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
9+
chr3 29292 . T CGTA,TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .

test/test.pl

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -298,6 +298,12 @@
298298
run_test(\&test_vcf_norm,$opts,in=>'norm.right-align',fai=>'norm.right-align',out=>'norm.right-align.2.out',args=>'--old-rec-tag ORI -g {PATH}/norm.right-align.gff');
299299
run_test(\&test_vcf_norm,$opts,in=>'norm.atom-split-norm',fai=>'norm.atom-split-norm',out=>'norm.atom-split-norm.1.out',args=>'--old-rec-tag ORI -a -m -any');
300300
run_test(\&test_vcf_norm,$opts,in=>'norm.string-tags',out=>'norm.string-tags.1.out',args=>'-m -any');
301+
run_test(\&test_vcf_norm,$opts,in=>'norm.split.merge',out=>'norm.split.merge.1.out',args=>['-m -','-m +both']);
302+
run_test(\&test_vcf_norm,$opts,in=>'norm.split.merge',out=>'norm.split.merge.2.out',args=>['-m -','-m +indels']);
303+
run_test(\&test_vcf_norm,$opts,in=>'norm.split.merge',out=>'norm.split.merge.3.out',args=>['-m -','-m +snps']);
304+
run_test(\&test_vcf_norm,$opts,in=>'norm.split.merge',out=>'norm.split.merge.4.out',args=>['-m -','-m +any']);
305+
run_test(\&test_vcf_norm,$opts,in=>'norm.merge.4',out=>'norm.merge.4.1.out',args=>'-m +any');
306+
run_test(\&test_vcf_norm,$opts,in=>'norm.merge.4',out=>'norm.merge.4.2.out',args=>'-m +both');
301307
run_test(\&test_vcf_view,$opts,in=>'merge.gvcf.2.a',out=>'merge.gvcf.2.a.1.out',args=>'-HA');
302308
run_test(\&test_vcf_view,$opts,in=>'merge.gvcf.2.a',out=>'merge.gvcf.2.a.2.out',args=>'-HAA');
303309
run_test(\&test_vcf_view,$opts,in=>'weird-chr-names',out=>'weird-chr-names.1.out',args=>'',reg=>'-r 1');

vcfnorm.c

Lines changed: 83 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -79,8 +79,8 @@ typedef struct
7979
{
8080
char *tseq, *seq;
8181
int mseq;
82-
bcf1_t **lines, **tmp_lines, **alines, **blines, *mrow_out;
83-
int ntmp_lines, mtmp_lines, nalines, malines, nblines, mblines;
82+
bcf1_t **lines, **tmp_lines, **mrows, *mrow_out;
83+
int ntmp_lines, mtmp_lines, nmrows, mmrows, mrows_first;
8484
map_t *maps; // mrow map for each buffered record
8585
char **als;
8686
int mmaps, nals, mals;
@@ -1874,72 +1874,98 @@ static void merge_biallelics_to_multiallelic(args_t *args, bcf1_t *dst, bcf1_t *
18741874
}
18751875

18761876
#define SWAP(type_t, a, b) { type_t t = a; a = b; b = t; }
1877-
static void mrows_schedule(args_t *args, bcf1_t **line)
1877+
static void mrows_push(args_t *args, bcf1_t **line)
18781878
{
18791879
int i,m;
1880-
if ( args->mrows_collapse==COLLAPSE_ANY // merge all record types together
1881-
|| bcf_get_variant_types(*line)&VCF_SNP // SNP, put into alines
1882-
|| bcf_get_variant_types(*line)==VCF_REF ) // ref
1883-
{
1884-
args->nalines++;
1885-
m = args->malines;
1886-
hts_expand(bcf1_t*,args->nalines,args->malines,args->alines);
1887-
for (i=m; i<args->malines; i++) args->alines[i] = bcf_init1();
1888-
SWAP(bcf1_t*, args->alines[args->nalines-1], *line);
1889-
}
1890-
else
1891-
{
1892-
args->nblines++;
1893-
m = args->mblines;
1894-
hts_expand(bcf1_t*,args->nblines,args->mblines,args->blines);
1895-
for (i=m; i<args->mblines; i++) args->blines[i] = bcf_init1();
1896-
SWAP(bcf1_t*, args->blines[args->nblines-1], *line);
1880+
if ( !args->nmrows ) args->mrows_first = 0;
1881+
args->nmrows++;
1882+
m = args->mmrows;
1883+
hts_expand(bcf1_t*,args->nmrows,args->mmrows,args->mrows);
1884+
for (i=m; i<args->mmrows; i++) args->mrows[i] = bcf_init1();
1885+
SWAP(bcf1_t*, args->mrows[args->nmrows-1], *line);
1886+
1887+
if ( args->mrows_collapse==COLLAPSE_ANY ) return;
1888+
1889+
// move the line up the sorted list so that the same variant types end up together
1890+
int cur_type = bcf_get_variant_types(args->mrows[args->nmrows-1]);
1891+
i = args->mrows_first + args->nmrows - 1;
1892+
while (i>0)
1893+
{
1894+
int prev_type = bcf_get_variant_types(args->mrows[i-1]);
1895+
if ( prev_type <= cur_type ) break;
1896+
bcf1_t *tmp = args->mrows[i-1];
1897+
args->mrows[i-1] = args->mrows[i];
1898+
args->mrows[i] = tmp;
1899+
i--;
18971900
}
18981901
}
1899-
static int mrows_ready_to_flush(args_t *args, bcf1_t *line)
1902+
static int mrows_can_flush(args_t *args, bcf1_t *line)
19001903
{
1901-
if ( args->nalines && (args->alines[0]->rid!=line->rid || args->alines[0]->pos!=line->pos) ) return 1;
1902-
if ( args->nblines && (args->blines[0]->rid!=line->rid || args->blines[0]->pos!=line->pos) ) return 1;
1904+
if ( !args->nmrows ) return 0;
1905+
int ibeg = args->mrows_first;
1906+
if ( args->mrows[ibeg]->rid != line->rid ) return 1;
1907+
if ( args->mrows[ibeg]->pos != line->pos ) return 1;
19031908
return 0;
19041909
}
19051910
static bcf1_t *mrows_flush(args_t *args)
19061911
{
1907-
if ( args->nblines && args->nalines==1 && bcf_get_variant_types(args->alines[0])==VCF_REF )
1912+
if ( !args->nmrows ) return NULL;
1913+
1914+
int ibeg = args->mrows_first;
1915+
1916+
//fprintf(stderr,"flush: ibeg=%d n=%d\n",ibeg,args->nmrows);
1917+
//int i;
1918+
//for (i=ibeg; i<ibeg+args->nmrows; i++)
1919+
// fprintf(stderr,"\ti=%d type=%d %s %s\n",i,bcf_get_variant_types(args->mrows[i]),args->mrows[i]->d.allele[0],args->mrows[i]->d.allele[1]);
1920+
1921+
if ( args->nmrows==1 )
19081922
{
1909-
// By default, REF lines are merged with SNPs if SNPs and indels are to be kept separately.
1910-
// However, if there are indels only and a single REF line, merge it with indels.
1911-
args->nblines++;
1912-
int i,m = args->mblines;
1913-
hts_expand(bcf1_t*,args->nblines,args->mblines,args->blines);
1914-
for (i=m; i<args->mblines; i++) args->blines[i] = bcf_init1();
1915-
SWAP(bcf1_t*, args->blines[args->nblines-1], args->alines[0]);
1916-
args->nalines--;
1923+
args->nmrows = 0;
1924+
return args->mrows[ibeg];
19171925
}
1918-
if ( args->nalines )
1926+
1927+
if ( args->mrows_collapse==COLLAPSE_ANY )
19191928
{
1920-
if ( args->nalines==1 )
1921-
{
1922-
args->nalines = 0;
1923-
return args->alines[0];
1924-
}
1929+
// merge everything with anything
19251930
bcf_clear(args->mrow_out);
1926-
merge_biallelics_to_multiallelic(args, args->mrow_out, args->alines, args->nalines);
1927-
args->nalines = 0;
1931+
merge_biallelics_to_multiallelic(args, args->mrow_out, &args->mrows[ibeg], args->nmrows - ibeg);
1932+
args->nmrows = 0;
19281933
return args->mrow_out;
19291934
}
1930-
else if ( args->nblines )
1935+
1936+
int j;
1937+
int types[] = { VCF_SNP, VCF_MNP, VCF_INDEL, VCF_OTHER, -1 }; // merge everything within the same category
1938+
if ( args->mrows_collapse==COLLAPSE_SNPS ) types[1] = -1; // merge SNPs only
1939+
else if ( args->mrows_collapse==COLLAPSE_INDELS ) types[0] = VCF_INDEL, types[1] = -1; // merge indels only
1940+
for (j=0; types[j]!=-1; j++)
19311941
{
1932-
if ( args->nblines==1 )
1942+
int i, type = types[j]; // to keep the compiler happy
1943+
for (i=ibeg; i<ibeg+args->nmrows; i++)
19331944
{
1934-
args->nblines = 0;
1935-
return args->blines[0];
1945+
type = bcf_get_variant_types(args->mrows[i]);
1946+
if ( type!=types[j] && type!=VCF_REF ) break;
1947+
}
1948+
if ( i==ibeg+1 && type!=VCF_REF )
1949+
{
1950+
// just one line of this type, no merging, but multiple lines of different type follow
1951+
args->nmrows--;
1952+
args->mrows_first++;
1953+
return args->mrows[ibeg];
1954+
}
1955+
if ( i>ibeg )
1956+
{
1957+
// more than one line, merging is needed
1958+
int nflush = i - ibeg;
1959+
bcf_clear(args->mrow_out);
1960+
merge_biallelics_to_multiallelic(args, args->mrow_out, &args->mrows[ibeg], nflush);
1961+
args->nmrows -= nflush;
1962+
args->mrows_first += nflush;
1963+
return args->mrow_out;
19361964
}
1937-
bcf_clear(args->mrow_out);
1938-
merge_biallelics_to_multiallelic(args, args->mrow_out, args->blines, args->nblines);
1939-
args->nblines = 0;
1940-
return args->mrow_out;
19411965
}
1942-
return NULL;
1966+
args->nmrows--;
1967+
args->mrows_first++;
1968+
return args->mrows[ibeg];
19431969
}
19441970
static void cmpals_add(cmpals_t *ca, bcf1_t *rec)
19451971
{
@@ -2013,21 +2039,13 @@ static void flush_buffer(args_t *args, htsFile *file, int n)
20132039
k = rbuf_shift(&args->rbuf);
20142040
if ( args->mrows_op==MROWS_MERGE )
20152041
{
2016-
if ( mrows_ready_to_flush(args, args->lines[k]) )
2042+
if ( mrows_can_flush(args, args->lines[k]) )
20172043
{
20182044
while ( (line=mrows_flush(args)) )
20192045
if ( bcf_write1(file, args->out_hdr, line)!=0 ) error("[%s] Error: cannot write to %s\n", __func__,args->output_fname);
20202046
}
2021-
int merge = 1;
2022-
if ( args->mrows_collapse!=COLLAPSE_BOTH && args->mrows_collapse!=COLLAPSE_ANY )
2023-
{
2024-
if ( !(bcf_get_variant_types(args->lines[k]) & args->mrows_collapse) ) merge = 0;
2025-
}
2026-
if ( merge )
2027-
{
2028-
mrows_schedule(args, &args->lines[k]);
2029-
continue;
2030-
}
2047+
mrows_push(args, &args->lines[k]);
2048+
continue;
20312049
}
20322050
else if ( args->rmdup )
20332051
{
@@ -2125,12 +2143,9 @@ static void destroy_data(args_t *args)
21252143
for (i=0; i<args->mtmp_lines; i++)
21262144
if ( args->tmp_lines[i] ) bcf_destroy1(args->tmp_lines[i]);
21272145
free(args->tmp_lines);
2128-
for (i=0; i<args->malines; i++)
2129-
bcf_destroy1(args->alines[i]);
2130-
free(args->alines);
2131-
for (i=0; i<args->mblines; i++)
2132-
bcf_destroy1(args->blines[i]);
2133-
free(args->blines);
2146+
for (i=0; i<args->mmrows; i++)
2147+
bcf_destroy1(args->mrows[i]);
2148+
free(args->mrows);
21342149
for (i=0; i<args->mmaps; i++)
21352150
free(args->maps[i].map);
21362151
for (i=0; i<args->ntmp_als; i++)
@@ -2228,7 +2243,8 @@ static int split_and_normalize(args_t *args)
22282243
// any restrictions on variant types to split?
22292244
if ( args->mrows_collapse!=COLLAPSE_BOTH && args->mrows_collapse!=COLLAPSE_ANY )
22302245
{
2231-
if ( !(bcf_get_variant_types(line) & args->mrows_collapse) )
2246+
int type = args->mrows_collapse==COLLAPSE_SNPS ? VCF_SNP : VCF_INDEL;
2247+
if ( !(bcf_get_variant_types(line) & type) )
22322248
{
22332249
normalize_line(args, line);
22342250
return 0;

0 commit comments

Comments
 (0)