SamRecordClipper has function(s) to trim reads that extend beyond the ends of their mates. I.e. if you have an FR pair where the reads and longer than the insert size, to trim the reads back to the insert size.
A change was made to change the trim point to take into account softclipping and reduce how aggressive we are in clipping.
However, with the change it is possible to create situations where a read is now either over- or under-clipped. This only comes up in rather infrequent cases involving:
- A read pair that sequences all the way through the insert and into the adapters
- A mapping location in the reference where at least 1-2 bases of adapter will align to the reference
- An indel of sufficient size / proximity to one end of the read to disrupt the alignment in the presence of other sequencing errors
For example, say we have 150bp reads from an insert that a) matches to 90bp of the reference sequence and b) has a 10bp insertion 30bp from one side. We might get alignments that look like this:
r1 strand=+ start=1234500 cigar=70M10I23M47S end=1234592 // 3bp of adapter sequence matches reference
r2 strand=- start=1234500 cigar=50S70M30S end=1234569
When go to see if R1 extends past the start of R2, we calculate the "unclipped" end of R2 as end+clipping which comes out to 1234569 + 30 = 1234599. We see that is beyond the mapped end position of R1, into the soft-clipping, therefore a) don't remove any aligned bases and b) if in hard-clipping mode, remove only part of the soft-clipped bases at the end of R1.
Alternatively, if a deletion is present instead of an insertion, then over-clipping occurs.
All of this occurs because we are using distances on the reference and distances on the query sequence interchangeably in the calculation of where to clip, which breaks down in the presences of indels. I believe what we should be doing instead is as follows:
- Quick/efficient check to see if clipping is definitely not needed because a) the read being examined has no soft-clipping at it's end and it's end position is not past the start position of it's mate
- Calculate the last position at which the read and it's mate share a reference position (in the example above this would be at
1234569 when clipping R1)
- Calculate how many query bases each read has past the last shared reference position (in the example above this would be 80 for R1 and 30 for R2)
- Remove bases such that there are the same number of query bases after the last shared reference position (in the example above, remove 50 bases from the end of R1, bringing it's cigar to
70M10I20M50H)
SamRecordClipper has function(s) to trim reads that extend beyond the ends of their mates. I.e. if you have an FR pair where the reads and longer than the insert size, to trim the reads back to the insert size.
A change was made to change the trim point to take into account softclipping and reduce how aggressive we are in clipping.
However, with the change it is possible to create situations where a read is now either over- or under-clipped. This only comes up in rather infrequent cases involving:
For example, say we have 150bp reads from an insert that a) matches to 90bp of the reference sequence and b) has a 10bp insertion 30bp from one side. We might get alignments that look like this:
When go to see if R1 extends past the start of R2, we calculate the "unclipped" end of R2 as
end+clippingwhich comes out to1234569 + 30= 1234599. We see that is beyond the mapped end position of R1, into the soft-clipping, therefore a) don't remove any aligned bases and b) if in hard-clipping mode, remove only part of the soft-clipped bases at the end of R1.Alternatively, if a deletion is present instead of an insertion, then over-clipping occurs.
All of this occurs because we are using distances on the reference and distances on the query sequence interchangeably in the calculation of where to clip, which breaks down in the presences of indels. I believe what we should be doing instead is as follows:
1234569when clipping R1)70M10I20M50H)