Skip to content

Update SamRecordClipper's functions to "clip past end of mate" to use distances in query bases, not reference bases #1090

@tfenne

Description

@tfenne

SamRecordClipper has function(s) to trim reads that extend beyond the ends of their mates. I.e. if you have an FR pair where the reads and longer than the insert size, to trim the reads back to the insert size.

A change was made to change the trim point to take into account softclipping and reduce how aggressive we are in clipping.

However, with the change it is possible to create situations where a read is now either over- or under-clipped. This only comes up in rather infrequent cases involving:

  • A read pair that sequences all the way through the insert and into the adapters
  • A mapping location in the reference where at least 1-2 bases of adapter will align to the reference
  • An indel of sufficient size / proximity to one end of the read to disrupt the alignment in the presence of other sequencing errors

For example, say we have 150bp reads from an insert that a) matches to 90bp of the reference sequence and b) has a 10bp insertion 30bp from one side. We might get alignments that look like this:

r1 strand=+  start=1234500  cigar=70M10I23M47S  end=1234592 // 3bp of adapter sequence matches reference
r2 strand=-  start=1234500  cigar=50S70M30S     end=1234569

When go to see if R1 extends past the start of R2, we calculate the "unclipped" end of R2 as end+clipping which comes out to 1234569 + 30 = 1234599. We see that is beyond the mapped end position of R1, into the soft-clipping, therefore a) don't remove any aligned bases and b) if in hard-clipping mode, remove only part of the soft-clipped bases at the end of R1.

Alternatively, if a deletion is present instead of an insertion, then over-clipping occurs.

All of this occurs because we are using distances on the reference and distances on the query sequence interchangeably in the calculation of where to clip, which breaks down in the presences of indels. I believe what we should be doing instead is as follows:

  1. Quick/efficient check to see if clipping is definitely not needed because a) the read being examined has no soft-clipping at it's end and it's end position is not past the start position of it's mate
  2. Calculate the last position at which the read and it's mate share a reference position (in the example above this would be at 1234569 when clipping R1)
  3. Calculate how many query bases each read has past the last shared reference position (in the example above this would be 80 for R1 and 30 for R2)
  4. Remove bases such that there are the same number of query bases after the last shared reference position (in the example above, remove 50 bases from the end of R1, bringing it's cigar to 70M10I20M50H)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions