Skip to content

TLEN meaning when leftmost and rightmost are the same for a pair, or the same for a single record #842

@jkbonfield

Description

@jkbonfield

This came about from samtools/htslib#1948 in what I consider to be an implementation bug in my CRAM decoder.
See also #522 and #48, but I'm ignoring for now the issue of primary vs secondary and simply looking at primary alignments

Although this is arguably an implementation bug, while looking at it I also see it's a specification weakness too.

The CRAM specification defines how to compute TLEN when we have a read-pair stored as a pair (ie not "detached"). Here the reads are in the same slice and we can derive TLEN on the fly without storing it verbatim, but to do so we have to tightly define the meaning of TLEN. CRAM 3.1 section 10.4 defines this as:

27: Find leftmost and rightmost mapped coordinate in records this_record and next_record.
28: For leftmost of this_record and next_record: template_size ← rightmost − lef tmost + 1
29: For rightmost of this_record and next_record: template_size ← −(rightmost − lef tmost + 1)

All well and good, but what if leftmost and rightmost are the same? Ie a containment:

------------>
    <-----

That's obviously under-specified. A better definition may be to define how we assign the positive (rightmost-leftmost+1) value and use an else clause for the other alignments. We lack the statement that SAM has about the two ends having differing sign:

If segments cover the same coordinates then the choice of which is leftmost and rightmost is arbitrary, but the two ends must still have differing sign.

This obviously makes the explicit assumption of "two ends". I guess we could define "ends" to be primary READ1 and READ2 and any middle reads (READ1+READ2) or supplementaries are excluded from that statement, but it's open to interpretation and I can clearly see cases where this is challenging when, for example, the supplementary or "middle" read happens to be outside the extents of the primary READ1/READ2 ends and so modifies the definition of leftmost and rightmost.

What about where we share the same start but differing ends, or the same end but differing starts? That's clearly not covered by "if segments cover the same coordinates", but it does confuse the definition of leftmost and rightmost. Similarly containments where the leftmost and rightmost are the same alignment and the "other end" is shorter and somewhere between those two points. Such cases don't happen for properly mapped data, but it can always be triggered by base calling errors and/or alignment failures. The specification needs to be clear.

If we have 3 reads, with a READ1+READ2, then that read could potentially be the leftmost or rightmost outright (not tied) due to misalignments. So maybe leftmost and rightmost there refer only to the READ1 and READ2 coordinates? Frankly triplets (or more) are just a disaster, but they could occur with supplementary alignments, so we have READ1, READ1(supp), READ2 for expample. (Ignored for now as the language is confusing enough.)

eg

----->  1+2
   -------> 1
        <-------- 2

In the above case, read 1 and 2 form a nice pair, but 1+2 is the leftmost and so we can (and do) get read1 and read2 having the same TLEN sign. Technically this is spec compliant, but I doubt it was intended to be. Sigh.

Focusing purely on pairs I can see a number of disparities between samtools, picard and noodles CRAM implementations and also samtools fixmates. I have a PR brewing for this, but I've been producing test data too. However it's also a SAM issue.

I have a work in progress in testing this, and some of my correct vs incorrect assignments are themselves wrong due to the nuance of "arbitrary" in the SAM spec above. I'll continue to fix this. For now however my on-going test files and spreadsheet of results are attached.

https://docs.google.com/spreadsheets/d/11pAe0I9794ItNKrPEw3N0F27LVPafr9Cq_F-RNBNkzs/edit?usp=sharing

TLEN.test.tar.gz

My thoughts on how to fix these issues are still in flux, but:

  • For SAM: it's pretty good, but we need to be explicit about the containment case where leftmost and rightmost are the same read and we have an extra read which is neither leftmost nor rightmost. Even if it's something such as "If segments cover the same coordinates or the leftmost and rightmost coordinates are from the same segment, then..."

  • For CRAM: we need to explain leftmost and rightmost in a far better fashion, again compensating for the situation of segments with the same start and/or end coordinates, and containments. As it's written in pseudocode, defining the positive one and using an else clause for others feels the most natural solution.

  • We may also wish to add a recommendation about ties rather than "arbitrary". Eg if we sort a file we shouldn't expect the signs to flip, which is why htslib looks at BAM flags to resolve them. However it's implementation defined and you can be spec compliant without doing that (eg noodles) so maybe that's not necessary.

PS. Once I've explored this more I'll look into creating more test files for hts-specs, but it may need multiple "valid" decodes to cover the arbitrary definitions.

I'll probably also work on a samtools fixmates --tlen1 / --tlen2 option at some point so it can be spec compliant with the current definitions, and also fix it so it's not so broken on more than 2 reads, maybe. However that's an implementation issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions