PR #703 clarified this a lot, but in doing so added the paragraph
A base substitution is defined as a change from one nucleotide base (reference base) to another (read base), including N as an unknown or missing base. There are 5 supported reference bases (ACGTN), with 4 possible substitutions for each base. Any other base type, such as an ambiguity code, must be written verbatim using the BA data series.
Obviously we cannot encode a non-ACGTN sample base as it's absent from the lookup table, but the explicit claim here is also that a reference with an ambiguity code must not use the BS data series to encode the sample base.
However this is not how Cramtools.jar worked, and nor is it how io_lib (scramble) or htslib (samtools) works. They interpret all non-ACGT reference bases as if they were N and use the BS data series to then store a delta against it. This is actually preferable, as using BA also comes with the requirement of a quality value in QS (for no obvious reason frankly other than the early origins of doing lossy-compression and only storing quality on specific bases deemed worthy). At some point htsjdk switched to honour what the specification currently states, as does Noodles.
See zaeleus/noodles#355 for discussion.
A demonstration of this is:
$ cat ref.fa
>ref
AGCTANAYRGAC
$ cat seq.sam
@SQ SN:ref LN:12
seq 0 ref 1 0 12M * 0 0 AGCTAAATGGAC *
$ samtools view -O cram -T ref.fa seq.sam|cram_dump -v - | sed -n '/Rec 1/,/Block/p'
Rec 1/1 at 0,7
BF = 0 => SAM 0x0 (ret 0, out_sz 1)
CF = 3 (ret 0, out_sz 1)
RL = 12 (ret 0, out_sz 1)
AP = 0 (ret 0, out_sz 1)
RG = -1 (ret 0, out_sz 1)
RN = seq (ret 0, out_sz 3)
Detached
MF = 0 (ret 0, out_sz 1)
NS = -1 (ret 0, out_sz 1)
NP = 0 (ret 0, out_sz 1)
TS = 0 (ret 0, out_sz 1)
TL = 0 (ret 0, out_sz 1)
FN = 3 (ret 0, out_sz 1)
0: FC = X (ret 0, out_sz 1)
0: FP = 6+0 = 6 (ret 0, out_sz 1)
0: BS = 0 (ret 0)
1: FC = X (ret 0, out_sz 1)
1: FP = 2+6 = 8 (ret 0, out_sz 1)
1: BS = 3 (ret 0)
2: FC = X (ret 0, out_sz 1)
2: FP = 1+8 = 9 (ret 0, out_sz 1)
2: BS = 2 (ret 0)
MQ = 0 (ret 0, out_sz 1)
QS = (ret 0, out_sz 12)
Block 1/5
Given my PR to "clarify" (sorry!) was only 2 years ago, I think we can conclude it was an over-reach and cramtools, io-lib and htslib are working to an earlier specification version that permitted this behaviour, albeit silently. The spec clearly needs expanding to be explicit and mention that non ACGTN reference bases are interpreted as "N" for purposes of using the BS substitution codes. Obviously non ACGTN sample bases still need to be stored verbatim, and obviously there is nothing wrong in doing so for non ACGTN reference bases either, but given the amount of data in the wild treating e.g. ref "Y" as "N" then we need to mention this explicitly and to remove the incorrect statement.
While we're at it, it ought to explicitly mention again the case here and explain reference is turned to uppercase (as well as sample sequence, but that's kind of implicit anyway through BAM compatibility which only supported one case too).
PR #703 clarified this a lot, but in doing so added the paragraph
Obviously we cannot encode a non-ACGTN sample base as it's absent from the lookup table, but the explicit claim here is also that a reference with an ambiguity code must not use the BS data series to encode the sample base.
However this is not how Cramtools.jar worked, and nor is it how io_lib (scramble) or htslib (samtools) works. They interpret all non-ACGT reference bases as if they were N and use the BS data series to then store a delta against it. This is actually preferable, as using BA also comes with the requirement of a quality value in QS (for no obvious reason frankly other than the early origins of doing lossy-compression and only storing quality on specific bases deemed worthy). At some point htsjdk switched to honour what the specification currently states, as does Noodles.
See zaeleus/noodles#355 for discussion.
A demonstration of this is:
Given my PR to "clarify" (sorry!) was only 2 years ago, I think we can conclude it was an over-reach and cramtools, io-lib and htslib are working to an earlier specification version that permitted this behaviour, albeit silently. The spec clearly needs expanding to be explicit and mention that non ACGTN reference bases are interpreted as "N" for purposes of using the BS substitution codes. Obviously non ACGTN sample bases still need to be stored verbatim, and obviously there is nothing wrong in doing so for non ACGTN reference bases either, but given the amount of data in the wild treating e.g. ref "Y" as "N" then we need to mention this explicitly and to remove the incorrect statement.
While we're at it, it ought to explicitly mention again the case here and explain reference is turned to uppercase (as well as sample sequence, but that's kind of implicit anyway through BAM compatibility which only supported one case too).