Minor CRAM notation suggestion: Use IEC-style MiB notation for powers of two, rather than MB#841
Minor CRAM notation suggestion: Use IEC-style MiB notation for powers of two, rather than MB#841jmarshall wants to merge 2 commits intosamtools:masterfrom
Conversation
|
Thank you for the update, but on reviewing this text I see it's been hanging around since CRAMv2 days (before I started my implementation). The figures are very misleading, especially given no statement about instrument type. Eg:
On a 1 million read novaseq file this averaged out at 0.93 bits per base including quality values, aux tags, etc. Sequence was about 0.18 bits per base, so that's perhaps where this value came from. I note it's unchanged since v2.1, and maybe earlier. Original CRAM wasn't storing quality values amongst other things, and maybe no tags, so perhaps it was more realistic then? Uncompressing this (writing to BAM level 0) comes out at 17.6 bpb (aound 19x larger) and a bit less for uncompressed CRAM (11bpb). Yet this file also has just 0.167 MiB compressed container size. We're talking in the quoted text about compressed sizes, so a 1MiB compressed container would be ~20 MiB if held in memory as a an array of decoded BAM objects. Still fitting in the L2 size, but do we really want to recommend a default block size two orders of magnitude larger than BAM (64KiB)? It feels heavy handed. Indeed my own implementations default to capping at 10,000 alignments or 500 kbp, whichever comes sooner. For short read data that's around 3MiB uncompressed, or closer to 10MiB for long read technologies. My conclusion is the entire section is somewhat irrelevant, and a single recommendation is inappropriate too. I know the cancer pipeline here were using smaller CRAM container sizes than the defaults because they prioritised random access. Other places maybe using larger containers and slower compression methods as they view CRAM primarily as an archive-only format. Hence the profiles (e.g. Maybe: "The choice of containing size is entirely implementation defined, as is which compression methods and compression levels to use. I think the rest of it is unnecessary (including being overly specific on units). |
This was both out of date and simply wrong in a lot of the mathematics.
|
Thank you for the MB vs MiB changes. I added an extra commit to completely rewrite the worked examples of container sizes, as they were wrong in the maths and plain misleading anyway as technology has changed with quantised quality values in NovaSeq and unquantised many-more quality values in ONT. It's best to just stick to the basic principles and leave the specifics unsaid. I realise now that this means the MB vs MiB change is completely invisible as it's all removed again! However I don't see that as necessarily a reason for removing the commit as it does accurately reflect the history of edits. Back to @jmarshall I think to review the wording. Thanks |
CRAM parts split out from PR #839 — see the conversation there:
MB(megabytes, though could useMiB) andkb/Mb(kilobases and megabases, correctly decimal, and the correct abbreviation for “bases” in a bioinformatics context).