Minor CRAM notation suggestion: Use IEC-style MiB notation for powers of two, rather than MB by jmarshall · Pull Request #841 · samtools/hts-specs

jmarshall · 2025-08-12T11:45:39Z

CRAM parts split out from PR #839 — see the conversation there:

CRAMv2.1.tex and CRAMv3.tex correctly distinguish MB (megabytes, though could use MiB) and kb/Mb (kilobases and megabases, correctly decimal, and the correct abbreviation for “bases” in a bioinformatics context).

github-actions · 2025-08-12T11:47:36Z

Changed PDFs as of 4a80f86: CRAMv2.1 (diff), CRAMv3.

jkbonfield · 2025-08-12T14:28:51Z

Thank you for the update, but on reviewing this text I see it's been hanging around since CRAMv2 days (before I started my implementation). The figures are very misleading, especially given no statement about instrument type.

Eg:

\textbf{Mapped short reads with bases, pairing and mapping information}

We have 250,000 mapped short reads (100bp) with bases, pairing and mapping information.
We estimate the compression to be 0.2 bits/base. Space estimate is $250,000 \times 100
\times 0.2 \bits \approx 0.6 \MiB$. Data could be stored in a single
container.

On a 1 million read novaseq file this averaged out at 0.93 bits per base including quality values, aux tags, etc. Sequence was about 0.18 bits per base, so that's perhaps where this value came from. I note it's unchanged since v2.1, and maybe earlier. Original CRAM wasn't storing quality values amongst other things, and maybe no tags, so perhaps it was more realistic then? Uncompressing this (writing to BAM level 0) comes out at 17.6 bpb (aound 19x larger) and a bit less for uncompressed CRAM (11bpb). Yet this file also has just 0.167 MiB compressed container size.

We're talking in the quoted text about compressed sizes, so a 1MiB compressed container would be ~20 MiB if held in memory as a an array of decoded BAM objects. Still fitting in the L2 size, but do we really want to recommend a default block size two orders of magnitude larger than BAM (64KiB)? It feels heavy handed.

Indeed my own implementations default to capping at 10,000 alignments or 500 kbp, whichever comes sooner. For short read data that's around 3MiB uncompressed, or closer to 10MiB for long read technologies.

My conclusion is the entire section is somewhat irrelevant, and a single recommendation is inappropriate too. I know the cancer pipeline here were using smaller CRAM container sizes than the defaults because they prioritised random access. Other places maybe using larger containers and slower compression methods as they view CRAM primarily as an archive-only format. Hence the profiles (e.g. samtools view -O cram,fast or samtools view -O cram,small). These are totally implementation defined and we don't really have any stipulations to make. A recommendation would be soft, and perhaps for the middle default ground.

Maybe:

"The choice of containing size is entirely implementation defined, as is which compression methods and compression levels to use.
We recommend exposing a series of compression profiles or command-line options to provide user control, defaulting to faster methods and no more than a few megabytes of uncompressed data per container."

I think the rest of it is unnecessary (including being overly specific on units).

…amtools#841)

github-actions · 2026-03-17T15:56:48Z

Changed PDFs as of fed5e10: CRAMv2.1 (diff), CRAMv3.

This was both out of date and simply wrong in a lot of the mathematics.

github-actions · 2026-03-17T17:04:22Z

Changed PDFs as of c09e001: CRAMv2.1 (diff), CRAMv3.

jkbonfield · 2026-03-17T17:05:58Z

Thank you for the MB vs MiB changes.

I added an extra commit to completely rewrite the worked examples of container sizes, as they were wrong in the maths and plain misleading anyway as technology has changed with quantised quality values in NovaSeq and unquantised many-more quality values in ONT. It's best to just stick to the basic principles and leave the specifics unsaid.

I realise now that this means the MB vs MiB change is completely invisible as it's all removed again! However I don't see that as necessarily a reason for removing the commit as it does accurately reflect the history of edits.

Back to @jmarshall I think to review the wording. Thanks

jmarshall added the cram label Aug 12, 2025

jmarshall mentioned this pull request Aug 12, 2025

Fix BGZF block size to be 64 kilobyte #839

Merged

jkbonfield added this to GA4GH File Formats Aug 12, 2025

jkbonfield moved this to New items in GA4GH File Formats Aug 12, 2025

jkbonfield moved this from New items to To do (backlog) in GA4GH File Formats Aug 12, 2025

Use IEC-style MiB notation for powers of two, rather than MB [minor] (s…

fed5e10

…amtools#841)

jkbonfield force-pushed the cram-iec branch from 4a80f86 to fed5e10 Compare March 17, 2026 15:54

Rewrote the "Choosing the container size" appendix

c09e001

This was both out of date and simply wrong in a lot of the mathematics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor CRAM notation suggestion: Use IEC-style MiB notation for powers of two, rather than MB#841

Minor CRAM notation suggestion: Use IEC-style MiB notation for powers of two, rather than MB#841
jmarshall wants to merge 2 commits into
samtools:masterfrom
jmarshall:cram-iec

jmarshall commented Aug 12, 2025

Uh oh!

github-actions Bot commented Aug 12, 2025

Uh oh!

jkbonfield commented Aug 12, 2025

Uh oh!

github-actions Bot commented Mar 17, 2026

Uh oh!

github-actions Bot commented Mar 17, 2026

Uh oh!

jkbonfield commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jmarshall commented Aug 12, 2025

Uh oh!

github-actions Bot commented Aug 12, 2025

Uh oh!

jkbonfield commented Aug 12, 2025

Uh oh!

github-actions Bot commented Mar 17, 2026

Uh oh!

github-actions Bot commented Mar 17, 2026

Uh oh!

jkbonfield commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants