samtools · jmarshall · Aug 11, 2025 · Mar 17, 2026
diff --git a/CRAMv2.1.tex b/CRAMv2.1.tex
@@ -1657,51 +1657,26 @@ \subsubsection*{BYTE\_ARRAY\_STOP }
 
 \subsection{\textbf{Choosing the container size}}
 
-CRAM format does not constrain the size of the containers. However, the following 
-should be considered when deciding the container size:
+The CRAM format does not constrain the size of the containers and this is an implementation choice.
+However, the following should be considered when deciding the container size:
 
-$\bullet$ Data can be compressed better by using larger containers
-
-$\bullet$ Random access performance is better for smaller containers 
-
-$\bullet$ Streaming is more convenient for small containers
-
-$\bullet$ Applications typically buffer containers into memory
-
-We recommend 1MB containers. They are small enough to provide good random access 
-and streaming performance while being large enough to provide good compression. 
-1MB containers are also small enough to fit into the L2 cache of most modern CPUs.
-
-Some simplified examples are provided below to fit data into 1MB containers.
-
-\textbf{Unmapped short reads with bases, read names, recalibrated and original 
-quality scores}
-
-We have 10,000 unmapped short reads (100bp) with read names, recalibrated and original 
-quality scores. We estimate 0.4 bits/base (read names) + 0.4 bits/base (bases) 
-+ 3 bits/base (recalibrated quality scores) + 3 bits/base (original quality scores) 
-=\textasciitilde{} 7 bits/base. Space estimate is (10,000 * 100 * 7) / 8 / 1024 
-/ 1024 =\textasciitilde{} 0.9 MB. Data could be stored in a single container.
+\begin{itemize}
+\item Splitting data into defined genomic regions can be problematic when the depth is highly variable, so data size is a more useful metric.
 
-\textbf{Unmapped long reads with bases, read names and quality scores}
+\item Larger containers compress better.
 
-We have 10,000 unmapped long reads (10kb) with read names and quality scores. We 
-estimate: 0.4 bits/base (bases) + 3 bits/base (original quality scores) =\textasciitilde{} 
-3.5 bits/base. Space estimate is (10,000 * 10,000 * 3.5) / 8 / 1024 / 1024 =\textasciitilde{} 
-42 MB. Data could be stored in 42 x 1MB containers.
+\item Smaller containers offer better random access performance.
 
-\textbf{Mapped short reads with bases, pairing and mapping information}
+\item Streaming is more convenient for small containers.
 
-We have 250,000 mapped short reads (100bp) with bases, pairing and mapping information. 
-We estimate the compression to be 0.2 bits/base. Space estimate is (250,000 * 100 
-* 0.2) / 8 / 1024 / 1024 =\textasciitilde{} 0.6 MB. Data could be stored in a single 
-container.
+\item Applications typically buffer containers into memory.  This becomes more important if an application is multi-threaded.
+\end{itemize}
 
-\textbf{Embedded reference sequences}
+The optimal size of container will depend on the use case (e.g. active work or archival; streaming or random-access).
+We recommend application writers consider exposing a series of compression profiles or command-line options to provide user control, defaulting to faster methods and no more than a few megabytes of uncompressed data per container.
 
-We have a reference sequence (10Mb). We estimate the compression to be 2 bits/base. 
-Space estimate is (10000000 * 2 / 8 / 1024 / 1024) =\textasciitilde{} 2.4MB. Data 
-could be written into three containers: 1MB + 1MB + 0.4MB.
+Consider that different instrument types can have vastly different compression ratios per byte stream, for example quality values quantised into 4 values with high correlation with previous qualities, or 60 or more discrete values with minimal predictability.
+Tools such as \emph{samtools cram-size} can report the space taken up by each data type to aid parameter evaluation.
 
 \newpage
 

diff --git a/CRAMv3.tex b/CRAMv3.tex
@@ -26,7 +26,7 @@
 \renewcommand{\footrulewidth}{0pt}
 
 \newcommand\bits{\,\mbox{bits}}
-\newcommand\MB{\,\mbox{MB}}
+\newcommand\MiB{\,\mbox{MiB}}
 
 \setlength{\parindent}{0cm}
 \setlength{\parskip}{0.18cm}
@@ -2560,51 +2560,26 @@ \subsection{\textbf{name tokeniser}}
 \section{\textbf{Appendix}}
 \subsection{\textbf{Choosing the container size}}
 
-CRAM format does not constrain the size of the containers. However, the following 
-should be considered when deciding the container size:
+The CRAM format does not constrain the size of the containers and this is an implementation choice.
+However, the following should be considered when deciding the container size:
 
-$\bullet$ Data can be compressed better by using larger containers
-
-$\bullet$ Random access performance is better for smaller containers 
-
-$\bullet$ Streaming is more convenient for small containers
-
-$\bullet$ Applications typically buffer containers into memory
-
-We recommend 1 megabyte containers. They are small enough to provide good random access
-and streaming performance while being large enough to provide good compression. 
-1\MB\ containers are also small enough to fit into the L2 cache of most modern CPUs.
-
-Some simplified examples are provided below to fit data into 1\MB\ containers.
-
-\textbf{Unmapped short reads with bases, read names, recalibrated and original 
-quality scores}
-
-We have 10,000 unmapped short reads (100bp) with read names, recalibrated and original 
-quality scores. We estimate 0.4 bits/base (read names) + 0.4 bits/base (bases) 
-+ 3 bits/base (recalibrated quality scores) + 3 bits/base (original quality scores) 
-$\approx$ 7 bits/base. Space estimate is $10\,000 \times 100 \times 7 \bits
-\approx 0.9 \MB$. Data could be stored in a single container.
+\begin{itemize}
+\item Splitting data into defined genomic regions can be problematic when the depth is highly variable, so data size is a more useful metric.
 
-\textbf{Unmapped long reads with bases, read names and quality scores}
+\item Larger containers compress better.
 
-We have 10,000 unmapped long reads (10kb) with read names and quality scores. We 
-estimate: 0.4 bits/base (bases) + 3 bits/base (original quality scores) $\approx$
-3.5 bits/base. Space estimate is $10\,000 \times 10\,000 \times 3.5 \bits
-\approx 42 \MB$. Data could be stored in $42 \times 1\MB$ containers.
+\item Smaller containers offer better random access performance.
 
-\textbf{Mapped short reads with bases, pairing and mapping information}
+\item Streaming is more convenient for small containers.
 
-We have 250,000 mapped short reads (100bp) with bases, pairing and mapping information. 
-We estimate the compression to be 0.2 bits/base. Space estimate is $250\,000 \times 100
-\times 0.2 \bits \approx 0.6 \MB$. Data could be stored in a single
-container.
+\item Applications typically buffer containers into memory.  This becomes more important if an application is multi-threaded.
+\end{itemize}
 
-\textbf{Embedded reference sequences}
+The optimal size of container will depend on the use case (e.g. active work or archival; streaming or random-access).
+We recommend application writers consider exposing a series of compression profiles or command-line options to provide user control, defaulting to faster methods and no more than a few megabytes of uncompressed data per container.
 
-We have a reference sequence (10Mb). We estimate the compression to be 2 bits/base. 
-Space estimate is $10\,000\,000 \times 2 \bits \approx 2.4 \MB$. Data
-could be written into three containers: $1\MB + 1\MB + 0.4\MB$.
+Consider that different instrument types can have vastly different compression ratios per byte stream, for example quality values quantised into 4 values with high correlation with previous qualities, or 60 or more discrete values with minimal predictability.
+Tools such as \emph{samtools cram-size} can report the space taken up by each data type to aid parameter evaluation.
 
 \newpage