Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 13 additions & 38 deletions CRAMv2.1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -1657,51 +1657,26 @@ \subsubsection*{BYTE\_ARRAY\_STOP }

\subsection{\textbf{Choosing the container size}}

CRAM format does not constrain the size of the containers. However, the following
should be considered when deciding the container size:
The CRAM format does not constrain the size of the containers and this is an implementation choice.
However, the following should be considered when deciding the container size:

$\bullet$ Data can be compressed better by using larger containers

$\bullet$ Random access performance is better for smaller containers

$\bullet$ Streaming is more convenient for small containers

$\bullet$ Applications typically buffer containers into memory

We recommend 1MB containers. They are small enough to provide good random access
and streaming performance while being large enough to provide good compression.
1MB containers are also small enough to fit into the L2 cache of most modern CPUs.

Some simplified examples are provided below to fit data into 1MB containers.

\textbf{Unmapped short reads with bases, read names, recalibrated and original
quality scores}

We have 10,000 unmapped short reads (100bp) with read names, recalibrated and original
quality scores. We estimate 0.4 bits/base (read names) + 0.4 bits/base (bases)
+ 3 bits/base (recalibrated quality scores) + 3 bits/base (original quality scores)
=\textasciitilde{} 7 bits/base. Space estimate is (10,000 * 100 * 7) / 8 / 1024
/ 1024 =\textasciitilde{} 0.9 MB. Data could be stored in a single container.
\begin{itemize}
\item Splitting data into defined genomic regions can be problematic when the depth is highly variable, so data size is a more useful metric.

\textbf{Unmapped long reads with bases, read names and quality scores}
\item Larger containers compress better.

We have 10,000 unmapped long reads (10kb) with read names and quality scores. We
estimate: 0.4 bits/base (bases) + 3 bits/base (original quality scores) =\textasciitilde{}
3.5 bits/base. Space estimate is (10,000 * 10,000 * 3.5) / 8 / 1024 / 1024 =\textasciitilde{}
42 MB. Data could be stored in 42 x 1MB containers.
\item Smaller containers offer better random access performance.

\textbf{Mapped short reads with bases, pairing and mapping information}
\item Streaming is more convenient for small containers.

We have 250,000 mapped short reads (100bp) with bases, pairing and mapping information.
We estimate the compression to be 0.2 bits/base. Space estimate is (250,000 * 100
* 0.2) / 8 / 1024 / 1024 =\textasciitilde{} 0.6 MB. Data could be stored in a single
container.
\item Applications typically buffer containers into memory. This becomes more important if an application is multi-threaded.
\end{itemize}

\textbf{Embedded reference sequences}
The optimal size of container will depend on the use case (e.g. active work or archival; streaming or random-access).
We recommend application writers consider exposing a series of compression profiles or command-line options to provide user control, defaulting to faster methods and no more than a few megabytes of uncompressed data per container.

We have a reference sequence (10Mb). We estimate the compression to be 2 bits/base.
Space estimate is (10000000 * 2 / 8 / 1024 / 1024) =\textasciitilde{} 2.4MB. Data
could be written into three containers: 1MB + 1MB + 0.4MB.
Consider that different instrument types can have vastly different compression ratios per byte stream, for example quality values quantised into 4 values with high correlation with previous qualities, or 60 or more discrete values with minimal predictability.
Tools such as \emph{samtools cram-size} can report the space taken up by each data type to aid parameter evaluation.

\newpage

Expand Down
53 changes: 14 additions & 39 deletions CRAMv3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
\renewcommand{\footrulewidth}{0pt}

\newcommand\bits{\,\mbox{bits}}
\newcommand\MB{\,\mbox{MB}}
\newcommand\MiB{\,\mbox{MiB}}

\setlength{\parindent}{0cm}
\setlength{\parskip}{0.18cm}
Expand Down Expand Up @@ -2560,51 +2560,26 @@ \subsection{\textbf{name tokeniser}}
\section{\textbf{Appendix}}
\subsection{\textbf{Choosing the container size}}

CRAM format does not constrain the size of the containers. However, the following
should be considered when deciding the container size:
The CRAM format does not constrain the size of the containers and this is an implementation choice.
However, the following should be considered when deciding the container size:

$\bullet$ Data can be compressed better by using larger containers

$\bullet$ Random access performance is better for smaller containers

$\bullet$ Streaming is more convenient for small containers

$\bullet$ Applications typically buffer containers into memory

We recommend 1 megabyte containers. They are small enough to provide good random access
and streaming performance while being large enough to provide good compression.
1\MB\ containers are also small enough to fit into the L2 cache of most modern CPUs.

Some simplified examples are provided below to fit data into 1\MB\ containers.

\textbf{Unmapped short reads with bases, read names, recalibrated and original
quality scores}

We have 10,000 unmapped short reads (100bp) with read names, recalibrated and original
quality scores. We estimate 0.4 bits/base (read names) + 0.4 bits/base (bases)
+ 3 bits/base (recalibrated quality scores) + 3 bits/base (original quality scores)
$\approx$ 7 bits/base. Space estimate is $10\,000 \times 100 \times 7 \bits
\approx 0.9 \MB$. Data could be stored in a single container.
\begin{itemize}
\item Splitting data into defined genomic regions can be problematic when the depth is highly variable, so data size is a more useful metric.

\textbf{Unmapped long reads with bases, read names and quality scores}
\item Larger containers compress better.

We have 10,000 unmapped long reads (10kb) with read names and quality scores. We
estimate: 0.4 bits/base (bases) + 3 bits/base (original quality scores) $\approx$
3.5 bits/base. Space estimate is $10\,000 \times 10\,000 \times 3.5 \bits
\approx 42 \MB$. Data could be stored in $42 \times 1\MB$ containers.
\item Smaller containers offer better random access performance.

\textbf{Mapped short reads with bases, pairing and mapping information}
\item Streaming is more convenient for small containers.

We have 250,000 mapped short reads (100bp) with bases, pairing and mapping information.
We estimate the compression to be 0.2 bits/base. Space estimate is $250\,000 \times 100
\times 0.2 \bits \approx 0.6 \MB$. Data could be stored in a single
container.
\item Applications typically buffer containers into memory. This becomes more important if an application is multi-threaded.
\end{itemize}

\textbf{Embedded reference sequences}
The optimal size of container will depend on the use case (e.g. active work or archival; streaming or random-access).
We recommend application writers consider exposing a series of compression profiles or command-line options to provide user control, defaulting to faster methods and no more than a few megabytes of uncompressed data per container.

We have a reference sequence (10Mb). We estimate the compression to be 2 bits/base.
Space estimate is $10\,000\,000 \times 2 \bits \approx 2.4 \MB$. Data
could be written into three containers: $1\MB + 1\MB + 0.4\MB$.
Consider that different instrument types can have vastly different compression ratios per byte stream, for example quality values quantised into 4 values with high correlation with previous qualities, or 60 or more discrete values with minimal predictability.
Tools such as \emph{samtools cram-size} can report the space taken up by each data type to aid parameter evaluation.

\newpage

Expand Down
Loading