Skip to content

Enable Seurat to Handle Ultra-Large Matrices (>2^31 values) #9798

Open
@BenjaminDEMAILLE

Description

@BenjaminDEMAILLE

Motivation

As the scale of single-cell and spatial transcriptomics datasets continues to grow, particularly in the context of large national or international consortia (e.g., Human Cell Atlas, BICCN, HuBMAP, Allen Brain Atlas), we are increasingly working with datasets that exceed the current technical limitations of R's internal matrix representations.

Seurat currently relies on in-memory matrix formats such as matrix or dgCMatrix (from the Matrix package), which use 32-bit signed integers to represent indices and pointers. This imposes a hard limit of 2^31 - 1 elements (~2.1 billion entries) per matrix. Once a matrix exceeds this number of values, it becomes impossible to represent or manipulate it within these formats.

##Example: Integration of 10 large scRNA-seq samples

Suppose we are analyzing 10 human scRNA-seq samples, each with:

  • ~300,000 cells
  • ~30,000 genes

This results in a combined gene expression matrix of size:

300,000 cells/sample × 10 samples = 3,000,000 cells
3,000,000 cells × 30,000 genes = 90,000,000,000 values

Even if the matrix is sparse and only 1% of the values are non-zero, that still gives 900 million entries — already nearing the 2^31 limit. With more samples, higher cell counts, or multi-modal assays (e.g., CITE-seq or spatial transcriptomics), we can easily cross this threshold.

When trying to create a dgCMatrix or pass this matrix into Seurat with CreateSeuratObject(), users may receive errors such as:

Error in validObject(.Object) : invalid class “dgCMatrix” object: 'i' slot is too large (> 2^31 - 1 elements)

Real-World Consequences

  • Users are forced to downsample their data or split it into batches, losing valuable biological information.
  • Large institutions with powerful compute infrastructures are unable to take advantage of their resources within the Seurat framework due to this internal bottleneck.
  • Other tools (e.g., Scanpy in Python) are more flexible in this regard, making users consider migrating away from Seurat despite its rich functionality.

Why This Feature Matters

Projects like the Allen Brain Atlas and BICCN (Brain Initiative Cell Census Network) regularly release ultra-large single-cell and spatial transcriptomics datasets, often exceeding millions of cells and covering diverse brain regions and modalities. These datasets are invaluable for neuroscience research but are difficult to analyze at full scale using Seurat under current constraints.

Adding support for matrix formats that bypass this limitation would allow Seurat to:

  • Scale to meet the demands of modern single-cell and spatial omics projects.
  • Interoperate with large consortium data releases without preprocessing compromises.
  • Retain its leadership role in the single-cell R ecosystem as data sizes increase.

This could be achieved by integrating support for backends such as:

  • spam – which provides sparse matrix classes with 64-bit integer indexing support (via the spam64 class)
  • DelayedArray (block-wise processing, supports sparse and dense backends)
  • HDF5Array (on-disk storage, HDF5 format)
  • bigmemory or ff (external memory matrix solutions)
  • Experimental sparse array formats with 64-bit indexing

Providing this capability would represent a major step toward making Seurat robust for high-throughput, large-scale datasets — a necessity as the field advances.

Feature Description

This feature would enable Seurat to handle matrices with more than 2³¹ values, which is currently not possible due to internal limitations of R’s default matrix types (matrix, dgCMatrix). These types rely on 32-bit signed integers for indexing, which restricts the maximum number of elements in any matrix to approximately 2.1 billion.

We are requesting support for alternative matrix backends such as:

DelayedArray
HDF5Array
bigmemory
or any custom sparse matrix format that allows 64-bit indexing and/or on-disk storage.
Ideally, Seurat would be able to:

Accept these formats as input for CreateSeuratObject() and downstream functions.
Retain compatibility with core Seurat functions (e.g., NormalizeData, FindVariableFeatures, ScaleData, etc.) when using these backends.
Gracefully fall back to on-disk or chunked operations when memory limits are approached.
This feature is essential for users working with very large single-cell or spatial datasets — especially in atlas-scale projects or multi-sample integration — and would future-proof Seurat for continued growth in dataset size.

#8164 #7760 #7739

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions