Enable Seurat to Handle Ultra-Large Matrices (>2^31 values)

# Motivation



As the scale of single-cell and spatial transcriptomics datasets continues to grow, particularly in the context of large national or international consortia (e.g., Human Cell Atlas, BICCN, HuBMAP, Allen Brain Atlas), we are increasingly working with datasets that exceed the current technical limitations of R's internal matrix representations.

Seurat currently relies on in-memory matrix formats such as `matrix` or `dgCMatrix` (from the **Matrix** package), which use **32-bit signed integers** to represent indices and pointers. This imposes a hard limit of **2^31 - 1 elements (~2.1 billion entries)** per matrix. Once a matrix exceeds this number of values, it becomes impossible to represent or manipulate it within these formats.

##Example: Integration of 10 large scRNA-seq samples

Suppose we are analyzing 10 human scRNA-seq samples, each with:

- ~300,000 cells  
- ~30,000 genes  

This results in a combined gene expression matrix of size:

300,000 cells/sample × 10 samples = 3,000,000 cells
3,000,000 cells × 30,000 genes = 90,000,000,000 values


Even if the matrix is sparse and only 1% of the values are non-zero, that still gives **900 million entries** — already nearing the 2^31 limit. With more samples, higher cell counts, or multi-modal assays (e.g., CITE-seq or spatial transcriptomics), we can easily cross this threshold.

When trying to create a `dgCMatrix` or pass this matrix into Seurat with `CreateSeuratObject()`, users may receive errors such as:

`Error in validObject(.Object) : 
  invalid class “dgCMatrix” object: 'i' slot is too large (> 2^31 - 1 elements)` 


## Real-World Consequences

- Users are forced to **downsample** their data or split it into batches, losing valuable biological information.
- Large institutions with powerful compute infrastructures are **unable to take advantage** of their resources within the Seurat framework due to this internal bottleneck.
- Other tools (e.g., Scanpy in Python) are more flexible in this regard, making users consider migrating away from Seurat despite its rich functionality.

## Why This Feature Matters

Projects like the **Allen Brain Atlas** and **BICCN (Brain Initiative Cell Census Network)** regularly release **ultra-large single-cell and spatial transcriptomics datasets**, often exceeding millions of cells and covering diverse brain regions and modalities. These datasets are invaluable for neuroscience research but are difficult to analyze at full scale using Seurat under current constraints.

Adding support for matrix formats that bypass this limitation would allow Seurat to:

- Scale to meet the demands of modern single-cell and spatial omics projects.
- Interoperate with large consortium data releases without preprocessing compromises.
- Retain its leadership role in the single-cell R ecosystem as data sizes increase.

This could be achieved by integrating support for backends such as:

- `spam` – which provides sparse matrix classes with **64-bit integer indexing** support (via the `spam64` class)
- `DelayedArray` (block-wise processing, supports sparse and dense backends)
- `HDF5Array` (on-disk storage, HDF5 format)
- `bigmemory` or `ff` (external memory matrix solutions)
- Experimental sparse array formats with 64-bit indexing

Providing this capability would represent a major step toward making Seurat robust for high-throughput, large-scale datasets — a necessity as the field advances.


# Feature Description

This feature would enable Seurat to handle matrices with more than 2³¹ values, which is currently not possible due to internal limitations of R’s default matrix types (matrix, dgCMatrix). These types rely on 32-bit signed integers for indexing, which restricts the maximum number of elements in any matrix to approximately 2.1 billion.

We are requesting support for alternative matrix backends such as:

DelayedArray
HDF5Array
bigmemory
or any custom sparse matrix format that allows 64-bit indexing and/or on-disk storage.
Ideally, Seurat would be able to:

Accept these formats as input for CreateSeuratObject() and downstream functions.
Retain compatibility with core Seurat functions (e.g., NormalizeData, FindVariableFeatures, ScaleData, etc.) when using these backends.
Gracefully fall back to on-disk or chunked operations when memory limits are approached.
This feature is essential for users working with very large single-cell or spatial datasets — especially in atlas-scale projects or multi-sample integration — and would future-proof Seurat for continued growth in dataset size.

#8164  #7760 #7739 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Seurat to Handle Ultra-Large Matrices (>2^31 values) #9798

Motivation

Real-World Consequences

Why This Feature Matters

Feature Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enable Seurat to Handle Ultra-Large Matrices (>2^31 values) #9798

Description

Motivation

Real-World Consequences

Why This Feature Matters

Feature Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions