Skip to content

feat: Efficient category count and partial loading for lazy AnnData #2283

@katosh

Description

@katosh

Summary

Request an API to efficiently get category count (without loading values) and partial category loading for lazy AnnData columns.

Motivation

For lazy AnnData (from read_lazy()), accessing category information via col.dtype.categories works but always loads ALL category values. For datasets with many categories (e.g., cell IDs, barcodes), this is inefficient when you only need the count or a preview.

Current behavior:

from anndata.experimental import read_lazy

lazy_adata = read_lazy("dataset.zarr")  # 1M cells, 100k categories
col = lazy_adata.obs["cell_id"]

categories = col.dtype.categories  # ~25ms - loads all 100k values
n_categories = len(col.dtype.categories)  # Must load all just to count

Desired behavior:

# Efficient count-only (no data loading):
n_categories = get_category_count(col)  # ~0.1µs, metadata only

# Partial loading for previews:
first_100 = get_categories(col, n=100)  # Load only first 100

Benchmark Results

Tested with 1M cells and 100k categories (median of 3 runs):

Method Zarr H5AD Data loaded
col.values (full column) 80 ms 87 ms All codes (1M) + categories
col.dtype.categories 25 ms 34 ms Category group (values + mask)
read_elem(values) 9 ms 16 ms Category values only
read_elem_partial(n=100) 9.1 ms 0.1 ms First 100 categories
values.shape[0] (metadata) 0.12 µs 0.08 µs Count only (no loading)

Key findings:

  • count from metadata is much faster than loading values
  • col.dtype.categories reads the whole category group (values + mask), ~2.5x slower than reading values directly
  • Partial loading via read_elem_partial is very fast for H5AD but shows no benefit for Zarr (possibly not optimized yet?)
Benchmark code
import anndata as ad
from anndata.experimental import read_lazy
from anndata._io.specs.registry import read_elem, read_elem_partial
import numpy as np
import pandas as pd
import tempfile
import time
import os

# Create AnnData with 1M cells and 100k categories
n_obs = 1_000_000
n_cats = 100_000
adata = ad.AnnData(
    X=np.random.rand(n_obs, 10).astype(np.float32),
    obs=pd.DataFrame({
        'cell_id': pd.Categorical([f'Cell_{i % n_cats}' for i in range(n_obs)])
    })
)

with tempfile.TemporaryDirectory() as tmpdir:
    # For Zarr:
    path = os.path.join(tmpdir, 'test.zarr')
    adata.write_zarr(path)
    # For H5AD: change to 'test.h5ad' and adata.write_h5ad(path)

    lazy_adata = read_lazy(path)
    col = lazy_adata.obs['cell_id']

    # Navigate to storage
    cat_arr = col.variable._data.array
    cats_storage = cat_arr._categories
    values = cats_storage["values"] if hasattr(cats_storage, "keys") else cats_storage

    # Benchmark full column load
    start = time.perf_counter()
    _ = col.values
    print(f"col.values: {(time.perf_counter() - start)*1000:.1f}ms")

    # Note: col.dtype.categories uses @cached_property, so we need fresh read_lazy()
    lazy_adata = read_lazy(path)
    col = lazy_adata.obs['cell_id']
    start = time.perf_counter()
    _ = col.dtype.categories
    print(f"col.dtype.categories: {(time.perf_counter() - start)*1000:.1f}ms")

    start = time.perf_counter()
    _ = read_elem(values)
    print(f"read_elem(values): {(time.perf_counter() - start)*1000:.1f}ms")

    start = time.perf_counter()
    _ = read_elem_partial(values, indices=slice(0, 100))
    print(f"read_elem_partial(n=100): {(time.perf_counter() - start)*1000:.1f}ms")

    start = time.perf_counter()
    _ = values.shape[0]
    print(f"values.shape[0]: {(time.perf_counter() - start)*1e6:.2f}µs")
Test environment
  • Hardware: Apple M3 Max (arm64)
  • Python: 3.12.9
  • anndata: 0.13.0.dev
  • zarr: 3.1.5
  • h5py: 3.13.0
  • numpy: 2.4.0
  • pandas: 3.0.0rc1

Current Workaround (in PR #2236)

The HTML representation module implements a workaround for efficient category display by navigating to the underlying storage:

from anndata.experimental.backed._lazy_arrays import CategoricalArray
from anndata._io.specs.registry import read_elem_partial

# Navigate xarray internals to find CategoricalArray
cat_arr = col.variable._data.array  # Fragile internal path

# Access storage directly (bypasses cached_property)
cats = cat_arr._categories  # zarr.Group or h5py.Group
values = cats["values"] if hasattr(cats, "keys") else cats

# Get count from metadata (no loading)
n_categories = values.shape[0]

# Partial loading
first_n = read_elem_partial(values, indices=slice(0, n))

Why this workaround is fragile:

  • Navigates xarray internals (col.variable._data.array) which may change
  • Relies on private _categories attribute
  • Assumes storage structure (cats["values"] pattern)

Proposed Solution

Option A: Methods on CategoricalArray

class CategoricalArray:
    @property
    def n_categories(self) -> int:
        """Number of categories (metadata only, no loading)."""
        values = self._categories["values"] if hasattr(self._categories, "keys") else self._categories
        return values.shape[0]

    def get_categories(self, n: int | None = None) -> np.ndarray:
        """Get category values, optionally limited to first N."""
        values = self._categories["values"] if hasattr(self._categories, "keys") else self._categories
        if n is not None:
            return read_elem_partial(values, indices=slice(0, n))
        return read_elem(values)

Option B: Utility function

def get_lazy_column_categories(
    col: DataArray,
    n: int | None = None,
) -> tuple[np.ndarray, int]:
    """Efficiently get categories from a lazy column."""

Option C: Integration with Dataset2D

class Dataset2D:
    def get_column_categories(self, column: str, n: int | None = None) -> np.ndarray | None:
        """Get categories efficiently, with optional partial loading."""

    def get_column_category_count(self, column: str) -> int | None:
        """Get category count from metadata (no data loading)."""

Use Cases

  1. HTML representation - Display category count ("100,000 categories") and preview actual values (first N) without loading all (PR feat: Add HTML representation #2236, see lazy loading of visual demo 8b)
  2. Data inspection - Quick count and preview of categories in large datasets
  3. Validation - Check category count matches color arrays in uns

References

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions