feat: Efficient category count and partial loading for lazy AnnData

## Summary

Request an API to efficiently get category count (without loading values) and partial category loading for lazy AnnData columns.

## Motivation

For lazy AnnData (from `read_lazy()`), accessing category information via `col.dtype.categories` works but always loads ALL category values. For datasets with many categories (e.g., cell IDs, barcodes), this is inefficient when you only need the count or a preview.

**Current behavior:**
```python
from anndata.experimental import read_lazy

lazy_adata = read_lazy("dataset.zarr")  # 1M cells, 100k categories
col = lazy_adata.obs["cell_id"]

categories = col.dtype.categories  # ~25ms - loads all 100k values
n_categories = len(col.dtype.categories)  # Must load all just to count
```

**Desired behavior:**
```python
# Efficient count-only (no data loading):
n_categories = get_category_count(col)  # ~0.1µs, metadata only

# Partial loading for previews:
first_100 = get_categories(col, n=100)  # Load only first 100
```

## Benchmark Results

Tested with **1M cells and 100k categories** (median of 3 runs):

| Method | Zarr | H5AD | Data loaded |
|--------|------|------|-------------|
| `col.values` (full column) | 80 ms | 87 ms | All codes (1M) + categories |
| `col.dtype.categories` | 25 ms | 34 ms | Category group (values + mask) |
| `read_elem(values)` | 9 ms | 16 ms | Category values only |
| `read_elem_partial(n=100)` | 9.1 ms | 0.1 ms | First 100 categories |
| `values.shape[0]` (metadata) | 0.12 µs | 0.08 µs | Count only (no loading) |

**Key findings:**
- count from metadata is much faster than loading values
- `col.dtype.categories` reads the whole category group (values + mask), ~2.5x slower than reading values directly
- Partial loading via `read_elem_partial` is very fast for H5AD but shows no benefit for Zarr (possibly not optimized yet?)

<details>
<summary>Benchmark code</summary>

```python
import anndata as ad
from anndata.experimental import read_lazy
from anndata._io.specs.registry import read_elem, read_elem_partial
import numpy as np
import pandas as pd
import tempfile
import time
import os

# Create AnnData with 1M cells and 100k categories
n_obs = 1_000_000
n_cats = 100_000
adata = ad.AnnData(
    X=np.random.rand(n_obs, 10).astype(np.float32),
    obs=pd.DataFrame({
        'cell_id': pd.Categorical([f'Cell_{i % n_cats}' for i in range(n_obs)])
    })
)

with tempfile.TemporaryDirectory() as tmpdir:
    # For Zarr:
    path = os.path.join(tmpdir, 'test.zarr')
    adata.write_zarr(path)
    # For H5AD: change to 'test.h5ad' and adata.write_h5ad(path)

    lazy_adata = read_lazy(path)
    col = lazy_adata.obs['cell_id']

    # Navigate to storage
    cat_arr = col.variable._data.array
    cats_storage = cat_arr._categories
    values = cats_storage["values"] if hasattr(cats_storage, "keys") else cats_storage

    # Benchmark full column load
    start = time.perf_counter()
    _ = col.values
    print(f"col.values: {(time.perf_counter() - start)*1000:.1f}ms")

    # Note: col.dtype.categories uses @cached_property, so we need fresh read_lazy()
    lazy_adata = read_lazy(path)
    col = lazy_adata.obs['cell_id']
    start = time.perf_counter()
    _ = col.dtype.categories
    print(f"col.dtype.categories: {(time.perf_counter() - start)*1000:.1f}ms")

    start = time.perf_counter()
    _ = read_elem(values)
    print(f"read_elem(values): {(time.perf_counter() - start)*1000:.1f}ms")

    start = time.perf_counter()
    _ = read_elem_partial(values, indices=slice(0, 100))
    print(f"read_elem_partial(n=100): {(time.perf_counter() - start)*1000:.1f}ms")

    start = time.perf_counter()
    _ = values.shape[0]
    print(f"values.shape[0]: {(time.perf_counter() - start)*1e6:.2f}µs")
```

</details>

<details>
<summary>Test environment</summary>

- **Hardware:** Apple M3 Max (arm64)
- **Python:** 3.12.9
- **anndata:** 0.13.0.dev
- **zarr:** 3.1.5
- **h5py:** 3.13.0
- **numpy:** 2.4.0
- **pandas:** 3.0.0rc1

</details>

## Current Workaround (in PR #2236)

The HTML representation module implements a workaround for efficient category display by navigating to the underlying storage:

```python
from anndata.experimental.backed._lazy_arrays import CategoricalArray
from anndata._io.specs.registry import read_elem_partial

# Navigate xarray internals to find CategoricalArray
cat_arr = col.variable._data.array  # Fragile internal path

# Access storage directly (bypasses cached_property)
cats = cat_arr._categories  # zarr.Group or h5py.Group
values = cats["values"] if hasattr(cats, "keys") else cats

# Get count from metadata (no loading)
n_categories = values.shape[0]

# Partial loading
first_n = read_elem_partial(values, indices=slice(0, n))
```

**Why this workaround is fragile:**
- Navigates xarray internals (`col.variable._data.array`) which may change
- Relies on private `_categories` attribute
- Assumes storage structure (`cats["values"]` pattern)

## Proposed Solution

### Option A: Methods on CategoricalArray

```python
class CategoricalArray:
    @property
    def n_categories(self) -> int:
        """Number of categories (metadata only, no loading)."""
        values = self._categories["values"] if hasattr(self._categories, "keys") else self._categories
        return values.shape[0]

    def get_categories(self, n: int | None = None) -> np.ndarray:
        """Get category values, optionally limited to first N."""
        values = self._categories["values"] if hasattr(self._categories, "keys") else self._categories
        if n is not None:
            return read_elem_partial(values, indices=slice(0, n))
        return read_elem(values)
```

### Option B: Utility function

```python
def get_lazy_column_categories(
    col: DataArray,
    n: int | None = None,
) -> tuple[np.ndarray, int]:
    """Efficiently get categories from a lazy column."""
```

### Option C: Integration with Dataset2D

```python
class Dataset2D:
    def get_column_categories(self, column: str, n: int | None = None) -> np.ndarray | None:
        """Get categories efficiently, with optional partial loading."""

    def get_column_category_count(self, column: str) -> int | None:
        """Get category count from metadata (no data loading)."""
```

## Use Cases

1. **HTML representation** - Display category count ("100,000 categories") and preview actual values (first N) without loading all (PR #2236, see lazy loading of [visual demo 8b](https://htmlpreview.github.io/?https://gist.githubusercontent.com/katosh/4a2399d1472c733b041ef8dfd5b489b9/raw/repr_html_visual_test.html#8b-lazy-anndata-experimental))
2. **Data inspection** - Quick count and preview of categories in large datasets
3. **Validation** - Check category count matches color arrays in `uns`

## References

- [PR #2236](https://github.com/scverse/anndata/pull/2236): HTML representation with lazy AnnData support
- [`CategoricalArray`](https://github.com/scverse/anndata/blob/a892252b/src/anndata/experimental/backed/_lazy_arrays.py#L80-L132)
- [Current workaround `_get_lazy_categories`](https://github.com/scverse/anndata/blob/aef9083f/src/anndata/_repr/utils.py#L261-L323)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Efficient category count and partial loading for lazy AnnData #2283

Summary

Motivation

Benchmark Results

Current Workaround (in PR #2236)

Proposed Solution

Option A: Methods on CategoricalArray

Option B: Utility function

Option C: Integration with Dataset2D

Use Cases

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Method	Zarr	H5AD	Data loaded
`col.values` (full column)	80 ms	87 ms	All codes (1M) + categories
`col.dtype.categories`	25 ms	34 ms	Category group (values + mask)
`read_elem(values)`	9 ms	16 ms	Category values only
`read_elem_partial(n=100)`	9.1 ms	0.1 ms	First 100 categories
`values.shape[0]` (metadata)	0.12 µs	0.08 µs	Count only (no loading)

feat: Efficient category count and partial loading for lazy AnnData #2283

Description

Summary

Motivation

Benchmark Results

Current Workaround (in PR #2236)

Proposed Solution

Option A: Methods on CategoricalArray

Option B: Utility function

Option C: Integration with Dataset2D

Use Cases

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions