feat: Chunking for categorical `categories` string arrays only

# feat: Chunking for categorical `categories` string arrays only

## Motivation

[PR #2288](https://github.com/scverse/anndata/pull/2288) introduced `LazyCategoricalDtype` with `head_categories()` / `tail_categories()` methods for efficient partial category inspection without loading the entire array. However, zarr's default auto-chunking stores categories as single chunks, severely limiting the benefit:

| Method | H5AD | Zarr (auto) | Zarr (10k chunks) |
|--------|------|-------------|-------------------|
| `head_categories(10)` | 0.19ms | 14.2ms | **1.8ms** |
| Full load | 30ms | 16.0ms | **14.9ms** |

With proper chunking, partial reads improve by **8x** while full reads remain fast.

## Context

After investigating [#2295](https://github.com/scverse/anndata/issues/2295) (default chunking for 1D arrays in obs/var), I ran benchmarks on local SSD, NFS, and S3 storage. The results showed that **general 1D array chunking has a full-read penalty on S3** (1.4x slower), making it a net negative for typical access patterns.

However, there is an important exception: **categorical `categories` string arrays** show a **win-win** with chunking—faster partial reads AND faster full reads across all storage backends.

This proposes a more targeted change: apply default chunking only to categorical `categories` arrays, enabling the full potential of `head_categories()` / `tail_categories()` from [PR #2288](https://github.com/scverse/anndata/pull/2288).

## Why string arrays behave differently

General obs/var arrays (numeric codes, float columns) have S3 chunking penalties because:
- Request overhead dominates for small numeric data
- Numeric decompression is fast, so parallelism doesn't help much

Categorical `categories` (string arrays) benefit from chunking because:
- Variable-length string decompression is slower and benefits from parallelism
- Sequential access patterns for categories
- Smaller chunks decompress faster than one large string blob

## Benchmark: String Arrays Only

**System**: Darwin 24.6.0 (arm64), Apple M3 Max, 133 Mbps down / 101 Mbps up, S3 us-west-2

### Local SSD Results

| Categories | Read Size | 100 chunks | 1k chunks | 10k chunks | auto |
|------------|-----------|------------|-----------|------------|------|
| 10 | head(10) | 0.6ms | 0.5ms | 0.4ms | 0.4ms |
| 10 | full | 0.4ms | 0.4ms | 0.4ms | 0.4ms |
| | | | | | |
| 100 | head(10) | 0.4ms | 0.5ms | 0.5ms | 0.4ms |
| 100 | full | 0.4ms | 0.4ms | 0.4ms | 0.4ms |
| | | | | | |
| 1,000 | head(10) | 0.5ms | 0.6ms | 1.0ms | 0.5ms |
| 1,000 | full | 1.8ms | 0.5ms | 1.0ms | 0.5ms |
| | | | | | |
| 10,000 | head(10) | 0.5ms | 0.6ms | 1.7ms | 1.8ms |
| 10,000 | full | 16.9ms | 3.1ms | 1.8ms | 1.9ms |
| | | | | | |
| 50,000 | head(10) | 0.5ms | 0.7ms | 1.8ms | 7.3ms |
| 50,000 | full | 90.2ms | 15.7ms | 7.5ms | 8.3ms |
| | | | | | |
| **100,000** | **head(10)** | 0.5ms | 0.6ms | **1.8ms** | 14.2ms |
| **100,000** | **full** | 179.9ms | 30.6ms | **14.9ms** | 16.0ms |
| | | | | | |
| 500,000 | head(10) | 0.6ms | 0.6ms | 1.8ms | 36.7ms |
| 500,000 | full | 881.8ms | 146.8ms | 72.2ms | 76.7ms |

### S3 Results

| Categories | Read Size | 100 chunks | 1k chunks | 10k chunks | auto |
|------------|-----------|------------|-----------|------------|------|
| 10 | head(10) | 57ms | 55ms | 53ms | 58ms |
| 10 | full | 56ms | 53ms | 50ms | 53ms |
| | | | | | |
| 100 | head(10) | 54ms | 52ms | 56ms | 61ms |
| 100 | full | 58ms | 52ms | 54ms | 56ms |
| | | | | | |
| 1,000 | head(10) | 59ms | 61ms | 58ms | 61ms |
| 1,000 | full | 74ms | 58ms | 56ms | 65ms |
| | | | | | |
| 10,000 | head(10) | 54ms | 62ms | 119ms | 96ms |
| 10,000 | full | 644ms | 130ms | 66ms | 66ms |
| | | | | | |
| 50,000 | head(10) | 64ms | 60ms | 106ms | 169ms |
| 50,000 | full | 3,237ms | 387ms | 251ms | 231ms |
| | | | | | |
| **100,000** | **head(10)** | 62ms | 65ms | **126ms** | 422ms |
| **100,000** | **full** | 6,391ms | 772ms | **266ms** | 378ms |
| | | | | | |
| 500,000 | head(10) | 61ms | 66ms | 124ms | 1,568ms |
| 500,000 | full | 33,023ms | 3,491ms | 875ms | 1,714ms |

## Analysis: 10k chunks are the sweet spot

### Comparison at 100k categories

| Chunk Size | Local head(10) | Local full | S3 head(10) | S3 full |
|------------|----------------|------------|-------------|---------|
| auto | 14.2ms | 16.0ms | 422ms | 378ms |
| **10,000** | **1.8ms (8x)** | **14.9ms (7% faster)** | **126ms (3.4x)** | **266ms (30% faster)** |
| 1,000 | 0.6ms (24x) | 30.6ms (1.9x slower) | 65ms (6.5x) | 772ms (2x slower) |
| 100 | 0.5ms (28x) | 179.9ms (11x slower) | 62ms (6.8x) | 6,391ms (17x slower) |

### Key findings

1. **10k chunks**: Best balance
   - Partial reads: 3-8x faster
   - Full reads: 7-30% **faster** (not slower!)

2. **1k chunks**: Too aggressive
   - Partial reads: Marginally better than 10k
   - Full reads: 2x slower on S3 (too many requests)

3. **100 chunks**: Catastrophic for full reads
   - 17x slower on S3 due to thousands of requests

## Proposed change

Apply 10k chunking specifically to categorical `categories` arrays in `write_categorical`:

https://github.com/scverse/anndata/blob/c6f6f54ca10775cb684a928b8fe34ba0b6843834/src/anndata/_io/specs/methods.py#L1107-L1112

Proposed change:

```python
categories = v.categories.to_numpy()
cat_kwargs = dataset_kwargs
if len(categories) > 10_000 and "chunks" not in dataset_kwargs:
    cat_kwargs = dict(dataset_kwargs, chunks=(10_000,))
_writer.write_elem(g, "categories", categories, dataset_kwargs=cat_kwargs)
```

## Impact

This targeted change:
1. **Improves** `head_categories()` / `tail_categories()` from [PR #2288](https://github.com/scverse/anndata/pull/2288): 3-8x faster
2. **Improves** full category loads: 7-30% faster
3. **Does not affect** numeric obs/var arrays (which would have S3 penalties)
4. **Backward compatible**: zarr reads any chunk layout transparently

## Conclusion

While the original proposal in #2295 (chunking all 1D obs/var arrays) has unfavorable S3 tradeoffs, **categorical `categories` string arrays are an exception** where chunking provides a win-win. I propose implementing chunking for this specific case only.

**Benchmark code**: [benchmark_string_chunks.py](https://gist.github.com/katosh/f2d34cac307f4ae72c268c551954168c)

## Related

- [#2295](https://github.com/scverse/anndata/issues/2295): Original chunking proposal (closing due to S3 penalties)
- [PR #2288](https://github.com/scverse/anndata/pull/2288): `LazyCategoricalDtype` with `head_categories()`
- [zarr-python#270](https://github.com/zarr-developers/zarr-python/issues/270): Chunk size configuration discussion


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Chunking for categorical `categories` string arrays only #2296

feat: Chunking for categorical `categories` string arrays only

Motivation

Context

Why string arrays behave differently

Benchmark: String Arrays Only

Local SSD Results

S3 Results

Analysis: 10k chunks are the sweet spot

Comparison at 100k categories

Key findings

Proposed change

Impact

Conclusion

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Method	H5AD	Zarr (auto)	Zarr (10k chunks)
`head_categories(10)`	0.19ms	14.2ms	1.8ms
Full load	30ms	16.0ms	14.9ms

Categories	Read Size	100 chunks	1k chunks	10k chunks	auto
10	head(10)	0.6ms	0.5ms	0.4ms	0.4ms
10	full	0.4ms	0.4ms	0.4ms	0.4ms

100	head(10)	0.4ms	0.5ms	0.5ms	0.4ms
100	full	0.4ms	0.4ms	0.4ms	0.4ms

1,000	head(10)	0.5ms	0.6ms	1.0ms	0.5ms
1,000	full	1.8ms	0.5ms	1.0ms	0.5ms

10,000	head(10)	0.5ms	0.6ms	1.7ms	1.8ms
10,000	full	16.9ms	3.1ms	1.8ms	1.9ms

50,000	head(10)	0.5ms	0.7ms	1.8ms	7.3ms
50,000	full	90.2ms	15.7ms	7.5ms	8.3ms

100,000	head(10)	0.5ms	0.6ms	1.8ms	14.2ms
100,000	full	179.9ms	30.6ms	14.9ms	16.0ms

500,000	head(10)	0.6ms	0.6ms	1.8ms	36.7ms
500,000	full	881.8ms	146.8ms	72.2ms	76.7ms

Categories	Read Size	100 chunks	1k chunks	10k chunks	auto
10	head(10)	57ms	55ms	53ms	58ms
10	full	56ms	53ms	50ms	53ms

100	head(10)	54ms	52ms	56ms	61ms
100	full	58ms	52ms	54ms	56ms

1,000	head(10)	59ms	61ms	58ms	61ms
1,000	full	74ms	58ms	56ms	65ms

10,000	head(10)	54ms	62ms	119ms	96ms
10,000	full	644ms	130ms	66ms	66ms

50,000	head(10)	64ms	60ms	106ms	169ms
50,000	full	3,237ms	387ms	251ms	231ms

100,000	head(10)	62ms	65ms	126ms	422ms
100,000	full	6,391ms	772ms	266ms	378ms

500,000	head(10)	61ms	66ms	124ms	1,568ms
500,000	full	33,023ms	3,491ms	875ms	1,714ms

Chunk Size	Local head(10)	Local full	S3 head(10)	S3 full
auto	14.2ms	16.0ms	422ms	378ms
10,000	1.8ms (8x)	14.9ms (7% faster)	126ms (3.4x)	266ms (30% faster)
1,000	0.6ms (24x)	30.6ms (1.9x slower)	65ms (6.5x)	772ms (2x slower)
100	0.5ms (28x)	179.9ms (11x slower)	62ms (6.8x)	6,391ms (17x slower)

	_writer.write_elem(
	g,
	"categories",
	v.categories.to_numpy(),
	dataset_kwargs=dataset_kwargs,
	)

feat: Chunking for categorical categories string arrays only #2296

Description

feat: Chunking for categorical categories string arrays only

Motivation

Context

Why string arrays behave differently

Benchmark: String Arrays Only

Local SSD Results

S3 Results

Analysis: 10k chunks are the sweet spot

Comparison at 100k categories

Key findings

Proposed change

Impact

Conclusion

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

feat: Chunking for categorical `categories` string arrays only #2296

feat: Chunking for categorical `categories` string arrays only