feat: Chunking for categorical categories string arrays only
Motivation
PR #2288 introduced LazyCategoricalDtype with head_categories() / tail_categories() methods for efficient partial category inspection without loading the entire array. However, zarr's default auto-chunking stores categories as single chunks, severely limiting the benefit:
| Method |
H5AD |
Zarr (auto) |
Zarr (10k chunks) |
head_categories(10) |
0.19ms |
14.2ms |
1.8ms |
| Full load |
30ms |
16.0ms |
14.9ms |
With proper chunking, partial reads improve by 8x while full reads remain fast.
Context
After investigating #2295 (default chunking for 1D arrays in obs/var), I ran benchmarks on local SSD, NFS, and S3 storage. The results showed that general 1D array chunking has a full-read penalty on S3 (1.4x slower), making it a net negative for typical access patterns.
However, there is an important exception: categorical categories string arrays show a win-win with chunking—faster partial reads AND faster full reads across all storage backends.
This proposes a more targeted change: apply default chunking only to categorical categories arrays, enabling the full potential of head_categories() / tail_categories() from PR #2288.
Why string arrays behave differently
General obs/var arrays (numeric codes, float columns) have S3 chunking penalties because:
- Request overhead dominates for small numeric data
- Numeric decompression is fast, so parallelism doesn't help much
Categorical categories (string arrays) benefit from chunking because:
- Variable-length string decompression is slower and benefits from parallelism
- Sequential access patterns for categories
- Smaller chunks decompress faster than one large string blob
Benchmark: String Arrays Only
System: Darwin 24.6.0 (arm64), Apple M3 Max, 133 Mbps down / 101 Mbps up, S3 us-west-2
Local SSD Results
| Categories |
Read Size |
100 chunks |
1k chunks |
10k chunks |
auto |
| 10 |
head(10) |
0.6ms |
0.5ms |
0.4ms |
0.4ms |
| 10 |
full |
0.4ms |
0.4ms |
0.4ms |
0.4ms |
|
|
|
|
|
|
| 100 |
head(10) |
0.4ms |
0.5ms |
0.5ms |
0.4ms |
| 100 |
full |
0.4ms |
0.4ms |
0.4ms |
0.4ms |
|
|
|
|
|
|
| 1,000 |
head(10) |
0.5ms |
0.6ms |
1.0ms |
0.5ms |
| 1,000 |
full |
1.8ms |
0.5ms |
1.0ms |
0.5ms |
|
|
|
|
|
|
| 10,000 |
head(10) |
0.5ms |
0.6ms |
1.7ms |
1.8ms |
| 10,000 |
full |
16.9ms |
3.1ms |
1.8ms |
1.9ms |
|
|
|
|
|
|
| 50,000 |
head(10) |
0.5ms |
0.7ms |
1.8ms |
7.3ms |
| 50,000 |
full |
90.2ms |
15.7ms |
7.5ms |
8.3ms |
|
|
|
|
|
|
| 100,000 |
head(10) |
0.5ms |
0.6ms |
1.8ms |
14.2ms |
| 100,000 |
full |
179.9ms |
30.6ms |
14.9ms |
16.0ms |
|
|
|
|
|
|
| 500,000 |
head(10) |
0.6ms |
0.6ms |
1.8ms |
36.7ms |
| 500,000 |
full |
881.8ms |
146.8ms |
72.2ms |
76.7ms |
S3 Results
| Categories |
Read Size |
100 chunks |
1k chunks |
10k chunks |
auto |
| 10 |
head(10) |
57ms |
55ms |
53ms |
58ms |
| 10 |
full |
56ms |
53ms |
50ms |
53ms |
|
|
|
|
|
|
| 100 |
head(10) |
54ms |
52ms |
56ms |
61ms |
| 100 |
full |
58ms |
52ms |
54ms |
56ms |
|
|
|
|
|
|
| 1,000 |
head(10) |
59ms |
61ms |
58ms |
61ms |
| 1,000 |
full |
74ms |
58ms |
56ms |
65ms |
|
|
|
|
|
|
| 10,000 |
head(10) |
54ms |
62ms |
119ms |
96ms |
| 10,000 |
full |
644ms |
130ms |
66ms |
66ms |
|
|
|
|
|
|
| 50,000 |
head(10) |
64ms |
60ms |
106ms |
169ms |
| 50,000 |
full |
3,237ms |
387ms |
251ms |
231ms |
|
|
|
|
|
|
| 100,000 |
head(10) |
62ms |
65ms |
126ms |
422ms |
| 100,000 |
full |
6,391ms |
772ms |
266ms |
378ms |
|
|
|
|
|
|
| 500,000 |
head(10) |
61ms |
66ms |
124ms |
1,568ms |
| 500,000 |
full |
33,023ms |
3,491ms |
875ms |
1,714ms |
Analysis: 10k chunks are the sweet spot
Comparison at 100k categories
| Chunk Size |
Local head(10) |
Local full |
S3 head(10) |
S3 full |
| auto |
14.2ms |
16.0ms |
422ms |
378ms |
| 10,000 |
1.8ms (8x) |
14.9ms (7% faster) |
126ms (3.4x) |
266ms (30% faster) |
| 1,000 |
0.6ms (24x) |
30.6ms (1.9x slower) |
65ms (6.5x) |
772ms (2x slower) |
| 100 |
0.5ms (28x) |
179.9ms (11x slower) |
62ms (6.8x) |
6,391ms (17x slower) |
Key findings
-
10k chunks: Best balance
- Partial reads: 3-8x faster
- Full reads: 7-30% faster (not slower!)
-
1k chunks: Too aggressive
- Partial reads: Marginally better than 10k
- Full reads: 2x slower on S3 (too many requests)
-
100 chunks: Catastrophic for full reads
- 17x slower on S3 due to thousands of requests
Proposed change
Apply 10k chunking specifically to categorical categories arrays in write_categorical:
|
_writer.write_elem( |
|
g, |
|
"categories", |
|
v.categories.to_numpy(), |
|
dataset_kwargs=dataset_kwargs, |
|
) |
Proposed change:
categories = v.categories.to_numpy()
cat_kwargs = dataset_kwargs
if len(categories) > 10_000 and "chunks" not in dataset_kwargs:
cat_kwargs = dict(dataset_kwargs, chunks=(10_000,))
_writer.write_elem(g, "categories", categories, dataset_kwargs=cat_kwargs)
Impact
This targeted change:
- Improves
head_categories() / tail_categories() from PR #2288: 3-8x faster
- Improves full category loads: 7-30% faster
- Does not affect numeric obs/var arrays (which would have S3 penalties)
- Backward compatible: zarr reads any chunk layout transparently
Conclusion
While the original proposal in #2295 (chunking all 1D obs/var arrays) has unfavorable S3 tradeoffs, categorical categories string arrays are an exception where chunking provides a win-win. I propose implementing chunking for this specific case only.
Benchmark code: benchmark_string_chunks.py
Related
- #2295: Original chunking proposal (closing due to S3 penalties)
- PR #2288:
LazyCategoricalDtype with head_categories()
- zarr-python#270: Chunk size configuration discussion
feat: Chunking for categorical
categoriesstring arrays onlyMotivation
PR #2288 introduced
LazyCategoricalDtypewithhead_categories()/tail_categories()methods for efficient partial category inspection without loading the entire array. However, zarr's default auto-chunking stores categories as single chunks, severely limiting the benefit:head_categories(10)With proper chunking, partial reads improve by 8x while full reads remain fast.
Context
After investigating #2295 (default chunking for 1D arrays in obs/var), I ran benchmarks on local SSD, NFS, and S3 storage. The results showed that general 1D array chunking has a full-read penalty on S3 (1.4x slower), making it a net negative for typical access patterns.
However, there is an important exception: categorical
categoriesstring arrays show a win-win with chunking—faster partial reads AND faster full reads across all storage backends.This proposes a more targeted change: apply default chunking only to categorical
categoriesarrays, enabling the full potential ofhead_categories()/tail_categories()from PR #2288.Why string arrays behave differently
General obs/var arrays (numeric codes, float columns) have S3 chunking penalties because:
Categorical
categories(string arrays) benefit from chunking because:Benchmark: String Arrays Only
System: Darwin 24.6.0 (arm64), Apple M3 Max, 133 Mbps down / 101 Mbps up, S3 us-west-2
Local SSD Results
S3 Results
Analysis: 10k chunks are the sweet spot
Comparison at 100k categories
Key findings
10k chunks: Best balance
1k chunks: Too aggressive
100 chunks: Catastrophic for full reads
Proposed change
Apply 10k chunking specifically to categorical
categoriesarrays inwrite_categorical:anndata/src/anndata/_io/specs/methods.py
Lines 1107 to 1112 in c6f6f54
Proposed change:
Impact
This targeted change:
head_categories()/tail_categories()from PR #2288: 3-8x fasterConclusion
While the original proposal in #2295 (chunking all 1D obs/var arrays) has unfavorable S3 tradeoffs, categorical
categoriesstring arrays are an exception where chunking provides a win-win. I propose implementing chunking for this specific case only.Benchmark code: benchmark_string_chunks.py
Related
LazyCategoricalDtypewithhead_categories()