Skip to content

feat: Chunking for categorical categories string arrays only #2296

@katosh

Description

@katosh

feat: Chunking for categorical categories string arrays only

Motivation

PR #2288 introduced LazyCategoricalDtype with head_categories() / tail_categories() methods for efficient partial category inspection without loading the entire array. However, zarr's default auto-chunking stores categories as single chunks, severely limiting the benefit:

Method H5AD Zarr (auto) Zarr (10k chunks)
head_categories(10) 0.19ms 14.2ms 1.8ms
Full load 30ms 16.0ms 14.9ms

With proper chunking, partial reads improve by 8x while full reads remain fast.

Context

After investigating #2295 (default chunking for 1D arrays in obs/var), I ran benchmarks on local SSD, NFS, and S3 storage. The results showed that general 1D array chunking has a full-read penalty on S3 (1.4x slower), making it a net negative for typical access patterns.

However, there is an important exception: categorical categories string arrays show a win-win with chunking—faster partial reads AND faster full reads across all storage backends.

This proposes a more targeted change: apply default chunking only to categorical categories arrays, enabling the full potential of head_categories() / tail_categories() from PR #2288.

Why string arrays behave differently

General obs/var arrays (numeric codes, float columns) have S3 chunking penalties because:

  • Request overhead dominates for small numeric data
  • Numeric decompression is fast, so parallelism doesn't help much

Categorical categories (string arrays) benefit from chunking because:

  • Variable-length string decompression is slower and benefits from parallelism
  • Sequential access patterns for categories
  • Smaller chunks decompress faster than one large string blob

Benchmark: String Arrays Only

System: Darwin 24.6.0 (arm64), Apple M3 Max, 133 Mbps down / 101 Mbps up, S3 us-west-2

Local SSD Results

Categories Read Size 100 chunks 1k chunks 10k chunks auto
10 head(10) 0.6ms 0.5ms 0.4ms 0.4ms
10 full 0.4ms 0.4ms 0.4ms 0.4ms
100 head(10) 0.4ms 0.5ms 0.5ms 0.4ms
100 full 0.4ms 0.4ms 0.4ms 0.4ms
1,000 head(10) 0.5ms 0.6ms 1.0ms 0.5ms
1,000 full 1.8ms 0.5ms 1.0ms 0.5ms
10,000 head(10) 0.5ms 0.6ms 1.7ms 1.8ms
10,000 full 16.9ms 3.1ms 1.8ms 1.9ms
50,000 head(10) 0.5ms 0.7ms 1.8ms 7.3ms
50,000 full 90.2ms 15.7ms 7.5ms 8.3ms
100,000 head(10) 0.5ms 0.6ms 1.8ms 14.2ms
100,000 full 179.9ms 30.6ms 14.9ms 16.0ms
500,000 head(10) 0.6ms 0.6ms 1.8ms 36.7ms
500,000 full 881.8ms 146.8ms 72.2ms 76.7ms

S3 Results

Categories Read Size 100 chunks 1k chunks 10k chunks auto
10 head(10) 57ms 55ms 53ms 58ms
10 full 56ms 53ms 50ms 53ms
100 head(10) 54ms 52ms 56ms 61ms
100 full 58ms 52ms 54ms 56ms
1,000 head(10) 59ms 61ms 58ms 61ms
1,000 full 74ms 58ms 56ms 65ms
10,000 head(10) 54ms 62ms 119ms 96ms
10,000 full 644ms 130ms 66ms 66ms
50,000 head(10) 64ms 60ms 106ms 169ms
50,000 full 3,237ms 387ms 251ms 231ms
100,000 head(10) 62ms 65ms 126ms 422ms
100,000 full 6,391ms 772ms 266ms 378ms
500,000 head(10) 61ms 66ms 124ms 1,568ms
500,000 full 33,023ms 3,491ms 875ms 1,714ms

Analysis: 10k chunks are the sweet spot

Comparison at 100k categories

Chunk Size Local head(10) Local full S3 head(10) S3 full
auto 14.2ms 16.0ms 422ms 378ms
10,000 1.8ms (8x) 14.9ms (7% faster) 126ms (3.4x) 266ms (30% faster)
1,000 0.6ms (24x) 30.6ms (1.9x slower) 65ms (6.5x) 772ms (2x slower)
100 0.5ms (28x) 179.9ms (11x slower) 62ms (6.8x) 6,391ms (17x slower)

Key findings

  1. 10k chunks: Best balance

    • Partial reads: 3-8x faster
    • Full reads: 7-30% faster (not slower!)
  2. 1k chunks: Too aggressive

    • Partial reads: Marginally better than 10k
    • Full reads: 2x slower on S3 (too many requests)
  3. 100 chunks: Catastrophic for full reads

    • 17x slower on S3 due to thousands of requests

Proposed change

Apply 10k chunking specifically to categorical categories arrays in write_categorical:

_writer.write_elem(
g,
"categories",
v.categories.to_numpy(),
dataset_kwargs=dataset_kwargs,
)

Proposed change:

categories = v.categories.to_numpy()
cat_kwargs = dataset_kwargs
if len(categories) > 10_000 and "chunks" not in dataset_kwargs:
    cat_kwargs = dict(dataset_kwargs, chunks=(10_000,))
_writer.write_elem(g, "categories", categories, dataset_kwargs=cat_kwargs)

Impact

This targeted change:

  1. Improves head_categories() / tail_categories() from PR #2288: 3-8x faster
  2. Improves full category loads: 7-30% faster
  3. Does not affect numeric obs/var arrays (which would have S3 penalties)
  4. Backward compatible: zarr reads any chunk layout transparently

Conclusion

While the original proposal in #2295 (chunking all 1D obs/var arrays) has unfavorable S3 tradeoffs, categorical categories string arrays are an exception where chunking provides a win-win. I propose implementing chunking for this specific case only.

Benchmark code: benchmark_string_chunks.py

Related

  • #2295: Original chunking proposal (closing due to S3 penalties)
  • PR #2288: LazyCategoricalDtype with head_categories()
  • zarr-python#270: Chunk size configuration discussion

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions