You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Request an API to efficiently get category count (without loading values) and partial category loading for lazy AnnData columns.
Motivation
For lazy AnnData (from read_lazy()), accessing category information via col.dtype.categories works but always loads ALL category values. For datasets with many categories (e.g., cell IDs, barcodes), this is inefficient when you only need the count or a preview.
Current behavior:
fromanndata.experimentalimportread_lazylazy_adata=read_lazy("dataset.zarr") # 1M cells, 100k categoriescol=lazy_adata.obs["cell_id"]
categories=col.dtype.categories# ~25ms - loads all 100k valuesn_categories=len(col.dtype.categories) # Must load all just to count
Desired behavior:
# Efficient count-only (no data loading):n_categories=get_category_count(col) # ~0.1µs, metadata only# Partial loading for previews:first_100=get_categories(col, n=100) # Load only first 100
Benchmark Results
Tested with 1M cells and 100k categories (median of 3 runs):
Method
Zarr
H5AD
Data loaded
col.values (full column)
80 ms
87 ms
All codes (1M) + categories
col.dtype.categories
25 ms
34 ms
Category group (values + mask)
read_elem(values)
9 ms
16 ms
Category values only
read_elem_partial(n=100)
9.1 ms
0.1 ms
First 100 categories
values.shape[0] (metadata)
0.12 µs
0.08 µs
Count only (no loading)
Key findings:
count from metadata is much faster than loading values
col.dtype.categories reads the whole category group (values + mask), ~2.5x slower than reading values directly
Partial loading via read_elem_partial is very fast for H5AD but shows no benefit for Zarr (possibly not optimized yet?)
Benchmark code
importanndataasadfromanndata.experimentalimportread_lazyfromanndata._io.specs.registryimportread_elem, read_elem_partialimportnumpyasnpimportpandasaspdimporttempfileimporttimeimportos# Create AnnData with 1M cells and 100k categoriesn_obs=1_000_000n_cats=100_000adata=ad.AnnData(
X=np.random.rand(n_obs, 10).astype(np.float32),
obs=pd.DataFrame({
'cell_id': pd.Categorical([f'Cell_{i%n_cats}'foriinrange(n_obs)])
})
)
withtempfile.TemporaryDirectory() astmpdir:
# For Zarr:path=os.path.join(tmpdir, 'test.zarr')
adata.write_zarr(path)
# For H5AD: change to 'test.h5ad' and adata.write_h5ad(path)lazy_adata=read_lazy(path)
col=lazy_adata.obs['cell_id']
# Navigate to storagecat_arr=col.variable._data.arraycats_storage=cat_arr._categoriesvalues=cats_storage["values"] ifhasattr(cats_storage, "keys") elsecats_storage# Benchmark full column loadstart=time.perf_counter()
_=col.valuesprint(f"col.values: {(time.perf_counter() -start)*1000:.1f}ms")
# Note: col.dtype.categories uses @cached_property, so we need fresh read_lazy()lazy_adata=read_lazy(path)
col=lazy_adata.obs['cell_id']
start=time.perf_counter()
_=col.dtype.categoriesprint(f"col.dtype.categories: {(time.perf_counter() -start)*1000:.1f}ms")
start=time.perf_counter()
_=read_elem(values)
print(f"read_elem(values): {(time.perf_counter() -start)*1000:.1f}ms")
start=time.perf_counter()
_=read_elem_partial(values, indices=slice(0, 100))
print(f"read_elem_partial(n=100): {(time.perf_counter() -start)*1000:.1f}ms")
start=time.perf_counter()
_=values.shape[0]
print(f"values.shape[0]: {(time.perf_counter() -start)*1e6:.2f}µs")
classCategoricalArray:
@propertydefn_categories(self) ->int:
"""Number of categories (metadata only, no loading)."""values=self._categories["values"] ifhasattr(self._categories, "keys") elseself._categoriesreturnvalues.shape[0]
defget_categories(self, n: int|None=None) ->np.ndarray:
"""Get category values, optionally limited to first N."""values=self._categories["values"] ifhasattr(self._categories, "keys") elseself._categoriesifnisnotNone:
returnread_elem_partial(values, indices=slice(0, n))
returnread_elem(values)
Option B: Utility function
defget_lazy_column_categories(
col: DataArray,
n: int|None=None,
) ->tuple[np.ndarray, int]:
"""Efficiently get categories from a lazy column."""
Option C: Integration with Dataset2D
classDataset2D:
defget_column_categories(self, column: str, n: int|None=None) ->np.ndarray|None:
"""Get categories efficiently, with optional partial loading."""defget_column_category_count(self, column: str) ->int|None:
"""Get category count from metadata (no data loading)."""
Use Cases
HTML representation - Display category count ("100,000 categories") and preview actual values (first N) without loading all (PR feat: Add HTML representation #2236, see lazy loading of visual demo 8b)
Data inspection - Quick count and preview of categories in large datasets
Validation - Check category count matches color arrays in uns
References
PR #2236: HTML representation with lazy AnnData support
Summary
Request an API to efficiently get category count (without loading values) and partial category loading for lazy AnnData columns.
Motivation
For lazy AnnData (from
read_lazy()), accessing category information viacol.dtype.categoriesworks but always loads ALL category values. For datasets with many categories (e.g., cell IDs, barcodes), this is inefficient when you only need the count or a preview.Current behavior:
Desired behavior:
Benchmark Results
Tested with 1M cells and 100k categories (median of 3 runs):
col.values(full column)col.dtype.categoriesread_elem(values)read_elem_partial(n=100)values.shape[0](metadata)Key findings:
col.dtype.categoriesreads the whole category group (values + mask), ~2.5x slower than reading values directlyread_elem_partialis very fast for H5AD but shows no benefit for Zarr (possibly not optimized yet?)Benchmark code
Test environment
Current Workaround (in PR #2236)
The HTML representation module implements a workaround for efficient category display by navigating to the underlying storage:
Why this workaround is fragile:
col.variable._data.array) which may change_categoriesattributecats["values"]pattern)Proposed Solution
Option A: Methods on CategoricalArray
Option B: Utility function
Option C: Integration with Dataset2D
Use Cases
unsReferences
CategoricalArray_get_lazy_categories