Skip to content

perf: fast path for unchunked minor axis dask sparse reindexing when concating#2395

Merged
ilan-gold merged 16 commits into
mainfrom
ig/minor_axis_unchunked
Apr 21, 2026
Merged

perf: fast path for unchunked minor axis dask sparse reindexing when concating#2395
ilan-gold merged 16 commits into
mainfrom
ig/minor_axis_unchunked

Conversation

@ilan-gold
Copy link
Copy Markdown
Contributor

@ilan-gold ilan-gold commented Apr 20, 2026

Someone tried annbatch shuffling-ing the whole cellxgene census which raised this bug where our outer-join, while correct, was effectively a densification operation for CSR matrices. Why? The previous iteration basically did this for every block

import scipy.sparse as sp

mat = sp.random(10, 20, format="csr")
mat[0, 15] = 0
assert 0 in mat.data

which means 0's are actually stored. After fixing this, it took sub-7 hours, which is basically what we'd expect given our benchmark in our paper. The speedup / memory gains are confirmed on the benchmark (see below for results).

  • Closes #
  • Tests added
  • Release note not necessary because:

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.47%. Comparing base (3aefe2e) to head (7bef9e9).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2395      +/-   ##
==========================================
- Coverage   87.43%   85.47%   -1.96%     
==========================================
  Files          49       49              
  Lines        7720     7725       +5     
==========================================
- Hits         6750     6603     -147     
- Misses        970     1122     +152     
Files with missing lines Coverage Δ
src/anndata/_core/merge.py 84.78% <100.00%> (-8.81%) ⬇️

... and 7 files with indirect coverage changes

@scverse-benchmark
Copy link
Copy Markdown

scverse-benchmark Bot commented Apr 20, 2026

Benchmark changes

Change Before [30f20c8] After [ece963b] Ratio Benchmark (Parameter)
- 22.1±0.06ms 20.0±1ms 0.91 dataset2d.Dataset2D.time_getitem_bool_mask('zarr', (-1,), 'cat')
- 22.0±0.2ms 19.9±1ms 0.9 dataset2d.Dataset2D.time_getitem_bool_mask('zarr', None, 'cat')
+ 17.2±0.6ms 21.3±0.5ms 1.24 sparse_dataset.SparseCSRContiguousSlice.time_getitem('array', False)
+ 512±50μs 576±10μs 1.13 sparse_dataset.SparseCSRContiguousSlice.time_getitem_adata('alternating', False)
- 358±0.8μs 314±2μs 0.88 sparse_dataset.SparseCSRContiguousSlice.time_getitem_adata('array', False)
- 7.09G 1.08G 0.15 sparse_dataset.SparseCSRDaskConcat.peakmem_concat_with_mem('outer')
- 536±20ms 402±5ms 0.75 sparse_dataset.SparseCSRDaskConcat.time_concat('outer')
- 6.01±0.02s 1.46±0.02s 0.24 sparse_dataset.SparseCSRDaskConcat.time_concat_with_mem('outer')

Comparison: https://github.com/scverse/anndata/compare/30f20c8588259827926df75b458d541e6b0bb434..ece963b4de3184890ad5f5747e3017cfbb6855e9
Last changed: Mon, 20 Apr 2026 13:52:10 +0000

More details: https://github.com/scverse/anndata/pull/2395/checks?check_run_id=72116855662

@ilan-gold ilan-gold changed the title perf: fast path for unchunked minor axis + 0 fill value dask sparse reindexing when concating perf: fast path for unchunked minor axis CSR dask sparse reindexing when concating Apr 20, 2026
@ilan-gold ilan-gold marked this pull request as ready for review April 20, 2026 14:00
@ilan-gold ilan-gold added this to the 0.12.11 milestone Apr 20, 2026
@ilan-gold ilan-gold requested a review from flying-sheep April 20, 2026 14:04
Copy link
Copy Markdown
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! One nitpick, otherwise let’s go!

Comment thread src/anndata/_core/merge.py Outdated
@ilan-gold ilan-gold enabled auto-merge (squash) April 20, 2026 17:01
@ilan-gold ilan-gold changed the title perf: fast path for unchunked minor axis CSR dask sparse reindexing when concating perf: fast path for unchunked minor axis dask sparse reindexing when concating Apr 20, 2026
@ilan-gold ilan-gold merged commit aae79ef into main Apr 21, 2026
24 checks passed
@ilan-gold ilan-gold deleted the ig/minor_axis_unchunked branch April 21, 2026 10:31
flying-sheep pushed a commit that referenced this pull request Apr 21, 2026
…or axis `dask` sparse reindexing when `concat`ing) (#2397)

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants