perf: use `name` in `map_blocks` to bypass tokenization #2121

ilan-gold · 2025-09-16T14:58:57Z

Closes ad.concat is slow on lazy data on account of tokenize #1989
Tests added (benchmark)
Release note added (or unnecessary)

codecov · 2025-09-16T15:01:10Z

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 85.61%. Comparing base (9da064f) to head (81004c5).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/anndata/_io/specs/lazy_methods.py	75.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2121      +/-   ##
==========================================
+ Coverage   85.42%   85.61%   +0.18%     
==========================================
  Files          46       46              
  Lines        7081     7083       +2     
==========================================
+ Hits         6049     6064      +15     
+ Misses       1032     1019      -13

Files with missing lines	Coverage Δ
src/anndata/_core/aligned_df.py	`96.66% <100.00%> (+0.17%)`	⬆️
src/anndata/_core/merge.py	`85.13% <100.00%> (+0.02%)`	⬆️
src/anndata/experimental/backed/_lazy_arrays.py	`91.59% <100.00%> (ø)`
src/anndata/_io/specs/lazy_methods.py	`95.45% <75.00%> (-0.70%)`	⬇️

... and 4 files with indirect coverage changes

scverse-benchmark · 2025-09-16T15:29:55Z

Benchmark changes

Change	Before [`e88a6c2`]	After [`7a35269`]	Ratio	Benchmark (Parameter)
-	3.49±0.01s	629±4ms	0.18	dataset2d.Dataset2D.time_concat(<function Dataset2D.> (0), (-1,))
-	3.50±0.02s	631±0.9ms	0.18	dataset2d.Dataset2D.time_concat(<function Dataset2D.> (0), None)
-	4.12±0.2s	1.52±0.01s	0.37	dataset2d.Dataset2D.time_concat(<function Dataset2D.> (1), (-1,))
-	5.10±0.02s	1.55±0.01s	0.3	dataset2d.Dataset2D.time_concat(<function Dataset2D.> (1), None)
+	9.98±0.1ms	12.6±0.2ms	1.26	dataset2d.Dataset2D.time_getitem_slice(<function Dataset2D.> (0), (-1,))

Comparison: https://github.com/scverse/anndata/compare/e88a6c2397ccc199eb8265e075e914fc53e8abb1..7a35269b955a90f3d4407c1be3877b3a201c3d16
Last changed: Thu, 2 Oct 2025 14:24:41 +0000

More details: https://github.com/scverse/anndata/pull/2121/checks?check_run_id=51800151862

src/anndata/_io/specs/registry.py

flying-sheep

Can you explain the idea behind the uuid? Why is this the right thing to do for this use case and not other uses of map_blocks? Is there anything in the dask docs that recommends this pattern for certain use cases?

ilan-gold · 2025-10-17T11:13:03Z

Can you explain the idea behind the uuid?

I just want to be 100% sure the name I create is unique. From the docs:

The key name to use for the output array. Note that this fully specifies the output key name, and must be unique. If not provided, will be determined by a hash of the arguments.

Why is this the right thing to do for this use case

I want to avoid the tokenization of the input function, which appears to be expensive in certain cases. I had originally added name everywhere we do map_blocks but this was overkill it seems and does not seem to have affected performance.

flying-sheep

I see! Weird that adding it everywhere didn’t affect performance if adding it here does. Maybe we can form an intuition on where it’s expensive so we don’t have to guess/measure as much?

ilan-gold · 2025-10-17T12:25:17Z

Requested review based on the last non-merge commit @flying-sheep because I added an optimization for something I noticed in #2156 , although I don't know why it fails for Selman (it worked for me but was slow, hence the additional commit after your approval)

ilan-gold added 2 commits September 16, 2025 16:42

fix: use name to speed up .map_blocks

cd1e7c3

chore: add concat benchmark

51fee0a

ilan-gold marked this pull request as draft September 16, 2025 14:59

ilan-gold added 2 commits September 16, 2025 17:08

fix: zarr path

03e5bea

fix: docstring test

bd5ae70

ilan-gold added the skip-gpu-ci label Sep 16, 2025

ilan-gold added this to the 0.12.2 milestone Sep 16, 2025

ilan-gold added the benchmark label Sep 16, 2025

ilan-gold added 3 commits September 23, 2025 10:28

Merge branch 'main' into ig/accelerate_map_blocks

c7056ac

fix: bound asv

3080c8a

Merge branch 'main' into ig/accelerate_map_blocks

e2da061

ilan-gold added benchmark and removed benchmark labels Sep 29, 2025

Merge branch 'main' into ig/accelerate_map_blocks

9232d12

ilan-gold added benchmark and removed benchmark labels Sep 29, 2025

ilan-gold added 4 commits September 29, 2025 17:05

Merge branch 'main' into ig/accelerate_map_blocks

b2136f3

Merge branch 'main' into ig/accelerate_map_blocks

ac82887

fix: use uuids and guarnateed-to-exist properties

01d05e2

Merge branch 'main' into ig/accelerate_map_blocks

0f94a19

ilan-gold marked this pull request as ready for review October 1, 2025 15:53

feat: add concat sparse benchmarks

59041d4

flying-sheep reviewed Oct 2, 2025

View reviewed changes

src/anndata/_io/specs/registry.py Outdated Show resolved Hide resolved

ilan-gold added 3 commits October 2, 2025 13:32

fix: benchmark read and concat separate

fdac470

fix: revert name for lazy_methods

5905961

Merge branch 'main' into ig/accelerate_map_blocks

dd8af4d

ilan-gold requested a review from flying-sheep October 2, 2025 14:00

chore: relnote

7a35269

ilan-gold modified the milestones: 0.12.2, 0.12.3 Oct 15, 2025

Merge branch 'main' into ig/accelerate_map_blocks

159789b

ilan-gold modified the milestones: 0.12.3, 0.12.4 Oct 16, 2025

ilan-gold mentioned this pull request Oct 17, 2025

Dataset2D from read_lazy fails when I try to call an unsupported column #2156

Open

3 tasks

flying-sheep reviewed Oct 17, 2025

View reviewed changes

flying-sheep approved these changes Oct 17, 2025

View reviewed changes

fix: add more benchmarks + fix for string arrays

150b596

ilan-gold added benchmark and removed benchmark labels Oct 17, 2025

ilan-gold requested a review from flying-sheep October 17, 2025 12:24

ilan-gold added benchmark and removed benchmark labels Oct 17, 2025

fix: remove read_dataset

0484478

ilan-gold force-pushed the ig/accelerate_map_blocks branch from 017b829 to c29d7f1 Compare October 19, 2025 10:21

ilan-gold added 2 commits October 19, 2025 12:25

Merge branch 'main' into ig/accelerate_map_blocks

93d8910

fix: check order

3bc4ee2

ilan-gold force-pushed the ig/accelerate_map_blocks branch from c29d7f1 to 3bc4ee2 Compare October 19, 2025 10:27

fix: use name for h5 make_chunk

81004c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: use `name` in `map_blocks` to bypass tokenization #2121

perf: use `name` in `map_blocks` to bypass tokenization #2121

Uh oh!

ilan-gold commented Sep 16, 2025

Uh oh!

codecov bot commented Sep 16, 2025 •

edited

Loading

Uh oh!

scverse-benchmark bot commented Sep 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

flying-sheep left a comment

Uh oh!

ilan-gold commented Oct 17, 2025

Uh oh!

flying-sheep left a comment •

edited

Loading

Uh oh!

ilan-gold commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: use name in map_blocks to bypass tokenization #2121

Are you sure you want to change the base?

perf: use name in map_blocks to bypass tokenization #2121

Uh oh!

Conversation

ilan-gold commented Sep 16, 2025

Uh oh!

codecov bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

scverse-benchmark bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark changes

Uh oh!

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

ilan-gold commented Oct 17, 2025

Uh oh!

flying-sheep left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: use `name` in `map_blocks` to bypass tokenization #2121

perf: use `name` in `map_blocks` to bypass tokenization #2121

codecov bot commented Sep 16, 2025 •

edited

Loading

scverse-benchmark bot commented Sep 16, 2025 •

edited

Loading

flying-sheep left a comment •

edited

Loading