fix: fancy indexing fixes backed h5py error #2066

sjfleming · 2025-07-30T15:45:20Z

Closes duplicated indices when slicing dense backed view lead to .to_memory() TypeError #2064
Tests added
Release note added (or unnecessary)

This is a first stab at fixing #2064 by adding _safe_fancy_index_h5py (and 3 related helper functions) to anndata/_core/index.py. The function _safe_fancy_index_h5py only gets called in the case where there are repeated indices being requested (this is the only case that is currently causing a bug, so in all other cases, the existing code -- d[tuple(ordered)][tuple(rev_order)] -- is what runs).

for more information, see https://pre-commit.ci

codecov · 2025-07-30T15:47:38Z

Codecov Report

❌ Patch coverage is 35.84906% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.43%. Comparing base (a1d6f17) to head (64af43e).

Files with missing lines	Patch %	Lines
src/anndata/_core/index.py	27.65%	34 Missing ⚠️

❌ Your project check has failed because the head coverage (66.43%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (a1d6f17) and HEAD (64af43e). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (a1d6f17) HEAD (64af43e)

5 3

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2066       +/-   ##
===========================================
- Coverage   85.57%   66.43%   -19.14%     
===========================================
  Files          46       46               
  Lines        7092     7118       +26     
===========================================
- Hits         6069     4729     -1340     
- Misses       1023     2389     +1366

Files with missing lines	Coverage Δ
src/anndata/_core/merge.py	`63.48% <100.00%> (-20.93%)`	⬇️
src/anndata/_core/sparse_dataset.py	`83.28% <100.00%> (-9.39%)`	⬇️
src/anndata/experimental/backed/_lazy_arrays.py	`83.19% <100.00%> (-8.41%)`	⬇️
src/anndata/_core/index.py	`60.09% <27.65%> (-32.56%)`	⬇️

... and 24 files with indirect coverage changes

for more information, see https://pre-commit.ci

flying-sheep

Hi, this looks good, thank you!

I have a lot of little comments. Please tell me if you prefer that I do all this, it’s fine with me!

src/anndata/_core/index.py

tests/test_backed_hdf5.py

for more information, see https://pre-commit.ci

sjfleming · 2025-08-01T21:30:47Z

I took a shot at it. Let me know what you think @flying-sheep ! Thanks

flying-sheep

This looks great!

I narrowed the types down a bit, please check if all my changes make sense:

Apart from one test function, _subset only ever gets called with a tuple of length 1 or 2 containing normalized indices (1D boolean arrays, 1D integer arrays, and slices)

src/anndata/_core/index.py

sjfleming · 2025-09-05T16:23:55Z

Thanks @flying-sheep ! Yes I think your type changes make sense. Please check and see whether my comments and changes address your concerns (which identified a problem with my previous type-hints for _apply_index_to_result)

flying-sheep · 2025-09-08T10:46:22Z

I see, so actually it should just be

result = cast("np.ndarray", dataset[processed_indices[0]])
result = result[:, *processed_indices[1:]]

right?

And if the first index is :, this will load the entire dataset into memory.

I wonder what the best approach is then. Maybe finding out which 1D slice operation reduces the data the most, then apply that first?

for more information, see https://pre-commit.ci

sjfleming · 2025-09-19T15:35:22Z

Hi @flying-sheep thanks for the simplification, much better. I have again made it slightly more complicated by first slicing with the indexer that will make the dataset the smallest (as you suggested)... let me know what you think.

flying-sheep · 2025-09-25T12:42:32Z

looks good to me! maybe @ilan-gold should have a look in case I missed something

ilan-gold

It seems like we're doing some unecessary unique calls, no? _subset_dataset calls _index_order_and_inverse and checks its outputs for duplicates - if they are there _safe_fancy_index_h5py then again checks for duplciates, calling unique twice. Do I have this right? Is there any way to simplify this?

ilan-gold · 2025-09-25T13:06:51Z

src/anndata/_core/index.py

+    if axis_idx.dtype == bool:
+        axis_idx = np.flatnonzero(axis_idx)
+    order = np.argsort(axis_idx)
+    return axis_idx[order], np.argsort(order)


Suggested change

return axis_idx[order], np.argsort(order)

return axis_idx[order], np.arange(len(order))

isn't order already sorted so argsort would just do an arange?

order is not already sorted here, it's just the index order that sorts axis_idx. I did try the above suggested change, but the result was that tests would no longer pass.

ilan-gold · 2025-09-25T13:09:33Z

src/anndata/_core/index.py

+    return (
+        # Has duplicates - use unique + inverse mapping approach
+        np.unique(idx, return_inverse=True)
+        if len(np.unique(idx)) != len(idx)
+        # No duplicates - just sort and track reverse mapping
+        else _index_order_and_inverse(idx)
+    )


Suggested change

return (

# Has duplicates - use unique + inverse mapping approach

np.unique(idx, return_inverse=True)

if len(np.unique(idx)) != len(idx)

# No duplicates - just sort and track reverse mapping

else _index_order_and_inverse(idx)

)

unique, inverse = np.unique(idx, return_inverse=True)

return (

# Has duplicates - use unique + inverse mapping approach

unique, inverse

if len(unique) != len(idx)

# No duplicates - just sort and track reverse mapping

else _index_order_and_inverse(idx)

)

Yes, you are correct here that there is no reason to call np.unique() twice. I implemented this suggestion.

sjfleming · 2025-10-09T04:46:25Z

@ilan-gold I think I see what you mean about checking for duplicates twice. I guess my thinking was that the _safe_fancy_index_h5py function should check for duplicates on its own in case it's ever called by another function for another purpose in the future. And then I wanted to modify the existing functionality as little as possible, so I wanted _subset_dataset to only call any of the new code if it needed to... so I decided to check for duplicates in there as well. It could be handled differently though.

ilan-gold · 2025-10-09T08:30:13Z

@sjfleming Would you be up for adding a benchmark as well? It's in our benchmarks folder; I'm just a little concerned about all the np.unique and np.argsort calls that are now run by default (I count 6). I think at the moment, you could only add something for non-duplicated indexing but even there it would be helpful given the new default behavior.

Aside from that my only comment would be you could potentially add a flag to _safe_fancy_index_h5py to indicate that its results have already been checked.

sjfleming and others added 4 commits July 30, 2025 11:36

fixes scverse#2064

7dc2af2

undo accidental message changes

4c2d47d

remove comment

d451c74

[pre-commit.ci] auto fixes from pre-commit.com hooks

080050e

for more information, see https://pre-commit.ci

sjfleming and others added 7 commits July 30, 2025 15:00

remove redundant code

77956e9

test the new function explicitly

dae91ab

additional test case

c99ec17

[pre-commit.ci] auto fixes from pre-commit.com hooks

989cca5

for more information, see https://pre-commit.ci

simplify code

887641e

[pre-commit.ci] auto fixes from pre-commit.com hooks

eefb639

for more information, see https://pre-commit.ci

ruff check

eb55618

sjfleming mentioned this pull request Jul 30, 2025

Implement dataloader h5ad reading in backed mode cellarium-ai/cellarium-ml#325

Draft

flying-sheep added the skip-gpu-ci label Jul 31, 2025

Merge branch 'main' into sf-backed-hdf5-fancy-indexing

07c4ab9

flying-sheep reviewed Jul 31, 2025

View reviewed changes

sjfleming and others added 2 commits July 31, 2025 13:51

address PR comments

be9cec0

[pre-commit.ci] auto fixes from pre-commit.com hooks

4368f0a

for more information, see https://pre-commit.ci

sjfleming added 3 commits August 8, 2025 18:13

Merge branch 'main' into sf-backed-hdf5-fancy-indexing

58b990b

Merge branch 'main' into sf-backed-hdf5-fancy-indexing

5ee3658

Merge branch 'main' into sf-backed-hdf5-fancy-indexing

7c23963

This was referenced Aug 26, 2025

Implemented dataloading in backed mode for spatial project cellarium-ai/cellarium-ml#341

Merged

use a patched version of anndata and python 3.11 cellarium-ai/cellarium-ml#343

Merged

ilan-gold requested a review from flying-sheep August 28, 2025 13:30

flying-sheep added 5 commits August 29, 2025 10:21

fix subset type

28b874c

use correct type

b669f71

compact

9cc31e2

some more compacting

78c6141

early return

b2b68c3

fix types

8c4ce4e

flying-sheep requested changes Aug 29, 2025

View reviewed changes

src/anndata/_core/index.py Outdated Show resolved Hide resolved

src/anndata/_core/index.py Outdated Show resolved Hide resolved

fix type hint and provide comment

b4d70d7

simplify

8fde064

flying-sheep added this to the 0.12.2 milestone Sep 8, 2025

Merge branch 'main' into sf-backed-hdf5-fancy-indexing

df7a9ea

flying-sheep changed the title ~~fancy indexing fixes backed h5py error~~ fix: fancy indexing fixes backed h5py error Sep 8, 2025

sjfleming and others added 2 commits September 19, 2025 11:33

slice with most selective index first

ee30fa6

[pre-commit.ci] auto fixes from pre-commit.com hooks

e3c16b7

for more information, see https://pre-commit.ci

sjfleming and others added 4 commits September 19, 2025 11:35

Merge branch 'main' into sf-backed-hdf5-fancy-indexing

31f5bc5

move out

08f2cef

Merge branch 'main' into sf-backed-hdf5-fancy-indexing

64af43e

Merge branch 'main' into sf-backed-hdf5-fancy-indexing

e2b9bdc

flying-sheep approved these changes Sep 25, 2025

View reviewed changes

flying-sheep requested a review from ilan-gold September 25, 2025 12:41

ilan-gold reviewed Sep 25, 2025

View reviewed changes

sjfleming added 2 commits October 8, 2025 19:08

implement suggestion to remove extra unique call

bf8f2a2

Merge branch 'main' into sf-backed-hdf5-fancy-indexing

bdab900

ilan-gold modified the milestones: 0.12.2, 0.12.3, 0.12.4 Oct 15, 2025

	return axis_idx[order], np.argsort(order)
	return axis_idx[order], np.arange(len(order))

fix: fancy indexing fixes backed h5py error #2066

Are you sure you want to change the base?

fix: fancy indexing fixes backed h5py error #2066

Uh oh!

Conversation

sjfleming commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjfleming commented Aug 1, 2025

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sjfleming commented Sep 5, 2025

Uh oh!

flying-sheep commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjfleming commented Sep 19, 2025

Uh oh!

flying-sheep commented Sep 25, 2025

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

ilan-gold Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

sjfleming Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

sjfleming Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

sjfleming commented Oct 9, 2025

Uh oh!

ilan-gold commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sjfleming commented Jul 30, 2025 •

edited

Loading

codecov bot commented Jul 30, 2025 •

edited

Loading

flying-sheep commented Sep 8, 2025 •

edited

Loading