Skip to content

Conversation

sjfleming
Copy link
Contributor

@sjfleming sjfleming commented Jul 30, 2025

This is a first stab at fixing #2064 by adding _safe_fancy_index_h5py (and 3 related helper functions) to anndata/_core/index.py. The function _safe_fancy_index_h5py only gets called in the case where there are repeated indices being requested (this is the only case that is currently causing a bug, so in all other cases, the existing code -- d[tuple(ordered)][tuple(rev_order)] -- is what runs).

Copy link

codecov bot commented Jul 30, 2025

Codecov Report

❌ Patch coverage is 35.84906% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.43%. Comparing base (a1d6f17) to head (64af43e).

Files with missing lines Patch % Lines
src/anndata/_core/index.py 27.65% 34 Missing ⚠️

❌ Your project check has failed because the head coverage (66.43%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (a1d6f17) and HEAD (64af43e). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (a1d6f17) HEAD (64af43e)
5 3
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2066       +/-   ##
===========================================
- Coverage   85.57%   66.43%   -19.14%     
===========================================
  Files          46       46               
  Lines        7092     7118       +26     
===========================================
- Hits         6069     4729     -1340     
- Misses       1023     2389     +1366     
Files with missing lines Coverage Δ
src/anndata/_core/merge.py 63.48% <100.00%> (-20.93%) ⬇️
src/anndata/_core/sparse_dataset.py 83.28% <100.00%> (-9.39%) ⬇️
src/anndata/experimental/backed/_lazy_arrays.py 83.19% <100.00%> (-8.41%) ⬇️
src/anndata/_core/index.py 60.09% <27.65%> (-32.56%) ⬇️

... and 24 files with indirect coverage changes

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this looks good, thank you!

I have a lot of little comments. Please tell me if you prefer that I do all this, it’s fine with me!

@sjfleming
Copy link
Contributor Author

I took a shot at it. Let me know what you think @flying-sheep ! Thanks

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

I narrowed the types down a bit, please check if all my changes make sense:

Apart from one test function, _subset only ever gets called with a tuple of length 1 or 2 containing normalized indices (1D boolean arrays, 1D integer arrays, and slices)

@sjfleming
Copy link
Contributor Author

Thanks @flying-sheep ! Yes I think your type changes make sense. Please check and see whether my comments and changes address your concerns (which identified a problem with my previous type-hints for _apply_index_to_result)

@flying-sheep
Copy link
Member

flying-sheep commented Sep 8, 2025

I see, so actually it should just be

result = cast("np.ndarray", dataset[processed_indices[0]])
result = result[:, *processed_indices[1:]]

right?

And if the first index is :, this will load the entire dataset into memory.

I wonder what the best approach is then. Maybe finding out which 1D slice operation reduces the data the most, then apply that first?

@flying-sheep flying-sheep added this to the 0.12.2 milestone Sep 8, 2025
@flying-sheep flying-sheep changed the title fancy indexing fixes backed h5py error fix: fancy indexing fixes backed h5py error Sep 8, 2025
@sjfleming
Copy link
Contributor Author

Hi @flying-sheep thanks for the simplification, much better. I have again made it slightly more complicated by first slicing with the indexer that will make the dataset the smallest (as you suggested)... let me know what you think.

@flying-sheep
Copy link
Member

looks good to me! maybe @ilan-gold should have a look in case I missed something

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we're doing some unecessary unique calls, no? _subset_dataset calls _index_order_and_inverse and checks its outputs for duplicates - if they are there _safe_fancy_index_h5py then again checks for duplciates, calling unique twice. Do I have this right? Is there any way to simplify this?

if axis_idx.dtype == bool:
axis_idx = np.flatnonzero(axis_idx)
order = np.argsort(axis_idx)
return axis_idx[order], np.argsort(order)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return axis_idx[order], np.argsort(order)
return axis_idx[order], np.arange(len(order))

isn't order already sorted so argsort would just do an arange?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order is not already sorted here, it's just the index order that sorts axis_idx. I did try the above suggested change, but the result was that tests would no longer pass.

Comment on lines 278 to 284
return (
# Has duplicates - use unique + inverse mapping approach
np.unique(idx, return_inverse=True)
if len(np.unique(idx)) != len(idx)
# No duplicates - just sort and track reverse mapping
else _index_order_and_inverse(idx)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (
# Has duplicates - use unique + inverse mapping approach
np.unique(idx, return_inverse=True)
if len(np.unique(idx)) != len(idx)
# No duplicates - just sort and track reverse mapping
else _index_order_and_inverse(idx)
)
unique, inverse = np.unique(idx, return_inverse=True)
return (
# Has duplicates - use unique + inverse mapping approach
unique, inverse
if len(unique) != len(idx)
# No duplicates - just sort and track reverse mapping
else _index_order_and_inverse(idx)
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct here that there is no reason to call np.unique() twice. I implemented this suggestion.

@sjfleming
Copy link
Contributor Author

@ilan-gold I think I see what you mean about checking for duplicates twice. I guess my thinking was that the _safe_fancy_index_h5py function should check for duplicates on its own in case it's ever called by another function for another purpose in the future. And then I wanted to modify the existing functionality as little as possible, so I wanted _subset_dataset to only call any of the new code if it needed to... so I decided to check for duplicates in there as well. It could be handled differently though.

@ilan-gold
Copy link
Contributor

@sjfleming Would you be up for adding a benchmark as well? It's in our benchmarks folder; I'm just a little concerned about all the np.unique and np.argsort calls that are now run by default (I count 6). I think at the moment, you could only add something for non-duplicated indexing but even there it would be helpful given the new default behavior.

Aside from that my only comment would be you could potentially add a flag to _safe_fancy_index_h5py to indicate that its results have already been checked.

@ilan-gold ilan-gold modified the milestones: 0.12.2, 0.12.3, 0.12.4 Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

duplicated indices when slicing dense backed view lead to .to_memory() TypeError

3 participants