-
Notifications
You must be signed in to change notification settings - Fork 175
fix: fancy indexing fixes backed h5py error #2066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
❌ Your project check has failed because the head coverage (66.43%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.
Additional details and impacted files@@ Coverage Diff @@
## main #2066 +/- ##
===========================================
- Coverage 85.57% 66.43% -19.14%
===========================================
Files 46 46
Lines 7092 7118 +26
===========================================
- Hits 6069 4729 -1340
- Misses 1023 2389 +1366
|
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this looks good, thank you!
I have a lot of little comments. Please tell me if you prefer that I do all this, it’s fine with me!
for more information, see https://pre-commit.ci
I took a shot at it. Let me know what you think @flying-sheep ! Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!
I narrowed the types down a bit, please check if all my changes make sense:
Apart from one test function, _subset
only ever gets called with a tuple of length 1 or 2 containing normalized indices (1D boolean arrays, 1D integer arrays, and slices)
Thanks @flying-sheep ! Yes I think your type changes make sense. Please check and see whether my comments and changes address your concerns (which identified a problem with my previous type-hints for |
I see, so actually it should just be result = cast("np.ndarray", dataset[processed_indices[0]])
result = result[:, *processed_indices[1:]] right? And if the first index is I wonder what the best approach is then. Maybe finding out which 1D slice operation reduces the data the most, then apply that first? |
Hi @flying-sheep thanks for the simplification, much better. I have again made it slightly more complicated by first slicing with the indexer that will make the dataset the smallest (as you suggested)... let me know what you think. |
looks good to me! maybe @ilan-gold should have a look in case I missed something |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we're doing some unecessary unique
calls, no? _subset_dataset
calls _index_order_and_inverse
and checks its outputs for duplicates - if they are there _safe_fancy_index_h5py
then again checks for duplciates, calling unique
twice. Do I have this right? Is there any way to simplify this?
if axis_idx.dtype == bool: | ||
axis_idx = np.flatnonzero(axis_idx) | ||
order = np.argsort(axis_idx) | ||
return axis_idx[order], np.argsort(order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return axis_idx[order], np.argsort(order) | |
return axis_idx[order], np.arange(len(order)) |
isn't order
already sorted so argsort
would just do an arange
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order
is not already sorted here, it's just the index order that sorts axis_idx
. I did try the above suggested change, but the result was that tests would no longer pass.
return ( | ||
# Has duplicates - use unique + inverse mapping approach | ||
np.unique(idx, return_inverse=True) | ||
if len(np.unique(idx)) != len(idx) | ||
# No duplicates - just sort and track reverse mapping | ||
else _index_order_and_inverse(idx) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ( | |
# Has duplicates - use unique + inverse mapping approach | |
np.unique(idx, return_inverse=True) | |
if len(np.unique(idx)) != len(idx) | |
# No duplicates - just sort and track reverse mapping | |
else _index_order_and_inverse(idx) | |
) | |
unique, inverse = np.unique(idx, return_inverse=True) | |
return ( | |
# Has duplicates - use unique + inverse mapping approach | |
unique, inverse | |
if len(unique) != len(idx) | |
# No duplicates - just sort and track reverse mapping | |
else _index_order_and_inverse(idx) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct here that there is no reason to call np.unique()
twice. I implemented this suggestion.
@ilan-gold I think I see what you mean about checking for duplicates twice. I guess my thinking was that the |
@sjfleming Would you be up for adding a benchmark as well? It's in our benchmarks folder; I'm just a little concerned about all the Aside from that my only comment would be you could potentially add a flag to |
This is a first stab at fixing #2064 by adding
_safe_fancy_index_h5py
(and 3 related helper functions) toanndata/_core/index.py
. The function_safe_fancy_index_h5py
only gets called in the case where there are repeated indices being requested (this is the only case that is currently causing a bug, so in all other cases, the existing code --d[tuple(ordered)][tuple(rev_order)]
-- is what runs).