-
Notifications
You must be signed in to change notification settings - Fork 175
Add adapt_vars_like to align .var between AnnData objects (issue #1697) #1986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1986 +/- ##
==========================================
+ Coverage 83.34% 85.12% +1.77%
==========================================
Files 47 46 -1
Lines 6856 7003 +147
==========================================
+ Hits 5714 5961 +247
+ Misses 1142 1042 -100
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add tests as well! Great start, on the right track no doubt, and the implementation itself looks good for what it is, but let's take it a step further!
Main comment (besides tests) would be to use the ReIndexer
class to apply the logic to all parts of the AnnData
object. The function you end up writing will likely look somewhat similar to anndata.concatenate
in that you'll need to apply ReIndexer
to var
, X
, layers
, varm
and varp
(the later three recursively to their respective sub-elements).
src/anndata/utils.py
Outdated
# source = AnnData object that defines the desired genes | ||
# target = the data you want to reshape to match source | ||
# fill_vlaue = what value to use for missing genes (default set to 0.0) | ||
# returns a new AnnData object with the same genes as source | ||
""" | ||
Make target have the same .var (genes) as source., missing genes are filled with fill_value. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try copying the format here of other docstrings and add this to the public API i.e., in docs/api.md
. The CI job (or locally you) can then check the doc rendering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the docstring format, but I’m still a bit unsure about how api.md
is structured. I did add a section for the function, but I’d really appreciate it if you could take a quick look and let me know if anything needs adjusting.
src/anndata/utils.py
Outdated
# initializing a new dense np array of shape (number of target cells, number of genes in source) | ||
# filled with fill_value | ||
# this will become the new .X matrix. | ||
# It makes sure all genes in source are represented, and placeholders are ready for copying shared ones | ||
new_x = np.full((target.n_obs, new_var.shape[0]), fill_value, dtype=target.X.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use the Reindexer
class we have in src/anndata/_core/merge.py
to handle the reindexing logic. You'll just need to pass in the old/new indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will allow us to handle different array types and dataframes. Check out in that class how many different arrays there are, it's non-trivial definitely!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did end up switching, but maybe it is a bit too general now. Could you possibly review that as well?
for more information, see https://pre-commit.ci
…-k510/anndata into gene_panel_selection_1697
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome stuff :) The test cases are super clear, thanks for them.
src/anndata/utils.py
Outdated
if not source.var_names.isin(target.var_names).all(): | ||
# manual fix | ||
# computing the list of genes that are in source and target | ||
shared = source.var_names.intersection(target.var_names) | ||
# getting positions of the shared genes in source and target | ||
source_idx = new_var.index.get_indexer(shared) | ||
target_idx = target.var_names.get_indexer(shared) | ||
# creating a new matrix of shape (number of cells, number of genes in source) | ||
# filled with the fill_value | ||
new_x = np.full((target.n_obs, new_var.shape[0]), fill_value) | ||
# for the genes that are in both source and target, copy over the values | ||
new_x[:, source_idx] = target.X[:, target_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ReIndexer
not also handle this? Might be worth adding this case to the ReIndexer
in that case. In any case, we need to do this "operation" over all parts of the AnnData
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double-checked, and it does. I rewrote this function, so if you wouldn't mind checking it, I'd appreciate it!
src/anndata/utils.py
Outdated
# creates a new AnnData object with the new .X and .var | ||
# .X is the filled new_x array | ||
# .obs is a copy of the target.obs | ||
# .var is copied from source.var, making sure alignment of gene annotations | ||
new_adata = AnnData(X=new_x, obs=target.obs.copy(), var=new_var) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to do the whole AnnData
object. So you'll need to use the ReIndexer
(which operates on all types of matrices, dataframes etc) on all the parts of the object, I think, something like
reindexer = Reindexer(new_var.index, target.var.index)
AnnData(X=reindexer(target.X, fill_value=fill_value), obs=reindexer(target.obs, fill_value=fill_value), obsm={k: reindexer(v, fill_value=fill_value) for k, v in obsm.items()}...)
and so forth. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check out the gen_adata
function for a way to create "full" anndata objects that have all the different types and such to really thoroughly test that we are handling everything.
Otherwise, this is looking great. Getting close to being done
src/anndata/utils.py
Outdated
# otherwise I just create a dummy matrix of the right shape filled with a constant value | ||
new_X = np.full((target.n_obs, len(new_var)), fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to create a new X
, I think. Things should work in its absence, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Reindexer(target.X)
won’t crash if target.X
is None
, it’ll just return None
. But the point of the if/else is to make sure new_X
is always a proper matrix with the right shape, filled with fill_value
, so things downstream don’t break. If we drop the if, we’d end up passing None into places that probably assume a real array. That’s how I interpreted it at least but let me know if that assumption doesn’t hold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think my point here was that if target
has no X
we should probably not create an empty one. We can just set X=None
when declaring the new AnnData
instead of https://github.com/scverse/anndata/pull/1986/files#diff-22197e419767db6d7078531198e8c055d27d35510281a164e8343ec48fa9a938R523 setting it to this np.full
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I addressed that issue!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some more tests like when target.X is None
and "full" object tests from gen_adata
, but other than that, this PR is looking pretty good
src/anndata/utils.py
Outdated
# otherwise I just create a dummy matrix of the right shape filled with a constant value | ||
new_X = np.full((target.n_obs, len(new_var)), fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think my point here was that if target
has no X
we should probably not create an empty one. We can just set X=None
when declaring the new AnnData
instead of https://github.com/scverse/anndata/pull/1986/files#diff-22197e419767db6d7078531198e8c055d27d35510281a164e8343ec48fa9a938R523 setting it to this np.full
Fixes #1697 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a raw test (code coverage is complaining) and merge those two tests, but other than that, this looks great.
), | ||
], | ||
) | ||
def test_adapt_vars_with_fill_value(source, target, fill_value, expected_X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this test and test_adapt_vars
and have a fill value of None
for the test_adapt_vars
ones in the param
This adds a helper function
adapt_vars_like
that makes sure the.var
(i.e., gene metadata) of a target AnnData matches that of a source object. It copies over .obs from the target and reindexes the data matrix .X so that any missing genes are filled with a default value (by default I set it to 0.0). Useful when working with datasets that have different gene sets and you need to bring them to a shared space. For example, before concatenation. Helps avoid issues when downstream functions expect identical .var across objects.