Skip to content

Conversation

amalia-k510
Copy link
Contributor

This adds a helper function adapt_vars_like that makes sure the .var (i.e., gene metadata) of a target AnnData matches that of a source object. It copies over .obs from the target and reindexes the data matrix .X so that any missing genes are filled with a default value (by default I set it to 0.0). Useful when working with datasets that have different gene sets and you need to bring them to a shared space. For example, before concatenation. Helps avoid issues when downstream functions expect identical .var across objects.

@amalia-k510 amalia-k510 marked this pull request as ready for review May 14, 2025 11:08
@amalia-k510 amalia-k510 changed the title dd adapt_vars_like to align .var between AnnData objects (issue #1697) Add adapt_vars_like to align .var between AnnData objects (issue #1697) May 14, 2025
Copy link

codecov bot commented May 14, 2025

Codecov Report

Attention: Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.

Project coverage is 85.12%. Comparing base (b860cdb) to head (ac4c067).
Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
src/anndata/utils.py 88.23% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1986      +/-   ##
==========================================
+ Coverage   83.34%   85.12%   +1.77%     
==========================================
  Files          47       46       -1     
  Lines        6856     7003     +147     
==========================================
+ Hits         5714     5961     +247     
+ Misses       1142     1042     -100     
Files with missing lines Coverage Δ
src/anndata/__init__.py 100.00% <ø> (ø)
src/anndata/typing.py 100.00% <ø> (ø)
src/anndata/utils.py 86.36% <88.23%> (-0.75%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add tests as well! Great start, on the right track no doubt, and the implementation itself looks good for what it is, but let's take it a step further!

Main comment (besides tests) would be to use the ReIndexer class to apply the logic to all parts of the AnnData object. The function you end up writing will likely look somewhat similar to anndata.concatenate in that you'll need to apply ReIndexer to var, X, layers, varm and varp (the later three recursively to their respective sub-elements).

Comment on lines 458 to 464
# source = AnnData object that defines the desired genes
# target = the data you want to reshape to match source
# fill_vlaue = what value to use for missing genes (default set to 0.0)
# returns a new AnnData object with the same genes as source
"""
Make target have the same .var (genes) as source., missing genes are filled with fill_value.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try copying the format here of other docstrings and add this to the public API i.e., in docs/api.md. The CI job (or locally you) can then check the doc rendering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the docstring format, but I’m still a bit unsure about how api.md is structured. I did add a section for the function, but I’d really appreciate it if you could take a quick look and let me know if anything needs adjusting.

Comment on lines 475 to 479
# initializing a new dense np array of shape (number of target cells, number of genes in source)
# filled with fill_value
# this will become the new .X matrix.
# It makes sure all genes in source are represented, and placeholders are ready for copying shared ones
new_x = np.full((target.n_obs, new_var.shape[0]), fill_value, dtype=target.X.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use the Reindexer class we have in src/anndata/_core/merge.py to handle the reindexing logic. You'll just need to pass in the old/new indices

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will allow us to handle different array types and dataframes. Check out in that class how many different arrays there are, it's non-trivial definitely!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did end up switching, but maybe it is a bit too general now. Could you possibly review that as well?

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff :) The test cases are super clear, thanks for them.

Comment on lines 499 to 510
if not source.var_names.isin(target.var_names).all():
# manual fix
# computing the list of genes that are in source and target
shared = source.var_names.intersection(target.var_names)
# getting positions of the shared genes in source and target
source_idx = new_var.index.get_indexer(shared)
target_idx = target.var_names.get_indexer(shared)
# creating a new matrix of shape (number of cells, number of genes in source)
# filled with the fill_value
new_x = np.full((target.n_obs, new_var.shape[0]), fill_value)
# for the genes that are in both source and target, copy over the values
new_x[:, source_idx] = target.X[:, target_idx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ReIndexer not also handle this? Might be worth adding this case to the ReIndexer in that case. In any case, we need to do this "operation" over all parts of the AnnData object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double-checked, and it does. I rewrote this function, so if you wouldn't mind checking it, I'd appreciate it!

Comment on lines 515 to 519
# creates a new AnnData object with the new .X and .var
# .X is the filled new_x array
# .obs is a copy of the target.obs
# .var is copied from source.var, making sure alignment of gene annotations
new_adata = AnnData(X=new_x, obs=target.obs.copy(), var=new_var)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to do the whole AnnData object. So you'll need to use the ReIndexer (which operates on all types of matrices, dataframes etc) on all the parts of the object, I think, something like

reindexer = Reindexer(new_var.index, target.var.index)

AnnData(X=reindexer(target.X, fill_value=fill_value), obs=reindexer(target.obs, fill_value=fill_value), obsm={k: reindexer(v, fill_value=fill_value) for k, v in obsm.items()}...)

and so forth. Does that make sense?

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out the gen_adata function for a way to create "full" anndata objects that have all the different types and such to really thoroughly test that we are handling everything.

Otherwise, this is looking great. Getting close to being done

Comment on lines 498 to 499
# otherwise I just create a dummy matrix of the right shape filled with a constant value
new_X = np.full((target.n_obs, len(new_var)), fill_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to create a new X, I think. Things should work in its absence, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Reindexer(target.X) won’t crash if target.X is None, it’ll just return None. But the point of the if/else is to make sure new_X is always a proper matrix with the right shape, filled with fill_value, so things downstream don’t break. If we drop the if, we’d end up passing None into places that probably assume a real array. That’s how I interpreted it at least but let me know if that assumption doesn’t hold.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think my point here was that if targethas no X we should probably not create an empty one. We can just set X=None when declaring the new AnnData instead of https://github.com/scverse/anndata/pull/1986/files#diff-22197e419767db6d7078531198e8c055d27d35510281a164e8343ec48fa9a938R523 setting it to this np.full

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I addressed that issue!

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some more tests like when target.X is None and "full" object tests from gen_adata, but other than that, this PR is looking pretty good

Comment on lines 498 to 499
# otherwise I just create a dummy matrix of the right shape filled with a constant value
new_X = np.full((target.n_obs, len(new_var)), fill_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think my point here was that if targethas no X we should probably not create an empty one. We can just set X=None when declaring the new AnnData instead of https://github.com/scverse/anndata/pull/1986/files#diff-22197e419767db6d7078531198e8c055d27d35510281a164e8343ec48fa9a938R523 setting it to this np.full

@ilan-gold ilan-gold added this to the 0.13.0 milestone Jun 3, 2025
@ilan-gold
Copy link
Contributor

Fixes #1697

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a raw test (code coverage is complaining) and merge those two tests, but other than that, this looks great.

),
],
)
def test_adapt_vars_with_fill_value(source, target, fill_value, expected_X):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this test and test_adapt_vars and have a fill value of None for the test_adapt_vars ones in the param

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants