Add adapt_vars_like to align .var between AnnData objects (issue #1697) #1986

amalia-k510 · 2025-05-14T11:06:58Z

This adds a helper function adapt_vars_like that makes sure the .var (i.e., gene metadata) of a target AnnData matches that of a source object. It copies over .obs from the target and reindexes the data matrix .X so that any missing genes are filled with a default value (by default I set it to 0.0). Useful when working with datasets that have different gene sets and you need to bring them to a shared space. For example, before concatenation. Helps avoid issues when downstream functions expect identical .var across objects.

codecov · 2025-05-14T14:08:38Z

Codecov Report

Attention: Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.

Project coverage is 85.12%. Comparing base (b860cdb) to head (ac4c067).
Report is 10 commits behind head on main.

Files with missing lines	Patch %	Lines
src/anndata/utils.py	88.23%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1986      +/-   ##
==========================================
+ Coverage   83.34%   85.12%   +1.77%     
==========================================
  Files          47       46       -1     
  Lines        6856     7003     +147     
==========================================
+ Hits         5714     5961     +247     
+ Misses       1142     1042     -100

Files with missing lines	Coverage Δ
src/anndata/__init__.py	`100.00% <ø> (ø)`
src/anndata/typing.py	`100.00% <ø> (ø)`
src/anndata/utils.py	`86.36% <88.23%> (-0.75%)`	⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ilan-gold

Let's add tests as well! Great start, on the right track no doubt, and the implementation itself looks good for what it is, but let's take it a step further!

Main comment (besides tests) would be to use the ReIndexer class to apply the logic to all parts of the AnnData object. The function you end up writing will likely look somewhat similar to anndata.concatenate in that you'll need to apply ReIndexer to var, X, layers, varm and varp (the later three recursively to their respective sub-elements).

ilan-gold · 2025-05-15T09:45:49Z

src/anndata/utils.py

+    # source = AnnData object that defines the desired genes
+    # target = the data you want to reshape to match source
+    # fill_vlaue = what value to use for missing genes (default set to 0.0)
+    # returns a new AnnData object with the same genes as source
+    """
+    Make target have the same .var (genes) as source., missing genes are filled with fill_value.
+    """


Try copying the format here of other docstrings and add this to the public API i.e., in docs/api.md. The CI job (or locally you) can then check the doc rendering

I updated the docstring format, but I’m still a bit unsure about how api.md is structured. I did add a section for the function, but I’d really appreciate it if you could take a quick look and let me know if anything needs adjusting.

ilan-gold · 2025-05-15T09:48:49Z

src/anndata/utils.py

+    # initializing a new dense np array of shape (number of target cells, number of genes in source)
+    # filled with fill_value
+    # this will become the new .X matrix.
+    # It makes sure all genes in source are represented, and placeholders are ready for copying shared ones
+    new_x = np.full((target.n_obs, new_var.shape[0]), fill_value, dtype=target.X.dtype)


I think you can use the Reindexer class we have in src/anndata/_core/merge.py to handle the reindexing logic. You'll just need to pass in the old/new indices

This will allow us to handle different array types and dataframes. Check out in that class how many different arrays there are, it's non-trivial definitely!

I did end up switching, but maybe it is a bit too general now. Could you possibly review that as well?

for more information, see https://pre-commit.ci

…-k510/anndata into gene_panel_selection_1697

ilan-gold

Awesome stuff :) The test cases are super clear, thanks for them.

src/anndata/__init__.py

src/anndata/utils.py

ilan-gold · 2025-05-19T09:54:21Z

src/anndata/utils.py

+    if not source.var_names.isin(target.var_names).all():
+        # manual fix
+        # computing the list of genes that are in source and target
+        shared = source.var_names.intersection(target.var_names)
+        # getting positions of the shared genes in source and target
+        source_idx = new_var.index.get_indexer(shared)
+        target_idx = target.var_names.get_indexer(shared)
+        # creating a new matrix of shape (number of cells, number of genes in source)
+        # filled with the fill_value
+        new_x = np.full((target.n_obs, new_var.shape[0]), fill_value)
+        # for the genes that are in both source and target, copy over the values
+        new_x[:, source_idx] = target.X[:, target_idx]


Does ReIndexer not also handle this? Might be worth adding this case to the ReIndexer in that case. In any case, we need to do this "operation" over all parts of the AnnData object.

I double-checked, and it does. I rewrote this function, so if you wouldn't mind checking it, I'd appreciate it!

ilan-gold · 2025-05-19T09:55:11Z

src/anndata/utils.py

+    # creates a new AnnData object with the new .X and .var
+    # .X is the filled new_x array
+    # .obs is a copy of the target.obs
+    # .var is copied from source.var, making sure alignment of gene annotations
+    new_adata = AnnData(X=new_x, obs=target.obs.copy(), var=new_var)


We'll want to do the whole AnnData object. So you'll need to use the ReIndexer (which operates on all types of matrices, dataframes etc) on all the parts of the object, I think, something like

reindexer = Reindexer(new_var.index, target.var.index) AnnData(X=reindexer(target.X, fill_value=fill_value), obs=reindexer(target.obs, fill_value=fill_value), obsm={k: reindexer(v, fill_value=fill_value) for k, v in obsm.items()}...)

and so forth. Does that make sense?

tests/test_utils.py

docs/api.md

ilan-gold

Check out the gen_adata function for a way to create "full" anndata objects that have all the different types and such to really thoroughly test that we are handling everything.

Otherwise, this is looking great. Getting close to being done

ilan-gold · 2025-05-28T14:05:40Z

src/anndata/utils.py

+        # otherwise I just create a dummy matrix of the right shape filled with a constant value
+        new_X = np.full((target.n_obs, len(new_var)), fill_value)


No need to create a new X, I think. Things should work in its absence, no?

Yeah, Reindexer(target.X) won’t crash if target.X is None, it’ll just return None. But the point of the if/else is to make sure new_X is always a proper matrix with the right shape, filled with fill_value, so things downstream don’t break. If we drop the if, we’d end up passing None into places that probably assume a real array. That’s how I interpreted it at least but let me know if that assumption doesn’t hold.

Ok I think my point here was that if targethas no X we should probably not create an empty one. We can just set X=None when declaring the new AnnData instead of https://github.com/scverse/anndata/pull/1986/files#diff-22197e419767db6d7078531198e8c055d27d35510281a164e8343ec48fa9a938R523 setting it to this np.full

Got it. I addressed that issue!

src/anndata/utils.py

ilan-gold

Let's add some more tests like when target.X is None and "full" object tests from gen_adata, but other than that, this PR is looking pretty good

ilan-gold · 2025-06-02T08:47:37Z

src/anndata/utils.py

+        # otherwise I just create a dummy matrix of the right shape filled with a constant value
+        new_X = np.full((target.n_obs, len(new_var)), fill_value)


Ok I think my point here was that if targethas no X we should probably not create an empty one. We can just set X=None when declaring the new AnnData instead of https://github.com/scverse/anndata/pull/1986/files#diff-22197e419767db6d7078531198e8c055d27d35510281a164e8343ec48fa9a938R523 setting it to this np.full

ilan-gold · 2025-06-03T09:07:46Z

Fixes #1697

ilan-gold

Let's add a raw test (code coverage is complaining) and merge those two tests, but other than that, this looks great.

ilan-gold · 2025-06-03T09:05:49Z

tests/test_utils.py

+        ),
+    ],
+)
+def test_adapt_vars_with_fill_value(source, target, fill_value, expected_X):


Let's merge this test and test_adapt_vars and have a fill value of None for the test_adapt_vars ones in the param

amalia-k510 added 2 commits May 14, 2025 12:40

gene panel selection feature

06bb519

comments fix

24da345

amalia-k510 marked this pull request as ready for review May 14, 2025 11:08

import error fix

74b3ebd

amalia-k510 changed the title ~~dd adapt_vars_like to align .var between AnnData objects (issue #1697)~~ Add adapt_vars_like to align .var between AnnData objects (issue #1697) May 14, 2025

import error fix and init script update to make new fxn accessible

aa2295f

ilan-gold reviewed May 15, 2025

View reviewed changes

amalia-k510 and others added 6 commits May 18, 2025 14:53

doc string fix and api.md added

16264bf

[pre-commit.ci] auto fixes from pre-commit.com hooks

46b728f

for more information, see https://pre-commit.ci

typo fix

c09cf63

Merge branch 'gene_panel_selection_1697' of https://github.com/amalia…

d1910d4

…-k510/anndata into gene_panel_selection_1697

Switch to reindex

0dd7878

tests and manual fix for the missing genes case

5d6279e

ilan-gold reviewed May 19, 2025

View reviewed changes

amalia-k510 added 4 commits May 26, 2025 14:10

test fix and comments

a14af7f

reindexer fix

a30c365

Update __init__.py

fa48833

import error and spelling error

11fdb50

ilan-gold reviewed May 28, 2025

View reviewed changes

amalia-k510 added 2 commits June 1, 2025 14:26

AxisStorable implementaiton

715fd0f

adding new_varp and new_obsp for consistency

63a2933

ilan-gold reviewed Jun 2, 2025

View reviewed changes

amalia-k510 added 2 commits June 2, 2025 16:35

target change to None and test for it

8c30646

testing all aspects

ac4c067

ilan-gold added this to the 0.13.0 milestone Jun 3, 2025

ilan-gold added the skip-gpu-ci label Jun 3, 2025

ilan-gold approved these changes Jun 3, 2025

View reviewed changes

Merge branch 'main' into gene_panel_selection_1697

3a65f62

		# otherwise I just create a dummy matrix of the right shape filled with a constant value
		new_X = np.full((target.n_obs, len(new_var)), fill_value)

Add adapt_vars_like to align .var between AnnData objects (issue #1697) #1986

Are you sure you want to change the base?

Add adapt_vars_like to align .var between AnnData objects (issue #1697) #1986

Uh oh!

Conversation

amalia-k510 commented May 14, 2025

Uh oh!

codecov bot commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold commented Jun 3, 2025

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented May 14, 2025 •

edited

Loading