Skip to content

Fix/no mutate input pearson residuals#661

Closed
Arshammik wants to merge 2 commits into
scverse:mainfrom
Arshammik:fix/no-mutate-input-pearson-residuals
Closed

Fix/no mutate input pearson residuals#661
Arshammik wants to merge 2 commits into
scverse:mainfrom
Arshammik:fix/no-mutate-input-pearson-residuals

Conversation

@Arshammik
Copy link
Copy Markdown
Contributor

Summary

_check_gpu_X in src/rapids_singlecell/preprocessing/_utils.py is named and used as a validator, but when called with require_cf=True on a sparse matrix that was not in canonical format it had two defects:

  1. It silently mutated the caller's matrix in place. It called X.sort_indices() and X.sum_duplicates() on the object returned by _get_obs_rep _ i.e. on adata.X itself. The operation is value-preserving, but it reorders the indices, merges duplicate entries and changes nnz, so any code holding a reference to the matrix, hashing its buffers, or round-tripping it to disk sees it change underneath them.

  2. It returned inconsistently. That branch fell through with no return, so _check_gpu_X(..., require_cf=True) returned None for a non-canonical matrix and True for an already-canonical one _ different return values for identical calls.

Two callers were affected: pp.normalize_pearson_residuals and pp.highly_variable_genes(flavor='pearson_residuals').

Fix

The two concerns are split:

  • _check_gpu_X is now a pure validator. It drops the require_cf parameter, never modifies X, and returns True on every valid path (raising TypeError otherwise).
  • _ensure_canonical_format(X) (new) performs the canonicalization. When X is sparse and not already canonical it canonicalizes a copy and returns it, leaving the caller's matrix untouched. Dense and already-canonical inputs are returned unchanged with no copy.

The two require_cf=True call sites now call _check_gpu_X(X) followed by X = _ensure_canonical_format(X), so the GPU kernels still receive a canonical matrix while adata.X is no longer mutated.

Verification

Running the real _check_gpu_X and _ensure_canonical_format on a non-canonical CSR matrix (NVIDIA H100 80GB, CUDA 12.9):

  • _check_gpu_X(X) _ returns True, leaves X byte-for-byte unchanged.
  • _ensure_canonical_format(X) _ returns a canonical copy (same dense values); the original X keeps its original indices, data, nnz and has_canonical_format.
  • _ensure_canonical_format on an already-canonical matrix returns the same object _ no needless copy.

Changes

  • preprocessing/_utils.py _ _check_gpu_X becomes a pure validator; new _ensure_canonical_format.
  • preprocessing/_normalize.py _ normalize_pearson_residuals uses _check_gpu_X(X) + _ensure_canonical_format(X).
  • preprocessing/_hvg/_pearson_residuals.py _ same for the pearson_residuals HVG flavor.
  • tests/test_normalization.py _ adds test_normalize_pearson_residuals_preserves_input_matrix, which feeds a non-canonical sparse adata.X and asserts the matrix is unchanged after the call.
  • docs/release-notes/0.15.1.md _ bug-fix entry.

Arshammik added 2 commits May 15, 2026 15:54
`_check_gpu_X` in preprocessing/_utils.py is named and used as a
validator, but when called with `require_cf=True` on a sparse matrix
that was not in canonical format it silently mutated the caller's
matrix in place -- it called `X.sort_indices()` and `X.sum_duplicates()`
on the object returned by `_get_obs_rep`, i.e. on `adata.X` itself.
Two callers were affected: `pp.normalize_pearson_residuals` and
`pp.highly_variable_genes(flavor='pearson_residuals')`. The operation
is value-preserving, but it reorders the indices, merges duplicate
entries and changes `nnz`, so any code holding a reference to the
matrix, hashing its buffers, or round-tripping it to disk sees it
change underneath them.

The same branch also fell through without a `return`, so
`_check_gpu_X(..., require_cf=True)` returned `None` for a
non-canonical matrix and `True` for an already-canonical one -- an
inconsistent return value for identical calls.

Split the two concerns:

  * `_check_gpu_X` is now a pure validator. It drops the `require_cf`
    parameter, never modifies `X`, and returns `True` on every valid
    path (raising `TypeError` otherwise).
  * A new `_ensure_canonical_format(X)` performs the canonicalization.
    When `X` is sparse and not already canonical it canonicalizes a
    *copy* and returns it, leaving the caller's matrix untouched.
    Dense and already-canonical inputs are returned unchanged with no
    copy.

The two `require_cf=True` call sites now call `_check_gpu_X(X)`
followed by `X = _ensure_canonical_format(X)`, so the GPU kernels
still receive a canonical matrix while `adata.X` is no longer mutated.

Verified on an NVIDIA H100 80GB (CUDA 12.9): running the real
`_check_gpu_X` and `_ensure_canonical_format` on a non-canonical CSR
matrix confirms `_check_gpu_X` leaves the input unchanged and returns
`True`, and `_ensure_canonical_format` returns a canonical copy while
the original matrix's indices, data and nnz are preserved.
Adds `test_normalize_pearson_residuals_preserves_input_matrix` to
tests/test_normalization.py: it feeds a non-canonical sparse `adata.X`
(unsorted indices + a duplicate entry) to `pp.normalize_pearson_residuals`
with `inplace=False` and asserts the input matrix is byte-identical
afterwards -- its indices, data, nnz and `has_canonical_format` flag are
all unchanged. The test fails on the pre-fix code, which canonicalized
the matrix in place, and passes now that canonicalization happens on a
copy.

Also adds the corresponding `docs/release-notes/0.15.1.md` entry.
@Intron7
Copy link
Copy Markdown
Member

Intron7 commented May 16, 2026

Thanks for the work on this. After thinking it through and verifying, I'm going to close without merging — the premise doesn't hold up:

  1. CuPy's sparse ops auto-canonicalize the matrix in place. Confirmed locally: X @ Y, X @ X, X.sum(axis=1), and any other cuSPARSE-backed op on a non-canonical CSR flips has_canonical_format to True, merges duplicates (nnz drops), and replaces the indices/data buffers. So any caller who runs literally any sparse op on adata.X after normalize_pearson_residuals would see the same "mutation" regardless of this PR. The copy gives false protection.
  2. The mutation is mathematically invariant. sort_indices() + sum_duplicates() are value-preserving — same linear map, same toarray(), same @ results. Only the buffer layout changes.
  3. The copy has real cost. Duplicating indices + data on every call, for matrices anndata virtually always hands back canonical already, isn't worth defending against a semantically-invisible change that's going to happen anyway.

@Intron7 Intron7 closed this May 16, 2026
@Arshammik
Copy link
Copy Markdown
Contributor Author

Thanks for the review, I agree that the zero copy on this PR can cause false protection because of the matrix mutation. You're right that CuPy's cuSPARSE ops call sum_duplicates() in place on their operands, so the first sparse op on adata.X re-canonicalizes it anyway, and sort_indices/sum_duplicates are value-preserving regardless. Defending against it in normalize_pearson_residuals specifically doesn't really buy anything. I'll close this branch.

Before I do though, I wanted to float one smaller idea for the part that isn't about the copy. The thing that originally bugged me wasn't the canonicalization itself, it was that a function called _check_gpu_X silently mutates the caller's matrix, and on top of that it returns None for a non-canonical input but True for a canonical one because the require_cf branch falls through without a return.

What if _check_gpu_X just stayed a pure validator and raised on non-canonical input instead of fixing it? Something like

"Input matrix is not in canonical format, the Pearson-residual kernels need sorted indices with no duplicates, call X.sort_indices(); X.sum_duplicates() first."

Without any copy or mutation, and it returns True consistently on every valid path. It costs nothing on the common path since that's just a has_canonical_format check, and it doesn't pretend to protect anything, it just stops doing the canonicalization itself and makes the requirement explicit.

The downside is that it's a behavior change. Input that silently works today would start erroring. But you mentioned non-canonical input is rare in practice, so it shouldn't hit many people, and when it does they get a clear message instead of a silent buffer rewrite.

Totally fine if you'd rather leave the current behavior as is. Just wanted to check whether a raising validator like that would be worth a small standalone PR, or if you'd consider it not worth the churn.

Many thanks.

@Intron7
Copy link
Copy Markdown
Member

Intron7 commented May 16, 2026

Return value: I already fixed that in #664 — _check_gpu_X returns True on both branches now. To be honest it wasn't actually breaking anything (no caller looked at the return), but you're right that it was sloppy, so good for consistency either way.

Raise-on-non-canonical: this doesn't really fly given how CuPy actually treats the canonical-format state. It's a property the substrate flips back and forth as the user does ordinary things, in both directions:

  • cuSPARSE-backed ops silently canonicalize their operand: X @ Y, Y @ X, X.sum(axis=...), X.T @ Y, most things that touch cuSPARSE
  • Other ordinary ops silently produce non-canonical output: X[:, mask] on a CSR, X[rows, :] on a CSC, various fancy-indexing paths

So a user who does Xsub = adata.X[:, hvg_mask] ends up with a non-canonical matrix without doing anything explicit. They then pass Xsub into normalize_pearson_residuals and under the proposal hit a raised error telling them to call sort_indices() — for a reason they have no way to anticipate, because canonical format isn't user-controlled state in CuPy.

The "make the requirement explicit" framing assumes there's a precondition the user can reason about. There isn't — CuPy decides when it canonicalizes and when it doesn't, and the user can't keep track of it across a pipeline. The function doing the canonicalization itself matches how the rest of the CuPy sparse ecosystem behaves and is ~free on the common path. I'll leave it as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants