Fix/no mutate input pearson residuals#661
Conversation
`_check_gpu_X` in preprocessing/_utils.py is named and used as a
validator, but when called with `require_cf=True` on a sparse matrix
that was not in canonical format it silently mutated the caller's
matrix in place -- it called `X.sort_indices()` and `X.sum_duplicates()`
on the object returned by `_get_obs_rep`, i.e. on `adata.X` itself.
Two callers were affected: `pp.normalize_pearson_residuals` and
`pp.highly_variable_genes(flavor='pearson_residuals')`. The operation
is value-preserving, but it reorders the indices, merges duplicate
entries and changes `nnz`, so any code holding a reference to the
matrix, hashing its buffers, or round-tripping it to disk sees it
change underneath them.
The same branch also fell through without a `return`, so
`_check_gpu_X(..., require_cf=True)` returned `None` for a
non-canonical matrix and `True` for an already-canonical one -- an
inconsistent return value for identical calls.
Split the two concerns:
* `_check_gpu_X` is now a pure validator. It drops the `require_cf`
parameter, never modifies `X`, and returns `True` on every valid
path (raising `TypeError` otherwise).
* A new `_ensure_canonical_format(X)` performs the canonicalization.
When `X` is sparse and not already canonical it canonicalizes a
*copy* and returns it, leaving the caller's matrix untouched.
Dense and already-canonical inputs are returned unchanged with no
copy.
The two `require_cf=True` call sites now call `_check_gpu_X(X)`
followed by `X = _ensure_canonical_format(X)`, so the GPU kernels
still receive a canonical matrix while `adata.X` is no longer mutated.
Verified on an NVIDIA H100 80GB (CUDA 12.9): running the real
`_check_gpu_X` and `_ensure_canonical_format` on a non-canonical CSR
matrix confirms `_check_gpu_X` leaves the input unchanged and returns
`True`, and `_ensure_canonical_format` returns a canonical copy while
the original matrix's indices, data and nnz are preserved.
Adds `test_normalize_pearson_residuals_preserves_input_matrix` to tests/test_normalization.py: it feeds a non-canonical sparse `adata.X` (unsorted indices + a duplicate entry) to `pp.normalize_pearson_residuals` with `inplace=False` and asserts the input matrix is byte-identical afterwards -- its indices, data, nnz and `has_canonical_format` flag are all unchanged. The test fails on the pre-fix code, which canonicalized the matrix in place, and passes now that canonicalization happens on a copy. Also adds the corresponding `docs/release-notes/0.15.1.md` entry.
|
Thanks for the work on this. After thinking it through and verifying, I'm going to close without merging — the premise doesn't hold up:
|
|
Thanks for the review, I agree that the zero copy on this PR can cause false protection because of the matrix mutation. You're right that CuPy's cuSPARSE ops call Before I do though, I wanted to float one smaller idea for the part that isn't about the copy. The thing that originally bugged me wasn't the canonicalization itself, it was that a function called What if "Input matrix is not in canonical format, the Pearson-residual kernels need sorted indices with no duplicates, call Without any copy or mutation, and it returns The downside is that it's a behavior change. Input that silently works today would start erroring. But you mentioned non-canonical input is rare in practice, so it shouldn't hit many people, and when it does they get a clear message instead of a silent buffer rewrite. Totally fine if you'd rather leave the current behavior as is. Just wanted to check whether a raising validator like that would be worth a small standalone PR, or if you'd consider it not worth the churn. Many thanks. |
|
Return value: I already fixed that in #664 — _check_gpu_X returns True on both branches now. To be honest it wasn't actually breaking anything (no caller looked at the return), but you're right that it was sloppy, so good for consistency either way. Raise-on-non-canonical: this doesn't really fly given how CuPy actually treats the canonical-format state. It's a property the substrate flips back and forth as the user does ordinary things, in both directions:
So a user who does The "make the requirement explicit" framing assumes there's a precondition the user can reason about. There isn't — CuPy decides when it canonicalizes and when it doesn't, and the user can't keep track of it across a pipeline. The function doing the canonicalization itself matches how the rest of the CuPy sparse ecosystem behaves and is ~free on the common path. I'll leave it as is. |
Summary
_check_gpu_Xinsrc/rapids_singlecell/preprocessing/_utils.pyis named and used as a validator, but when called withrequire_cf=Trueon a sparse matrix that was not in canonical format it had two defects:It silently mutated the caller's matrix in place. It called
X.sort_indices()andX.sum_duplicates()on the object returned by_get_obs_rep_ i.e. onadata.Xitself. The operation is value-preserving, but it reorders the indices, merges duplicate entries and changesnnz, so any code holding a reference to the matrix, hashing its buffers, or round-tripping it to disk sees it change underneath them.It returned inconsistently. That branch fell through with no
return, so_check_gpu_X(..., require_cf=True)returnedNonefor a non-canonical matrix andTruefor an already-canonical one _ different return values for identical calls.Two callers were affected:
pp.normalize_pearson_residualsandpp.highly_variable_genes(flavor='pearson_residuals').Fix
The two concerns are split:
_check_gpu_Xis now a pure validator. It drops therequire_cfparameter, never modifiesX, and returnsTrueon every valid path (raisingTypeErrorotherwise)._ensure_canonical_format(X)(new) performs the canonicalization. WhenXis sparse and not already canonical it canonicalizes a copy and returns it, leaving the caller's matrix untouched. Dense and already-canonical inputs are returned unchanged with no copy.The two
require_cf=Truecall sites now call_check_gpu_X(X)followed byX = _ensure_canonical_format(X), so the GPU kernels still receive a canonical matrix whileadata.Xis no longer mutated.Verification
Running the real
_check_gpu_Xand_ensure_canonical_formaton a non-canonical CSR matrix (NVIDIA H100 80GB, CUDA 12.9):_check_gpu_X(X)_ returnsTrue, leavesXbyte-for-byte unchanged._ensure_canonical_format(X)_ returns a canonical copy (same dense values); the originalXkeeps its originalindices,data,nnzandhas_canonical_format._ensure_canonical_formaton an already-canonical matrix returns the same object _ no needless copy.Changes
preprocessing/_utils.py__check_gpu_Xbecomes a pure validator; new_ensure_canonical_format.preprocessing/_normalize.py_normalize_pearson_residualsuses_check_gpu_X(X)+_ensure_canonical_format(X).preprocessing/_hvg/_pearson_residuals.py_ same for thepearson_residualsHVG flavor.tests/test_normalization.py_ addstest_normalize_pearson_residuals_preserves_input_matrix, which feeds a non-canonical sparseadata.Xand asserts the matrix is unchanged after the call.docs/release-notes/0.15.1.md_ bug-fix entry.