-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Hi rsc team — thanks a lot for the great work on rapids-singlecell!
I have three related questions:
- GPU-accelerated I/O for AnnData / other formats
Is there a plan or timeline to support GPU-accelerated reading of h5ad (and possibly loom/mtx) so that data land on the device without an extra round-trip?
Even partial acceleration (e.g., reading into GPU-native sparse/dense arrays, or a zero-copy path for .X) would already help a lot for large datasets.
Any recommended interim best practices to minimize host↔device copies when starting from h5ad?
- Wilcoxon differential expression (PR Wilcoxon rank-sum addition to rank_genes_groups #487)
I noticed that Wilcoxon DE has been added here: #487
What’s the expected release or availability timeline?
Will the API mirror Scanpy’s tl.rank_genes_groups(..., method="wilcoxon") semantics (ties, groups vs. reference, dense/sparse support)?
Any constraints to be aware of (e.g., memory behavior on large CSR inputs, batching, multi-GPU)?
- When to use rsc.get.anndata_to_GPU / rsc.get.anndata_to_CPU
I’m a bit unsure about when these calls are required vs. when rsc will handle device placement automatically.
Concretely:
Do rsc functions auto-move .X (and relevant matrices) to GPU if they detect host arrays, or should we always call rsc.get.anndata_to_GPU(adata) first?
What exactly gets moved by these helpers: only .X, or also layers, obsm (e.g., embeddings), and other arrays if present?
Expected GPU dtypes/structures: should .X be on the device as Cupy dense or cupyx.scipy.sparse.csr_matrix? Any guidance on supported sparsity patterns?
Mixed workflows: If I run a CPU step (e.g., a Scanpy CPU-only function) and then a GPU step, do I need to call get.anndata_to_GPU(adata) again?
When is get.anndata_to_CPU(adata) recommended (e.g., for plotting, exporting, certain Scanpy ops)? Are there safeguards to avoid redundant copies?
A tiny MWE to clarify expectations would be super helpful: