remydubois
diff --git a/‎.github/workflows/python-package.yaml‎
Lines changed: 4 additions & 0 deletions b/‎.github/workflows/python-package.yaml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 10 additions & 0 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎changelog.md‎
Lines changed: 15 additions & 0 deletions b/‎changelog.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎docs/out_of_core.md‎
Lines changed: 18 additions & 3 deletions b/‎docs/out_of_core.md‎
Lines changed: 18 additions & 3 deletions
diff --git a/‎docs/results.md‎
Lines changed: 5 additions & 2 deletions b/‎docs/results.md‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎illico/__init__.py‎
Lines changed: 36 additions & 0 deletions b/‎illico/__init__.py‎
Lines changed: 36 additions & 0 deletions
@@ -37,3 +37,7 @@ jobs:
     - name: Test with tox
       run: |
         python -m tox -e unit-tests
+    # Remove until docformatter is compat with py314
+    # - name: docformatter
+    #   run: |
+    #     poetry run docformatter --check --recursive illico tests
@@ -35,3 +35,13 @@ repos:
     #   entry: bash -lc 'rustup component add clippy >/dev/null 2>&1 || true; cargo clippy --all -- -D warnings'
     #   language: system
     #   files: '\.rs$'
+- repo: https://github.com/PyCQA/docformatter
+  rev: v1.7.7
+  hooks:
+    - id: docformatter
+      name: docformatter
+      entry: docformatter
+      language: python
+      types: [python]
+      additional_dependencies: [tomli]
+      args: [--in-place, --config, ./pyproject.toml]
@@ -1,6 +1,21 @@
 Changelog
 =========
 
+Version 0.5.0rc1
+------------
+This version improves compatibility with `scanpy`:
+- Added `n_genes` arguments allowing to return only the top N genes per group when `return_as_scanpy=True`. This allowed to match `scanpy`'s sorting method (partial sort) resulting in better reproducibility of scanpy results.
+- Fixed genes ordering in the scanpy formatter, by removing redundant sorting of perturbation names as `encode_and_count_groups` already returns sorted unique perturbation names. This ensures that gene names are sorted the same way everywhere.
+- Added explicit testing of genes ordering. In the PBMC dataset, lots of genes end up with identical z-scores but different logfoldchanges. This was not caught by previous tests.
+- Fold change is now computed with `(numerator + 1.e-9) / (denominator + 1.e-9)` to avoid division by zero, and to be more consistent with scanpy's implementation. This has no     effect on the ranking of genes, but allows to get finite fold change values for all genes.
+
+It also includes some performance improvements:
+- Improved CSR chunking mechanism for the OVO test, resulting in faster execution and much smaller memory footprint. A direct implication is that `batch_size` can grow much larger now.
+    - On TAHOE's `plate3` (in RAM) with `batch_size=1024`, this reduced memory footprint from 35GB to 1.5GB, and runtime from 1:17 to 0:50 with 8 CPUs.
+    - The reduced footprint allows to scale more aggressively `n_threads`. With 32 threads, TAHOE's `plate3` runs in 21 seconds, while eating only 2.5GB of RAM.
+
+Also, it adds support for OVO test on lazy CSR (h5-based) datasets, through a specific parallelization scenario where groups are processed one by one.
+
 Version 0.4.0
 ------------
 - Added option to return scanpy-friendly output with `return_as_scanpy` arg. `asymptotic_wilcoxon` returns either:
 
@@ -4,11 +4,26 @@ Although not initially designed to run out-of-core rank-sum tests, `illico` supp
 
 - h5-dense (np.ndarray) disk-backed dataset are natively supported
 - h5-CSC (sparse along the columns) disk-backed datasets are natively supported
-- :warning: **h5-CSR (sparse along the rows) disk-backed datasets are not supported**
+- h5-CSR (sparse along the rows) disk-backed datasets are natively supported **only for OVO (perturbed vs controls) test**. If you want to perform OVR (each group vs the rest) tests, you are better off loading it entirely in memory, as OVR test requires each column to be entirely in RAM at once, and CSR format does not allow to load columns from disk without loading the entire `.indices` in RAM (without telling you).
+
+If your data is backed through Dask or another backend, please open an issue as it should require little rework for it to be supported.
+
+Summary:
+|               Test               | Format | Storage | Supported ? | Remark |
+|----------------------------------|--------|--------|--------|------|
+| [OVO\|OVR]  | [Dense\|CSC\|CSR]  |  In RAM  | ✅   | - |
+| OVO (reference="non-targeting")  | Dense  |  Lazy (H5)  | ✅   | - |
+| OVO (reference="non-targeting")  | CSR  |  Lazy (H5)  | ✅   | Specific parallelization scheme |
+| OVO (reference="non-targeting")  | CSC  |  Lazy (H5)  | ✅   | - |
+| OVR (reference=None)  | Dense  |  Lazy (H5)  | ✅   | - |
+| OVR (reference=None)  | CSR  |  Lazy (H5)  |  ❌   | Voluntarily not supported, better off loading in RAM  |
+| OVR (reference=None)  | CSC  |  Lazy (H5)  | ✅   | - |
+
 
-If your data is backed through Dask or another backend, please open an issue as dense and CSC use cases should require very little rework to be supported.
 
 Notes:
 
-1. Supporting the CSR use case is highly non trivial, and running `adata[:, idxs]` on a backed CSR matrix will load (temporarily) the entirety of the indices in RAM, resulting in a memory footprint almost equivalent to loading everything at once, on top of being extremely slow.
+1. Supporting the CSR use case is highly non trivial, and running `adata[:, idxs]` on a backed CSR matrix will load (temporarily) the entirety of the indices in RAM, resulting in a memory footprint almost equivalent to loading everything at once, on top of being extremely slow. That's why OVR test on lazy CSR is not supported.
 2. Users struggling with out-of-core single cell RNASeq analyses should visit `rapids-singlecell`, which explicitely targets this use-case.
+3. The "Specific parallelization scheme mentioned for the OVO lazy CSR use case simply relies on the fact that due to the nature of the OVR test, we can run it group by group, and thus only load one group at a time in RAM, which is not the case for OVR where we need to load all groups at once.
+4. Note also that illico is expected to scale less well on lazy datasets, as most of the time the data loading part (such as the one of h5 datasets) is GIL-blocking.
@@ -28,12 +28,15 @@ $$\text{fold-change} = \frac{E[e^{X_{\text{perturbed}}} - 1]}{E[e^{X_{\text{cont
 2. If `exp_post_agg` is `False`, expression values are **exponentiated then averaged**.
 $$\text{fold-change} = \frac{e^{E[X_{\text{perturbed}}]} - 1}{e^{E[X_{\text{control}}]} - 1}$$ -->
 The fold-change computed by `illico` depends on the value of `is_log1p` and `exp_post_agg` as follows:
+
+⚠️ Note that a `1.e-9` will be added to both numerator and denominator to avoid division by zero, and to be more consistent with `scanpy`'s implementation. This has no effect on the ranking of genes, but allows to get finite fold change values for all genes.
+
 | `is_log1p` | `exp_post_agg` | Fold-change equation | Remark |
 |---|---|---|---|
 | `False` | `True` | $\text{fold-change} = \frac{E[X_{\text{perturbed}}]}{E[X_{\text{control}}]}$ | |
 | `False` | `False` | $\text{fold-change} = \frac{E[X_{\text{perturbed}}]}{E[X_{\text{control}}]}$ | |
-| `True` | `True` | $\text{fold-change} = \frac{E[e^{X_{\text{perturbed}}} - 1]}{E[e^{X_{\text{control}}} - 1]}$ | 🎯 Scanpy's default |
-| `True` | `False` | $\text{fold-change} = \frac{e^{E[X_{\text{perturbed}}]} - 1}{e^{E[X_{\text{control}}]} - 1}$ | |
+| `True` | `False` | $\text{fold-change} = \frac{E[e^{X_{\text{perturbed}}} - 1]}{E[e^{X_{\text{control}}} - 1]}$ | |
+| `True` | `True` | $\text{fold-change} = \frac{e^{E[X_{\text{perturbed}}]} - 1}{e^{E[X_{\text{control}}]} - 1}$ | 🎯 Scanpy's default |
 
 ⚠️ Please note that by default, `scanpy.rank_genes_groups` assumes that your data is log1p-transformed, and exponentiates after aggregation. Consequently, if you are coming from `scanpy` and want to drop-in replace `scanpy.tl.rank_genes_groups`, you should set:
 ```python
 
@@ -1,3 +1,39 @@
 from illico.asymptotic_wilcoxon import asymptotic_wilcoxon
 
+# Import kernel modules to trigger decorator registration
+# These imports must come after the registry definitions above
+from illico.ovo import (  # noqa: E402, F401
+    csc_ovo_mwu_kernel_over_contiguous_col_chunk,
+    csr_ovo_mwu_kernel_over_contiguous_col_chunk,
+    dense_ovo_mwu_kernel_over_contiguous_col_chunk,
+)
+from illico.ovr import (  # noqa: E402, F401
+    csc_ovr_mwu_kernel_over_contiguous_col_chunk,
+    csr_ovr_mwu_kernel_over_contiguous_col_chunk,
+    dense_ovr_mwu_kernel_over_contiguous_col_chunk,
+)
+
+# Now register the Rust kernels
+from illico.rust_backend import (  # noqa: E402, F401
+    csc_ovo_mwu_kernel_over_contiguous_col_chunk_rust,
+    csc_ovr_mwu_kernel_over_contiguous_col_chunk_rust,
+    csr_ovo_mwu_kernel_over_contiguous_col_chunk_rust,
+    csr_ovr_mwu_kernel_over_contiguous_col_chunk_rust,
+    dense_ovo_over_contiguous_col_chunk_rust,
+    dense_ovr_over_contiguous_col_chunk_rust,
+)
+from illico.utils.registry import (
+    KernelDataFormat,
+    Test,
+    rs_dispatcher_registry,
+)
+
+rs_dispatcher_registry.register(Test.OVO, KernelDataFormat.DENSE)(dense_ovo_over_contiguous_col_chunk_rust)
+rs_dispatcher_registry.register(Test.OVR, KernelDataFormat.DENSE)(dense_ovr_over_contiguous_col_chunk_rust)
+rs_dispatcher_registry.register(Test.OVO, KernelDataFormat.CSC)(csc_ovo_mwu_kernel_over_contiguous_col_chunk_rust)
+rs_dispatcher_registry.register(Test.OVO, KernelDataFormat.CSR)(csr_ovo_mwu_kernel_over_contiguous_col_chunk_rust)
+rs_dispatcher_registry.register(Test.OVR, KernelDataFormat.CSC)(csc_ovr_mwu_kernel_over_contiguous_col_chunk_rust)
+rs_dispatcher_registry.register(Test.OVR, KernelDataFormat.CSR)(csr_ovr_mwu_kernel_over_contiguous_col_chunk_rust)
+
+
 __all__ = ["asymptotic_wilcoxon"]