Skip to content

Commit e7fc038

Browse files
authored
v0.5.0rc1 (#14)
- Improved scanpy compat with better genes ordering, fold change dtype and computation, and possibility to return top n genes only - Improved OVO CSR chunking mechanism leading to reduced memory footprint and enhanced execution speed - Added support for OVO test on lazy CSR data through a specific parallelization scheme
1 parent 670ca85 commit e7fc038

35 files changed

Lines changed: 1086 additions & 332 deletions

.github/workflows/python-package.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,7 @@ jobs:
3737
- name: Test with tox
3838
run: |
3939
python -m tox -e unit-tests
40+
# Remove until docformatter is compat with py314
41+
# - name: docformatter
42+
# run: |
43+
# poetry run docformatter --check --recursive illico tests

.pre-commit-config.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,13 @@ repos:
3535
# entry: bash -lc 'rustup component add clippy >/dev/null 2>&1 || true; cargo clippy --all -- -D warnings'
3636
# language: system
3737
# files: '\.rs$'
38+
- repo: https://github.com/PyCQA/docformatter
39+
rev: v1.7.7
40+
hooks:
41+
- id: docformatter
42+
name: docformatter
43+
entry: docformatter
44+
language: python
45+
types: [python]
46+
additional_dependencies: [tomli]
47+
args: [--in-place, --config, ./pyproject.toml]

changelog.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,21 @@
11
Changelog
22
=========
33

4+
Version 0.5.0rc1
5+
------------
6+
This version improves compatibility with `scanpy`:
7+
- Added `n_genes` arguments allowing to return only the top N genes per group when `return_as_scanpy=True`. This allowed to match `scanpy`'s sorting method (partial sort) resulting in better reproducibility of scanpy results.
8+
- Fixed genes ordering in the scanpy formatter, by removing redundant sorting of perturbation names as `encode_and_count_groups` already returns sorted unique perturbation names. This ensures that gene names are sorted the same way everywhere.
9+
- Added explicit testing of genes ordering. In the PBMC dataset, lots of genes end up with identical z-scores but different logfoldchanges. This was not caught by previous tests.
10+
- Fold change is now computed with `(numerator + 1.e-9) / (denominator + 1.e-9)` to avoid division by zero, and to be more consistent with scanpy's implementation. This has no effect on the ranking of genes, but allows to get finite fold change values for all genes.
11+
12+
It also includes some performance improvements:
13+
- Improved CSR chunking mechanism for the OVO test, resulting in faster execution and much smaller memory footprint. A direct implication is that `batch_size` can grow much larger now.
14+
- On TAHOE's `plate3` (in RAM) with `batch_size=1024`, this reduced memory footprint from 35GB to 1.5GB, and runtime from 1:17 to 0:50 with 8 CPUs.
15+
- The reduced footprint allows to scale more aggressively `n_threads`. With 32 threads, TAHOE's `plate3` runs in 21 seconds, while eating only 2.5GB of RAM.
16+
17+
Also, it adds support for OVO test on lazy CSR (h5-based) datasets, through a specific parallelization scenario where groups are processed one by one.
18+
419
Version 0.4.0
520
------------
621
- Added option to return scanpy-friendly output with `return_as_scanpy` arg. `asymptotic_wilcoxon` returns either:

docs/out_of_core.md

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,26 @@ Although not initially designed to run out-of-core rank-sum tests, `illico` supp
44

55
- h5-dense (np.ndarray) disk-backed dataset are natively supported
66
- h5-CSC (sparse along the columns) disk-backed datasets are natively supported
7-
- :warning: **h5-CSR (sparse along the rows) disk-backed datasets are not supported**
7+
- h5-CSR (sparse along the rows) disk-backed datasets are natively supported **only for OVO (perturbed vs controls) test**. If you want to perform OVR (each group vs the rest) tests, you are better off loading it entirely in memory, as OVR test requires each column to be entirely in RAM at once, and CSR format does not allow to load columns from disk without loading the entire `.indices` in RAM (without telling you).
8+
9+
If your data is backed through Dask or another backend, please open an issue as it should require little rework for it to be supported.
10+
11+
Summary:
12+
| Test | Format | Storage | Supported ? | Remark |
13+
|----------------------------------|--------|--------|--------|------|
14+
| [OVO\|OVR] | [Dense\|CSC\|CSR] | In RAM || - |
15+
| OVO (reference="non-targeting") | Dense | Lazy (H5) || - |
16+
| OVO (reference="non-targeting") | CSR | Lazy (H5) || Specific parallelization scheme |
17+
| OVO (reference="non-targeting") | CSC | Lazy (H5) || - |
18+
| OVR (reference=None) | Dense | Lazy (H5) || - |
19+
| OVR (reference=None) | CSR | Lazy (H5) || Voluntarily not supported, better off loading in RAM |
20+
| OVR (reference=None) | CSC | Lazy (H5) || - |
21+
822

9-
If your data is backed through Dask or another backend, please open an issue as dense and CSC use cases should require very little rework to be supported.
1023

1124
Notes:
1225

13-
1. Supporting the CSR use case is highly non trivial, and running `adata[:, idxs]` on a backed CSR matrix will load (temporarily) the entirety of the indices in RAM, resulting in a memory footprint almost equivalent to loading everything at once, on top of being extremely slow.
26+
1. Supporting the CSR use case is highly non trivial, and running `adata[:, idxs]` on a backed CSR matrix will load (temporarily) the entirety of the indices in RAM, resulting in a memory footprint almost equivalent to loading everything at once, on top of being extremely slow. That's why OVR test on lazy CSR is not supported.
1427
2. Users struggling with out-of-core single cell RNASeq analyses should visit `rapids-singlecell`, which explicitely targets this use-case.
28+
3. The "Specific parallelization scheme mentioned for the OVO lazy CSR use case simply relies on the fact that due to the nature of the OVR test, we can run it group by group, and thus only load one group at a time in RAM, which is not the case for OVR where we need to load all groups at once.
29+
4. Note also that illico is expected to scale less well on lazy datasets, as most of the time the data loading part (such as the one of h5 datasets) is GIL-blocking.

docs/results.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,15 @@ $$\text{fold-change} = \frac{E[e^{X_{\text{perturbed}}} - 1]}{E[e^{X_{\text{cont
2828
2. If `exp_post_agg` is `False`, expression values are **exponentiated then averaged**.
2929
$$\text{fold-change} = \frac{e^{E[X_{\text{perturbed}}]} - 1}{e^{E[X_{\text{control}}]} - 1}$$ -->
3030
The fold-change computed by `illico` depends on the value of `is_log1p` and `exp_post_agg` as follows:
31+
32+
⚠️ Note that a `1.e-9` will be added to both numerator and denominator to avoid division by zero, and to be more consistent with `scanpy`'s implementation. This has no effect on the ranking of genes, but allows to get finite fold change values for all genes.
33+
3134
| `is_log1p` | `exp_post_agg` | Fold-change equation | Remark |
3235
|---|---|---|---|
3336
| `False` | `True` | $\text{fold-change} = \frac{E[X_{\text{perturbed}}]}{E[X_{\text{control}}]}$ | |
3437
| `False` | `False` | $\text{fold-change} = \frac{E[X_{\text{perturbed}}]}{E[X_{\text{control}}]}$ | |
35-
| `True` | `True` | $\text{fold-change} = \frac{E[e^{X_{\text{perturbed}}} - 1]}{E[e^{X_{\text{control}}} - 1]}$ | 🎯 Scanpy's default |
36-
| `True` | `False` | $\text{fold-change} = \frac{e^{E[X_{\text{perturbed}}]} - 1}{e^{E[X_{\text{control}}]} - 1}$ | |
38+
| `True` | `False` | $\text{fold-change} = \frac{E[e^{X_{\text{perturbed}}} - 1]}{E[e^{X_{\text{control}}} - 1]}$ | |
39+
| `True` | `True` | $\text{fold-change} = \frac{e^{E[X_{\text{perturbed}}]} - 1}{e^{E[X_{\text{control}}]} - 1}$ | 🎯 Scanpy's default |
3740

3841
⚠️ Please note that by default, `scanpy.rank_genes_groups` assumes that your data is log1p-transformed, and exponentiates after aggregation. Consequently, if you are coming from `scanpy` and want to drop-in replace `scanpy.tl.rank_genes_groups`, you should set:
3942
```python

illico/__init__.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,39 @@
11
from illico.asymptotic_wilcoxon import asymptotic_wilcoxon
22

3+
# Import kernel modules to trigger decorator registration
4+
# These imports must come after the registry definitions above
5+
from illico.ovo import ( # noqa: E402, F401
6+
csc_ovo_mwu_kernel_over_contiguous_col_chunk,
7+
csr_ovo_mwu_kernel_over_contiguous_col_chunk,
8+
dense_ovo_mwu_kernel_over_contiguous_col_chunk,
9+
)
10+
from illico.ovr import ( # noqa: E402, F401
11+
csc_ovr_mwu_kernel_over_contiguous_col_chunk,
12+
csr_ovr_mwu_kernel_over_contiguous_col_chunk,
13+
dense_ovr_mwu_kernel_over_contiguous_col_chunk,
14+
)
15+
16+
# Now register the Rust kernels
17+
from illico.rust_backend import ( # noqa: E402, F401
18+
csc_ovo_mwu_kernel_over_contiguous_col_chunk_rust,
19+
csc_ovr_mwu_kernel_over_contiguous_col_chunk_rust,
20+
csr_ovo_mwu_kernel_over_contiguous_col_chunk_rust,
21+
csr_ovr_mwu_kernel_over_contiguous_col_chunk_rust,
22+
dense_ovo_over_contiguous_col_chunk_rust,
23+
dense_ovr_over_contiguous_col_chunk_rust,
24+
)
25+
from illico.utils.registry import (
26+
KernelDataFormat,
27+
Test,
28+
rs_dispatcher_registry,
29+
)
30+
31+
rs_dispatcher_registry.register(Test.OVO, KernelDataFormat.DENSE)(dense_ovo_over_contiguous_col_chunk_rust)
32+
rs_dispatcher_registry.register(Test.OVR, KernelDataFormat.DENSE)(dense_ovr_over_contiguous_col_chunk_rust)
33+
rs_dispatcher_registry.register(Test.OVO, KernelDataFormat.CSC)(csc_ovo_mwu_kernel_over_contiguous_col_chunk_rust)
34+
rs_dispatcher_registry.register(Test.OVO, KernelDataFormat.CSR)(csr_ovo_mwu_kernel_over_contiguous_col_chunk_rust)
35+
rs_dispatcher_registry.register(Test.OVR, KernelDataFormat.CSC)(csc_ovr_mwu_kernel_over_contiguous_col_chunk_rust)
36+
rs_dispatcher_registry.register(Test.OVR, KernelDataFormat.CSR)(csr_ovr_mwu_kernel_over_contiguous_col_chunk_rust)
37+
38+
339
__all__ = ["asymptotic_wilcoxon"]

0 commit comments

Comments
 (0)