Skip to content

Commit cbc7af5

Browse files
committed
Added support for sparse arrays, improved OVO CSR chunking, fixed genes ordering, foldchange safeguard, and possiblity to return top n genes
1 parent 670ca85 commit cbc7af5

20 files changed

Lines changed: 551 additions & 216 deletions

changelog.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,16 @@
11
Changelog
22
=========
33

4+
Version 0.5.0
5+
------------
6+
- Added `n_genes` arguments allowing to return only the top N genes per group when `return_as_scanpy=True`. This allowed to match `scanpy`'s sorting method (partial sort) resulting in better reproducibility of scanpy results.
7+
- Fixed genes ordering in the scanpy formatter, by removing redundant sorting of perturbation names as `encode_and_count_groups` already returns sorted unique perturbation names. This ensures that gene names are sorted the same way everywhere.
8+
- Added explicit testing of genes ordering. In the PBMC dataset, lots of genes end up with identical z-scores but different logfoldchanges. This was not caught by previous tests.
9+
- Improved CSR chunking mechanism for the OVO test, resulting in faster execution and much smaller memory footprint. A direct implication is that `batch_size` can grow much larger now.
10+
- On TAHOE's `plate3` (in RAM) with `batch_size=1024`, this reduced memory footprint from 35GB to 1.5GB, and runtime from 1:17 to 0:50 with 8 CPUs.
11+
- The reduced footprint allows to scale more aggressively `n_threads`. With 32 threads, TAHOE's `plate3` runs in 21 seconds, while eating only 2.5GB of RAM.
12+
- Fold change is now computed with `(numerator + 1.e-9) / (denominator + 1.e-9)` to avoid division by zero, and to be more consistent with scanpy's implementation. This has no effect on the ranking of genes, but allows to get finite fold change values for all genes.
13+
414
Version 0.4.0
515
------------
616
- Added option to return scanpy-friendly output with `return_as_scanpy` arg. `asymptotic_wilcoxon` returns either:

docs/results.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,15 @@ $$\text{fold-change} = \frac{E[e^{X_{\text{perturbed}}} - 1]}{E[e^{X_{\text{cont
2828
2. If `exp_post_agg` is `False`, expression values are **exponentiated then averaged**.
2929
$$\text{fold-change} = \frac{e^{E[X_{\text{perturbed}}]} - 1}{e^{E[X_{\text{control}}]} - 1}$$ -->
3030
The fold-change computed by `illico` depends on the value of `is_log1p` and `exp_post_agg` as follows:
31+
32+
⚠️ Note that a `1.e-9` will be added to both numerator and denominator to avoid division by zero, and to be more consistent with `scanpy`'s implementation. This has no effect on the ranking of genes, but allows to get finite fold change values for all genes.
33+
3134
| `is_log1p` | `exp_post_agg` | Fold-change equation | Remark |
3235
|---|---|---|---|
3336
| `False` | `True` | $\text{fold-change} = \frac{E[X_{\text{perturbed}}]}{E[X_{\text{control}}]}$ | |
3437
| `False` | `False` | $\text{fold-change} = \frac{E[X_{\text{perturbed}}]}{E[X_{\text{control}}]}$ | |
35-
| `True` | `True` | $\text{fold-change} = \frac{E[e^{X_{\text{perturbed}}} - 1]}{E[e^{X_{\text{control}}} - 1]}$ | 🎯 Scanpy's default |
36-
| `True` | `False` | $\text{fold-change} = \frac{e^{E[X_{\text{perturbed}}]} - 1}{e^{E[X_{\text{control}}]} - 1}$ | |
38+
| `True` | `False` | $\text{fold-change} = \frac{E[e^{X_{\text{perturbed}}} - 1]}{E[e^{X_{\text{control}}} - 1]}$ | |
39+
| `True` | `True` | $\text{fold-change} = \frac{e^{E[X_{\text{perturbed}}]} - 1}{e^{E[X_{\text{control}}]} - 1}$ | 🎯 Scanpy's default |
3740

3841
⚠️ Please note that by default, `scanpy.rank_genes_groups` assumes that your data is log1p-transformed, and exponentiates after aggregation. Consequently, if you are coming from `scanpy` and want to drop-in replace `scanpy.tl.rank_genes_groups`, you should set:
3942
```python

illico/asymptotic_wilcoxon.py

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,7 @@ def asymptotic_wilcoxon(
9898
precompile: bool = True,
9999
use_rust: bool = True,
100100
return_as_scanpy: bool = False,
101+
n_genes: int | None = None,
101102
corr_method: Literal["benjamini-hochberg", "bonferroni"] = "benjamini-hochberg",
102103
) -> pd.DataFrame | dict:
103104
"""Perform asymptotic Mann-Whitney tests for differential gene expression.
@@ -144,6 +145,10 @@ def asymptotic_wilcoxon(
144145
Whether to return results in a format compatible with Scanpy's `rank_genes_groups` function.
145146
If yes, the output is a dictionary that can be attached to the `adata` object like this:
146147
`adata.uns['rank_genes_groups'] = asymptotic_wilcoxon(..., return_as_scanpy=True)`
148+
n_genes : int or None, default=None
149+
Number of top genes to return per group, sorted by z-score. If `None`, returns all genes. This is relevant only if `return_as_scanpy=True`,
150+
as Scanpy's `rank_genes_groups` function expects the results to be sorted by significance. If `return_as_scanpy=False`, the results are
151+
not sorted and `n_genes` is ignored.
147152
corr_method: str, default="benjamini-hochberg"
148153
Method to use for multiple testing correction. One of 'benjamini-hochberg' or 'bonferroni'.
149154
@@ -247,22 +252,22 @@ def asymptotic_wilcoxon(
247252
logger.info(
248253
f"Found {group_container.counts.size} unique groups (min size: {group_container.counts.min()} cells; max size: {group_container.counts.max()} cells), with reference group: {reference}"
249254
)
250-
_, n_genes = X.shape
255+
_, n_genes_total = X.shape
251256

252257
# Allocate the results dataframes
253258
cols = pd.Series(adata.var_names, name="feature", dtype=str)
254259
rows = pd.Series(unique_raw_groups, name="pert", dtype=str)
255260
results = np.empty((len(rows), len(cols), 4), dtype=np.float64)
256261

257262
# Compute the batch bounds for each thread
258-
iterator, batch_size = compute_batch_bounds(n_genes, batch_size, n_threads)
259-
logger.trace(f"Processing {n_genes} genes through {len(iterator)} batches with {n_threads} threads.")
263+
iterator, batch_size = compute_batch_bounds(n_genes_total, batch_size, n_threads)
264+
logger.trace(f"Processing {n_genes_total} genes through {len(iterator)} batches with {n_threads} threads.")
260265

261266
# Compute estimated mem footprint
262267
_ = log_memory_usage(data_handler, group_container, batch_size, n_threads)
263268

264269
# Go through all the possible combinations
265-
n_tests = n_genes * group_container.counts.size
270+
n_tests = n_genes_total * group_container.counts.size
266271
logger.trace(f"Performing a total of {n_tests:,d} tests.")
267272
with Parallel(n_threads, prefer="threads", return_as="generator_unordered") as pool:
268273
with tqdm(total=n_tests, smoothing=0.0, unit="it", unit_scale=True, unit_divisor=1000) as pbar:
@@ -279,12 +284,16 @@ def asymptotic_wilcoxon(
279284
exp_post_agg,
280285
use_rust,
281286
results,
282-
)
287+
) # fmt: off
283288
for lb, ub in iterator
284289
):
285290
pbar.update(group_container.counts.size * (ub - lb))
286291

287292
if not return_as_scanpy:
293+
if n_genes is not None:
294+
logger.warning(
295+
"Argument `n_genes` is ignored when `return_as_scanpy=False`, as the results are not sorted. Returning all genes."
296+
)
288297
# Return a pd.DataFrame to index results
289298
results = pd.DataFrame(
290299
data=results.reshape(-1, 4),
@@ -295,10 +304,12 @@ def asymptotic_wilcoxon(
295304
# Return a dict formatted for Scanpy's rank_genes_groups results
296305
results = format_illico_results_for_scanpy(
297306
adata=adata,
307+
unique_groups=unique_raw_groups,
298308
reference=reference,
299309
group_keys=group_keys,
300310
layer=layer,
301311
values=results,
312+
n_genes=n_genes,
302313
corr_method=corr_method,
303314
)
304315

0 commit comments

Comments
 (0)