remydubois
diff --git a/‎changelog.md‎
Lines changed: 10 additions & 0 deletions b/‎changelog.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎docs/results.md‎
Lines changed: 5 additions & 2 deletions b/‎docs/results.md‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎illico/asymptotic_wilcoxon.py‎
Lines changed: 16 additions & 5 deletions b/‎illico/asymptotic_wilcoxon.py‎
Lines changed: 16 additions & 5 deletions
@@ -1,6 +1,16 @@
 Changelog
 =========
 
+Version 0.5.0
+------------
+- Added `n_genes` arguments allowing to return only the top N genes per group when `return_as_scanpy=True`. This allowed to match `scanpy`'s sorting method (partial sort) resulting in better reproducibility of scanpy results.
+- Fixed genes ordering in the scanpy formatter, by removing redundant sorting of perturbation names as `encode_and_count_groups` already returns sorted unique perturbation names. This ensures that gene names are sorted the same way everywhere.
+- Added explicit testing of genes ordering. In the PBMC dataset, lots of genes end up with identical z-scores but different logfoldchanges. This was not caught by previous tests.
+- Improved CSR chunking mechanism for the OVO test, resulting in faster execution and much smaller memory footprint. A direct implication is that `batch_size` can grow much larger now.
+    - On TAHOE's `plate3` (in RAM) with `batch_size=1024`, this reduced memory footprint from 35GB to 1.5GB, and runtime from 1:17 to 0:50 with 8 CPUs.
+    - The reduced footprint allows to scale more aggressively `n_threads`. With 32 threads, TAHOE's `plate3` runs in 21 seconds, while eating only 2.5GB of RAM.
+- Fold change is now computed with `(numerator + 1.e-9) / (denominator + 1.e-9)` to avoid division by zero, and to be more consistent with scanpy's implementation. This has no effect on the ranking of genes, but allows to get finite fold change values for all genes.
+
 Version 0.4.0
 ------------
 - Added option to return scanpy-friendly output with `return_as_scanpy` arg. `asymptotic_wilcoxon` returns either:
 
@@ -28,12 +28,15 @@ $$\text{fold-change} = \frac{E[e^{X_{\text{perturbed}}} - 1]}{E[e^{X_{\text{cont
 2. If `exp_post_agg` is `False`, expression values are **exponentiated then averaged**.
 $$\text{fold-change} = \frac{e^{E[X_{\text{perturbed}}]} - 1}{e^{E[X_{\text{control}}]} - 1}$$ -->
 The fold-change computed by `illico` depends on the value of `is_log1p` and `exp_post_agg` as follows:
+
+⚠️ Note that a `1.e-9` will be added to both numerator and denominator to avoid division by zero, and to be more consistent with `scanpy`'s implementation. This has no effect on the ranking of genes, but allows to get finite fold change values for all genes.
+
 | `is_log1p` | `exp_post_agg` | Fold-change equation | Remark |
 |---|---|---|---|
 | `False` | `True` | $\text{fold-change} = \frac{E[X_{\text{perturbed}}]}{E[X_{\text{control}}]}$ | |
 | `False` | `False` | $\text{fold-change} = \frac{E[X_{\text{perturbed}}]}{E[X_{\text{control}}]}$ | |
-| `True` | `True` | $\text{fold-change} = \frac{E[e^{X_{\text{perturbed}}} - 1]}{E[e^{X_{\text{control}}} - 1]}$ | 🎯 Scanpy's default |
-| `True` | `False` | $\text{fold-change} = \frac{e^{E[X_{\text{perturbed}}]} - 1}{e^{E[X_{\text{control}}]} - 1}$ | |
+| `True` | `False` | $\text{fold-change} = \frac{E[e^{X_{\text{perturbed}}} - 1]}{E[e^{X_{\text{control}}} - 1]}$ | |
+| `True` | `True` | $\text{fold-change} = \frac{e^{E[X_{\text{perturbed}}]} - 1}{e^{E[X_{\text{control}}]} - 1}$ | 🎯 Scanpy's default |
 
 ⚠️ Please note that by default, `scanpy.rank_genes_groups` assumes that your data is log1p-transformed, and exponentiates after aggregation. Consequently, if you are coming from `scanpy` and want to drop-in replace `scanpy.tl.rank_genes_groups`, you should set:
 ```python
 
@@ -98,6 +98,7 @@ def asymptotic_wilcoxon(
     precompile: bool = True,
     use_rust: bool = True,
     return_as_scanpy: bool = False,
+    n_genes: int | None = None,
     corr_method: Literal["benjamini-hochberg", "bonferroni"] = "benjamini-hochberg",
 ) -> pd.DataFrame | dict:
     """Perform asymptotic Mann-Whitney tests for differential gene expression.
@@ -144,6 +145,10 @@ def asymptotic_wilcoxon(
         Whether to return results in a format compatible with Scanpy's `rank_genes_groups` function.
         If yes, the output is a dictionary that can be attached to the `adata` object like this:
         `adata.uns['rank_genes_groups'] = asymptotic_wilcoxon(..., return_as_scanpy=True)`
+    n_genes : int or None, default=None
+        Number of top genes to return per group, sorted by z-score. If `None`, returns all genes. This is relevant only if `return_as_scanpy=True`,
+        as Scanpy's `rank_genes_groups` function expects the results to be sorted by significance. If `return_as_scanpy=False`, the results are
+        not sorted and `n_genes` is ignored.
     corr_method: str, default="benjamini-hochberg"
         Method to use for multiple testing correction. One of 'benjamini-hochberg' or 'bonferroni'.
 
@@ -247,22 +252,22 @@ def asymptotic_wilcoxon(
     logger.info(
         f"Found {group_container.counts.size} unique groups (min size: {group_container.counts.min()} cells; max size: {group_container.counts.max()} cells), with reference group: {reference}"
     )
-    _, n_genes = X.shape
+    _, n_genes_total = X.shape
 
     # Allocate the results dataframes
     cols = pd.Series(adata.var_names, name="feature", dtype=str)
     rows = pd.Series(unique_raw_groups, name="pert", dtype=str)
     results = np.empty((len(rows), len(cols), 4), dtype=np.float64)
 
     # Compute the batch bounds for each thread
-    iterator, batch_size = compute_batch_bounds(n_genes, batch_size, n_threads)
-    logger.trace(f"Processing {n_genes} genes through {len(iterator)} batches with {n_threads} threads.")
+    iterator, batch_size = compute_batch_bounds(n_genes_total, batch_size, n_threads)
+    logger.trace(f"Processing {n_genes_total} genes through {len(iterator)} batches with {n_threads} threads.")
 
     # Compute estimated mem footprint
     _ = log_memory_usage(data_handler, group_container, batch_size, n_threads)
 
     # Go through all the possible combinations
-    n_tests = n_genes * group_container.counts.size
+    n_tests = n_genes_total * group_container.counts.size
     logger.trace(f"Performing a total of {n_tests:,d} tests.")
     with Parallel(n_threads, prefer="threads", return_as="generator_unordered") as pool:
         with tqdm(total=n_tests, smoothing=0.0, unit="it", unit_scale=True, unit_divisor=1000) as pbar:
@@ -279,12 +284,16 @@ def asymptotic_wilcoxon(
                     exp_post_agg,
                     use_rust,
                     results,
-                )
+                )  # fmt: off
                 for lb, ub in iterator
             ):
                 pbar.update(group_container.counts.size * (ub - lb))
 
     if not return_as_scanpy:
+        if n_genes is not None:
+            logger.warning(
+                "Argument `n_genes` is ignored when `return_as_scanpy=False`, as the results are not sorted. Returning all genes."
+            )
         # Return a pd.DataFrame to index results
         results = pd.DataFrame(
             data=results.reshape(-1, 4),
@@ -295,10 +304,12 @@ def asymptotic_wilcoxon(
         # Return a dict formatted for Scanpy's rank_genes_groups results
         results = format_illico_results_for_scanpy(
             adata=adata,
+            unique_groups=unique_raw_groups,
             reference=reference,
             group_keys=group_keys,
             layer=layer,
             values=results,
+            n_genes=n_genes,
             corr_method=corr_method,
         )