Skip to content

Commit 37681d9

Browse files
Merge branch 'main' into improve-annotation-defaults
2 parents c2cab3d + 5609d44 commit 37681d9

112 files changed

Lines changed: 3958 additions & 317 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/integration-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ jobs:
7373

7474
- uses: viash-io/viash-actions/setup@v6
7575

76-
- uses: nf-core/setup-nextflow@v2.1.4
76+
- uses: nf-core/setup-nextflow@v3.0.0
7777

7878
# use cache
7979
- name: Cache resources data

.github/workflows/release-build.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ jobs:
6262

6363
- uses: viash-io/viash-actions/setup@v6
6464

65-
- uses: nf-core/setup-nextflow@v2.1.4
65+
- uses: nf-core/setup-nextflow@v3.0.0
6666

6767
# use cache
6868
- name: Cache resources data

.github/workflows/viash-test.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ jobs:
2121
uses: actions/setup-python@v6
2222
- uses: r-lib/actions/setup-r@v2
2323
with:
24+
r-version: 4.5.3
2425
use-public-rspm: true
2526
- run: python -m pip install pre-commit
2627
shell: bash

CHANGELOG.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,26 @@
88

99
* `workflows/rna/rna_multisample`, `workflows/multiomics/process_batches`, `feature_annotation/highly_variable_features_scanpy`: add an option to exclude features before running highly variable gene calculation based on a user-defined list of feature names (PR #1121).
1010

11+
* `annotate/consensus_vote`: new component computing a (weighted) majority vote across cell type labels from multiple annotation methods (PR #1151).
12+
*
13+
* `filter/filter_with_quantile`: added a component to filter numerical .obs or .var columns based on quantile thresholds, with optional subsetting (PR #1146).
14+
15+
* `dimred/pca`: added possibility to do chunked processing using arguments `chunks` and `chunk_size`. Also added a `seed` argument in order to better control the variability between executions (PR #1157).
16+
17+
* `workflows/multiomics/process_singlesample`: New workflow for processing RNA, protein and GDO modalities of individual samples (PR #1147).
18+
19+
* `transform/clear_slots`: New component that can be used to remove all items from slots of a MuData object (PR #1171).
20+
21+
* `workflows/multiomics/process_singlesample`, `workflows/multiomics/process_samples`, `workflows/multiomics/process_batches`: add `--intersect_obs` option to remove observations that are not present in all processed modalities, so each modality shares the same set of cells (PR #1173, 1175).
22+
23+
* `labels_transfer/cellmapper`: New component that transfers labels from a reference to a query with a shared embedding using CellMapper (PR #1169, PR #1177)
24+
1125
## MAJOR CHANGES
1226

1327
* `qc/calculate_qc_metrics`: major improvements to memory consumption and runtimes (PR #1140).
1428

29+
* `annotate/popv`: bump version to 0.6.1 (PR #1167).
30+
1531
## MINOR CHANGES
1632

1733
* `dataflow/split_modalities`: improve memory consumption by only reading one modality at the same time (PR #1152).
@@ -28,6 +44,18 @@
2844

2945
* `workflows/annotation/scanvi_scarches`: set `--input_obs_batch_label` and `--reference_obs_batch_label` defaults to `sample_id` and `--reference_var_hvg` default to `filter_with_hvg` to align with upstream workflow defaults (PR #1155).
3046

47+
* `cluster/leiden`: added `flavor`, `n_iterations` and `seed` arguments (PR #1132)
48+
49+
* `cluster/leiden`: avoid creating unnecessary copies of the output data (PR #1132).
50+
51+
* `workflows/multiomics/process_samples`: refactored to use a shared `process_singlesample_base` subworkflow, which is also used by the new `process_singlesample` workflow to avoid code duplication (PR #1147).
52+
53+
* Bump anndata to `0.12.11` (PR #1174).
54+
55+
* Add missing `example` fields to several component and workflow configurations (PR #1067).
56+
57+
* Testing: bump `viashpy` to 0.10.0 (PR #1178).
58+
3159
## BUG FIXES
3260

3361
* `dataflow/split_h5mu`: pin scipy version to 1.16.3 to avoid regression that corrupts large sparse matrix indexing (PR #1153).
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
name: consensus_vote
2+
namespace: annotate
3+
scope: "public"
4+
description: |
5+
Combines cell type predictions from multiple annotation methods into a single consensus prediction using a weighted majority vote.
6+
For each cell, each method votes for its predicted cell type, optionally weighted by the probability score and/or a per-method weight.
7+
The consensus prediction is the cell type with the highest total weighted vote.
8+
Note that this method does not leverage pre-existing ontology or perform any reconciliation of cell type labels across methods, so the same cell type may be represented by different labels in different methods and will be treated as distinct cell types in the vote.
9+
authors:
10+
- __merge__: /src/authors/dorien_roosen.yaml
11+
roles: [ author ]
12+
13+
argument_groups:
14+
- name: Inputs
15+
description: Input dataset arguments.
16+
arguments:
17+
- name: "--input"
18+
type: file
19+
description: Input h5mu file containing cell type predictions in .obs.
20+
direction: input
21+
required: true
22+
example: input.h5mu
23+
- name: "--modality"
24+
description: Which modality to process.
25+
type: string
26+
default: "rna"
27+
required: false
28+
- name: "--input_obs_predictions"
29+
type: string
30+
description: |
31+
One or more .obs column names containing cell type predictions (labels) from
32+
different annotation methods.
33+
required: true
34+
multiple: true
35+
example: ["scanvi_pred", "celltypist_pred"]
36+
- name: "--input_obs_probabilities"
37+
type: string
38+
description: |
39+
One or more .obs column names containing prediction probability scores,
40+
one per method in --input_obs_predictions. When provided, each method's
41+
vote is scaled by the probability score for that cell (in addition to
42+
any per-method --weights). Must be the same length as --input_obs_predictions.
43+
required: false
44+
multiple: true
45+
example: ["scanvi_prob", "celltypist_prob", "singler_prob"]
46+
- name: "--tie_label"
47+
type: string
48+
description: |
49+
Label to assign when two or more cell types receive equal votes.
50+
If not provided, tied cells are assigned None (missing value).
51+
required: false
52+
example: "Unknown"
53+
- name: "--weights"
54+
type: double
55+
description: |
56+
Per-method weights for the consensus vote. Must be the same length as
57+
--input_obs_predictions when provided. Weights are normalized to sum to 1
58+
before use. If not provided, all methods are weighted equally.
59+
required: false
60+
multiple: true
61+
example: [1.0, 2.0]
62+
63+
- name: Outputs
64+
description: Output arguments.
65+
arguments:
66+
- name: "--output"
67+
alternatives: [-o]
68+
type: file
69+
description: Output h5mu file.
70+
direction: output
71+
example: output.h5mu
72+
- name: "--output_obs_predictions"
73+
type: string
74+
default: consensus_pred
75+
required: false
76+
description: |
77+
In which `.obs` slot to store the consensus predicted cell type.
78+
- name: "--output_obs_score"
79+
type: string
80+
default: consensus_score
81+
required: false
82+
description: |
83+
In which `.obs` slot to store the consensus score, defined as the fraction
84+
of total weight assigned to the winning cell type.
85+
__merge__: [., /src/base/h5_compression_argument.yaml]
86+
87+
resources:
88+
- type: python_script
89+
path: script.py
90+
- path: /src/utils/setup_logger.py
91+
- path: /src/utils/compress_h5mu.py
92+
93+
test_resources:
94+
- type: python_script
95+
path: test.py
96+
97+
engines:
98+
- type: docker
99+
image: python:3.13-slim
100+
setup:
101+
- type: apt
102+
packages:
103+
- procps
104+
- type: python
105+
__merge__: [ /src/base/requirements/anndata_mudata.yaml, .]
106+
__merge__: [ /src/base/requirements/python_test_setup.yaml, .]
107+
runners:
108+
- type: executable
109+
- type: nextflow
110+
directives:
111+
label: [lowcpu, lowmem, lowdisk]
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
import sys
2+
import mudata as mu
3+
import numpy as np
4+
import pandas as pd
5+
6+
## VIASH START
7+
par = {
8+
"input": "test_with_probabilities.h5mu",
9+
"modality": "rna",
10+
"input_obs_predictions": ["scanvi_pred", "celltypist_pred", "singler_pred"],
11+
"input_obs_probabilities": ["scanvi_prob", "celltypist_prob", "singler_prob"],
12+
"weights": None,
13+
"tie_label": None,
14+
"output": "consensus_test_output.h5mu",
15+
"output_obs_predictions": "consensus_pred",
16+
"output_obs_score": "consensus_score",
17+
"output_compression": "gzip",
18+
}
19+
meta = {"resources_dir": "src/utils"}
20+
## VIASH END
21+
22+
sys.path.append(meta["resources_dir"])
23+
from setup_logger import setup_logger
24+
from compress_h5mu import write_h5ad_to_h5mu_with_compression
25+
26+
logger = setup_logger()
27+
28+
29+
def main():
30+
prediction_cols = par["input_obs_predictions"]
31+
prob_cols = par["input_obs_probabilities"]
32+
weights = par["weights"]
33+
34+
if weights and len(weights) != len(prediction_cols):
35+
raise ValueError(
36+
f"--weights must have the same length as --input_obs_predictions. "
37+
f"Got {len(weights)} weights for {len(prediction_cols)} prediction columns."
38+
)
39+
if prob_cols and len(prob_cols) != len(prediction_cols):
40+
raise ValueError(
41+
f"--input_obs_probabilities must have the same length as --input_obs_predictions. "
42+
f"Got {len(prob_cols)} probability columns for {len(prediction_cols)} prediction columns."
43+
)
44+
45+
logger.info("Reading input data.")
46+
adata = mu.read_h5ad(par["input"], mod=par["modality"])
47+
48+
cols_to_check = [prediction_cols]
49+
if prob_cols:
50+
cols_to_check.append(prob_cols)
51+
for cols in cols_to_check:
52+
for col in cols:
53+
if col not in adata.obs.columns:
54+
raise ValueError(f"Column '{col}' not found in .obs.")
55+
56+
# Each method is treated equally by default, unless user specific weights are provided
57+
n_methods = len(prediction_cols)
58+
logger.info("Initializing weights to matrix of ones")
59+
weights_arr = np.ones(n_methods, dtype=np.float32)
60+
if weights:
61+
logger.info("Applying user-provided weights.")
62+
weights_arr = np.array(weights, dtype=np.float32)
63+
logger.info("Normalizing weights")
64+
weights_arr = weights_arr / weights_arr.sum()
65+
66+
# Apply the weights to the probabilities in the data
67+
weights = pd.DataFrame(
68+
[weights_arr] * adata.n_obs, index=adata.obs.index, columns=prediction_cols
69+
)
70+
if prob_cols:
71+
logger.info("Scaling the weights with the probabilities from each method")
72+
weights = weights * adata.obs[prob_cols].astype(np.float32).to_numpy()
73+
assert pd.notna(weights).all(axis=None)
74+
75+
logger.info("Computing weighted majority vote.")
76+
pred_df = adata.obs[prediction_cols].astype(str)
77+
78+
# For each cell and each method (index), get the label and the weight
79+
incidences_weights = pd.DataFrame(
80+
{"label": pred_df.stack(), "weights": weights.stack()}
81+
)
82+
# Move the label to the index, there might be duplicate indices now
83+
incidences_weights = incidences_weights.set_index("label", append=True).rename_axis(
84+
["cell_id", "method", "label"]
85+
)
86+
# Sum the weights per label, from this the labels with the largest weights need to be selected
87+
summed_weights = incidences_weights.groupby(level=["cell_id", "label"]).sum()
88+
# Find the weight that is the largest per group
89+
max_weight_per_group = summed_weights.groupby(level="cell_id").transform("max")
90+
# Use the value to look-up the corresponding IDs and labels
91+
max_weights_mask = summed_weights["weights"] == max_weight_per_group["weights"]
92+
entries_for_max_weights = summed_weights[max_weights_mask].reset_index(
93+
level="label"
94+
)
95+
# Find the cases where there is a tie
96+
is_duplicated = max_weights_mask.groupby(level="cell_id").sum() > 1
97+
# For the ties, overwrite the label. If a cell is in the frame more than once it is because of a tie.
98+
entries_for_max_weights.loc[is_duplicated, ["label"]] = par["tie_label"]
99+
# Now its safe to just take the first index in case of duplicates, since the label and the score is the same.
100+
entries_for_max_weights = entries_for_max_weights[
101+
~entries_for_max_weights.index.duplicated()
102+
]
103+
# Normalize the weights
104+
normalized_scores = (
105+
entries_for_max_weights["weights"]
106+
/ incidences_weights["weights"].groupby(level="cell_id").sum()
107+
)
108+
# Handle devision by 0
109+
normalized_scores = normalized_scores.replace([np.inf, -np.inf], 0.0).fillna(0.0)
110+
logger.info("Moving the output to the anndata.")
111+
adata.obs[par["output_obs_predictions"]] = entries_for_max_weights["label"].astype(
112+
"category"
113+
)
114+
adata.obs[par["output_obs_score"]] = normalized_scores
115+
116+
logger.info("Writing output data...")
117+
write_h5ad_to_h5mu_with_compression(
118+
par["output"], par["input"], par["modality"], adata, par["output_compression"]
119+
)
120+
121+
122+
if __name__ == "__main__":
123+
main()

0 commit comments

Comments
 (0)