Skip to content

Add a parallel_integration workflow#15

Merged
jakubmajercik merged 15 commits into
mainfrom
add-multi-integration-workflow
Jun 4, 2026
Merged

Add a parallel_integration workflow#15
jakubmajercik merged 15 commits into
mainfrom
add-multi-integration-workflow

Conversation

@jakubmajercik

@jakubmajercik jakubmajercik commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Summary

  • New single_cell/multi_integration composed workflow that runs harmony, scvi, scanorama, and bbknn integration methods from openpipeline in parallel on a preprocessed h5mu input, then merges each method's annotations (.obs cluster labels, .obsm embeddings + UMAP, .obsp neighbor graphs, .uns neighbor params) back into a single output h5mu.
  • Copies openpipeline's dataflow/move_anndata_slots component from openpipelines-bio/openpipeline#1163 (not yet merged/released) into this repo as a local dependency. Once the PR merges and a new openpipeline tag ships, src/dataflow/, src/utils/, and src/base/ can be removed and the local dep swapped for an openpipeline repository reference in the workflow config.

Parallel-then-merge pattern

Each integration .run() call reads state.input (the original preprocessed file), not the previous step's output — so Nextflow's DAG scheduler sees no dependencies between them and executes the four integrations concurrently. Four sequential move_anndata_slots calls then accumulate each method's slots into one merged h5mu.

Test plan

  • viash ns build --query "dataflow/move_anndata_slots|single_cell/multi_integration" — 3/3 configs built clean
  • viash test src/dataflow/move_anndata_slots/config.vsh.yaml — 18/18 unit tests pass
  • bash src/single_cell/multi_integration/integration_test.sh — full Nextflow integration test (requires S3 test data + Docker; run locally or via CI)
  • Inspect the output .h5mu to confirm all four *_integration_leiden_* obs columns, X_*_integrated / X_*_umap obsm keys, and method-specific neighbor graphs are present
  • Sanity check: single-method invocation (e.g. --integration_methods harmony) emits only the harmony slots

Runs harmony, scvi, scanorama, and bbknn integration methods from
openpipeline in parallel on a preprocessed h5mu input, then merges each
method's annotations (.obs cluster labels, .obsm embeddings + UMAP, .obsp
neighbor graphs, .uns neighbor params) into a single output h5mu.

Copies openpipeline's dataflow/move_anndata_slots component from PR #1163
(not yet merged/released) as a local dependency. When the PR merges and
a new openpipeline tag ships, src/dataflow/, src/utils/, and src/base/
can be removed and the local dep swapped for an openpipeline repository
reference in multi_integration's config.
@jakubmajercik jakubmajercik marked this pull request as ready for review April 23, 2026 13:33
@jakubmajercik jakubmajercik requested a review from dorien-er April 23, 2026 13:33
Comment thread src/dataflow/move_anndata_slots/config.vsh.yaml Outdated
Comment thread src/dataflow/move_anndata_slots/script.py
Comment thread src/single_cell/multi_integration/config.vsh.yaml Outdated
Comment thread src/single_cell/parallel_integration/main.nf Outdated
jakubmajercik added a commit that referenced this pull request Jun 2, 2026
…a_slots overwrite behavior

Addresses PR #15 review:
- Rename the single_cell/multi_integration workflow to parallel_integration
  to disambiguate from "multi" in cellranger_multi and reflect that single-
  method runs are also supported. Updates the config name, test.nf include
  path and references, and the integration_test.sh main-script path.
- Fix the move_anndata_slots description: by default an existing target key
  raises an error; --allow_overwrite opts into overwriting.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…a_slots overwrite behavior

Addresses PR #15 review:
- Rename the single_cell/multi_integration workflow to parallel_integration
  to disambiguate from "multi" in cellranger_multi and reflect that single-
  method runs are also supported. Updates the config name, test.nf include
  path and references, and the integration_test.sh main-script path.
- Fix the move_anndata_slots description: by default an existing target key
  raises an error; --allow_overwrite opts into overwriting.
@jakubmajercik jakubmajercik force-pushed the add-multi-integration-workflow branch from f85b539 to 80a1105 Compare June 2, 2026 09:25
Addresses PR #15 review: chaining the integration .run() calls with `|` made
Nextflow treat each method as dependent on the previous one's output channel,
so per sample the four methods ran sequentially even though each only reads the
original preprocessed input.

Fork `integration_ch` into one branch per method (harmony, scvi, scanorama,
bbknn) so they have no mutual dependency and run concurrently, then re-sync the
branches with mix + groupTuple(by: 0, size: 4). Each method's `*_output` is
picked explicitly from the grouped states (never a blind state merge) so an
unset output cannot clobber another branch's path; a method skipped via runIf
still passes its event through, keeping the group size at 4. The sequential
move_slots merge of per-method annotations is unchanged.
@jakubmajercik jakubmajercik changed the title add multi_integration workflow Add parallel_integration workflow Jun 2, 2026
@jakubmajercik jakubmajercik changed the title Add parallel_integration workflow Add a parallel_integration workflow Jun 2, 2026
@jakubmajercik jakubmajercik requested a review from dorien-er June 2, 2026 14:34
Comment thread src/base/requirements/anndata_mudata.yaml Outdated
Comment thread src/single_cell/parallel_integration/config.vsh.yaml Outdated
Comment thread src/single_cell/parallel_integration/config.vsh.yaml Outdated
Comment thread src/single_cell/parallel_integration/test.nf
anndata 0.12.7->0.12.16, scipy !=1.17.*->~=1.17.1, mudata 0.3.2->0.3.8, viashpy 0.8.0->0.10.0
Renames --early_stopping*, --max_epochs, --reduce_lr_on_plateau, --lr_factor,
--lr_patience to scvi_-prefixed names; updates main.nf fromState and test.nf.
Replace per-method layer/batch/covariate args with shared --layer_log_normalized_counts,
--layer_raw_counts, --obs_batch, --obs_covariates, --var_input. Add a validation map
that fails early when a selected method's required layer is missing. Mirrors the
argument pattern in single_cell/process_integrate_annotate.
New test_workflows/assert_integration_output component reads the output h5mu and
checks the expected .obs/.obsm/.obsp/.uns slots per method are present. Wired into
parallel_integration/test.nf with expected slots derived from each case's methods.
Add a terminal toSortedList/map assertion so the per-event slot checks can't be
silently skipped when no events are emitted. Mirrors the openpipeline
totalvi_leiden test pattern.
@jakubmajercik jakubmajercik requested a review from dorien-er June 3, 2026 14:15
Comment thread src/single_cell/parallel_integration/config.vsh.yaml
Comment thread src/single_cell/parallel_integration/config.vsh.yaml
- Add --obs_categorical_covariates and --obs_numerical_covariates (none by
  default), mapped to scvi_leiden's obs_categorical_covariate / obs_continuous_covariate.
- Expose --output_scvi_model; carry scvi_model through the branch sync and emit it
  when scVI is selected.
@jakubmajercik jakubmajercik requested a review from dorien-er June 3, 2026 15:41

@dorien-er dorien-er left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Feel free to merge after merging main into the branch and adding two changelog entries for the workflow + component

@jakubmajercik jakubmajercik merged commit fc7a414 into main Jun 4, 2026
3 checks passed
@jakubmajercik jakubmajercik deleted the add-multi-integration-workflow branch June 4, 2026 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants