Skip to content

Phase 1: Disjoint grouping core implementation#341

Open
s-canchi wants to merge 16 commits intodisjoint-groupingfrom
001-disjoint-grouping
Open

Phase 1: Disjoint grouping core implementation#341
s-canchi wants to merge 16 commits intodisjoint-groupingfrom
001-disjoint-grouping

Conversation

@s-canchi
Copy link
Collaborator

Summary

  • Three independently callable subcommands: disjoint-group, per-group partition, assemble-groups
  • YAML manifest as handoff between steps
  • Works for paired (multi-locus) and unpaired (single-locus) data
  • Works for simulated and real data
  • Disjoint grouping does not degrade clustering accuracy (validated HA and vsearch against standard partition)
  • CDR3 length disjointness validated on simulated and real data
  • Test framework integrated into test/test.py with reference results for all four configs
  • No regressions (partis-test.py --quick and --paired --no-simu pass)

Test plan

  • partis-test.py --quick passes
  • partis-test.py --paired --no-simu passes
  • Disjoint grouping tests pass for paired simu, paired data, unpaired simu, unpaired data
  • Clustering accuracy comparison shows no degradation

Related: #337

s-canchi and others added 8 commits March 11, 2026 17:04
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ouping

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-dir arg

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…idation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… (T019)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@s-canchi s-canchi requested a review from psathyrella March 12, 2026 13:28
@s-canchi
Copy link
Collaborator Author

Design choices in Phase 1

  1. Independent subcommands as internal machinery: three separate steps (disjoint-group, partition, assemble-groups) rather than a single command. A disjoint-partition wrapper for standard partis single-command UX is planned for the next phase.

  2. YAML manifest as handoff contract: new artifact type for partis, describing groups, file paths, and metadata between steps.

  3. Per-group FASTA files as intermediate artifacts: writes to disk between steps rather than keeping data in memory, enabling independent parallel execution. I/O metrics to be captured during scale validation to quantify the tradeoff.

  4. Custom merge logic instead of merge_paired_yamls(): merge_yamls() asserts matching partition step counts, which independent runs do not produce. Custom logic extracts only the best partition from each group and reconciles germlines.

@psathyrella
Copy link
Owner

testing

  • Should run on --slow tests, as well as paired paper validation data, to check performance
  • disjoint tests seem not to work with --dry? I use --dry quite a bit to get the commands that would run to help with testing other things
  • it appears that the disjoint-group results from testing aren't actually compared? I don't see them in the comparison output
  • disjoint group tests should be in same order as other tests, as a variation of partition (probably after subset-partition)
  • what's the purpose of having --disjoint-dir separate from --workdir?
  • it might be better to test disjoint-group without --parameter-dir since in practice I think it would always infer its own parameters
  • it seems to rerun group creation even if the groups are already there, which seems worth avoiding.
  • it prints step 1: and step 2: but not 3: (just the component partition commands)
  • I don't think I understand why we need the new function run_disjoint_group_tests() -- it doesn't fit with the existing tests, which all consist of single commands that run within a common framework. It also seems like most of the code in this new function would need to be duplicated elsewhere if you were running disjoint partitioning on non-test data.
  • i don't think we want a check for families spread across disjoint groups -- this will just fail if the shm indels fall within cdr3, it's just an assumption. We want to run the normal testing performance comparison on results

disjointgrouper.py

  • do we really need a new function to group seqs by cdr3? I think this is done in several other places already (e.g. paircluster.py 1766; see also utils.collapse_naive_seqs_with_hashes and utils.split_clusters_by_cdr3). Even if we have to modify an existing function, it would be nice to not start from scratch
  • it would be nice to use utils.write_fasta e.g. in write_group_fastas
  • why is there a hardcoded yaml version in write_manifest?
  • get_sw_cache_path: I would really prefer if we could sort out which location we expect, rather than just looking in both, since (for instance) the latter is prone to using the wrong one if they both exist. It would probably also be simpler to follow subset-partition in handling everything as paired samples since it's so trivial to rearrange unpaired input to behave as paired input.
  • get_loci: if we really want this function it should probably be in utils.py since that's where the functions with similar names live. But I think really it would be better to not handle paired and unpaired as separate logic paths
  • find_partis_cmd: this also seems like, if we really do need it, it should be a general thing that lots of commands need, so in utils.py. But the docs just say that partis should now be in the path, so is there a reason we can't just continue that convention?
  • I feel the same way about get_infname_for_locus: it even has an exception that it isn't sure what to use. I'm quite skeptical that we need any of this guessing infrastructure
  • get_parameter_dir_for_locus: samesies
  • auto_cache_parameters: it seems like this passes n-procs and is-simu, but what about all the other args that could be set that need to be passed to cache-parameters? See partis.run_step, which is doing the same thing. Hopefully we can use that function rather than starting a new one?
  • it seems that it's set up to handle all disjoint groups from all loci together, but I'm not sure this is best. It seems it might be a lot simpler to handle one locus at a time without needing to know anything about loci.
  • it needs to run locus splitting/pair info extraction so we don't lose that functionality that exists elsewhere.
  • assemble_merged_output: it seems an awful lot simpler to me to add an option to merge_yamls rather than duplicate the entire function to avoid an assertion.
  • this doesn't seem to do anything to handle pairing info, e.g. merging paired partition, pair info cleaning. So unless I'm misunderstanding (I could be!) this is essentially only handling the single chain partition step of paired clustering -- is this intentional, we'll just add it later?

Big picture, sorry in advance if this is supposed to be just a stepping stone, but I think disjoint grouping just needs to be run as an option, eventually default, for partitioning -- we can't require the user to run multiple commands with complicated path dependencies (again, sorry if I'm misunderstanding). Like for instance

partis subset-partition --infname xxx.fa --paired-loci --disjoint-group

from the user's perspective should run the same as without disjoint grouping, it just runs a more complicated apportionment algorithm with extra parameter caching step. Now, exactly how much of the existing subset partition function should it use? That's for sure still to be decided, they are doing somewhat different, we of cousre don't have to reuse everything. But at minimum, conceptually, this should be behaving similarly.

Again, thanks so much for pushing on this!

@s-canchi
Copy link
Collaborator Author

Thank you for the detailed review. I agree with the code-level feedback across the board (reuse existing CDR3 grouping functions, use utils.write_fasta, integrate via run_step(), modify merge_yamls() instead of duplicating, restructure tests into the standard framework, etc). I will follow up with implementation.

On the bigger architectural question, I agree that subset-partition --disjoint-group is the right interface rather than separate subcommands.

What I plan to change in subset-partition

I will modify subset_partition() in bin/partis to support a --disjoint-group flag. When set, three things change relative to the current random-split behavior:

  1. Apportionment: instead of loading all sequences into memory and splitting randomly via subset_paired_queries(), the disjoint path runs cache-parameters --only-smith-waterman as a subprocess (which exits and releases memory), then reads the SW annotations and groups UIDs by cdr3_length. --parameter-dir is the escape hatch for cases where SW annotation was already done externally. For the grouping itself, I will reuse existing CDR3 grouping functions. Per-group FASTAs and a manifest are written to disk.

  2. Per-group partition dispatch: each CDR3 group runs standard partition as a subprocess, same as current subset runs. The manifest enables resumability (skip groups whose output already exists).

  3. Merge step: since CDR3 groups are biologically disjoint (no clonal family spans two CDR3 lengths), the merge is concatenation + germline reconciliation via merge_yamls() with the partition count assertion relaxed. The re-partition step that random-split subset-partition requires (because random subsets can split families) is not needed.

Without the --disjoint-group flag, existing subset-partition behavior is completely untouched.

Paired path: everything routes through the existing paired path (unpaired as degenerate single locus), using run_all_loci() / run_step() for parameter caching and locus handling.

HPC / resource optimization

One motivation for the original separate-command design was that each step has different resource profiles. This applies to SW annotation too, since for multi-million sequence datasets chunking SW annotation across nodes is much faster than running it as a single serial process. Would you be open to a mode where the disjoint path writes the manifest/groups and exits, so an external orchestrator can submit per-group jobs with right-sized resources? This could be a flag like --manifest-only, or I could keep the individual subcommands available as building blocks. The single-command path would remain the default.

Pairing info

Phase 1 scope was single-chain partition correctness. The paired pairing/cleaning steps should work unchanged on the merged single-chain outputs from disjoint grouping, but I have not confirmed that yet. I will verify as part of the integration.

@psathyrella
Copy link
Owner

This sounds great, thanks!

What I plan to change in subset-partition

I will modify subset_partition() in bin/partis to support a --disjoint-group flag. When set, three things change relative to the current random-split behavior:

1. **Apportionment**: instead of loading all sequences into memory and splitting randomly via `subset_paired_queries()`, the disjoint path runs `cache-parameters --only-smith-waterman` as a subprocess (which exits and releases memory), then reads the SW annotations and groups UIDs by cdr3_length. `--parameter-dir` is the escape hatch for cases where SW annotation was already done externally. For the grouping itself, I will reuse existing CDR3 grouping functions. Per-group FASTAs and a manifest are written to disk.

great.

2. **Per-group partition dispatch**: each CDR3 group runs standard partition as a subprocess, same as current subset runs. The manifest enables resumability (skip groups whose output already exists).

This sounds good, although the current subset partition already skips individual subset partition runs if output exists, although using the manifest does sound like a nice addition.

3. **Merge step**: since CDR3 groups are biologically disjoint (no clonal family spans two CDR3 lengths), the merge is concatenation + germline reconciliation via `merge_yamls()` with the partition count assertion relaxed. The re-partition step that random-split subset-partition requires (because random subsets can split families) is not needed.

yes, exactly, just concatenation without looking for new merges.

Without the --disjoint-group flag, existing subset-partition behavior is completely untouched.

Paired path: everything routes through the existing paired path (unpaired as degenerate single locus), using run_all_loci() / run_step() for parameter caching and locus handling.

HPC / resource optimization

One motivation for the original separate-command design was that each step has different resource profiles. This applies to SW annotation too, since for multi-million sequence datasets chunking SW annotation across nodes is much faster than running it as a single serial process. Would you be open to a mode where the disjoint path writes the manifest/groups and exits, so an external orchestrator can submit per-group jobs with right-sized resources? This could be a flag like --manifest-only, or I could keep the individual subcommands available as building blocks. The single-command path would remain the default.

Yes, that sounds great. I think we want to enable the case of super large samples that require finagling different resource allocations at different steps, but the most common use case is still running on one machine with one command, so that should stay the default.

s-canchi and others added 8 commits March 13, 2026 10:17
…_partition_only to merge_yamls

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hestration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…run()/compare framework

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rtition

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… dirs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-partition

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@s-canchi
Copy link
Collaborator Author

s-canchi commented Mar 14, 2026

Post-review update: all items addressed + validation results

All review feedback has been addressed and validation is complete across simulated tests and the paired paper datasets.

Architecture (big picture)

  • Integrated disjoint grouping as subset-partition --disjoint-group instead of separate subcommands. From the user perspective it runs the same as standard subset-partition, just with a different apportionment algorithm.
  • Everything runs through subset_partition() in bin/partis. The disjoint path replaces the random-split step with: SW annotation + CDR3 length grouping + per-group partition + assembly merge.
  • Locus splitting and pair info extraction handled by existing run_all_loci() framework.

disjointgrouper.py cleanup

  • Removed 6 duplicate helper functions (~80 lines): find_partis_cmd(), get_sw_cache_path(), get_loci(), get_infname_for_locus(), get_parameter_dir_for_locus(), auto_cache_parameters().
  • Replaced custom CDR3 grouping with existing utils.group_seqs_by_value().
  • Using utils.write_fasta() instead of manual FASTA writing.
  • Removed hardcoded YAML version in write_manifest().
  • Modified merge_yamls() with best_partition_only parameter instead of duplicating merge logic in assemble_merged_output().

Test restructuring

  • Removed run_disjoint_group_tests(). Disjoint test is now a standard self.tests entry processed by run()/compare_stuff().
  • --dry works automatically through the standard framework.
  • Results are compared via CCF metrics (purity/completeness) and pair-clean performance, same as all other partition tests.
  • Removed the disjointness assertion (CDR3 families not splitting across groups). As noted in review, this is an assumption, not something to assert. Validation is through the standard CCF metrics.
  • Test placed after subset-partition in the test ordering.
  • --disjoint-dir removed; uses standard --paired-outdir/--outdir paths.
  • Test runs without --parameter-dir (SW annotation auto-triggers).
  • Skip-if-exists logic added for grouping step, per-group partition, and assembly.
  • Step 3 label now prints in test output.

Pair cleaning integration

  • --input-metafnames passed to per-group partition commands so paired-uids propagate through to per-group annotations.
  • Assembled output linked to single-chain/partition-{locus}.yaml where combine_inf_chains() expects it.
  • combine_inf_chains() called after disjoint subset partition returns in run_all_loci().
  • Pair-cleaned output and true-pair-clean-performance.csv produced identically to standard partition.

Validation results

Simulated data: normal paired tests (--paired --no-simu, ~56 seqs per locus)

Method Purity Completeness Pair Clean Correct Unpaired Mispaired
partition 0.980 0.588 0.726 0.133 0.142
subset 0.980 0.677 0.726 0.133 0.142
disjoint 0.980 0.568 0.726 0.133 0.142

Simulated data: slow paired tests (--slow --paired, ~1700 seqs)

Method Purity Completeness Pair Clean Correct Unpaired Mispaired
partition 0.890 0.788 0.795 0.156 0.049
subset 0.878 0.776 0.795 0.145 0.055
disjoint 0.903 0.792 0.800 0.159 0.042

Disjoint partition matches or slightly outperforms standard partition on all metrics at both scales.

Paired paper validation datasets (10x data, Zenodo 6998443)

Ran subset-partition --disjoint-group --paired-loci on all 4 datasets from the paired paper using existing parameters. Compared pair-cleaned output against existing standard partition results.

Single-chain (pre pair-clean) sequence counts are identical between standard and disjoint for all datasets and loci. Zero sequence loss from CDR3 length grouping.

Dataset Locus Std seqs DJ seqs Std purity Std completeness DJ purity DJ completeness
hs-2-pbmc igh 1007 909 1.0000 1.0000 1.0000 1.0000
hs-2-pbmc igk 620 605 0.9537 0.9653 0.9653 0.9537
hs-2-pbmc igl 426 417 0.9496 0.9616 0.9616 0.9496
mm-balbc igh 923 849 0.9988 1.0000 1.0000 0.9988
mm-balbc igk 935 900 0.9122 0.9589 0.9589 0.9122
mm-balbc igl 61 59 0.9492 0.9492 0.9492 0.9492
hs-1-postvax igh 8498 7678 0.9995 0.9991 0.9991 0.9995
hs-1-postvax igk 4800 4764 0.9163 0.9399 0.9399 0.9163
hs-1-postvax igl 4096 4056 0.9258 0.9383 0.9383 0.9258
hs-1-prevax igh 9187 8253 0.9992 0.9978 0.9978 0.9992
hs-1-prevax igk 5329 5284 0.9093 0.9375 0.9375 0.9093
hs-1-prevax igl 4360 4308 0.9248 0.9136 0.9248 0.9136

"Std purity" = purity of standard partition with disjoint as reference. "DJ purity" = purity of disjoint partition with standard as reference. The metrics are symmetric: purity(A vs B) = completeness(B vs A).

Summary:

  • IGH (heavy chain): near-perfect agreement across all datasets (purity/completeness > 0.997). CDR3 length grouping has essentially no effect on heavy chain clustering.
  • Light chains (IGK/IGL): 91-97% agreement. Differences arise from pair cleaning operating on slightly different cluster boundaries. Neither method is systematically better.
  • Sequence count differences in the final pair-cleaned output (~10% fewer IGH in disjoint) are entirely from pair cleaning, not from CDR3 grouping. Single-chain counts are identical.

CDR3 length disjointness

Validated across all test scales: 585 simulated families (200 unpaired + 385 paired), 100 real paired sequences, 1,371 real unpaired sequences, and all 4 paper validation datasets (~40K total sequences). Zero families split across CDR3 length groups in all cases.

If these changes look good, the next phase I had planned out was to implement naive hamming fraction sub-grouping within CDR3 length bins which would become necessary if CDR3 length bins are too large at scale. Testing the CDR3 and CDR3 + hfrac grouping on the 4m subsets from previous benchmarking steps and comparing results would let us compare the method and resource usage on real data. One more motivation for testing the hfrac based sub grouping was to check if enabling HA as default for disjoint-grouping would be feasible. What are your thoughts @psathyrella ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants