Phase 1: Disjoint grouping core implementation by s-canchi · Pull Request #341 · psathyrella/partis

s-canchi · 2026-03-11T21:22:29Z

Summary

Three independently callable subcommands: disjoint-group, per-group partition, assemble-groups
YAML manifest as handoff between steps
Works for paired (multi-locus) and unpaired (single-locus) data
Works for simulated and real data
Disjoint grouping does not degrade clustering accuracy (validated HA and vsearch against standard partition)
CDR3 length disjointness validated on simulated and real data
Test framework integrated into test/test.py with reference results for all four configs
No regressions (partis-test.py --quick and --paired --no-simu pass)

Test plan

partis-test.py --quick passes
partis-test.py --paired --no-simu passes
Disjoint grouping tests pass for paired simu, paired data, unpaired simu, unpaired data
Clustering accuracy comparison shows no degradation

Related: #337

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ouping Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…-dir arg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…idation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… (T019) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

s-canchi · 2026-03-12T13:37:16Z

Design choices in Phase 1

Independent subcommands as internal machinery: three separate steps (disjoint-group, partition, assemble-groups) rather than a single command. A disjoint-partition wrapper for standard partis single-command UX is planned for the next phase.
YAML manifest as handoff contract: new artifact type for partis, describing groups, file paths, and metadata between steps.
Per-group FASTA files as intermediate artifacts: writes to disk between steps rather than keeping data in memory, enabling independent parallel execution. I/O metrics to be captured during scale validation to quantify the tradeoff.
Custom merge logic instead of merge_paired_yamls(): merge_yamls() asserts matching partition step counts, which independent runs do not produce. Custom logic extracts only the best partition from each group and reconciles germlines.

psathyrella · 2026-03-12T17:47:10Z

testing

Should run on --slow tests, as well as paired paper validation data, to check performance
disjoint tests seem not to work with --dry? I use --dry quite a bit to get the commands that would run to help with testing other things
it appears that the disjoint-group results from testing aren't actually compared? I don't see them in the comparison output
disjoint group tests should be in same order as other tests, as a variation of partition (probably after subset-partition)
what's the purpose of having --disjoint-dir separate from --workdir?
it might be better to test disjoint-group without --parameter-dir since in practice I think it would always infer its own parameters
it seems to rerun group creation even if the groups are already there, which seems worth avoiding.
it prints step 1: and step 2: but not 3: (just the component partition commands)
I don't think I understand why we need the new function run_disjoint_group_tests() -- it doesn't fit with the existing tests, which all consist of single commands that run within a common framework. It also seems like most of the code in this new function would need to be duplicated elsewhere if you were running disjoint partitioning on non-test data.
i don't think we want a check for families spread across disjoint groups -- this will just fail if the shm indels fall within cdr3, it's just an assumption. We want to run the normal testing performance comparison on results

disjointgrouper.py

do we really need a new function to group seqs by cdr3? I think this is done in several other places already (e.g. paircluster.py 1766; see also utils.collapse_naive_seqs_with_hashes and utils.split_clusters_by_cdr3). Even if we have to modify an existing function, it would be nice to not start from scratch
it would be nice to use utils.write_fasta e.g. in write_group_fastas
why is there a hardcoded yaml version in write_manifest?
get_sw_cache_path: I would really prefer if we could sort out which location we expect, rather than just looking in both, since (for instance) the latter is prone to using the wrong one if they both exist. It would probably also be simpler to follow subset-partition in handling everything as paired samples since it's so trivial to rearrange unpaired input to behave as paired input.
get_loci: if we really want this function it should probably be in utils.py since that's where the functions with similar names live. But I think really it would be better to not handle paired and unpaired as separate logic paths
find_partis_cmd: this also seems like, if we really do need it, it should be a general thing that lots of commands need, so in utils.py. But the docs just say that partis should now be in the path, so is there a reason we can't just continue that convention?
I feel the same way about get_infname_for_locus: it even has an exception that it isn't sure what to use. I'm quite skeptical that we need any of this guessing infrastructure
get_parameter_dir_for_locus: samesies
auto_cache_parameters: it seems like this passes n-procs and is-simu, but what about all the other args that could be set that need to be passed to cache-parameters? See partis.run_step, which is doing the same thing. Hopefully we can use that function rather than starting a new one?
it seems that it's set up to handle all disjoint groups from all loci together, but I'm not sure this is best. It seems it might be a lot simpler to handle one locus at a time without needing to know anything about loci.
it needs to run locus splitting/pair info extraction so we don't lose that functionality that exists elsewhere.
assemble_merged_output: it seems an awful lot simpler to me to add an option to merge_yamls rather than duplicate the entire function to avoid an assertion.
this doesn't seem to do anything to handle pairing info, e.g. merging paired partition, pair info cleaning. So unless I'm misunderstanding (I could be!) this is essentially only handling the single chain partition step of paired clustering -- is this intentional, we'll just add it later?

Big picture, sorry in advance if this is supposed to be just a stepping stone, but I think disjoint grouping just needs to be run as an option, eventually default, for partitioning -- we can't require the user to run multiple commands with complicated path dependencies (again, sorry if I'm misunderstanding). Like for instance

partis subset-partition --infname xxx.fa --paired-loci --disjoint-group

from the user's perspective should run the same as without disjoint grouping, it just runs a more complicated apportionment algorithm with extra parameter caching step. Now, exactly how much of the existing subset partition function should it use? That's for sure still to be decided, they are doing somewhat different, we of cousre don't have to reuse everything. But at minimum, conceptually, this should be behaving similarly.

Again, thanks so much for pushing on this!

s-canchi · 2026-03-12T20:04:42Z

Thank you for the detailed review. I agree with the code-level feedback across the board (reuse existing CDR3 grouping functions, use utils.write_fasta, integrate via run_step(), modify merge_yamls() instead of duplicating, restructure tests into the standard framework, etc). I will follow up with implementation.

On the bigger architectural question, I agree that subset-partition --disjoint-group is the right interface rather than separate subcommands.

What I plan to change in subset-partition

I will modify subset_partition() in bin/partis to support a --disjoint-group flag. When set, three things change relative to the current random-split behavior:

Apportionment: instead of loading all sequences into memory and splitting randomly via subset_paired_queries(), the disjoint path runs cache-parameters --only-smith-waterman as a subprocess (which exits and releases memory), then reads the SW annotations and groups UIDs by cdr3_length. --parameter-dir is the escape hatch for cases where SW annotation was already done externally. For the grouping itself, I will reuse existing CDR3 grouping functions. Per-group FASTAs and a manifest are written to disk.
Per-group partition dispatch: each CDR3 group runs standard partition as a subprocess, same as current subset runs. The manifest enables resumability (skip groups whose output already exists).
Merge step: since CDR3 groups are biologically disjoint (no clonal family spans two CDR3 lengths), the merge is concatenation + germline reconciliation via merge_yamls() with the partition count assertion relaxed. The re-partition step that random-split subset-partition requires (because random subsets can split families) is not needed.

Without the --disjoint-group flag, existing subset-partition behavior is completely untouched.

Paired path: everything routes through the existing paired path (unpaired as degenerate single locus), using run_all_loci() / run_step() for parameter caching and locus handling.

HPC / resource optimization

One motivation for the original separate-command design was that each step has different resource profiles. This applies to SW annotation too, since for multi-million sequence datasets chunking SW annotation across nodes is much faster than running it as a single serial process. Would you be open to a mode where the disjoint path writes the manifest/groups and exits, so an external orchestrator can submit per-group jobs with right-sized resources? This could be a flag like --manifest-only, or I could keep the individual subcommands available as building blocks. The single-command path would remain the default.

Pairing info

Phase 1 scope was single-chain partition correctness. The paired pairing/cleaning steps should work unchanged on the merged single-chain outputs from disjoint grouping, but I have not confirmed that yet. I will verify as part of the integration.

psathyrella · 2026-03-12T21:40:12Z

This sounds great, thanks!

What I plan to change in subset-partition

I will modify subset_partition() in bin/partis to support a --disjoint-group flag. When set, three things change relative to the current random-split behavior:

1. **Apportionment**: instead of loading all sequences into memory and splitting randomly via `subset_paired_queries()`, the disjoint path runs `cache-parameters --only-smith-waterman` as a subprocess (which exits and releases memory), then reads the SW annotations and groups UIDs by cdr3_length. `--parameter-dir` is the escape hatch for cases where SW annotation was already done externally. For the grouping itself, I will reuse existing CDR3 grouping functions. Per-group FASTAs and a manifest are written to disk.

great.

2. **Per-group partition dispatch**: each CDR3 group runs standard partition as a subprocess, same as current subset runs. The manifest enables resumability (skip groups whose output already exists).

This sounds good, although the current subset partition already skips individual subset partition runs if output exists, although using the manifest does sound like a nice addition.

3. **Merge step**: since CDR3 groups are biologically disjoint (no clonal family spans two CDR3 lengths), the merge is concatenation + germline reconciliation via `merge_yamls()` with the partition count assertion relaxed. The re-partition step that random-split subset-partition requires (because random subsets can split families) is not needed.

yes, exactly, just concatenation without looking for new merges.

Without the --disjoint-group flag, existing subset-partition behavior is completely untouched.

Paired path: everything routes through the existing paired path (unpaired as degenerate single locus), using run_all_loci() / run_step() for parameter caching and locus handling.

HPC / resource optimization

One motivation for the original separate-command design was that each step has different resource profiles. This applies to SW annotation too, since for multi-million sequence datasets chunking SW annotation across nodes is much faster than running it as a single serial process. Would you be open to a mode where the disjoint path writes the manifest/groups and exits, so an external orchestrator can submit per-group jobs with right-sized resources? This could be a flag like --manifest-only, or I could keep the individual subcommands available as building blocks. The single-command path would remain the default.

Yes, that sounds great. I think we want to enable the case of super large samples that require finagling different resource allocations at different steps, but the most common use case is still running on one machine with one command, so that should stay the default.

…_partition_only to merge_yamls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…hestration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…run()/compare framework Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rtition Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… dirs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…-partition Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

s-canchi · 2026-03-14T00:51:00Z

Post-review update: all items addressed + validation results

All review feedback has been addressed and validation is complete across simulated tests and the paired paper datasets.

Architecture (big picture)

Integrated disjoint grouping as subset-partition --disjoint-group instead of separate subcommands. From the user perspective it runs the same as standard subset-partition, just with a different apportionment algorithm.
Everything runs through subset_partition() in bin/partis. The disjoint path replaces the random-split step with: SW annotation + CDR3 length grouping + per-group partition + assembly merge.
Locus splitting and pair info extraction handled by existing run_all_loci() framework.

disjointgrouper.py cleanup

Removed 6 duplicate helper functions (~80 lines): find_partis_cmd(), get_sw_cache_path(), get_loci(), get_infname_for_locus(), get_parameter_dir_for_locus(), auto_cache_parameters().
Replaced custom CDR3 grouping with existing utils.group_seqs_by_value().
Using utils.write_fasta() instead of manual FASTA writing.
Removed hardcoded YAML version in write_manifest().
Modified merge_yamls() with best_partition_only parameter instead of duplicating merge logic in assemble_merged_output().

Test restructuring

Removed run_disjoint_group_tests(). Disjoint test is now a standard self.tests entry processed by run()/compare_stuff().
--dry works automatically through the standard framework.
Results are compared via CCF metrics (purity/completeness) and pair-clean performance, same as all other partition tests.
Removed the disjointness assertion (CDR3 families not splitting across groups). As noted in review, this is an assumption, not something to assert. Validation is through the standard CCF metrics.
Test placed after subset-partition in the test ordering.
--disjoint-dir removed; uses standard --paired-outdir/--outdir paths.
Test runs without --parameter-dir (SW annotation auto-triggers).
Skip-if-exists logic added for grouping step, per-group partition, and assembly.
Step 3 label now prints in test output.

Pair cleaning integration

--input-metafnames passed to per-group partition commands so paired-uids propagate through to per-group annotations.
Assembled output linked to single-chain/partition-{locus}.yaml where combine_inf_chains() expects it.
combine_inf_chains() called after disjoint subset partition returns in run_all_loci().
Pair-cleaned output and true-pair-clean-performance.csv produced identically to standard partition.

Validation results

Simulated data: normal paired tests (`--paired --no-simu`, ~56 seqs per locus)

Method	Purity	Completeness	Pair Clean Correct	Unpaired	Mispaired
partition	0.980	0.588	0.726	0.133	0.142
subset	0.980	0.677	0.726	0.133	0.142
disjoint	0.980	0.568	0.726	0.133	0.142

Simulated data: slow paired tests (`--slow --paired`, ~1700 seqs)

Method	Purity	Completeness	Pair Clean Correct	Unpaired	Mispaired
partition	0.890	0.788	0.795	0.156	0.049
subset	0.878	0.776	0.795	0.145	0.055
disjoint	0.903	0.792	0.800	0.159	0.042

Disjoint partition matches or slightly outperforms standard partition on all metrics at both scales.

Paired paper validation datasets (10x data, Zenodo 6998443)

Ran subset-partition --disjoint-group --paired-loci on all 4 datasets from the paired paper using existing parameters. Compared pair-cleaned output against existing standard partition results.

Single-chain (pre pair-clean) sequence counts are identical between standard and disjoint for all datasets and loci. Zero sequence loss from CDR3 length grouping.

Dataset	Locus	Std seqs	DJ seqs	Std purity	Std completeness	DJ purity	DJ completeness
hs-2-pbmc	igh	1007	909	1.0000	1.0000	1.0000	1.0000
hs-2-pbmc	igk	620	605	0.9537	0.9653	0.9653	0.9537
hs-2-pbmc	igl	426	417	0.9496	0.9616	0.9616	0.9496
mm-balbc	igh	923	849	0.9988	1.0000	1.0000	0.9988
mm-balbc	igk	935	900	0.9122	0.9589	0.9589	0.9122
mm-balbc	igl	61	59	0.9492	0.9492	0.9492	0.9492
hs-1-postvax	igh	8498	7678	0.9995	0.9991	0.9991	0.9995
hs-1-postvax	igk	4800	4764	0.9163	0.9399	0.9399	0.9163
hs-1-postvax	igl	4096	4056	0.9258	0.9383	0.9383	0.9258
hs-1-prevax	igh	9187	8253	0.9992	0.9978	0.9978	0.9992
hs-1-prevax	igk	5329	5284	0.9093	0.9375	0.9375	0.9093
hs-1-prevax	igl	4360	4308	0.9248	0.9136	0.9248	0.9136

"Std purity" = purity of standard partition with disjoint as reference. "DJ purity" = purity of disjoint partition with standard as reference. The metrics are symmetric: purity(A vs B) = completeness(B vs A).

Summary:

IGH (heavy chain): near-perfect agreement across all datasets (purity/completeness > 0.997). CDR3 length grouping has essentially no effect on heavy chain clustering.
Light chains (IGK/IGL): 91-97% agreement. Differences arise from pair cleaning operating on slightly different cluster boundaries. Neither method is systematically better.
Sequence count differences in the final pair-cleaned output (~10% fewer IGH in disjoint) are entirely from pair cleaning, not from CDR3 grouping. Single-chain counts are identical.

CDR3 length disjointness

Validated across all test scales: 585 simulated families (200 unpaired + 385 paired), 100 real paired sequences, 1,371 real unpaired sequences, and all 4 paper validation datasets (~40K total sequences). Zero families split across CDR3 length groups in all cases.

If these changes look good, the next phase I had planned out was to implement naive hamming fraction sub-grouping within CDR3 length bins which would become necessary if CDR3 length bins are too large at scale. Testing the CDR3 and CDR3 + hfrac grouping on the 4m subsets from previous benchmarking steps and comparing results would let us compare the method and resource usage on real data. One more motivation for testing the hfrac based sub grouping was to check if enabling HA as default for disjoint-grouping would be feasible. What are your thoughts @psathyrella ?

s-canchi and others added 8 commits March 11, 2026 17:04

add disjoint-group and assemble-groups subcommand scaffolding

cd79803

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add core grouping, manifest, and validation functions for disjoint gr…

385b05d

…ouping Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

implement disjoint-group and assemble-groups handlers with --disjoint…

6f6ced0

…-dir arg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add assembly merge and validation, flip to --no-merge-output for scale

33e8792

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add auto-trigger SW annotation for disjoint-group (T016)

1f72f4f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix assembly merge and add disjoint-group tests with disjointness val…

0097445

…idation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add disjoint grouping reference results for paired and unpaired tests…

f546198

… (T019) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix unpaired data test to use self.datafname and regenerate ref results

b04c3ce

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

s-canchi requested a review from psathyrella March 12, 2026 13:28

s-canchi and others added 8 commits March 13, 2026 10:17

address PR #341 review: reuse existing functions, fix tests, add best…

7f9ee41

…_partition_only to merge_yamls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add --disjoint-group flag to subset-partition with skip-if-exists orc…

6939264

…hestration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

restructure disjoint tests into self.tests OrderedDict with standard …

be6bc37

…run()/compare framework Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add post-merge CCF calculation and fix test framework for disjoint-pa…

299a300

…rtition Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

update ref results: add disjoint-partition, remove old disjoint-group…

60a1951

… dirs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

regenerate slow paired ref results with subset-partition and disjoint…

0b629ca

…-partition Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

integrate pair cleaning into disjoint-partition path (T10)

5bc888a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

regenerate disjoint-partition ref results with pair cleaning

45752cb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1: Disjoint grouping core implementation#341

Phase 1: Disjoint grouping core implementation#341
s-canchi wants to merge 16 commits intodisjoint-groupingfrom
001-disjoint-grouping

s-canchi commented Mar 11, 2026

Uh oh!

s-canchi commented Mar 12, 2026

Uh oh!

psathyrella commented Mar 12, 2026

Uh oh!

s-canchi commented Mar 12, 2026

Uh oh!

psathyrella commented Mar 12, 2026

Uh oh!

s-canchi commented Mar 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

s-canchi commented Mar 11, 2026

Summary

Test plan

Uh oh!

s-canchi commented Mar 12, 2026

Design choices in Phase 1

Uh oh!

psathyrella commented Mar 12, 2026

testing

disjointgrouper.py

Uh oh!

s-canchi commented Mar 12, 2026

Uh oh!

psathyrella commented Mar 12, 2026

Uh oh!

s-canchi commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Post-review update: all items addressed + validation results

Validation results

Simulated data: normal paired tests (--paired --no-simu, ~56 seqs per locus)

Simulated data: slow paired tests (--slow --paired, ~1700 seqs)

Paired paper validation datasets (10x data, Zenodo 6998443)

CDR3 length disjointness

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

s-canchi commented Mar 14, 2026 •

edited

Loading

Simulated data: normal paired tests (`--paired --no-simu`, ~56 seqs per locus)

Simulated data: slow paired tests (`--slow --paired`, ~1700 seqs)