Phase 1: Disjoint grouping core implementation#341
Phase 1: Disjoint grouping core implementation#341s-canchi wants to merge 16 commits intodisjoint-groupingfrom
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ouping Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-dir arg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…idation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… (T019) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Design choices in Phase 1
|
testing
disjointgrouper.py
Big picture, sorry in advance if this is supposed to be just a stepping stone, but I think disjoint grouping just needs to be run as an option, eventually default, for partitioning -- we can't require the user to run multiple commands with complicated path dependencies (again, sorry if I'm misunderstanding). Like for instance from the user's perspective should run the same as without disjoint grouping, it just runs a more complicated apportionment algorithm with extra parameter caching step. Now, exactly how much of the existing subset partition function should it use? That's for sure still to be decided, they are doing somewhat different, we of cousre don't have to reuse everything. But at minimum, conceptually, this should be behaving similarly. Again, thanks so much for pushing on this! |
|
Thank you for the detailed review. I agree with the code-level feedback across the board (reuse existing CDR3 grouping functions, use utils.write_fasta, integrate via run_step(), modify merge_yamls() instead of duplicating, restructure tests into the standard framework, etc). I will follow up with implementation. On the bigger architectural question, I agree that What I plan to change in subset-partition I will modify
Without the Paired path: everything routes through the existing paired path (unpaired as degenerate single locus), using HPC / resource optimization One motivation for the original separate-command design was that each step has different resource profiles. This applies to SW annotation too, since for multi-million sequence datasets chunking SW annotation across nodes is much faster than running it as a single serial process. Would you be open to a mode where the disjoint path writes the manifest/groups and exits, so an external orchestrator can submit per-group jobs with right-sized resources? This could be a flag like Pairing info Phase 1 scope was single-chain partition correctness. The paired pairing/cleaning steps should work unchanged on the merged single-chain outputs from disjoint grouping, but I have not confirmed that yet. I will verify as part of the integration. |
|
This sounds great, thanks!
great.
This sounds good, although the current subset partition already skips individual subset partition runs if output exists, although using the manifest does sound like a nice addition.
yes, exactly, just concatenation without looking for new merges.
Yes, that sounds great. I think we want to enable the case of super large samples that require finagling different resource allocations at different steps, but the most common use case is still running on one machine with one command, so that should stay the default. |
…_partition_only to merge_yamls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hestration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…run()/compare framework Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rtition Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… dirs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-partition Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Post-review update: all items addressed + validation resultsAll review feedback has been addressed and validation is complete across simulated tests and the paired paper datasets. Architecture (big picture)
disjointgrouper.py cleanup
Test restructuring
Pair cleaning integration
Validation resultsSimulated data: normal paired tests (
|
| Method | Purity | Completeness | Pair Clean Correct | Unpaired | Mispaired |
|---|---|---|---|---|---|
| partition | 0.980 | 0.588 | 0.726 | 0.133 | 0.142 |
| subset | 0.980 | 0.677 | 0.726 | 0.133 | 0.142 |
| disjoint | 0.980 | 0.568 | 0.726 | 0.133 | 0.142 |
Simulated data: slow paired tests (--slow --paired, ~1700 seqs)
| Method | Purity | Completeness | Pair Clean Correct | Unpaired | Mispaired |
|---|---|---|---|---|---|
| partition | 0.890 | 0.788 | 0.795 | 0.156 | 0.049 |
| subset | 0.878 | 0.776 | 0.795 | 0.145 | 0.055 |
| disjoint | 0.903 | 0.792 | 0.800 | 0.159 | 0.042 |
Disjoint partition matches or slightly outperforms standard partition on all metrics at both scales.
Paired paper validation datasets (10x data, Zenodo 6998443)
Ran subset-partition --disjoint-group --paired-loci on all 4 datasets from the paired paper using existing parameters. Compared pair-cleaned output against existing standard partition results.
Single-chain (pre pair-clean) sequence counts are identical between standard and disjoint for all datasets and loci. Zero sequence loss from CDR3 length grouping.
| Dataset | Locus | Std seqs | DJ seqs | Std purity | Std completeness | DJ purity | DJ completeness |
|---|---|---|---|---|---|---|---|
| hs-2-pbmc | igh | 1007 | 909 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| hs-2-pbmc | igk | 620 | 605 | 0.9537 | 0.9653 | 0.9653 | 0.9537 |
| hs-2-pbmc | igl | 426 | 417 | 0.9496 | 0.9616 | 0.9616 | 0.9496 |
| mm-balbc | igh | 923 | 849 | 0.9988 | 1.0000 | 1.0000 | 0.9988 |
| mm-balbc | igk | 935 | 900 | 0.9122 | 0.9589 | 0.9589 | 0.9122 |
| mm-balbc | igl | 61 | 59 | 0.9492 | 0.9492 | 0.9492 | 0.9492 |
| hs-1-postvax | igh | 8498 | 7678 | 0.9995 | 0.9991 | 0.9991 | 0.9995 |
| hs-1-postvax | igk | 4800 | 4764 | 0.9163 | 0.9399 | 0.9399 | 0.9163 |
| hs-1-postvax | igl | 4096 | 4056 | 0.9258 | 0.9383 | 0.9383 | 0.9258 |
| hs-1-prevax | igh | 9187 | 8253 | 0.9992 | 0.9978 | 0.9978 | 0.9992 |
| hs-1-prevax | igk | 5329 | 5284 | 0.9093 | 0.9375 | 0.9375 | 0.9093 |
| hs-1-prevax | igl | 4360 | 4308 | 0.9248 | 0.9136 | 0.9248 | 0.9136 |
"Std purity" = purity of standard partition with disjoint as reference. "DJ purity" = purity of disjoint partition with standard as reference. The metrics are symmetric: purity(A vs B) = completeness(B vs A).
Summary:
- IGH (heavy chain): near-perfect agreement across all datasets (purity/completeness > 0.997). CDR3 length grouping has essentially no effect on heavy chain clustering.
- Light chains (IGK/IGL): 91-97% agreement. Differences arise from pair cleaning operating on slightly different cluster boundaries. Neither method is systematically better.
- Sequence count differences in the final pair-cleaned output (~10% fewer IGH in disjoint) are entirely from pair cleaning, not from CDR3 grouping. Single-chain counts are identical.
CDR3 length disjointness
Validated across all test scales: 585 simulated families (200 unpaired + 385 paired), 100 real paired sequences, 1,371 real unpaired sequences, and all 4 paper validation datasets (~40K total sequences). Zero families split across CDR3 length groups in all cases.
If these changes look good, the next phase I had planned out was to implement naive hamming fraction sub-grouping within CDR3 length bins which would become necessary if CDR3 length bins are too large at scale. Testing the CDR3 and CDR3 + hfrac grouping on the 4m subsets from previous benchmarking steps and comparing results would let us compare the method and resource usage on real data. One more motivation for testing the hfrac based sub grouping was to check if enabling HA as default for disjoint-grouping would be feasible. What are your thoughts @psathyrella ?
Summary
disjoint-group, per-grouppartition,assemble-groupstest/test.pywith reference results for all four configspartis-test.py --quickand--paired --no-simupass)Test plan
partis-test.py --quickpassespartis-test.py --paired --no-simupassesRelated: #337