Skip to content

Add asymmetric batch alignment, fix tempdir and segfault for large genomes#16

Open
unavailable-2374 wants to merge 7 commits intopangenome:mainfrom
unavailable-2374:main
Open

Add asymmetric batch alignment, fix tempdir and segfault for large genomes#16
unavailable-2374 wants to merge 7 commits intopangenome:mainfrom
unavailable-2374:main

Conversation

@unavailable-2374
Copy link
Copy Markdown

Summary

This PR adds asymmetric batch alignment for large genome pairs and fixes critical issues with --tempdir and segmentation faults when aligning large genomes.

Changes

New Features

  1. Asymmetric batch mode: When 2 input files + --batch-bytes, uses asymmetric alignment (file1 → file2) without self-alignment within each file
  2. Sequence-level splitting: When individual genomes exceed batch limit, splits chromosomes across batches
  3. Disk optimization: For unidirectional mode, query files are used directly (saves ~50% disk space)
    --batch-bidirectional flag: Optional flag for bidirectional alignment (A→B and B→A)

Bug Fixes

  1. Fix --tempdir: Index files now properly placed in tempdir instead of next to input files
  2. Fix --batch-bytes: Account for dual index disk usage; route two-FASTA case through batch mode
  3. Fix segmentation fault: Use forked fastga-rs with fix for large genomes (3+ Gbp)

Example usage

# Unidirectional: queries → targets only (default)
sweepga queries.fa targets.fa --batch-bytes 20G --tempdir /tmp

# Bidirectional: A→B and B→A
sweepga queries.fa targets.fa --batch-bytes 20G --batch-bidirectional

🤖 Generated with [Claude Code](https://claude.com/claude-code)

unavailable-2374 and others added 7 commits January 14, 2026 22:55
Previously, index files (GDB, GIX, ktab) were always created next to the
input FASTA file, ignoring the --tempdir option. This fix adds new methods
that copy/symlink input files to tempdir before alignment, ensuring all
intermediate files are created there.

Changes:
- Add prepare_working_copy() and cleanup_working_copy() helpers
- Add align_to_temp_paf_in_tempdir() and align_to_temp_1aln_in_tempdir()
- Update run_single_batch_alignment() to use new tempdir-aware methods
- Update all direct alignment calls in main.rs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
During cross-batch alignment, FastGA creates indexes for both query
and target batches simultaneously. This means peak disk usage is
roughly 2x the per-batch index size.

Changes:
- Divide max_index_bytes by 2 when partitioning into batches
- Each batch now targets half the user's limit, ensuring peak usage
  (query + target indexes) stays within the specified limit
- Added clearer log messages showing both total and per-batch limits
- Fixed AGC batch estimation to track basepairs instead of cumulative
  estimated sizes (was incorrectly adding 100MB overhead per sample)

Example: --batch-bytes 20G now creates batches with ~10G estimated
index each, so cross-batch alignment peaks at ~20G total.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes:
1. Two single-genome FASTAs now use batch mode when --batch-bytes is set
   - Previously, this case skipped batch mode entirely
   - Now properly respects the --batch-bytes limit

2. Added clear warnings when genomes exceed the limit:
   - Shows peak disk usage estimate (sum of two largest batches)
   - Shows minimum --batch-bytes needed for the genomes
   - Explains that batching only helps with many smaller genomes

3. Fixed AGC batch estimation to track basepairs correctly
   - Was adding 100MB overhead per sample instead of per batch

Example output with oversized genomes:
  [batch] WARNING: Peak disk usage (~43.2 GB) will exceed --batch-bytes limit (20.0 GB)
  [batch] NOTE: Minimum --batch-bytes needed for these genomes: 47.6 GB
  [batch] TIP: Large genomes cannot be split - batching only helps with many smaller genomes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When genome-level batching produces oversized batches (individual genomes
exceed the per-batch limit), automatically switch to sequence-level
splitting. This distributes chromosomes/contigs across batches to stay
within the disk limit.

New features:
- SequenceInfo and SequenceBatch structs for sequence-level tracking
- parse_sequences() to extract individual sequence info from FASTAs
- partition_sequences_into_batches() with first-fit decreasing bin packing
- write_sequence_batch_fasta() to write batch FASTA files
- run_sequence_batch_alignment() for sequence-level batch alignment

The algorithm:
1. Try genome-level batching first
2. If any batch exceeds limit, switch to sequence-level splitting
3. Parse individual sequences, sort by size (largest first)
4. Use first-fit decreasing to pack sequences into batches
5. Run all-pairs alignment between batches

Example with 2.5Gbp + 1.1Gbp genomes and 20GB limit:
- Before: Would create ~43GB indexes (genome-level)
- After: Creates 5 batches of ~10GB each, peak ~20GB

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update dependency to use unavailable-2374/fastga-rs which includes:
- Fix segfault when aligning large genomes (~3GB)
- Fix out-of-bounds error in ALNtoPAF with -pafm option

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This adds support for efficiently aligning two large genomes (e.g., 3+ Gbp each)
with controlled disk usage via --batch-bytes.

Key features:
- Asymmetric batch mode: when 2 input files + --batch-bytes, automatically
  uses asymmetric alignment (file1 → file2) without self-alignment
- Sequence-level splitting: when genomes exceed batch limit, splits individual
  chromosomes across batches instead of treating whole genome as one unit
- Disk optimization: for unidirectional alignment, only target sequences are
  written to temp directory; query files are used directly (saves ~50% disk)
- --batch-bidirectional flag for optional A→B and B→A alignment

Bug fixes:
- Fix rayon thread pool initialization order to prevent "already initialized" error

Implementation details:
- run_asymmetric_batch_alignment(): genome-level batching for asymmetric mode
- run_asymmetric_sequence_batch_alignment(): sequence-level fallback when
  genomes are too large for genome-level batching
- Automatic detection: if any batch exceeds per-batch limit after genome-level
  partitioning, switches to sequence-level splitting
- Proper cleanup of GDB files and batch directories

Example usage:
  sweepga queries.fa targets.fa --batch-bytes 20G  # queries→targets only
  sweepga queries.fa targets.fa --batch-bytes 20G --batch-bidirectional  # both directions

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@AndreaGuarracino AndreaGuarracino force-pushed the main branch 2 times, most recently from 70e27dd to 5905715 Compare March 6, 2026 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant