Add asymmetric batch alignment, fix tempdir and segfault for large genomes by unavailable-2374 · Pull Request #16 · pangenome/sweepga

unavailable-2374 · 2026-01-19T17:16:31Z

Summary

This PR adds asymmetric batch alignment for large genome pairs and fixes critical issues with --tempdir and segmentation faults when aligning large genomes.

Changes

New Features

Asymmetric batch mode: When 2 input files + --batch-bytes, uses asymmetric alignment (file1 → file2) without self-alignment within each file
Sequence-level splitting: When individual genomes exceed batch limit, splits chromosomes across batches
Disk optimization: For unidirectional mode, query files are used directly (saves ~50% disk space)
--batch-bidirectional flag: Optional flag for bidirectional alignment (A→B and B→A)

Bug Fixes

Fix --tempdir: Index files now properly placed in tempdir instead of next to input files
Fix --batch-bytes: Account for dual index disk usage; route two-FASTA case through batch mode
Fix segmentation fault: Use forked fastga-rs with fix for large genomes (3+ Gbp)

Example usage

# Unidirectional: queries → targets only (default)
sweepga queries.fa targets.fa --batch-bytes 20G --tempdir /tmp

# Bidirectional: A→B and B→A
sweepga queries.fa targets.fa --batch-bytes 20G --batch-bidirectional

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Previously, index files (GDB, GIX, ktab) were always created next to the input FASTA file, ignoring the --tempdir option. This fix adds new methods that copy/symlink input files to tempdir before alignment, ensuring all intermediate files are created there. Changes: - Add prepare_working_copy() and cleanup_working_copy() helpers - Add align_to_temp_paf_in_tempdir() and align_to_temp_1aln_in_tempdir() - Update run_single_batch_alignment() to use new tempdir-aware methods - Update all direct alignment calls in main.rs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

During cross-batch alignment, FastGA creates indexes for both query and target batches simultaneously. This means peak disk usage is roughly 2x the per-batch index size. Changes: - Divide max_index_bytes by 2 when partitioning into batches - Each batch now targets half the user's limit, ensuring peak usage (query + target indexes) stays within the specified limit - Added clearer log messages showing both total and per-batch limits - Fixed AGC batch estimation to track basepairs instead of cumulative estimated sizes (was incorrectly adding 100MB overhead per sample) Example: --batch-bytes 20G now creates batches with ~10G estimated index each, so cross-batch alignment peaks at ~20G total. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Changes: 1. Two single-genome FASTAs now use batch mode when --batch-bytes is set - Previously, this case skipped batch mode entirely - Now properly respects the --batch-bytes limit 2. Added clear warnings when genomes exceed the limit: - Shows peak disk usage estimate (sum of two largest batches) - Shows minimum --batch-bytes needed for the genomes - Explains that batching only helps with many smaller genomes 3. Fixed AGC batch estimation to track basepairs correctly - Was adding 100MB overhead per sample instead of per batch Example output with oversized genomes: [batch] WARNING: Peak disk usage (~43.2 GB) will exceed --batch-bytes limit (20.0 GB) [batch] NOTE: Minimum --batch-bytes needed for these genomes: 47.6 GB [batch] TIP: Large genomes cannot be split - batching only helps with many smaller genomes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When genome-level batching produces oversized batches (individual genomes exceed the per-batch limit), automatically switch to sequence-level splitting. This distributes chromosomes/contigs across batches to stay within the disk limit. New features: - SequenceInfo and SequenceBatch structs for sequence-level tracking - parse_sequences() to extract individual sequence info from FASTAs - partition_sequences_into_batches() with first-fit decreasing bin packing - write_sequence_batch_fasta() to write batch FASTA files - run_sequence_batch_alignment() for sequence-level batch alignment The algorithm: 1. Try genome-level batching first 2. If any batch exceeds limit, switch to sequence-level splitting 3. Parse individual sequences, sort by size (largest first) 4. Use first-fit decreasing to pack sequences into batches 5. Run all-pairs alignment between batches Example with 2.5Gbp + 1.1Gbp genomes and 20GB limit: - Before: Would create ~43GB indexes (genome-level) - After: Creates 5 batches of ~10GB each, peak ~20GB Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update dependency to use unavailable-2374/fastga-rs which includes: - Fix segfault when aligning large genomes (~3GB) - Fix out-of-bounds error in ALNtoPAF with -pafm option Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This adds support for efficiently aligning two large genomes (e.g., 3+ Gbp each) with controlled disk usage via --batch-bytes. Key features: - Asymmetric batch mode: when 2 input files + --batch-bytes, automatically uses asymmetric alignment (file1 → file2) without self-alignment - Sequence-level splitting: when genomes exceed batch limit, splits individual chromosomes across batches instead of treating whole genome as one unit - Disk optimization: for unidirectional alignment, only target sequences are written to temp directory; query files are used directly (saves ~50% disk) - --batch-bidirectional flag for optional A→B and B→A alignment Bug fixes: - Fix rayon thread pool initialization order to prevent "already initialized" error Implementation details: - run_asymmetric_batch_alignment(): genome-level batching for asymmetric mode - run_asymmetric_sequence_batch_alignment(): sequence-level fallback when genomes are too large for genome-level batching - Automatic detection: if any batch exceeds per-batch limit after genome-level partitioning, switches to sequence-level splitting - Proper cleanup of GDB files and batch directories Example usage: sweepga queries.fa targets.fa --batch-bytes 20G # queries→targets only sweepga queries.fa targets.fa --batch-bytes 20G --batch-bidirectional # both directions Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

unavailable-2374 and others added 7 commits January 14, 2026 22:55

Use forked fastga-rs with segfault fix for large genomes

d0470d7

Update dependency to use unavailable-2374/fastga-rs which includes: - Fix segfault when aligning large genomes (~3GB) - Fix out-of-bounds error in ALNtoPAF with -pafm option Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

update on --batch-bytes and --tempdir

c2ec1a4

AndreaGuarracino force-pushed the main branch 2 times, most recently from 70e27dd to 5905715 Compare March 6, 2026 00:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add asymmetric batch alignment, fix tempdir and segfault for large genomes#16

Add asymmetric batch alignment, fix tempdir and segfault for large genomes#16
unavailable-2374 wants to merge 7 commits intopangenome:mainfrom
unavailable-2374:main

unavailable-2374 commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unavailable-2374 commented Jan 19, 2026

Summary

Changes

New Features

Bug Fixes

Example usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant