Add asymmetric batch alignment, fix tempdir and segfault for large genomes#16
Open
unavailable-2374 wants to merge 7 commits intopangenome:mainfrom
Open
Add asymmetric batch alignment, fix tempdir and segfault for large genomes#16unavailable-2374 wants to merge 7 commits intopangenome:mainfrom
unavailable-2374 wants to merge 7 commits intopangenome:mainfrom
Conversation
Previously, index files (GDB, GIX, ktab) were always created next to the input FASTA file, ignoring the --tempdir option. This fix adds new methods that copy/symlink input files to tempdir before alignment, ensuring all intermediate files are created there. Changes: - Add prepare_working_copy() and cleanup_working_copy() helpers - Add align_to_temp_paf_in_tempdir() and align_to_temp_1aln_in_tempdir() - Update run_single_batch_alignment() to use new tempdir-aware methods - Update all direct alignment calls in main.rs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
During cross-batch alignment, FastGA creates indexes for both query and target batches simultaneously. This means peak disk usage is roughly 2x the per-batch index size. Changes: - Divide max_index_bytes by 2 when partitioning into batches - Each batch now targets half the user's limit, ensuring peak usage (query + target indexes) stays within the specified limit - Added clearer log messages showing both total and per-batch limits - Fixed AGC batch estimation to track basepairs instead of cumulative estimated sizes (was incorrectly adding 100MB overhead per sample) Example: --batch-bytes 20G now creates batches with ~10G estimated index each, so cross-batch alignment peaks at ~20G total. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes: 1. Two single-genome FASTAs now use batch mode when --batch-bytes is set - Previously, this case skipped batch mode entirely - Now properly respects the --batch-bytes limit 2. Added clear warnings when genomes exceed the limit: - Shows peak disk usage estimate (sum of two largest batches) - Shows minimum --batch-bytes needed for the genomes - Explains that batching only helps with many smaller genomes 3. Fixed AGC batch estimation to track basepairs correctly - Was adding 100MB overhead per sample instead of per batch Example output with oversized genomes: [batch] WARNING: Peak disk usage (~43.2 GB) will exceed --batch-bytes limit (20.0 GB) [batch] NOTE: Minimum --batch-bytes needed for these genomes: 47.6 GB [batch] TIP: Large genomes cannot be split - batching only helps with many smaller genomes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When genome-level batching produces oversized batches (individual genomes exceed the per-batch limit), automatically switch to sequence-level splitting. This distributes chromosomes/contigs across batches to stay within the disk limit. New features: - SequenceInfo and SequenceBatch structs for sequence-level tracking - parse_sequences() to extract individual sequence info from FASTAs - partition_sequences_into_batches() with first-fit decreasing bin packing - write_sequence_batch_fasta() to write batch FASTA files - run_sequence_batch_alignment() for sequence-level batch alignment The algorithm: 1. Try genome-level batching first 2. If any batch exceeds limit, switch to sequence-level splitting 3. Parse individual sequences, sort by size (largest first) 4. Use first-fit decreasing to pack sequences into batches 5. Run all-pairs alignment between batches Example with 2.5Gbp + 1.1Gbp genomes and 20GB limit: - Before: Would create ~43GB indexes (genome-level) - After: Creates 5 batches of ~10GB each, peak ~20GB Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update dependency to use unavailable-2374/fastga-rs which includes: - Fix segfault when aligning large genomes (~3GB) - Fix out-of-bounds error in ALNtoPAF with -pafm option Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This adds support for efficiently aligning two large genomes (e.g., 3+ Gbp each) with controlled disk usage via --batch-bytes. Key features: - Asymmetric batch mode: when 2 input files + --batch-bytes, automatically uses asymmetric alignment (file1 → file2) without self-alignment - Sequence-level splitting: when genomes exceed batch limit, splits individual chromosomes across batches instead of treating whole genome as one unit - Disk optimization: for unidirectional alignment, only target sequences are written to temp directory; query files are used directly (saves ~50% disk) - --batch-bidirectional flag for optional A→B and B→A alignment Bug fixes: - Fix rayon thread pool initialization order to prevent "already initialized" error Implementation details: - run_asymmetric_batch_alignment(): genome-level batching for asymmetric mode - run_asymmetric_sequence_batch_alignment(): sequence-level fallback when genomes are too large for genome-level batching - Automatic detection: if any batch exceeds per-batch limit after genome-level partitioning, switches to sequence-level splitting - Proper cleanup of GDB files and batch directories Example usage: sweepga queries.fa targets.fa --batch-bytes 20G # queries→targets only sweepga queries.fa targets.fa --batch-bytes 20G --batch-bidirectional # both directions Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
70e27dd to
5905715
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds asymmetric batch alignment for large genome pairs and fixes critical issues with
--tempdirand segmentation faults when aligning large genomes.Changes
New Features
--batch-bytes, uses asymmetric alignment (file1 → file2) without self-alignment within each file--batch-bidirectional flag: Optional flag for bidirectional alignment (A→B and B→A)
Bug Fixes
Example usage