Extract specific byte ranges from BAM/CRAM files and convert to interleaved FASTQ format. Designed for parallel processing across compute nodes without requiring pre-indexing.
- No pre-indexing required - accepts approximate byte offsets
- Auto-aligns to block boundaries - finds the next valid BGZF block at or after the start offset
- Byte-range based - process arbitrary byte ranges for easy parallelization
- No overlap - using contiguous byte ranges guarantees no duplicate reads
- Interleaved FASTQ output - same format as
samtools fastq - Parallel-ready - designed for distributed processing
cargo build --releaseBinary: target/release/bamslice
bamslice \
--input input.bam \
--start-offset 0 \
--end-offset 10000000 \
--output output.fastq--input, -i: Input BAM--start-offset, -s: Starting byte offset (will find next BGZF block at or after this offset)--end-offset, -e: Ending byte offset (will stop when reaching a block at or after this offset)--output, -o: Output FASTQ file (default: stdout)
Extract first half of file:
FILE_SIZE=$(stat -f%z input.bam) # macOS
# FILE_SIZE=$(stat -c%s input.bam) # Linux
HALF=$((FILE_SIZE / 2))
bamslice -i input.bam -s 0 -e $HALF -o first_half.fastqExtract second half (no overlap!):
bamslice -i input.bam -s $HALF -e $FILE_SIZE -o second_half.fastqOutput to stdout:
bamslice -i input.bam -s 0 -e 1000000 | head -n 4The tool uses byte ranges, making it trivial to parallelize without coordination
See example.nf for a pipeline that pipes bamslice output through fastp for QC/filtering.
nextflow run example.nf --bam input.bam --chunk_size 104857600- BGZF Structure: BAM files use BGZF (Blocked GZIP) - a series of independent compressed blocks
- Block Discovery: Given a start offset, scans forward to find the next valid BGZF block (magic:
0x1f 0x8b 0x08) - Range Processing: Processes all reads from blocks starting before
end_offset - No Overlap: Each block is processed by exactly one job when using contiguous byte ranges
- FASTQ Output: Converts BAM records to interleaved FASTQ format
- No indexing overhead: Don't need to scan the entire file first
- Trivial parallelization: Just choose your start/end offsets (see example nextflow)
- No coordination: Each process works independently
- Guaranteed coverage: Contiguous ranges ensure no reads are skipped
- No duplication: Block alignment ensures no reads are processed twice
Run the test suite to verify correctness:
cargo testlint the codebase:
make lintRun a coverage analysis:
make coverage && open target/coverage/html/index.htmlBuild a flamegraph for performance profiling:
make flamegraph && open flamegraph.svgRun the performance benchmark:
make bench && open target/criterion/report/index.htmlRelease a new version:
echo "update Cargo.toml with new version"
git commit -m 'update package version to vX.Y.Z'
git tag -m 'tag for release' vX.Y.Z
git push --follow-tags
cargo publishAGPLv3 - See LICENSE file for details