Changelog

Latest Changes (Mismatch Calculation Update)

Breaking Change: Mismatch Tolerance Calculation

Changed: Mismatch tolerance now calculated based on full sequence length instead of per-copy.

Before:

Each copy could have up to 10% mismatches independently
Example: 4bp motif × 5 copies → max 1 mismatch per copy (5 positions total can mismatch)
Formula: max_mismatches_per_copy = max(1, ⌈0.1 × motif_length⌉)

After:

Total mismatches across all copies ≤ 10% of full array length
Example: 4bp motif × 5 copies = 20bp total → max 2 total mismatches across entire array
Formula: max_mismatches = max(1, ⌈0.1 × (motif_length × n_copies)⌉)

Impact:

More stringent for arrays with many copies (better specificity)
More lenient for arrays with few copies (better sensitivity)
Results will differ from previous versions when detecting imperfect repeats

Banner text updated:

Mismatch tolerance:  Enabled (10% of full sequence)

Migration: If you need the old per-copy behavior, you can disable imperfect matching entirely with --no-mismatches flag (exact matches only).

Previous Changes (Performance & Usability Update)

New Defaults (User-Friendly Configuration)

Before:

Output: tandem_repeats.bed (BED format)
Format: BED (8 columns)
Tiers: User must specify --tier1 and/or --tier2
Parallelism: Sequential processing (no parallelism)

After:

Output: repeat.tab (more intuitive filename)
Format: STRfinder CSV (11 columns, tab-delimited)
Tiers: Tier 1 + Tier 2 enabled by default
Parallelism: 4 CPU cores by default

Command Line Changes

1. Simplified Parallelism

Removed: --parallel flag
Changed: --jobs N now controls parallelism directly
- --jobs 4 (default): Use 4 CPU cores
- --jobs 8: Use 8 CPU cores
- --jobs 0: Disable parallelism (sequential)

2. Tier Selection

Default: Both Tier 1 and Tier 2 enabled
--tier1: Enable Tier 1 ONLY (short repeats, faster)
Removed: --tier2 flag (Tier 2 is default unless --tier1 specified)
--tier3: Still available for very long repeats

3. Output Defaults

Default output file: repeat.tab (was tandem_repeats.bed)
Default format: strfinder (was bed)

Usage Examples

Before (Old Syntax)

# Parallel processing with STRfinder format
python bwt.py genome.fa --parallel --tier1 --tier2 \
    --format strfinder -o output.csv

# Sequential with BED format
python bwt.py genome.fa --tier1 --tier2 -o output.bed

After (New Syntax)

# Parallel processing with STRfinder format (DEFAULT!)
python bwt.py genome.fa
# Output: repeat.tab (4 cores, Tier 1+2, STRfinder format)

# Use 8 cores
python bwt.py genome.fa --jobs 8

# Sequential (no parallelism)
python bwt.py genome.fa --jobs 0

# Tier 1 only (faster)
python bwt.py genome.fa --tier1

# BED format instead
python bwt.py genome.fa --format bed -o output.bed

Performance Improvements

1. Multiprocessing (NEW)

Chromosome-level parallelism
Default: 4 CPU cores
Linear scaling with core count
Speedup: 4-8× on multi-core systems

2. Adaptive Tier 2 Scanning

Dynamic position/period stepping
Auto-skip for chromosomes >50 Mbp
Speedup: 2-200× for large chromosomes

3. Progress Indicators

Visual progress bar: [████████░░░░] 60.0% (3/5)
Chromosome length display: (248,956,422 bp)
Adaptive mode notifications

Migration Guide

If you have existing scripts, here's how to update them:

# Old command:
python bwt.py genome.fa --parallel --tier1 --tier2 --format strfinder -o output.csv

# New equivalent (simpler!):
python bwt.py genome.fa -o output.csv

# Or even simpler (uses default filename repeat.tab):
python bwt.py genome.fa

Key changes:

Remove --parallel → use --jobs N (default is 4)
Remove --tier2 → default includes Tier 2
Remove --format strfinder → default is STRfinder
Optional: remove -o output.csv → default is repeat.tab

Performance Benchmarks (Updated)

Human chr22 (~50 Mbp)

Mode	Old Time	New Time	Speedup
Default	180s (sequential)	25s (4 cores)	7.2×
Tier 1 only	8s (sequential)	2s (4 cores)	4×
Tier 1 exact	5s (sequential)	1.5s (4 cores)	3.3×

Whole Human Genome (3 Gbp, 24 chromosomes)

Mode	Old Time	New Time	Speedup
Tier 1+2	~10-15 hours	~2-3 hours	5×
Tier 1 only	~3 hours	~45 min	4×
Tier 1 exact	~2 hours	~30 min	4×

Breaking Changes

⚠️ Attention: If you have scripts using the old syntax, update them:

--parallel no longer exists
- Old: --parallel --jobs 8
- New: --jobs 8
--tier2 no longer exists
- Old: --tier1 --tier2
- New: (default, no flag needed)
- Or: Just remove the flags
Default output changed
- Old: tandem_repeats.bed (BED format)
- New: repeat.tab (STRfinder format)
- Fix: Add --format bed -o tandem_repeats.bed to get old behavior

Compatibility

Backwards compatibility preserved for:

All output formats (bed, vcf, trf_table, trf_dat, strfinder)
All tier options (--tier1, --tier3, --long-reads)
All filtering options (--min-copies, --min-entropy, etc.)
All performance options (--sa-sample, --progress, etc.)

Not backwards compatible:

--parallel flag removed (use --jobs N instead)
--tier2 flag removed (now default)
Default output filename changed
Default format changed

Summary

The tool is now much easier to use with sensible defaults:

One command does it all:

python bwt.py genome.fa

This runs:

✅ Tier 1 + Tier 2 detection
✅ 4 parallel CPU cores
✅ STRfinder CSV output format
✅ Output to repeat.tab
✅ Imperfect repeat detection (SNP tolerance)
✅ Progress bars and length display

Result: Fast, comprehensive tandem repeat detection with minimal configuration!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

Latest Changes (Mismatch Calculation Update)

Breaking Change: Mismatch Tolerance Calculation

Previous Changes (Performance & Usability Update)

New Defaults (User-Friendly Configuration)

Command Line Changes

1. Simplified Parallelism

2. Tier Selection

3. Output Defaults

Usage Examples

Before (Old Syntax)

After (New Syntax)

Performance Improvements

1. Multiprocessing (NEW)

2. Adaptive Tier 2 Scanning

3. Progress Indicators

Migration Guide

Performance Benchmarks (Updated)

Human chr22 (~50 Mbp)

Whole Human Genome (3 Gbp, 24 chromosomes)

Breaking Changes

Compatibility

Summary

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

Latest Changes (Mismatch Calculation Update)

Breaking Change: Mismatch Tolerance Calculation

Previous Changes (Performance & Usability Update)

New Defaults (User-Friendly Configuration)

Command Line Changes

1. Simplified Parallelism

2. Tier Selection

3. Output Defaults

Usage Examples

Before (Old Syntax)

After (New Syntax)

Performance Improvements

1. Multiprocessing (NEW)

2. Adaptive Tier 2 Scanning

3. Progress Indicators

Migration Guide

Performance Benchmarks (Updated)

Human chr22 (~50 Mbp)

Whole Human Genome (3 Gbp, 24 chromosomes)

Breaking Changes

Compatibility

Summary