Changed: Mismatch tolerance now calculated based on full sequence length instead of per-copy.
Before:
- Each copy could have up to 10% mismatches independently
- Example: 4bp motif × 5 copies → max 1 mismatch per copy (5 positions total can mismatch)
- Formula:
max_mismatches_per_copy = max(1, ⌈0.1 × motif_length⌉)
After:
- Total mismatches across all copies ≤ 10% of full array length
- Example: 4bp motif × 5 copies = 20bp total → max 2 total mismatches across entire array
- Formula:
max_mismatches = max(1, ⌈0.1 × (motif_length × n_copies)⌉)
Impact:
- More stringent for arrays with many copies (better specificity)
- More lenient for arrays with few copies (better sensitivity)
- Results will differ from previous versions when detecting imperfect repeats
Banner text updated:
Mismatch tolerance: Enabled (10% of full sequence)
Migration:
If you need the old per-copy behavior, you can disable imperfect matching entirely with --no-mismatches flag (exact matches only).
Before:
- Output:
tandem_repeats.bed(BED format) - Format: BED (8 columns)
- Tiers: User must specify
--tier1and/or--tier2 - Parallelism: Sequential processing (no parallelism)
After:
- Output:
repeat.tab(more intuitive filename) - Format: STRfinder CSV (11 columns, tab-delimited)
- Tiers: Tier 1 + Tier 2 enabled by default
- Parallelism: 4 CPU cores by default
- Removed:
--parallelflag - Changed:
--jobs Nnow controls parallelism directly--jobs 4(default): Use 4 CPU cores--jobs 8: Use 8 CPU cores--jobs 0: Disable parallelism (sequential)
- Default: Both Tier 1 and Tier 2 enabled
--tier1: Enable Tier 1 ONLY (short repeats, faster)- Removed:
--tier2flag (Tier 2 is default unless--tier1specified) --tier3: Still available for very long repeats
- Default output file:
repeat.tab(wastandem_repeats.bed) - Default format:
strfinder(wasbed)
# Parallel processing with STRfinder format
python bwt.py genome.fa --parallel --tier1 --tier2 \
--format strfinder -o output.csv
# Sequential with BED format
python bwt.py genome.fa --tier1 --tier2 -o output.bed# Parallel processing with STRfinder format (DEFAULT!)
python bwt.py genome.fa
# Output: repeat.tab (4 cores, Tier 1+2, STRfinder format)
# Use 8 cores
python bwt.py genome.fa --jobs 8
# Sequential (no parallelism)
python bwt.py genome.fa --jobs 0
# Tier 1 only (faster)
python bwt.py genome.fa --tier1
# BED format instead
python bwt.py genome.fa --format bed -o output.bed- Chromosome-level parallelism
- Default: 4 CPU cores
- Linear scaling with core count
- Speedup: 4-8× on multi-core systems
- Dynamic position/period stepping
- Auto-skip for chromosomes >50 Mbp
- Speedup: 2-200× for large chromosomes
- Visual progress bar:
[████████░░░░] 60.0% (3/5) - Chromosome length display:
(248,956,422 bp) - Adaptive mode notifications
If you have existing scripts, here's how to update them:
# Old command:
python bwt.py genome.fa --parallel --tier1 --tier2 --format strfinder -o output.csv
# New equivalent (simpler!):
python bwt.py genome.fa -o output.csv
# Or even simpler (uses default filename repeat.tab):
python bwt.py genome.faKey changes:
- Remove
--parallel→ use--jobs N(default is 4) - Remove
--tier2→ default includes Tier 2 - Remove
--format strfinder→ default is STRfinder - Optional: remove
-o output.csv→ default isrepeat.tab
| Mode | Old Time | New Time | Speedup |
|---|---|---|---|
| Default | 180s (sequential) | 25s (4 cores) | 7.2× |
| Tier 1 only | 8s (sequential) | 2s (4 cores) | 4× |
| Tier 1 exact | 5s (sequential) | 1.5s (4 cores) | 3.3× |
| Mode | Old Time | New Time | Speedup |
|---|---|---|---|
| Tier 1+2 | ~10-15 hours | ~2-3 hours | 5× |
| Tier 1 only | ~3 hours | ~45 min | 4× |
| Tier 1 exact | ~2 hours | ~30 min | 4× |
-
--parallelno longer exists- Old:
--parallel --jobs 8 - New:
--jobs 8
- Old:
-
--tier2no longer exists- Old:
--tier1 --tier2 - New: (default, no flag needed)
- Or: Just remove the flags
- Old:
-
Default output changed
- Old:
tandem_repeats.bed(BED format) - New:
repeat.tab(STRfinder format) - Fix: Add
--format bed -o tandem_repeats.bedto get old behavior
- Old:
Backwards compatibility preserved for:
- All output formats (bed, vcf, trf_table, trf_dat, strfinder)
- All tier options (--tier1, --tier3, --long-reads)
- All filtering options (--min-copies, --min-entropy, etc.)
- All performance options (--sa-sample, --progress, etc.)
Not backwards compatible:
--parallelflag removed (use--jobs Ninstead)--tier2flag removed (now default)- Default output filename changed
- Default format changed
The tool is now much easier to use with sensible defaults:
One command does it all:
python bwt.py genome.faThis runs:
- ✅ Tier 1 + Tier 2 detection
- ✅ 4 parallel CPU cores
- ✅ STRfinder CSV output format
- ✅ Output to
repeat.tab - ✅ Imperfect repeat detection (SNP tolerance)
- ✅ Progress bars and length display
Result: Fast, comprehensive tandem repeat detection with minimal configuration!