Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
name: Deploy website

on:
push:
branches: [main]
paths:
- 'docs/**'
- '.github/workflows/docs.yml'
workflow_dispatch:

permissions:
contents: read
pages: write
id-token: write

concurrency:
group: pages
cancel-in-progress: true

jobs:
build:
runs-on: ubuntu-latest
defaults:
run:
working-directory: docs
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Install pnpm
uses: pnpm/action-setup@v4

- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: 22
cache: pnpm
cache-dependency-path: docs/pnpm-lock.yaml

- name: Install dependencies
run: pnpm install --frozen-lockfile

- name: Build site
run: pnpm build

- name: Upload artifact
uses: actions/upload-pages-artifact@v3
with:
path: docs/dist

deploy:
needs: build
runs-on: ubuntu-latest
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4
6 changes: 3 additions & 3 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Always run `cargo clippy`, `cargo fmt --check`, and `cargo test` before consider

## Current Status

**396 tests passing, 0 clippy warnings.** SE: 8613/8926 compare_sam.py (96.5%; note: lower due to seeded-RNG tie-break PR diverging from STAR's mt19937), **99.815% faithfulness (tie-adjusted)** (8611/8627 non-tie reads exact), 299 tie-breaking diffs excluded. 1 CIGAR-only disagree (ERR12389696.13573895, insertion placement, seed-level tie). **0 STAR-only / 0 rustar-aligner-only SE reads**. PE: **8390 both-mapped** (STAR: 8390), **0 half-mapped**, 0 MAPQ inflations / 0 deflations, **99.883% PE exact faithfulness (tie-adjusted)** (16284/16306, 475 tie-breaking diffs excluded), **0 proper-pair diffs**, **0 NH diffs**. Phase 17.A: `scoreSeedBest` pre-extension. Phase 17.B: per-mate seeding. Phase 17.C: STAR-faithful SCORE-GATE + mappedFilter. Phase 17.D: combined-span penalty fix + dedup ordering. Phase 17.8: `--quantMode GeneCounts`. Phase E fix (2026-04-21): mate_id-aware diagonal dedup. Phase E2 (2026-04-22): STAR-faithful combined-read seeding. Phase E3 (2026-04-22): combined-threshold half-mapped fallback. Phase E4 (2026-04-22): PE-CHECK2 unconditional. Phase E5 (2026-04-23): split_combined_wt n_mismatch propagation. Phase E6 (2026-04-24): tie-adjusted faithfulness metric in assess_faithfulness.py. Phase F1: --runRNGseed + seeded primary tie-break (PR #5). Phase F2: --outSAMattrRGline (PR #6). Phase F3: --quantMode TranscriptomeSAM (PR #7). Phase F4: SJDB insertion into Genome+SA at genomeGenerate (PR #8). Phase G1 (2026-04-29): junction_shifts fix in split_combined_wt (rDNA cross-copy false-splice filter). Phase G2 (2026-04-29): MAX_RECURSION 10k→100k + sa_pos_to_forward overflow fix (ERR12389696.7118031 NH=3→9). Phase 17.2 (2026-04-29): coordinate-sorted BAM output (`--outSAMtype BAM SortedByCoordinate` → `Aligned.sortedByCoord.out.bam`). Phase 17.4 (2026-04-29): `--outReadsUnmapped Fastx` → `Unmapped.out.mate1` / `Unmapped.out.mate2`; writes unmapped + TooManyLoci reads; PE writes both mates for fully-unmapped and half-mapped pairs. Phase 17.6 (2026-05-01): `--outStd SAM/BAM_Unsorted/BAM_SortedByCoordinate` — routes primary alignment output to stdout via `Box<dyn AlignmentWriter>` trait dispatch; `SamStdoutWriter`, `BamStdoutWriter`, `SortedBamStdoutWriter` in sam.rs/bam.rs; verified with samtools pipe (967 records). Phase G3 (2026-05-01): SA tie-breaking fix — `compare_suffixes` tie-breaker changed from `pos_b.cmp(&pos_a)` to `packed_a.cmp(&packed_b)` (ascending by packed SA value with strand bit); rustar-aligner SA is now **byte-for-byte identical** to STAR's SA for the yeast genome (10,862 → 0 entry diffs). diff AS: 6→4 cases (4 remaining are rustar-aligner improvements: .844151 VIII 0mm vs STAR VII 6mm, .4972950 spliced vs unspliced mate2). Phase 17.3 (2026-05-01): PE chimeric detection — `detect_inter_mate_chimeric` in `chimeric/detect.rs`; intra-mate multi-cluster chimeric via cluster splitting + mate2 read_pos adjustment; inter-mate chimeric for discordant pairs (diff chr, same strand, or >1Mb); `align_paired_read` returns 4-tuple including `Vec<ChimericAlignment>`; no benchmark regression (8390 both-mapped, 0 half-mapped). Phase 17.11 (2026-05-01): `--chimOutType WithinBAM` — chimeric alignments written as supplementary records (FLAG 0x800) in primary BAM; donor record has full SEQ + SA tag; acceptor has FLAG 0x800 + SA tag + empty SEQ; `build_within_bam_records` in `chimeric/output.rs`; `chim_out_junctions()` / `chim_out_within_bam()` helpers in params.rs; supports mixed `--chimOutType Junctions WithinBAM`. Phase 17.7 (2026-05-01): GTF tag parameters — `--sjdbGTFchrPrefix`, `--sjdbGTFfeatureExon`, `--sjdbGTFtagExonParentTranscript`, `--sjdbGTFtagExonParentGene`; `_configured` variants in `junction/gtf.rs`, `quant/mod.rs`, `quant/transcriptome.rs`, `junction/mod.rs`; all 4 production paths thread params; backward-compat wrappers preserve zero test disruption. Phase 17.9 (2026-05-01): `--outBAMcompression` (BGZF level -1–9, default 1; -1/0=NONE, 1-8=flate2 levels, ≥9=BEST) + `--limitBAMsortRAM` (bytes, 0=unlimited; aborts sort if ~400 bytes/record estimate exceeds limit); `bgzf_compression()` + `make_bgzf_writer()` helpers in `io/bam.rs`; threaded through all 4 BAM writers (unsorted file, sorted file, unsorted stdout, sorted stdout). PE chimericDetectionOld (2026-05-01): per-mate `detect_chimeric_old` called on `all_m1_transcripts` / `all_m2_transcripts` pools after `filter_paired_transcripts` in `read_align.rs`. Phase 17.12 (2026-05-01): BySJout disk buffering — `BySJReadMeta` struct + `NamedTempFile` SAM temp file replaces `Vec<AlignmentBatchResults>`; `create_bysj_writer` / `bysj_write_records` / `bysj_read_n_records` helpers in `io/sam.rs`; `tempfile` moved to `[dependencies]`. Phase 17.13 (2026-05-01): 8 integration tests in `tests/alignment_features.rs` — synthetic 20kb genome with planted GT-AG intron; tests cover BAM output, PE alignment, spliced reads, BySJout, GeneCounts, unmapped output, two-pass mode. Phase 12.2 (2026-05-04): SE chimeric Tier 1b soft-clip re-mapping — `detect_from_soft_clips` in `chimeric/detect.rs` re-seeds the primary alignment's soft-clipped bases when `detect_chimeric_old` finds no partner; `adjust_read_positions` helper shifts sub-seq coords into full-read space for right clips; called as Step 3c in `read_align.rs`. Phase 17.10 (2026-05-04): Chimeric Tier 3 — `detect_from_chimeric_residuals` in `chimeric/detect.rs` re-seeds outer uncovered read regions (before donor / after acceptor) of each found chimeric pair; enables 3-way gene-fusion detection; called as Step 3d in `read_align.rs`. See [ROADMAP.md](ROADMAP.md) for detailed phase tracking and [docs/](docs/) for per-phase notes.
**396 tests passing, 0 clippy warnings.** SE: 8613/8926 compare_sam.py (96.5%; note: lower due to seeded-RNG tie-break PR diverging from STAR's mt19937), **99.815% faithfulness (tie-adjusted)** (8611/8627 non-tie reads exact), 299 tie-breaking diffs excluded. 1 CIGAR-only disagree (ERR12389696.13573895, insertion placement, seed-level tie). **0 STAR-only / 0 rustar-aligner-only SE reads**. PE: **8390 both-mapped** (STAR: 8390), **0 half-mapped**, 0 MAPQ inflations / 0 deflations, **99.883% PE exact faithfulness (tie-adjusted)** (16284/16306, 475 tie-breaking diffs excluded), **0 proper-pair diffs**, **0 NH diffs**. Phase 17.A: `scoreSeedBest` pre-extension. Phase 17.B: per-mate seeding. Phase 17.C: STAR-faithful SCORE-GATE + mappedFilter. Phase 17.D: combined-span penalty fix + dedup ordering. Phase 17.8: `--quantMode GeneCounts`. Phase E fix (2026-04-21): mate_id-aware diagonal dedup. Phase E2 (2026-04-22): STAR-faithful combined-read seeding. Phase E3 (2026-04-22): combined-threshold half-mapped fallback. Phase E4 (2026-04-22): PE-CHECK2 unconditional. Phase E5 (2026-04-23): split_combined_wt n_mismatch propagation. Phase E6 (2026-04-24): tie-adjusted faithfulness metric in assess_faithfulness.py. Phase F1: --runRNGseed + seeded primary tie-break (PR #5). Phase F2: --outSAMattrRGline (PR #6). Phase F3: --quantMode TranscriptomeSAM (PR #7). Phase F4: SJDB insertion into Genome+SA at genomeGenerate (PR #8). Phase G1 (2026-04-29): junction_shifts fix in split_combined_wt (rDNA cross-copy false-splice filter). Phase G2 (2026-04-29): MAX_RECURSION 10k→100k + sa_pos_to_forward overflow fix (ERR12389696.7118031 NH=3→9). Phase 17.2 (2026-04-29): coordinate-sorted BAM output (`--outSAMtype BAM SortedByCoordinate` → `Aligned.sortedByCoord.out.bam`). Phase 17.4 (2026-04-29): `--outReadsUnmapped Fastx` → `Unmapped.out.mate1` / `Unmapped.out.mate2`; writes unmapped + TooManyLoci reads; PE writes both mates for fully-unmapped and half-mapped pairs. Phase 17.6 (2026-05-01): `--outStd SAM/BAM_Unsorted/BAM_SortedByCoordinate` — routes primary alignment output to stdout via `Box<dyn AlignmentWriter>` trait dispatch; `SamStdoutWriter`, `BamStdoutWriter`, `SortedBamStdoutWriter` in sam.rs/bam.rs; verified with samtools pipe (967 records). Phase G3 (2026-05-01): SA tie-breaking fix — `compare_suffixes` tie-breaker changed from `pos_b.cmp(&pos_a)` to `packed_a.cmp(&packed_b)` (ascending by packed SA value with strand bit); rustar-aligner SA is now **byte-for-byte identical** to STAR's SA for the yeast genome (10,862 → 0 entry diffs). diff AS: 6→4 cases (4 remaining are rustar-aligner improvements: .844151 VIII 0mm vs STAR VII 6mm, .4972950 spliced vs unspliced mate2). Phase 17.3 (2026-05-01): PE chimeric detection — `detect_inter_mate_chimeric` in `chimeric/detect.rs`; intra-mate multi-cluster chimeric via cluster splitting + mate2 read_pos adjustment; inter-mate chimeric for discordant pairs (diff chr, same strand, or >1Mb); `align_paired_read` returns 4-tuple including `Vec<ChimericAlignment>`; no benchmark regression (8390 both-mapped, 0 half-mapped). Phase 17.11 (2026-05-01): `--chimOutType WithinBAM` — chimeric alignments written as supplementary records (FLAG 0x800) in primary BAM; donor record has full SEQ + SA tag; acceptor has FLAG 0x800 + SA tag + empty SEQ; `build_within_bam_records` in `chimeric/output.rs`; `chim_out_junctions()` / `chim_out_within_bam()` helpers in params.rs; supports mixed `--chimOutType Junctions WithinBAM`. Phase 17.7 (2026-05-01): GTF tag parameters — `--sjdbGTFchrPrefix`, `--sjdbGTFfeatureExon`, `--sjdbGTFtagExonParentTranscript`, `--sjdbGTFtagExonParentGene`; `_configured` variants in `junction/gtf.rs`, `quant/mod.rs`, `quant/transcriptome.rs`, `junction/mod.rs`; all 4 production paths thread params; backward-compat wrappers preserve zero test disruption. Phase 17.9 (2026-05-01): `--outBAMcompression` (BGZF level -1–9, default 1; -1/0=NONE, 1-8=flate2 levels, ≥9=BEST) + `--limitBAMsortRAM` (bytes, 0=unlimited; aborts sort if ~400 bytes/record estimate exceeds limit); `bgzf_compression()` + `make_bgzf_writer()` helpers in `io/bam.rs`; threaded through all 4 BAM writers (unsorted file, sorted file, unsorted stdout, sorted stdout). PE chimericDetectionOld (2026-05-01): per-mate `detect_chimeric_old` called on `all_m1_transcripts` / `all_m2_transcripts` pools after `filter_paired_transcripts` in `read_align.rs`. Phase 17.12 (2026-05-01): BySJout disk buffering — `BySJReadMeta` struct + `NamedTempFile` SAM temp file replaces `Vec<AlignmentBatchResults>`; `create_bysj_writer` / `bysj_write_records` / `bysj_read_n_records` helpers in `io/sam.rs`; `tempfile` moved to `[dependencies]`. Phase 17.13 (2026-05-01): 8 integration tests in `tests/alignment_features.rs` — synthetic 20kb genome with planted GT-AG intron; tests cover BAM output, PE alignment, spliced reads, BySJout, GeneCounts, unmapped output, two-pass mode. Phase 12.2 (2026-05-04): SE chimeric Tier 1b soft-clip re-mapping — `detect_from_soft_clips` in `chimeric/detect.rs` re-seeds the primary alignment's soft-clipped bases when `detect_chimeric_old` finds no partner; `adjust_read_positions` helper shifts sub-seq coords into full-read space for right clips; called as Step 3c in `read_align.rs`. Phase 17.10 (2026-05-04): Chimeric Tier 3 — `detect_from_chimeric_residuals` in `chimeric/detect.rs` re-seeds outer uncovered read regions (before donor / after acceptor) of each found chimeric pair; enables 3-way gene-fusion detection; called as Step 3d in `read_align.rs`. See [ROADMAP.md](ROADMAP.md) for detailed phase tracking and [docs-old/](docs-old/) for per-phase development notes. The published Astro Starlight docs site is in [docs/](docs/).

## Source Layout

Expand Down Expand Up @@ -159,7 +159,7 @@ Previously listed issues now resolved:
- `ERR12389696.5825571`: now aligns as `XV:80779 121M607028N13M16S` (exact match with STAR). Root cause: rustar-aligner computed a finite intron limit of 589824 when `alignIntronMax=0`, blocking the 607kb intron. STAR uses `>0` guard in `stitchAlignToTranscript.cpp` line 100 — alignIntronMax=0 means no limit. Fix: use `u32::MAX` sentinel in score.rs.
- `ERR12389696.16030539`: MAPQ inflation resolved. Both tools now find two alignments (XV:121224 128M925400N10M12S and XV:598336 128M448288N10M12S), both MAPQ=3. Primary selection differs (tie). MAPQ inflate: 1→0.

See [ROADMAP.md](ROADMAP.md) and [docs/](docs/) for full issue tracking.
See [ROADMAP.md](ROADMAP.md) and [docs-old/](docs-old/) for full issue tracking.

## PE Status (Updated 2026-04-29 — Phase G2)

Expand All @@ -176,4 +176,4 @@ See [ROADMAP.md](ROADMAP.md) and [docs/](docs/) for full issue tracking.
- No STARsolo single-cell features — Phase 14 (deferred)
- 4 PE AS diffs (rustar-aligner improvements, not bugs): `.844151` finds VIII:451791 0mm vs STAR's VII:1001391 6mm; `.4972950` finds correct spliced mate2 vs STAR's unspliced. Both cases: STAR's combined-window approach fails to stitch a PE pair at the better location.

See [docs/phase17_features.md](docs/phase17_features.md) for full feature status.
See [docs-old/phase17_features.md](docs-old/phase17_features.md) for full feature status.
15 changes: 14 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,23 @@ Small synthetic and yeast test data lives in `test/`. Integration tests in `test

## Project history

rustar-aligner was written as a faithful port of [STAR](https://github.com/alexdobin/STAR) by Alexander Dobin. Up to the initial release, the goal was behavioral parity with STAR — matching its algorithms, thresholds, and output formats as closely as possible. Notes from that development phase are in `docs/dev/`.
rustar-aligner was written as a faithful port of [STAR](https://github.com/alexdobin/STAR) by Alexander Dobin. Up to the initial release, the goal was behavioral parity with STAR — matching its algorithms, thresholds, and output formats as closely as possible. Notes from that development phase are in `docs-old/` (`docs-old/dev/` and the `phase*.md` files).

Future development is not bound by that constraint. Adding STARsolo, new features, or diverging from STAR behavior is entirely welcome.

## Documentation site

The user-facing docs site is an [Astro Starlight](https://starlight.astro.build/) project under `docs/`:

```bash
cd docs
pnpm install
pnpm dev # local dev server
pnpm build # production build into docs/dist/
```

Content lives under `docs/src/content/docs/` as Markdown / MDX files with YAML frontmatter (`title`, `description`). Sidebar order is configured in `docs/astro.config.mjs`. Site-wide design tokens (colours, fonts, graph-paper background, wave dividers) live in `docs/src/styles/custom.css` and can be tuned in one place.

## License

MIT, matching the original STAR license.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# ![rustar-aligner](docs/logo/rustar-logo.svg)
# ![rustar-aligner](docs/src/assets/rustar-logo.svg)

A Rust reimplementation of [STAR](https://github.com/alexdobin/STAR) (Spliced Transcripts Alignment to a Reference), the widely-used RNA-seq aligner originally written in C++ by Alexander Dobin.

Expand Down
Loading
Loading