Skip to content

Commit a5e1bb3

Browse files
debpalashclaude
andauthored
docs(v0.3.0): research + 18 plans for fat-milestone planning (#87)
* docs(phase-5): research opt-in bug reporting Phase 5 research: prefilled-URL GitHub Issues pattern, default-deny payload, redaction layer, two-step consent UX, rate/dedup/recursion safeguards, aggregation across Python/Rust/React error producers. Builds on Phase 1's links.py + errorDocsMap deeplink infrastructure; uses already-installed @tauri-apps/plugin-opener (^2.5.4). No new packages required. Covers REPORT-01..12 with confidence levels, 8 pitfalls, subprocess-engine error capture handoff to Phase 2, security domain mapped to ASVS, and 3-wave delivery plan (redactor + payload, consent UI, aggregation + pre-submit search). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(phases): research for Phases 2, 3, 4, 6 (Engine + Supertonic + Spikes + Release) * docs(stack): bump supertonic pin 1.2.3 → 1.3.1 (Phase 3 research finding) * docs(phases): plan Phases 2-6 for v0.3.0 fat-milestone release 15 new plan files + 2 ADR decision docs across 5 phases. Combined with Phase 1's 3 plans, the v0.3.0 milestone now has 18 PLAN.md files covering all 7 phases (Phase 0 already complete via PR #71). PHASE 2 (Engine Isolation — 4 plans): - 02-01: SubprocessBackend primitive + echo sidecar POC + graceful is_available wrap (ENGINE-01/05) - 02-02: _safe_torchaudio_save helper + migrate 11 WAV write sites + #48 regression (BUG-01) - 02-03: IndexTTS sidecar entry + venv-probe bootstrap + IndexTTS2Backend rewire (ENGINE-02/03/04/07, closes #42) - 02-04: Engine Compatibility Matrix UI + /engines/{id}/health route (ENGINE-06) PHASE 3 (Supertonic-3 + Mirror — 2 plans): - 03-01: Supertonic-3 engine on SubprocessBackend + SHA pin + license gate (TTS-01..06) - 03-02: bootstrap.rs mirror cascade + UV_DEFAULT_INDEX migration + frozen enforcement + docs (INST-07..11) PHASE 4 (Spike-first Adaptive & Specialty — 2 plans + 2 ADRs): - 04-01: OmniVoice-GGUF hardware-adaptive engine + quant_map + bundled binaries (SPIKE-01, GGUF-01..06) - 04-02: OmniVoice-Singing subclass + dub pipeline singing mode + segment detector (SPIKE-02, SING-01..05) - SPIKE-01-gguf.md + SPIKE-02-singing.md ADRs in .planning/decisions/ PHASE 5 (Opt-in Bug Reporting — 3 plans): - 05-01: Redactor + BugReporter + URL builder + rate/dedup/recursion safeguards + FastAPI router (REPORT-01/02/03/05/06/07/08/10/11) - 05-02: BugReportDialog two-step consent + PrivacyPanel + ErrorBoundary integration + Rust panic hook (chained) (REPORT-01-Rust/04/09/12) - 05-03: Dry-run vs 3 historical issues + cross-platform openUrl smoke + Phase 2 subprocess-errors handoff (REPORT-02 smoke, REPORT-03 expansion, REPORT-09) PHASE 6 (Release + Retro — 4 plans): - 06-01: rc1 prep — version bump across 4 sources + CHANGELOG + retro stub + PR-73-strategy doc (REL-01/03/06) - 06-02: CI guards — workflow-parity actionlint + tag-shaped dry-run (Phase 0 retro options B + C; closes release-engineer gap) - 06-03: PR #73 reimplementation (NOT rebase) — backend-split installer with mirror-cascade integration + pill-mode regression checkpoint - 06-04: Execute the release — pre-tag gates + 4-OS clean-VM + 48h soak + tag + retro + 3 v0.4 deferral tracking issues (REL-01/02/03/04/05/06) Scope decisions locked in plans (council session): - SoniTranslate refactor DEFERRED to v0.4 (Phase 2 ships SubprocessBackend without migrating Soni) - macOS notarization DEFERRED to v0.4 (Phase 6 ships xattr -cr automation per CLAUDE.md Key Decision #7) - supertonic pin 1.2.3 → 1.3.1 (already committed in ba63733) - SPIKE-01 and SPIKE-02 both GO; 13/13 Phase 4 reqs stay in scope - PR #73 reimplemented, not rebased (93 commits behind main) All 18 plans validated via gsd-sdk frontmatter.validate + verify.plan-structure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent e4dbf4c commit a5e1bb3

23 files changed

Lines changed: 9875 additions & 4 deletions

File tree

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# SPIKE-01: Adopt `Serveurperso/OmniVoice-GGUF` as hardware-adaptive default cloning engine
2+
3+
**Status:** Proposed (research-supported) — awaiting Phase 2 SubprocessBackend merge
4+
**Date:** 2026-05-18
5+
**Decision-makers:** [maintainer]
6+
**Related:** ROADMAP Phase 4; REQUIREMENTS GGUF-01..06; `.planning/phases/04-adaptive-specialty-engines-spike-first/04-RESEARCH.md`
7+
8+
## Context
9+
10+
OmniVoice Studio v0.2.7 ships `k2-fsa/OmniVoice` (Apache-2.0, 0.6B Qwen3 backbone, Higgs Audio v2 codec at 24 kHz mono) as its default voice-cloning engine via `backend/services/tts_backend.py:OmniVoiceBackend`. The Python in-process path requires PyTorch + CUDA / MPS / CPU and on 4 GB-VRAM GPUs falls back to CPU inference.
11+
12+
`Serveurperso/OmniVoice-GGUF` (HuggingFace, 10,603 downloads/month, verified 2026-05-18) publishes 4 quantizations of the same upstream model — Q4_K_M (~659 MB VRAM), Q8_0 (~945 MB, recommended balance), BF16 (~1.6 GB), F32 (~3.2 GB) — consumable through the MIT-licensed `omnivoice.cpp` runtime (`github.com/ServeurpersoCom/omnivoice.cpp`, 38 stars, 59 commits, 6 open issues). The quants use a custom `omnivoice-lm` architecture and do **not** load in vanilla llama.cpp.
13+
14+
This decision is whether to integrate the GGUF engine as a hardware-adaptive default with overridable fallback to the existing in-process `OmniVoiceBackend`.
15+
16+
## Decision
17+
18+
**GO** — integrate per GGUF-01..06.
19+
20+
The integration shape is `OmniVoiceGGUFBackend(TTSBackend)` wrapping Phase 2's `SubprocessBackend`, which spawns a bundled per-platform `omnivoice-tts` binary built from a pinned `omnivoice.cpp` commit SHA. Quant selection is driven by a `detect_capabilities()` extension of `backend/services/gpu_sandbox.py` mapping `(compute_class) → quant filename` via shippable `quant_map.json`. On hardware where probe + load succeed, GGUF becomes the default cloning engine; on any failure the existing in-process `OmniVoiceBackend` is the fallback.
21+
22+
## Consequences
23+
24+
**Positive:**
25+
- 4 GB-VRAM GPUs (currently falling back to CPU on the in-process path) get GPU-backed cloning via Q4_K_M.
26+
- Smaller VRAM footprint = stays out of the way of other engines when users run multiple in one session.
27+
- License chain unchanged (Apache-2.0 model + MIT runtime).
28+
- Same underlying model as what already ships — worst case it ties the in-process path on a given hardware class and we keep that path as the fallback.
29+
30+
**Negative / risk:**
31+
- Adds a maintained-by-others C++ runtime to the dependency graph (`omnivoice.cpp`, 38 stars at decision time).
32+
- Adds ~12-16 MB of platform binaries to the installer (must verify against Phase 3 mirror-timing baseline per Pitfall 6).
33+
- macOS code signing scope expands by 4 binaries (track via REL-05; same `xattr -cr` workaround as #54 applies in v0.3.x).
34+
- `omnivoice.cpp` README does not publish a macOS Metal build script — only `buildcpu.sh`, `buildcuda.sh`, `buildvulkan.sh`, `buildall.sh`. Apple Silicon Metal must be verified in Wave 1.
35+
36+
**Mitigations:**
37+
- Pin `omnivoice.cpp` by commit SHA; rebuild from pinned SHA in CI for all 4 target platforms.
38+
- Pin every quant file by commit SHA in `quant_map.json` (shippable JSON so the table can update without an app release).
39+
- In-process `OmniVoiceBackend` remains as fallback if any GGUF step fails (probe, download, load, generate).
40+
- macOS Apple Silicon Metal build is verified in Wave 1 with explicit acceptance criteria; if blocked, downgrade SPIKE-01 default on macOS to in-process path and document in this ADR's "Status" line.
41+
- SHA-256 checksums on bundled binaries (per GATE-05); verify at first launch and on every quant load.
42+
- Subprocess arg composition uses typed `Path` objects rooted in app directories; quant override UI is a dropdown over `quant_map.json` entries only (no freeform path input — supply-chain control analogous to INST-09).
43+
44+
## Sources
45+
46+
- `.planning/phases/04-adaptive-specialty-engines-spike-first/04-RESEARCH.md` (this milestone's research)
47+
- https://huggingface.co/Serveurperso/OmniVoice-GGUF (verified 2026-05-18)
48+
- https://github.com/ServeurpersoCom/omnivoice.cpp (verified 2026-05-18)
49+
- https://huggingface.co/k2-fsa/OmniVoice (upstream, Apache-2.0)
50+
- `backend/services/tts_backend.py` (existing `OmniVoiceBackend` reference)
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# SPIKE-02: Adopt `ModelsLab/omnivoice-singing` as singing variant of the existing engine
2+
3+
**Status:** Proposed (research-supported) — awaiting Phase 2 SubprocessBackend merge
4+
**Date:** 2026-05-18
5+
**Decision-makers:** [maintainer]
6+
**Related:** ROADMAP Phase 4; REQUIREMENTS SING-01..05; `.planning/phases/04-adaptive-specialty-engines-spike-first/04-RESEARCH.md`
7+
8+
## Context
9+
10+
`ModelsLab/omnivoice-singing` (HuggingFace, 1,053 downloads/month, verified 2026-05-18) is a finetune of `k2-fsa/OmniVoice` — same Apache-2.0 license, same Qwen3-0.6B backbone, same Higgs Audio v2 codec at 24 kHz mono, same `omnivoice` PyPI library (0.1.5, 2026-04-28) already shipping in OmniVoice Studio v0.2.7. Trained on additional singing + emotion-tagged data and activated by a `[singing]` text control tag at generation time.
11+
12+
OmniVoice's existing `dub_pipeline.py` runs Demucs to split source audio into vocal and instrumental stems and routes the vocal stem through the default TTS engine. Today this produces speech-like output even on sung source material, which is one of the loudest user complaints when dubbing music-adjacent content.
13+
14+
This decision is whether to integrate the singing finetune as a routed alternative for sung segments, with auto-detection + per-segment override.
15+
16+
## Decision
17+
18+
**GO with reduced scope** — integrate per SING-01..05.
19+
20+
The integration shape is `OmniVoiceSingingBackend(OmniVoiceBackend)` — a ≤30-line subclass overriding `id`, `display_name`, the `from_pretrained` model ID, and auto-injecting the `[singing]` control tag in `generate()` unless the prompt already starts with a `[`-prefixed tag. The dubbing pipeline gains a "singing mode" toggle and a segment-routing path (vocal stem → singing engine for sung segments, vocal stem → default engine for spoken segments, instrumental stem preserved untouched). Segment detection uses a pitch-stability + energy heuristic on the Demucs vocal stem with per-segment user override in the dubbing UI.
21+
22+
SING-02's full per-segment routing depth is **decided after a Wave 2 code-read of `dub_pipeline.py`**: if the existing pipeline supports per-segment routing in ≤50 lines, ship it; if it would require >500 lines of refactor, descope to "singing mode applies to entire dubbing job" for v0.3 and defer per-segment to v0.4.
23+
24+
## Consequences
25+
26+
**Positive:**
27+
- Sung segments of dubbed content produce sung output (currently produces unsuitable speech-like output).
28+
- Zero new Python dependencies — same `omnivoice` library already shipping.
29+
- ≤30-line backend subclass; no new engine architecture.
30+
- Hardware footprint identical to existing `OmniVoiceBackend`; runs anywhere the default engine already runs.
31+
32+
**Negative / risk:**
33+
- Heuristic segmentation (pitch-stability + energy) is one-dimensional and misclassifies operatic / sustained-vowel speech and vibrato-heavy speech.
34+
- Cross-language singing quality is acknowledged by the model card as "extrapolation with variable quality."
35+
- `omnivoice-singing` returns garbled output if the `[singing]` tag is missing — automatic injection is load-bearing.
36+
37+
**Mitigations:**
38+
- Per-segment override available in the dubbing UI before any segment is committed to a render (user owns the final route — SING-03 already requires this).
39+
- SING-05 acceptance scoped to native-language singing pass; cross-language flagged as best-effort with model-card disclaimer surfaced in the engine card UI.
40+
- `OmniVoiceSingingBackend.generate()` always prepends `[singing]` unless the prompt already starts with `[`, allowing power users to compose `[singing] [happy]` etc. manually.
41+
- Model-based singing-vs-speech classifier explicitly deferred to v2 per REQUIREMENTS.md Out of Scope.
42+
- License + model-card link surfaced in the engine card UI; first-use acceptance gates download (SING-04).
43+
44+
## Sources
45+
46+
- `.planning/phases/04-adaptive-specialty-engines-spike-first/04-RESEARCH.md` (this milestone's research)
47+
- https://huggingface.co/ModelsLab/omnivoice-singing (verified 2026-05-18)
48+
- https://huggingface.co/k2-fsa/OmniVoice (upstream)
49+
- https://pypi.org/project/omnivoice/ (0.1.5, 2026-04-28)
50+
- `backend/services/tts_backend.py` (existing `OmniVoiceBackend` reference)
51+
- `backend/services/dub_pipeline.py` (existing dubbing pipeline — Wave 2 code-read target)

0 commit comments

Comments
 (0)