Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions .planning/decisions/SPIKE-01-gguf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# SPIKE-01: Adopt `Serveurperso/OmniVoice-GGUF` as hardware-adaptive default cloning engine

**Status:** Proposed (research-supported) — awaiting Phase 2 SubprocessBackend merge
**Date:** 2026-05-18
**Decision-makers:** [maintainer]
**Related:** ROADMAP Phase 4; REQUIREMENTS GGUF-01..06; `.planning/phases/04-adaptive-specialty-engines-spike-first/04-RESEARCH.md`

## Context

OmniVoice Studio v0.2.7 ships `k2-fsa/OmniVoice` (Apache-2.0, 0.6B Qwen3 backbone, Higgs Audio v2 codec at 24 kHz mono) as its default voice-cloning engine via `backend/services/tts_backend.py:OmniVoiceBackend`. The Python in-process path requires PyTorch + CUDA / MPS / CPU and on 4 GB-VRAM GPUs falls back to CPU inference.

`Serveurperso/OmniVoice-GGUF` (HuggingFace, 10,603 downloads/month, verified 2026-05-18) publishes 4 quantizations of the same upstream model — Q4_K_M (~659 MB VRAM), Q8_0 (~945 MB, recommended balance), BF16 (~1.6 GB), F32 (~3.2 GB) — consumable through the MIT-licensed `omnivoice.cpp` runtime (`github.com/ServeurpersoCom/omnivoice.cpp`, 38 stars, 59 commits, 6 open issues). The quants use a custom `omnivoice-lm` architecture and do **not** load in vanilla llama.cpp.

This decision is whether to integrate the GGUF engine as a hardware-adaptive default with overridable fallback to the existing in-process `OmniVoiceBackend`.

## Decision

**GO** — integrate per GGUF-01..06.

The integration shape is `OmniVoiceGGUFBackend(TTSBackend)` wrapping Phase 2's `SubprocessBackend`, which spawns a bundled per-platform `omnivoice-tts` binary built from a pinned `omnivoice.cpp` commit SHA. Quant selection is driven by a `detect_capabilities()` extension of `backend/services/gpu_sandbox.py` mapping `(compute_class) → quant filename` via shippable `quant_map.json`. On hardware where probe + load succeed, GGUF becomes the default cloning engine; on any failure the existing in-process `OmniVoiceBackend` is the fallback.

## Consequences

**Positive:**
- 4 GB-VRAM GPUs (currently falling back to CPU on the in-process path) get GPU-backed cloning via Q4_K_M.
- Smaller VRAM footprint = stays out of the way of other engines when users run multiple in one session.
- License chain unchanged (Apache-2.0 model + MIT runtime).
- Same underlying model as what already ships — worst case it ties the in-process path on a given hardware class and we keep that path as the fallback.

**Negative / risk:**
- Adds a maintained-by-others C++ runtime to the dependency graph (`omnivoice.cpp`, 38 stars at decision time).
- Adds ~12-16 MB of platform binaries to the installer (must verify against Phase 3 mirror-timing baseline per Pitfall 6).
- macOS code signing scope expands by 4 binaries (track via REL-05; same `xattr -cr` workaround as #54 applies in v0.3.x).
- `omnivoice.cpp` README does not publish a macOS Metal build script — only `buildcpu.sh`, `buildcuda.sh`, `buildvulkan.sh`, `buildall.sh`. Apple Silicon Metal must be verified in Wave 1.

**Mitigations:**
- Pin `omnivoice.cpp` by commit SHA; rebuild from pinned SHA in CI for all 4 target platforms.
- Pin every quant file by commit SHA in `quant_map.json` (shippable JSON so the table can update without an app release).
- In-process `OmniVoiceBackend` remains as fallback if any GGUF step fails (probe, download, load, generate).
- macOS Apple Silicon Metal build is verified in Wave 1 with explicit acceptance criteria; if blocked, downgrade SPIKE-01 default on macOS to in-process path and document in this ADR's "Status" line.
- SHA-256 checksums on bundled binaries (per GATE-05); verify at first launch and on every quant load.
- Subprocess arg composition uses typed `Path` objects rooted in app directories; quant override UI is a dropdown over `quant_map.json` entries only (no freeform path input — supply-chain control analogous to INST-09).

## Sources

- `.planning/phases/04-adaptive-specialty-engines-spike-first/04-RESEARCH.md` (this milestone's research)
- https://huggingface.co/Serveurperso/OmniVoice-GGUF (verified 2026-05-18)
- https://github.com/ServeurpersoCom/omnivoice.cpp (verified 2026-05-18)
- https://huggingface.co/k2-fsa/OmniVoice (upstream, Apache-2.0)
- `backend/services/tts_backend.py` (existing `OmniVoiceBackend` reference)
51 changes: 51 additions & 0 deletions .planning/decisions/SPIKE-02-singing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# SPIKE-02: Adopt `ModelsLab/omnivoice-singing` as singing variant of the existing engine

**Status:** Proposed (research-supported) — awaiting Phase 2 SubprocessBackend merge
**Date:** 2026-05-18
**Decision-makers:** [maintainer]
**Related:** ROADMAP Phase 4; REQUIREMENTS SING-01..05; `.planning/phases/04-adaptive-specialty-engines-spike-first/04-RESEARCH.md`

## Context

`ModelsLab/omnivoice-singing` (HuggingFace, 1,053 downloads/month, verified 2026-05-18) is a finetune of `k2-fsa/OmniVoice` — same Apache-2.0 license, same Qwen3-0.6B backbone, same Higgs Audio v2 codec at 24 kHz mono, same `omnivoice` PyPI library (0.1.5, 2026-04-28) already shipping in OmniVoice Studio v0.2.7. Trained on additional singing + emotion-tagged data and activated by a `[singing]` text control tag at generation time.

OmniVoice's existing `dub_pipeline.py` runs Demucs to split source audio into vocal and instrumental stems and routes the vocal stem through the default TTS engine. Today this produces speech-like output even on sung source material, which is one of the loudest user complaints when dubbing music-adjacent content.

This decision is whether to integrate the singing finetune as a routed alternative for sung segments, with auto-detection + per-segment override.

## Decision

**GO with reduced scope** — integrate per SING-01..05.

The integration shape is `OmniVoiceSingingBackend(OmniVoiceBackend)` — a ≤30-line subclass overriding `id`, `display_name`, the `from_pretrained` model ID, and auto-injecting the `[singing]` control tag in `generate()` unless the prompt already starts with a `[`-prefixed tag. The dubbing pipeline gains a "singing mode" toggle and a segment-routing path (vocal stem → singing engine for sung segments, vocal stem → default engine for spoken segments, instrumental stem preserved untouched). Segment detection uses a pitch-stability + energy heuristic on the Demucs vocal stem with per-segment user override in the dubbing UI.

SING-02's full per-segment routing depth is **decided after a Wave 2 code-read of `dub_pipeline.py`**: if the existing pipeline supports per-segment routing in ≤50 lines, ship it; if it would require >500 lines of refactor, descope to "singing mode applies to entire dubbing job" for v0.3 and defer per-segment to v0.4.

## Consequences

**Positive:**
- Sung segments of dubbed content produce sung output (currently produces unsuitable speech-like output).
- Zero new Python dependencies — same `omnivoice` library already shipping.
- ≤30-line backend subclass; no new engine architecture.
- Hardware footprint identical to existing `OmniVoiceBackend`; runs anywhere the default engine already runs.

**Negative / risk:**
- Heuristic segmentation (pitch-stability + energy) is one-dimensional and misclassifies operatic / sustained-vowel speech and vibrato-heavy speech.
- Cross-language singing quality is acknowledged by the model card as "extrapolation with variable quality."
- `omnivoice-singing` returns garbled output if the `[singing]` tag is missing — automatic injection is load-bearing.

**Mitigations:**
- Per-segment override available in the dubbing UI before any segment is committed to a render (user owns the final route — SING-03 already requires this).
- SING-05 acceptance scoped to native-language singing pass; cross-language flagged as best-effort with model-card disclaimer surfaced in the engine card UI.
- `OmniVoiceSingingBackend.generate()` always prepends `[singing]` unless the prompt already starts with `[`, allowing power users to compose `[singing] [happy]` etc. manually.
Comment on lines +35 to +40

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Always enforce [singing] presence, not just “starts with [”.

Line 35 states missing [singing] yields garbled output, but Line 40’s rule allows prompts like [happy] ... to bypass injection and break synthesis. Gate on “contains [singing]” (or prepend it) rather than “starts with [”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.planning/decisions/SPIKE-02-singing.md around lines 35 - 40, The current
rule in OmniVoiceSingingBackend.generate() only skips prepending when the
prompt.startsWith('['), which lets prompts like '[happy] ...' bypass the
required singing tag and cause garbled output; change the logic to check for the
presence of the singing tag (e.g., prompt.includes('[singing]') with
case-insensitive match) and only skip prepending if that tag exists, otherwise
always prepend '[singing]' to the prompt before synthesis so power-user
bracketed modifiers still work.

- Model-based singing-vs-speech classifier explicitly deferred to v2 per REQUIREMENTS.md Out of Scope.
- License + model-card link surfaced in the engine card UI; first-use acceptance gates download (SING-04).

## Sources

- `.planning/phases/04-adaptive-specialty-engines-spike-first/04-RESEARCH.md` (this milestone's research)
- https://huggingface.co/ModelsLab/omnivoice-singing (verified 2026-05-18)
- https://huggingface.co/k2-fsa/OmniVoice (upstream)
- https://pypi.org/project/omnivoice/ (0.1.5, 2026-04-28)
- `backend/services/tts_backend.py` (existing `OmniVoiceBackend` reference)
- `backend/services/dub_pipeline.py` (existing dubbing pipeline — Wave 2 code-read target)
Loading
Loading