-
Notifications
You must be signed in to change notification settings - Fork 1.1k
docs(v0.3.0): research + 18 plans for fat-milestone planning #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
20e8d1f
docs(phase-5): research opt-in bug reporting
debpalash b839c80
docs(phases): research for Phases 2, 3, 4, 6 (Engine + Supertonic + S…
debpalash ba63733
docs(stack): bump supertonic pin 1.2.3 → 1.3.1 (Phase 3 research find…
debpalash ff336e6
docs(phases): plan Phases 2-6 for v0.3.0 fat-milestone release
debpalash File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| # SPIKE-01: Adopt `Serveurperso/OmniVoice-GGUF` as hardware-adaptive default cloning engine | ||
|
|
||
| **Status:** Proposed (research-supported) — awaiting Phase 2 SubprocessBackend merge | ||
| **Date:** 2026-05-18 | ||
| **Decision-makers:** [maintainer] | ||
| **Related:** ROADMAP Phase 4; REQUIREMENTS GGUF-01..06; `.planning/phases/04-adaptive-specialty-engines-spike-first/04-RESEARCH.md` | ||
|
|
||
| ## Context | ||
|
|
||
| OmniVoice Studio v0.2.7 ships `k2-fsa/OmniVoice` (Apache-2.0, 0.6B Qwen3 backbone, Higgs Audio v2 codec at 24 kHz mono) as its default voice-cloning engine via `backend/services/tts_backend.py:OmniVoiceBackend`. The Python in-process path requires PyTorch + CUDA / MPS / CPU and on 4 GB-VRAM GPUs falls back to CPU inference. | ||
|
|
||
| `Serveurperso/OmniVoice-GGUF` (HuggingFace, 10,603 downloads/month, verified 2026-05-18) publishes 4 quantizations of the same upstream model — Q4_K_M (~659 MB VRAM), Q8_0 (~945 MB, recommended balance), BF16 (~1.6 GB), F32 (~3.2 GB) — consumable through the MIT-licensed `omnivoice.cpp` runtime (`github.com/ServeurpersoCom/omnivoice.cpp`, 38 stars, 59 commits, 6 open issues). The quants use a custom `omnivoice-lm` architecture and do **not** load in vanilla llama.cpp. | ||
|
|
||
| This decision is whether to integrate the GGUF engine as a hardware-adaptive default with overridable fallback to the existing in-process `OmniVoiceBackend`. | ||
|
|
||
| ## Decision | ||
|
|
||
| **GO** — integrate per GGUF-01..06. | ||
|
|
||
| The integration shape is `OmniVoiceGGUFBackend(TTSBackend)` wrapping Phase 2's `SubprocessBackend`, which spawns a bundled per-platform `omnivoice-tts` binary built from a pinned `omnivoice.cpp` commit SHA. Quant selection is driven by a `detect_capabilities()` extension of `backend/services/gpu_sandbox.py` mapping `(compute_class) → quant filename` via shippable `quant_map.json`. On hardware where probe + load succeed, GGUF becomes the default cloning engine; on any failure the existing in-process `OmniVoiceBackend` is the fallback. | ||
|
|
||
| ## Consequences | ||
|
|
||
| **Positive:** | ||
| - 4 GB-VRAM GPUs (currently falling back to CPU on the in-process path) get GPU-backed cloning via Q4_K_M. | ||
| - Smaller VRAM footprint = stays out of the way of other engines when users run multiple in one session. | ||
| - License chain unchanged (Apache-2.0 model + MIT runtime). | ||
| - Same underlying model as what already ships — worst case it ties the in-process path on a given hardware class and we keep that path as the fallback. | ||
|
|
||
| **Negative / risk:** | ||
| - Adds a maintained-by-others C++ runtime to the dependency graph (`omnivoice.cpp`, 38 stars at decision time). | ||
| - Adds ~12-16 MB of platform binaries to the installer (must verify against Phase 3 mirror-timing baseline per Pitfall 6). | ||
| - macOS code signing scope expands by 4 binaries (track via REL-05; same `xattr -cr` workaround as #54 applies in v0.3.x). | ||
| - `omnivoice.cpp` README does not publish a macOS Metal build script — only `buildcpu.sh`, `buildcuda.sh`, `buildvulkan.sh`, `buildall.sh`. Apple Silicon Metal must be verified in Wave 1. | ||
|
|
||
| **Mitigations:** | ||
| - Pin `omnivoice.cpp` by commit SHA; rebuild from pinned SHA in CI for all 4 target platforms. | ||
| - Pin every quant file by commit SHA in `quant_map.json` (shippable JSON so the table can update without an app release). | ||
| - In-process `OmniVoiceBackend` remains as fallback if any GGUF step fails (probe, download, load, generate). | ||
| - macOS Apple Silicon Metal build is verified in Wave 1 with explicit acceptance criteria; if blocked, downgrade SPIKE-01 default on macOS to in-process path and document in this ADR's "Status" line. | ||
| - SHA-256 checksums on bundled binaries (per GATE-05); verify at first launch and on every quant load. | ||
| - Subprocess arg composition uses typed `Path` objects rooted in app directories; quant override UI is a dropdown over `quant_map.json` entries only (no freeform path input — supply-chain control analogous to INST-09). | ||
|
|
||
| ## Sources | ||
|
|
||
| - `.planning/phases/04-adaptive-specialty-engines-spike-first/04-RESEARCH.md` (this milestone's research) | ||
| - https://huggingface.co/Serveurperso/OmniVoice-GGUF (verified 2026-05-18) | ||
| - https://github.com/ServeurpersoCom/omnivoice.cpp (verified 2026-05-18) | ||
| - https://huggingface.co/k2-fsa/OmniVoice (upstream, Apache-2.0) | ||
| - `backend/services/tts_backend.py` (existing `OmniVoiceBackend` reference) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| # SPIKE-02: Adopt `ModelsLab/omnivoice-singing` as singing variant of the existing engine | ||
|
|
||
| **Status:** Proposed (research-supported) — awaiting Phase 2 SubprocessBackend merge | ||
| **Date:** 2026-05-18 | ||
| **Decision-makers:** [maintainer] | ||
| **Related:** ROADMAP Phase 4; REQUIREMENTS SING-01..05; `.planning/phases/04-adaptive-specialty-engines-spike-first/04-RESEARCH.md` | ||
|
|
||
| ## Context | ||
|
|
||
| `ModelsLab/omnivoice-singing` (HuggingFace, 1,053 downloads/month, verified 2026-05-18) is a finetune of `k2-fsa/OmniVoice` — same Apache-2.0 license, same Qwen3-0.6B backbone, same Higgs Audio v2 codec at 24 kHz mono, same `omnivoice` PyPI library (0.1.5, 2026-04-28) already shipping in OmniVoice Studio v0.2.7. Trained on additional singing + emotion-tagged data and activated by a `[singing]` text control tag at generation time. | ||
|
|
||
| OmniVoice's existing `dub_pipeline.py` runs Demucs to split source audio into vocal and instrumental stems and routes the vocal stem through the default TTS engine. Today this produces speech-like output even on sung source material, which is one of the loudest user complaints when dubbing music-adjacent content. | ||
|
|
||
| This decision is whether to integrate the singing finetune as a routed alternative for sung segments, with auto-detection + per-segment override. | ||
|
|
||
| ## Decision | ||
|
|
||
| **GO with reduced scope** — integrate per SING-01..05. | ||
|
|
||
| The integration shape is `OmniVoiceSingingBackend(OmniVoiceBackend)` — a ≤30-line subclass overriding `id`, `display_name`, the `from_pretrained` model ID, and auto-injecting the `[singing]` control tag in `generate()` unless the prompt already starts with a `[`-prefixed tag. The dubbing pipeline gains a "singing mode" toggle and a segment-routing path (vocal stem → singing engine for sung segments, vocal stem → default engine for spoken segments, instrumental stem preserved untouched). Segment detection uses a pitch-stability + energy heuristic on the Demucs vocal stem with per-segment user override in the dubbing UI. | ||
|
|
||
| SING-02's full per-segment routing depth is **decided after a Wave 2 code-read of `dub_pipeline.py`**: if the existing pipeline supports per-segment routing in ≤50 lines, ship it; if it would require >500 lines of refactor, descope to "singing mode applies to entire dubbing job" for v0.3 and defer per-segment to v0.4. | ||
|
|
||
| ## Consequences | ||
|
|
||
| **Positive:** | ||
| - Sung segments of dubbed content produce sung output (currently produces unsuitable speech-like output). | ||
| - Zero new Python dependencies — same `omnivoice` library already shipping. | ||
| - ≤30-line backend subclass; no new engine architecture. | ||
| - Hardware footprint identical to existing `OmniVoiceBackend`; runs anywhere the default engine already runs. | ||
|
|
||
| **Negative / risk:** | ||
| - Heuristic segmentation (pitch-stability + energy) is one-dimensional and misclassifies operatic / sustained-vowel speech and vibrato-heavy speech. | ||
| - Cross-language singing quality is acknowledged by the model card as "extrapolation with variable quality." | ||
| - `omnivoice-singing` returns garbled output if the `[singing]` tag is missing — automatic injection is load-bearing. | ||
|
|
||
| **Mitigations:** | ||
| - Per-segment override available in the dubbing UI before any segment is committed to a render (user owns the final route — SING-03 already requires this). | ||
| - SING-05 acceptance scoped to native-language singing pass; cross-language flagged as best-effort with model-card disclaimer surfaced in the engine card UI. | ||
| - `OmniVoiceSingingBackend.generate()` always prepends `[singing]` unless the prompt already starts with `[`, allowing power users to compose `[singing] [happy]` etc. manually. | ||
| - Model-based singing-vs-speech classifier explicitly deferred to v2 per REQUIREMENTS.md Out of Scope. | ||
| - License + model-card link surfaced in the engine card UI; first-use acceptance gates download (SING-04). | ||
|
|
||
| ## Sources | ||
|
|
||
| - `.planning/phases/04-adaptive-specialty-engines-spike-first/04-RESEARCH.md` (this milestone's research) | ||
| - https://huggingface.co/ModelsLab/omnivoice-singing (verified 2026-05-18) | ||
| - https://huggingface.co/k2-fsa/OmniVoice (upstream) | ||
| - https://pypi.org/project/omnivoice/ (0.1.5, 2026-04-28) | ||
| - `backend/services/tts_backend.py` (existing `OmniVoiceBackend` reference) | ||
| - `backend/services/dub_pipeline.py` (existing dubbing pipeline — Wave 2 code-read target) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always enforce
[singing]presence, not just “starts with[”.Line 35 states missing
[singing]yields garbled output, but Line 40’s rule allows prompts like[happy] ...to bypass injection and break synthesis. Gate on “contains[singing]” (or prepend it) rather than “starts with[”.🤖 Prompt for AI Agents