fix(tts): /generate honors the selected TTS engine (#312)#324
Conversation
The /generate route always ran the OmniVoice model directly, ignoring both the Settings engine selection and any per-request override. It now resolves the active backend (env var > Settings selection > default), supports an explicit `engine` form field (same pattern as /ws/tts and /v1/audio/speech), reuses the per-process engine instance cache, keeps inline [pause Nms] markers working on every engine, and honors applies_own_mastering so studio engines skip the broadcast mastering chain. The OmniVoice default path is byte-identical to the old behavior — existing API consumers see no change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughExtended ChangesEngine-aware /generate endpoint with refactored audio processing
Sequence DiagramsequenceDiagram
participant Client
participant GenerateEndpoint
participant Settings
participant BackendRegistry
participant OmniVoiceModel
participant BackendAdapter
participant AudioDSP
Client->>GenerateEndpoint: POST /generate (optional engine)
GenerateEndpoint->>Settings: Resolve engine ID (or use default)
GenerateEndpoint->>BackendRegistry: Validate engine exists
alt Unknown engine
GenerateEndpoint-->>Client: HTTP 400 (unknown)
else
GenerateEndpoint->>BackendRegistry: Check engine availability
alt Unavailable engine
GenerateEndpoint-->>Client: HTTP 400 (unavailable)
else
alt Engine is OmniVoice (or default)
GenerateEndpoint->>OmniVoiceModel: Load model
OmniVoiceModel->>OmniVoiceModel: Generate audio
else
GenerateEndpoint->>BackendAdapter: Load cached instance
BackendAdapter->>BackendAdapter: Generate audio
end
GenerateEndpoint->>AudioDSP: Apply effect chain (skip mastering if backend applies it)
GenerateEndpoint->>GenerateEndpoint: Watermark + serialize with resolved sample_rate
GenerateEndpoint-->>Client: HTTP 200 (audio WAV)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| def _run_backend_inference( | ||
| backend, text, language, ref_audio_path, ref_text, instruct, duration, | ||
| num_step, guidance_scale, speed, denoise, postprocess_output, | ||
| used_seed, effect_preset="broadcast", | ||
| ): |
|
| Filename | Overview |
|---|---|
| backend/api/routers/generation.py | Adds engine-aware routing to /generate: resolves engine via form field > env var > Settings > OmniVoice default, extracts _apply_effect_chain and _oom_friendly_reraise helpers, and introduces _run_backend_inference for non-OmniVoice paths. Two open issues (sample rate read before lazy load, bare RuntimeError raise) were flagged in prior review threads. |
| tests/test_generate_engine.py | New test module covering 6 cases: Settings-selected engine runs, per-request override, unknown/unavailable 400s, OmniVoice default path unchanged, applies_own_mastering skips mastering chain. Uses fresh stub classes per call and run-time module resolution to avoid fixture pollution; instance-cache isolation depends on _get_engine_instance keying by class identity rather than engine id string. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[POST /generate] --> B{engine form field?}
B -->|yes| C[engine_id = engine]
B -->|no| D[engine_id = active_backend_id\nenv > prefs > default]
C --> E[get_backend_class engine_id]
D --> E
E -->|ValueError| F[400 Unknown engine]
E --> G{backend_cls is\nOmniVoiceBackend?}
G -->|yes| H[get_model\nnative path]
G -->|no| I{is_available?}
I -->|no| J[400 Engine unavailable\nwith masked reason]
I -->|yes| K[_get_engine_instance\nper-process cache]
H --> L[run_in_executor\n_run_inference\nfull advanced params]
K --> M[run_in_executor\n_run_backend_inference\nadapter protocol]
L --> N[_apply_effect_chain\nskip_mastering=False]
M --> O{applies_own_mastering?}
O -->|yes| P[_apply_effect_chain\nskip_mastering=True]
O -->|no| N
N --> Q[embed_watermark\nsample_rate after generation]
P --> Q
Q --> R[save WAV + DB insert\nStreamingResponse]
Reviews (2): Last reviewed commit: "test(312): resolve modules at run time, ..." | Re-trigger Greptile
| language=language, ref_audio=ref_audio_path, ref_text=ref_text, | ||
| instruct=instruct, num_step=num_step, guidance_scale=guidance_scale, | ||
| speed=speed, denoise=denoise, postprocess_output=postprocess_output, | ||
| ) | ||
| sr = backend.sample_rate | ||
|
|
||
| # Inline [pause Nms] markers (issue #276) work for every engine — the | ||
| # silence stitching is model-free. | ||
| from omnivoice.utils.text import parse_pause_markers | ||
| segments = parse_pause_markers(text) | ||
| has_pause = len(segments) > 1 or (segments and segments[0][1] > 0) | ||
|
|
||
| if has_pause: | ||
| def _gen_span(span_text): | ||
| # Per-span duration is left to the engine; an explicit overall | ||
| # `duration` can't be meaningfully split across spans. | ||
| return backend.generate(span_text, duration=None, **gen_kwargs) | ||
| audio_out = _render_with_pauses(_gen_span, segments, sr) | ||
| else: | ||
| audio_out = backend.generate(text, duration=duration, **gen_kwargs) | ||
|
|
||
| return _apply_effect_chain( | ||
| audio_out, sr, effect_preset, | ||
| skip_mastering=getattr(backend, "applies_own_mastering", False), | ||
| ) |
There was a problem hiding this comment.
Sample rate captured before lazy model load
sr = backend.sample_rate is read before any call to backend.generate(). For backends with lazy-loading sample rates — SherpaOnnxBackend returns a hardcoded placeholder 22050 before its ONNX model is loaded, and CosyVoiceBackend/MLXAudioBackend similarly use placeholder values — both the silence-stitching calculation in _render_with_pauses (pause-marker path) and the DSP effects in _apply_effect_chain receive the wrong sample rate on the first request before weights are loaded. A 500 ms pause with a placeholder rate of 22050 but a true model rate of 24000 would produce ~459 ms of silence; more critically, the broadcast mastering chain (compressor, reverb) would be calibrated against the wrong rate.
The generate_speech caller is already aware of this: its comment "Read after generation: engines with lazy model loading report their real rate only once weights are up" is exactly why sample_rate = _backend.sample_rate is read after the executor returns. The same deferred read needs to happen inside _run_backend_inference — at minimum, move sr = backend.sample_rate to after the audio_out = backend.generate(...) call on the non-pause path so _apply_effect_chain always gets the true rate.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
backend/api/routers/generation.py (2)
262-271: ⚡ Quick winPreserve exception chain for better debugging.
The
HTTPExceptionshould be raised withfrom errto preserve the exception chain, aiding debugging when an unknown engine is provided.Proposed fix
- except ValueError: + except ValueError as err: raise HTTPException( status_code=400, detail=( f"Unknown TTS engine: {engine_id!r}. " "See GET /engines/tts for the list of valid engine ids." ), - ) + ) from err🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/api/routers/generation.py` around lines 262 - 271, The code catches ValueError from get_backend_class(engine_id) but re-raises HTTPException without preserving the original exception chain; update the except block to capture the ValueError (e.g. except ValueError as err:) and re-raise the HTTPException using "raise HTTPException(... ) from err" so the original traceback is preserved; keep the same status_code and detail message and reference get_backend_class, engine_id, HTTPException, and ValueError when making the change.
99-112: ⚡ Quick winAdd
NoReturntype hint to silence static analysis warning.This function always raises but lacks a type annotation. Static analysis tools (CodeQL) flag implicit returns in callers (
_run_inference,_run_backend_inference) because they can't infer that this function never returns.Proposed fix
+from typing import NoReturn + -def _oom_friendly_reraise(e): +def _oom_friendly_reraise(e: Exception) -> NoReturn: """Best-effort cache flush + the user-facing OOM hint shared by both inference paths."""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/api/routers/generation.py` around lines 99 - 112, The _oom_friendly_reraise helper always raises but lacks a return-type annotation, causing static analysis to complain in callers like _run_inference and _run_backend_inference; fix it by importing NoReturn from typing (or typing_extensions if you prefer) and annotate the function signature as def _oom_friendly_reraise(e) -> NoReturn: so analyzers know it never returns, leaving the body unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/api/routers/generation.py`:
- Around line 200-220: The pause-marker stitching uses sr = backend.sample_rate
before the backend may lazily load and change its sample_rate, causing incorrect
silence durations; fix by ensuring the backend is fully loaded before capturing
sr: call the backend's load routine (e.g., backend._ensure_loaded() or the
appropriate ensure/load method) at the start of _run_backend_inference (before
sr = backend.sample_rate and before any generate calls) so _render_with_pauses
and subsequent saving/watermarking use the same, final sample_rate;
alternatively, if no ensure method exists, defer computing sr until after the
first span generation (use the sample_rate reported after backend.generate in
_gen_span) and pass that same sr into _render_with_pauses and into
_apply_effect_chain.
---
Nitpick comments:
In `@backend/api/routers/generation.py`:
- Around line 262-271: The code catches ValueError from
get_backend_class(engine_id) but re-raises HTTPException without preserving the
original exception chain; update the except block to capture the ValueError
(e.g. except ValueError as err:) and re-raise the HTTPException using "raise
HTTPException(... ) from err" so the original traceback is preserved; keep the
same status_code and detail message and reference get_backend_class, engine_id,
HTTPException, and ValueError when making the change.
- Around line 99-112: The _oom_friendly_reraise helper always raises but lacks a
return-type annotation, causing static analysis to complain in callers like
_run_inference and _run_backend_inference; fix it by importing NoReturn from
typing (or typing_extensions if you prefer) and annotate the function signature
as def _oom_friendly_reraise(e) -> NoReturn: so analyzers know it never returns,
leaving the body unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 607e2bee-3317-4616-b520-dcdc10d8ff69
📒 Files selected for processing (2)
backend/api/routers/generation.pytests/test_generate_engine.py
| sr = backend.sample_rate | ||
|
|
||
| # Inline [pause Nms] markers (issue #276) work for every engine — the | ||
| # silence stitching is model-free. | ||
| from omnivoice.utils.text import parse_pause_markers | ||
| segments = parse_pause_markers(text) | ||
| has_pause = len(segments) > 1 or (segments and segments[0][1] > 0) | ||
|
|
||
| if has_pause: | ||
| def _gen_span(span_text): | ||
| # Per-span duration is left to the engine; an explicit overall | ||
| # `duration` can't be meaningfully split across spans. | ||
| return backend.generate(span_text, duration=None, **gen_kwargs) | ||
| audio_out = _render_with_pauses(_gen_span, segments, sr) | ||
| else: | ||
| audio_out = backend.generate(text, duration=duration, **gen_kwargs) | ||
|
|
||
| return _apply_effect_chain( | ||
| audio_out, sr, effect_preset, | ||
| skip_mastering=getattr(backend, "applies_own_mastering", False), | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if any TTSBackend implementations have lazy sample_rate that could change post-loading
echo "=== Searching for sample_rate property implementations in backends ==="
rg -n -A 10 '`@property`' --type=py | rg -A 10 'sample_rate'
echo ""
echo "=== Checking TTSBackend base class definition ==="
ast-grep --pattern $'class TTSBackend:
$$$
sample_rate
$$$
'Repository: debpalash/OmniVoice-Studio
Length of output: 16563
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Locate _run_backend_inference in generation router ==="
rg -n "_run_backend_inference" backend/api/routers/generation.py
echo
echo "=== Show generation.py around _run_backend_inference ==="
python3 - <<'PY'
import itertools,sys,os,re
path="backend/api/routers/generation.py"
with open(path,'r',encoding='utf-8') as f:
lines=f.readlines()
for i,l in enumerate(lines, start=1):
if "_run_backend_inference" in l:
start=max(1,i-80)
end=min(len(lines), i+140)
for j in range(start,end+1):
print(f"{j:5d}:{lines[j-1].rstrip()}")
break
else:
print("Not found")
PY
echo
echo "=== Show _render_with_pauses definition and where sr used ==="
rg -n "_render_with_pauses" backend/api/routers/generation.py
python3 - <<'PY'
path="backend/api/routers/generation.py"
with open(path,'r',encoding='utf-8') as f:
lines=f.readlines()
idx=[i for i,l in enumerate(lines, start=1) if "_render_with_pauses" in l]
if not idx:
print("No occurrences")
sys.exit(0)
for i in idx[:3]:
start=max(1,i-40); end=min(len(lines), i+120)
print("\n--- around line",i,"---")
for j in range(start,end+1):
print(f"{j:5d}:{lines[j-1].rstrip()}")
PY
echo
echo "=== Search where sample_rate is read in generation.py ==="
rg -n "sample_rate" backend/api/routers/generation.pyRepository: debpalash/OmniVoice-Studio
Length of output: 33607
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Locate backend classes with conditional sample_rate ==="
rg -n "class (OmniVoiceBackend|CosyVoiceBackend|Sherpa|SherpaONNX|Vits|TTSBackend)" backend/services/tts_backend.py
echo
echo "=== OmniVoiceBackend: sample_rate + _ensure_loaded + generate ==="
python3 - <<'PY'
import re
path="backend/services/tts_backend.py"
with open(path,'r',encoding='utf-8') as f: lines=f.readlines()
targets=["class OmniVoiceBackend","class CosyVoiceBackend","class SherpaONNX","class Sherpa-ONNX","class Sherpa"]
for t in targets:
for i,l in enumerate(lines):
if t in l:
start=max(1,i-5); end=min(len(lines), i+140)
print(f"\n--- {t} around line {i+1} ---")
for j in range(start,end+1):
print(f"{j:5d}:{lines[j-1].rstrip()}")
break
PY
echo
echo "=== Search for _ensure_loaded calls inside generate implementations ==="
rg -n "def generate\\(|_ensure_loaded\\(" backend/services/tts_backend.py | head -n 200
echo
echo "=== Show base adapter protocol (TTSBackend) for generate/sample_rate coupling ==="
rg -n "class TTSBackend|def generate\\(" backend/services/tts_backend.pyRepository: debpalash/OmniVoice-Studio
Length of output: 21343
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
path="backend/api/routers/generation.py"
with open(path,'r',encoding='utf-8') as f: lines=f.readlines()
# show around where sample_rate assigned after inference (reported around 360)
for center in [330,350,360,365,370,380,390]:
start=max(1,center-20); end=min(len(lines),center+20)
print(f"\n===== around line {center} =====")
for i in range(start,end+1):
print(f"{i:5d}:{lines[i-1].rstrip()}")
PYRepository: debpalash/OmniVoice-Studio
Length of output: 16347
Fix pause-marker silence timing when backend sample_rate is lazy-loaded
backend/api/routers/generation.py _run_backend_inference captures sr = backend.sample_rate before calling backend.generate(), and _render_with_pauses uses that sr to compute silence samples. However, generate_speech re-reads sample_rate = _backend.sample_rate after generation, and lazy backends change their reported rate once weights are loaded (e.g., OmniVoiceBackend.sample_rate is 24000 until _model is loaded; CosyVoiceBackend similarly; SherpaOnnxBackend is 22050 until _tts is loaded).
This can make pause-marker stitching durations wrong on the first request after lazy model load, even though the final WAV saving uses the updated sample_rate.
Consider either:
- Ensuring the backend is loaded (e.g., force
_ensure_loaded()or equivalent) before capturingsrfor_render_with_pauses, or - Capturing/deriving
sronly after the backend has loaded (e.g., after the first span is generated) and using that samesrfor stitching and for saving/watermarking.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/api/routers/generation.py` around lines 200 - 220, The pause-marker
stitching uses sr = backend.sample_rate before the backend may lazily load and
change its sample_rate, causing incorrect silence durations; fix by ensuring the
backend is fully loaded before capturing sr: call the backend's load routine
(e.g., backend._ensure_loaded() or the appropriate ensure/load method) at the
start of _run_backend_inference (before sr = backend.sample_rate and before any
generate calls) so _render_with_pauses and subsequent saving/watermarking use
the same, final sample_rate; alternatively, if no ensure method exists, defer
computing sr until after the first span generation (use the sample_rate reported
after backend.generate in _gen_span) and pass that same sr into
_render_with_pauses and into _apply_effect_chain.
…full-suite isolation
tests/backend/** runs before tests/test_*.py and pollutes sys.modules
(re-imports the services tree), so module-level imports bound at pytest
collection pointed at a stale services.tts_backend — registry patches
landed on a dict the routes no longer read ('Unknown TTS engine' in CI).
Modules are now resolved through sys.modules inside each test. The client
fixture also drops the module-scoped lifespan context manager that bound
event_bus queues to this module's loop (teardown 'Queue bound to a
different event loop') — plain function-scoped TestClient, the
test_api.py pattern.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Thanks — the approach looks right (resolution order matches /ws/tts and /v1/audio/speech, and keeping the OmniVoice default path byte-identical is the right call), but CI is red and the failures are in this PR's own new tests:
CI run: https://github.com/debpalash/OmniVoice-Studio/actions/runs/27321203190 Happy to re-review once the suite is green. |
…ith #324 engine routing PR #324 (engine resolution in /generate) landed on main and refactored the shared OOM error path in backend/api/routers/generation.py into the _oom_friendly_reraise() helper, and added a _run_backend_inference() twin for non-default engines. PR #278's compile-failure detection had been inlined in _run_inference's `except Exception` block, so the two collided at the same call site. Resolution: fold #278's _is_compile_runtime_failure() check into the shared _oom_friendly_reraise() helper rather than the now-removed inline block. This keeps both features: - The OmniVoice native path (_run_inference) still surfaces the torch.compile/Triton-specific message instead of mislabeling it as OOM — #278's intent preserved. - #324's _run_backend_inference (non-default engines) is left unchanged and now also benefits from the same detection (a no-op for non-compile errors, and torch.compile failures aren't OmniVoice-specific anyway). - The actual compile-fallback wrapper (_install_compile_fallback / _is_compile_runtime_failure in model_manager.py, plus the arch gate and OMNIVOICE_FORCE_TORCH_COMPILE override in engine_env.py) merged cleanly and is untouched — it still wraps get_model()'s generate on the native path. Validation (targeted): tests/test_generate_engine.py, tests/test_compile_fallback.py, tests/test_torch_compile_gate.py — 27 passed. `python -c "import api.routers.generation"` — OK. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Summary
POST /generatealways ran the OmniVoice model directly viaget_model(), silently ignoring the engine selected in Settings (and answering issue #312's question: no, it wasn't intended). It now:engineform field >OMNIVOICE_TTS_BACKENDenv > Settings selection > OmniVoice default — the same pattern/ws/tts(engine) and/v1/audio/speech(model) already uset_shift, layer/position/class controls) — existing API consumers see zero change[pause Nms]markers ([Feature] Pause in Transcript #276) working on every engine (silence stitching is model-free)applies_own_masteringso studio engines skip the broadcast mastering chain (closing the TODO left by 5d602c8), while loudness normalization still appliesFixes #312
Tests
tests/test_generate_engine.py(new, 6 tests, engine layer stubbed as in test_api.py): Settings-selected engine generates; per-requestengineoverrides; unknown → 400; unavailable → 400; default path unchanged; own-mastering engines skip the chain. Full file passes locally; CI runs the complete suite.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests