Conversation
This reverts commit 3014403.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Review Summary by QodoComplete TTS and ASR module refactoring with incremental streaming and provider consolidation
WalkthroughsDescription• **Refactored TTS and ASR modules** with comprehensive support for Volcengine and Alibaba Cloud providers, replacing deprecated OpenAI and ElevenLabs integrations • **Implemented incremental TTS streaming** with intelligent text chunking, stream segmentation, and retry logic for real-time audio playback • **Enhanced transcription workflow** with listening source tracking, browser recognition auto-restart, language normalization, and transcript sanitization • **Improved microphone input flow** across chat, desktop overlay, and settings interfaces with proper cleanup and state management • **Added direct TTS request builder** with backend relay support and legacy endpoint fallback for provider compatibility • **Consolidated speech provider configuration** with visibility filtering, voice catalog loading, and unified backend API for voice listing • **Fixed voice field reset** when speech model changes to prevent stale selections • **Added comprehensive test coverage** for TTS streaming, ASR integration, provider configuration, and utility functions • **Updated provider endpoints** to official Volcengine and Alibaba DashScope URLs with proper model ID formatting • **Added project documentation** (CLAUDE.md) for development guidelines and architecture overview Diagramflowchart LR
A["Chat/Settings UI"] -->|toggleMic| B["Transcription Store"]
B -->|source tracking| C["Browser Recognition<br/>+ Auto-restart"]
C -->|normalized language| D["ASR Providers<br/>Volcengine/Alibaba"]
E["Assistant Messages"] -->|token stream| F["Speech Output Store"]
F -->|TTS Chunking| G["TTS Stream<br/>Segmenter"]
G -->|chunk queue| H["TTS Providers<br/>Volcengine/Alibaba"]
H -->|retry logic| I["Audio Playback"]
J["Provider Config"] -->|voice catalog| K["Voice Selection"]
K -->|model change| L["Reset Voice Field"]
File Changes1. frontend/packages/stage-settings-ui/src/components/AudioSection.vue
|
Code Review by Qodo
1. Dify/Coze TTS broken
|
There was a problem hiding this comment.
Code Review Summary
This PR (XL: 8730+/992-, 66 files) makes substantial changes to both TTS and ASR modules. The review identified several issues ranging from silent failures and a potential SSRF vector to a logic bug that silently drops emoji characters from TTS output.
PR Size: XL
Issues Found
| Category | Critical | High | Medium | Low |
|---|---|---|---|---|
| Silent Failures | 1 | 4 | 1 | - |
| Logic Bugs | 1 | 3 | 2 | - |
| Security | 1 | 1 | - | - |
| Type Safety | - | - | 1 | 1 |
| Tests | - | 3 | 2 | - |
| Other | 1 | - | 2 | - |
Key Findings (details in inline comments)
Critical
- [SECURITY-SSRF]
asr.py/tts.py: User-controlledbase_url/api_baseforwarded to server-side HTTP requests. Aliyun/Volcengine/Alibaba paths accept user-supplied URLs from request bodyoverridesthat bypassASR_BLOCKED_CONFIG_KEYS, enabling SSRF against internal networks. - [LOGIC-BUG]
tts-chunker.ts: Multi-codepoint graphemes (emojis, flag characters, CJK with combining marks) are silently dropped from TTS output due toif (value.length > 1) { continue; }skipping them without appending to buffer. .idea/committed: 8 JetBrains IDE config files added to version control. Add.idea/to.gitignoreand remove these files.
High
4. [ERROR-SILENT] asr.py:210-214: _decode_base64 catches all exceptions and returns b"" with no logging. Caller then raises misleading "Missing audio data" instead of "Invalid base64".
5. [ERROR-SILENT] registry.py:~60: _load_local_tts_voices_cached swallows JSON parse errors and returns []. Because of @lru_cache, the empty result is permanently cached for the process lifetime.
6. [SECURITY-INFO-LEAK] asr.py:174-175: str(exc) sent directly to WebSocket client, potentially leaking internal URLs, file paths, or connection strings.
7. [LOGIC-BUG] speech-output.ts: supported computed now returns true unconditionally for non-browser TTS, even when the provider is unsupported/unconfigured.
8. [LOGIC-BUG] transcription.ts: startListening now sets enabled.value = true as a side effect, silently overriding the user's persisted preference.
9. [PERFORMANCE-ISSUE] tts-stream-segmenter.ts: drain() re-runs chunkTtsInput() on the full accumulated input every call, producing O(n²) cost for long streaming responses.
Medium
10. [ERROR-SILENT] tts.py: Two except Exception: pass blocks in _extract_tts_error with zero logging.
11. [ERROR-SILENT] asr.py: _close_aliyun_realtime_session and _read_aliyun_realtime_events swallow exceptions without logging.
12. [LOGIC-BUG] speech-output.ts: endAssistantStream finally block does not clear assistantStreamChunks or assistantStreamFailedChunkIndex, leaving stale state.
13. [UTF-8 BOM] stage-settings-ui/AudioSection.vue:1: UTF-8 BOM (U+FEFF) introduced at file start — likely an editor artifact.
14. [TEST-MISSING-CRITICAL] audio-direct.test.ts: buildDirectTtsHttpRequest null-return guards (missing required fields) are entirely untested.
15. [TEST-MISSING-CRITICAL] tts-chunker.test.ts: maximumWords limit splitting and soft punctuation splitting are untested.
Review Coverage
- Logic and correctness
- Security (OWASP Top 10)
- Error handling
- Type safety
- Documentation accuracy
- Test coverage
- Code clarity
Automated review by Claude AI
| ) | ||
| else: | ||
| response = await _forward_transcription( | ||
| engine_config, wav_bytes, overrides, filename, content_type |
There was a problem hiding this comment.
[ERROR-SILENT] _decode_base64 catches all exceptions and silently returns b"". The caller at line ~83 then raises a misleading "Missing audio data" error when the user actually provided data that was malformed. No logging occurs.
Suggestion:
def _decode_base64(payload: str) -> bytes:
try:
return base64.b64decode(payload, validate=True)
except Exception as exc:
logger.warning("Failed to decode base64 audio payload: %s", exc)
return b""Or better, raise HTTPException(400, detail="Invalid base64 audio data") to give actionable feedback.
| sample_rate=sample_rate, | ||
| ) | ||
| await websocket.send_json({"type": "ready"}) | ||
| elif message_type == "stop": |
There was a problem hiding this comment.
[SECURITY-INFO-LEAK] str(exc) is sent directly to the WebSocket client. This can leak internal URLs, file paths, connection strings, or stack trace details (e.g., from httpx connection errors revealing internal hostnames).
Suggestion:
except Exception as exc:
logger.exception("Unexpected error in ASR WebSocket stream")
await websocket.send_json({"type": "error", "error": "Internal server error"})
await websocket.close(code=1011)| @@ -0,0 +1,8 @@ | |||
| # 默认忽略的文件 | |||
There was a problem hiding this comment.
[CONFIG] 8 JetBrains IDE config files (.idea/) are being added to version control. These are developer-specific and should not be committed. The existing .gitignore already excludes .vscode/ under # IDE but .idea/ was missed.
Suggestion: Add .idea/ to .gitignore and remove these files from the PR:
echo '.idea/' >> .gitignore
git rm -r --cached .idea/| while (index < input.length) { | ||
| let value = input[index]; | ||
|
|
||
| if (value.length > 1) { |
There was a problem hiding this comment.
[LOGIC-BUG] Multi-codepoint graphemes (emojis like 👨👩👧👦, flag emojis 🇺🇸, CJK with combining marks) are silently dropped from TTS output. When Intl.Segmenter produces a grapheme cluster with value.length > 1, this code skips it entirely without appending to buffer, so the character vanishes from the output.
Suggestion: Append the multi-codepoint grapheme to the buffer instead of skipping it:
if (value.length > 1) {
buffer += value; // <-- ADD: don't drop the grapheme
previousValue = value;
index += 1;
continue;
}|
|
||
| try: | ||
| raw = json.loads(source.read_text(encoding="utf-8")) | ||
| except Exception: |
There was a problem hiding this comment.
[ERROR-SILENT] JSON parse errors are silently swallowed with no logging. Because this function is wrapped in @lru_cache(maxsize=8), the empty [] result is permanently cached for the process lifetime. Even if the JSON file is subsequently fixed on disk, a full server restart is needed.
Suggestion:
try:
raw = json.loads(source.read_text(encoding="utf-8"))
except Exception as exc:
logger.error("Failed to load TTS voices from %s: %s", path, exc)
return [] # NOTE: empty result is cached by @lru_cache; consider not caching on error| async def run_tts_engine(request: EngineRunRequest) -> Response: | ||
| engine_id = _resolve_tts_engine_id(request.engine) | ||
| runtime_config = _get_tts_engine_config(engine_id) | ||
|
|
||
| text = _extract_tts_input(request.data) | ||
| if not text: | ||
| raise HTTPException(status_code=400, detail="Missing text input") | ||
|
|
||
| engine_type = (config.engine_type or "openai_compat").lower() | ||
| overrides = request.config if isinstance(request.config, dict) else {} | ||
| api_key = _resolve_tts_api_key(runtime_config, overrides) | ||
| if not api_key: | ||
| raise HTTPException(status_code=400, detail="Missing apiKey for TTS provider") | ||
|
|
||
| if engine_id in VOLCENGINE_ENGINE_IDS: | ||
| return await _forward_volcengine_tts( | ||
| runtime_config=runtime_config, | ||
| text=text, | ||
| overrides=overrides, | ||
| api_key=api_key, | ||
| ) | ||
|
|
||
| if engine_type in {"dify_tts", "dify"}: | ||
| stream = await _stream_dify_tts(config, text, overrides) | ||
| return StreamingResponse(stream, media_type="audio/mpeg") | ||
| if engine_id in ALIBABA_ENGINE_IDS: | ||
| return await _forward_alibaba_tts( | ||
| engine_id=engine_id, | ||
| runtime_config=runtime_config, | ||
| text=text, | ||
| overrides=overrides, | ||
| api_key=api_key, | ||
| ) | ||
|
|
||
| if engine_type in {"coze_tts", "coze"}: | ||
| stream = await _stream_coze_tts(config, text, overrides) | ||
| return StreamingResponse(stream, media_type="audio/mpeg") | ||
| payload = _build_unspeech_payload( | ||
| engine_id=engine_id, | ||
| runtime_config=runtime_config, | ||
| text=text, | ||
| overrides=overrides, | ||
| ) | ||
|
|
||
| base_url_override, api_key_override = _resolve_connection_overrides(overrides) | ||
| payload: Dict[str, Any] = {"model": config.model, "input": text} | ||
| payload.update(config.default_params) | ||
| payload.update(sanitize_config(overrides)) | ||
| speech_path = runtime_config.paths.get("speech") if runtime_config.paths else None | ||
| url = runtime_config.base_url.rstrip("/") + normalize_path(speech_path or "/audio/speech") | ||
|
|
||
| if "voice" not in payload: | ||
| raise HTTPException(status_code=400, detail="Missing voice for TTS") | ||
| headers = { | ||
| "Content-Type": "application/json", | ||
| "Authorization": f"Bearer {api_key}", | ||
| } | ||
| headers.update(runtime_config.headers) |
There was a problem hiding this comment.
1. Dify/coze tts broken 🐞 Bug ✓ Correctness
backend/app/api/tts.py no longer routes by engine_type (dify_tts/coze_tts). Requests for engines still defined in backend/config/engines.yaml (dify-tts/coze-tts) will fall into the generic OpenAI-compatible relay path, requiring model/voice and using /audio/speech, which does not match their configured paths and request formats.
Agent Prompt
### Issue description
`run_tts_engine` no longer supports `dify_tts` / `coze_tts` engines even though they remain configured in `backend/config/engines.yaml`. Those engines will now be treated as OpenAI-compatible relays and likely fail.
### Issue Context
The PR refactored TTS routing to special-case Volcengine/Alibaba and treat everything else as OpenAI-compatible. Existing Dify/Coze engines in config still require custom endpoints/parameters.
### Fix Focus Areas
- backend/app/api/tts.py[76-140]
- backend/config/engines.yaml[138-179]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| function resolveTtsEngineId() { | ||
| const metadataEngineId = providerMetadata.value?.engineId; | ||
| if (typeof metadataEngineId === "string" && metadataEngineId.trim()) { | ||
| return metadataEngineId.trim(); | ||
| } | ||
| if (speechProviderId.value === "volcengine-speech" || speechProviderId.value === "volcengine") { | ||
| return "volcengine-speech"; | ||
| } | ||
| if ( | ||
| speechProviderId.value === "alibaba-cloud-model-studio-speech" || | ||
| speechProviderId.value === "alibaba-cloud-model-studio" | ||
| ) { | ||
| return "alibaba-cloud-model-studio-speech"; | ||
| } | ||
| return ""; | ||
| } |
There was a problem hiding this comment.
2. Speech provider id not migrated 🐞 Bug ⛯ Reliability
Frontend now restricts visible speech providers and changes the default speech provider, but existing users may still have removed/hidden speechProviderId values in localStorage. When provider metadata is missing, resolveTtsEngineId returns an empty string, causing direct TTS requests to be unsupported and breaking speech output after upgrade.
Agent Prompt
### Issue description
Upgraded users with a previously stored speech provider ID that is now hidden/removed can end up with `resolveTtsEngineId()` returning an empty string, which makes direct TTS unsupported and breaks speech output.
### Issue Context
`useLocalStorage` preserves old values across upgrades; the PR adds a strict allowlist of visible speech providers.
### Fix Focus Areas
- frontend/packages/app-core/src/stores/settings.ts[16-22]
- frontend/packages/app-core/src/utils/provider-visibility.ts[1-14]
- frontend/packages/app-core/src/stores/speech-output.ts[89-104]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| this.emittedCount = 0; | ||
| } | ||
|
|
||
| drain(finalize: boolean) { |
There was a problem hiding this comment.
[PERFORMANCE-ISSUE] drain() re-runs chunkTtsInput(this.input) on the full accumulated input text every call, then skips the first this.emittedCount chunks. As the stream grows, this becomes O(n) per token arrival, making the total cost O(n²) for a long streaming response.
Suggestion: Track the character offset of the last emitted chunk and only re-chunk from that offset onward:
drain(finalize: boolean) {
// Only chunk from the un-emitted portion of the input
const tail = this.input.slice(this.emittedOffset);
const newChunks = chunkTtsInput(tail);
// ... emit new chunks, update this.emittedOffset
}Or cache the chunks array and only re-chunk the text appended since the last drain().
| } | ||
| return Boolean(audioApiBaseUrl.value); | ||
| return true; | ||
| }); |
There was a problem hiding this comment.
[LOGIC-BUG] supported now returns true unconditionally for non-browser TTS, regardless of whether the provider is actually configured, has an API key, or has a valid engine. This misleads the UI into showing TTS as available when it may not work (e.g., if the user selects a speech provider that isn't volcengine or alibaba).
Suggestion: Check that the engine is actually supported:
const supported = computed(() => {
if (typeof window === "undefined") return false;
if (useBrowserTts.value) {
return "speechSynthesis" in window;
}
const engineId = resolveTtsEngineId();
return supportsDirectTts(engineId);
});| } | ||
|
|
||
| if (!enabled.value) { | ||
| enabled.value = true; |
There was a problem hiding this comment.
[LOGIC-BUG] startListening now sets enabled.value = true as a side effect. Since enabled is persisted to localStorage, this silently overrides the user's preference. If the user intentionally disabled transcription and some other code path calls startListening, their preference is permanently overwritten.
Suggestion: Make the auto-enable opt-in rather than implicit:
async function startListening(options?: StartListeningOptions & { autoEnable?: boolean }) {
if (!enabled.value) {
if (options?.autoEnable) {
enabled.value = true;
} else {
return;
}
}
// ...
}|
Blocker (can’t access the PR diff)
Without the diff (or PR head file contents), I can’t do an evidence-based review with file+line citations or generate valid inline review comments. What I can confirm from the local event payload
XL split suggestions (based on PR description only)
To proceed
|
There was a problem hiding this comment.
WhaleWhisper Elite Code Review — PR #33
Size: XL (8862+, 992−, 66 files)
Issues Found
| # | Severity | Category | File | Line(s) | Confidence | Title |
|---|---|---|---|---|---|---|
| 1 | Critical | LOGIC-BUG | backend/app/api/asr.py |
134–144 | 95 | WebSocket break paths leak Aliyun realtime session |
| 2 | Medium | LOGIC-BUG | frontend/.../tts-chunker.ts |
124–127 | 85 | Multi-character graphemes silently dropped from TTS text |
Review Coverage
- Comment Analyzer
- Test Analyzer
- Silent Failure Hunter
- Type Design Auditor
- General Code Reviewer
- Code Simplifier
Stats
- Files reviewed: 66
- Issues found (≥80 confidence): 2
- False positives filtered: 5
| while True: | ||
| message = await websocket.receive() | ||
| try: | ||
| message = await websocket.receive() | ||
| except WebSocketDisconnect: | ||
| break | ||
| except RuntimeError as exc: | ||
| if _is_disconnect_receive_runtime_error(exc): | ||
| break | ||
| raise | ||
|
|
||
| if _is_websocket_disconnect_message(message): |
There was a problem hiding this comment.
[CRITICAL] [LOGIC-BUG] WebSocket break paths leak aliyun_realtime_session (Confidence: 95)
These three break statements exit the while True loop normally, which means the except WebSocketDisconnect and except Exception handlers (which contain the session cleanup logic) are never reached. If aliyun_realtime_session has been created, the Aliyun WebSocket connection and its background reader task are leaked.
The outer except blocks at the end of run_asr_engine_stream correctly call _close_aliyun_realtime_session, but break does not trigger them — it falls through to the code after the try/except, which has no cleanup.
Suggested fix — wrap the loop in a try/finally:
aliyun_realtime_session: Optional["AliyunRealtimeSession"] = None
try:
while True:
try:
message = await websocket.receive()
except WebSocketDisconnect:
break
except RuntimeError as exc:
if _is_disconnect_receive_runtime_error(exc):
break
raise
if _is_websocket_disconnect_message(message):
break
# ... rest of message handling ...
except WebSocketDisconnect:
return
except Exception as exc:
# ... error handling ...
finally:
if aliyun_realtime_session is not None:
await _close_aliyun_realtime_session(aliyun_realtime_session)This ensures the session is always cleaned up regardless of how the loop exits.
| if (value.length > 1) { | ||
| previousValue = value; | ||
| index += 1; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
[MEDIUM] [LOGIC-BUG] Multi-character graphemes silently dropped from TTS text (Confidence: 85)
When value.length > 1 (true for emoji like 👋, composed characters like é via combining marks, and other multi-codeunit graphemes), the code sets previousValue and increments index but never appends value to buffer. These characters are silently lost from the TTS output.
This is especially impactful for CJK users and emoji-heavy conversations.
Suggested fix — add the value to the buffer before continuing:
if (value.length > 1) {
buffer += value; // ← preserve the grapheme
previousValue = value;
index += 1;
continue;
}Alternatively, consider whether multi-char graphemes should also increment chunkWordsCount (they likely represent a visible word/symbol).
概要
将 TTS 和 ASR 模块从第三方代理服务(unspeech)迁移至直接对接火山引擎和阿里云 DashScope 官方 API,同时新增增量流式 TTS、实时 ASR WebSocket、麦克风交互等能力。
问题
unspeech.hyp3r.link,增加了延迟和单点故障风险关联 Issue:
解决方案
后端:为火山引擎和阿里云分别实现原生 API 调用(火山引擎走 HTTPS + Base64 JSON,阿里云走 WebSocket 三阶段协议),消除对 unspeech 中转的依赖。新增阿里云 DashScope 实时 ASR(
qwen3-asr-flash-realtime)WebSocket 流式识别。Provider Registry 新增本地语音目录(JSON 文件),按模型筛选兼容音色。前端:实现增量流式 TTS(
TtsStreamSegmenter在 LLM token 到达时按句分段、立即合成),新增多个工具函数模块化拆分逻辑。对话界面和桌面端新增麦克风按钮,支持静音/取消静音状态。精简语音提供商列表,仅保留已实际接入的 4 个。变更内容
核心变更
后端 TTS 直连路由 (
backend/app/api/tts.py)_forward_volcengine_tts:直接调用火山引擎 TTS API,Bearer token 鉴权,Base64 音频解码_forward_alibaba_tts:通过 WebSocket 对接阿里云 CosyVoice(run-task → continue-task → finish-task 三阶段协议)后端 ASR 实时流 (
backend/app/api/asr.py)/compatible-mode/v1/chat/completions+input_audio)AliyunRealtimeSession实时 WebSocket ASR(wss://dashscope.aliyuncs.com/api-ws/v1/realtime)_is_websocket_disconnect_message,_safe_send_ws_error)-realtime后缀,批量自动去除Provider Registry (
backend/app/services/providers/registry.py)list_voices从本地 JSON 目录加载音色列表(LRU 缓存)compatible_models过滤,匹配当前选中模型前端增量流式 TTS (
speech-output.ts+ 新工具模块)TtsStreamSegmenter:按标点和特殊标记分段 LLM 流式 tokenrunTtsChunkQueue:顺序合成分段,失败时回退合并剩余文本tts-chunker.ts:支持 CJK 分词(Intl.Segmenter)、保留小数、省略号规范化tts-direct-request.ts:构建直连 TTS 请求,405 回退旧格式前端 ASR 增强 (
transcription.ts+ 新工具模块)shouldAutoRestartBrowserRecognition)decideCaptureFallback)sanitizeTranscript)zh→zh-CN,默认跟随navigator.language前端 UI
ChatArea.vue/DesktopChatOverlay.vue:新增麦克风按钮(含静音/取消静音视觉状态)App.vue:接入onTokenLiteral/onTokenSpecial驱动增量 TTSAudioSection.vue:语言改为下拉选择,移除冗余开关,新增测试说明useHttpsScheme以支持麦克风权限辅助变更
alibaba.json(CosyVoice 音色)、volcengine.json(火山引擎音色)provider-fallback.ts/provider-options.ts:移除 7 个未接入的语音提供商provider-visibility.ts:白名单仅保留 4 个已接入提供商provider-fields.ts:有默认 baseUrl 时自动隐藏该字段websockets从 dev 依赖提升为运行时依赖engines.yaml/providers.yaml更新为官方 API 端点注意事项
.idea/目录(JetBrains IDE 配置)被包含在提交中,建议加入.gitignoreCLAUDE.md作为项目指引文档一并提交测试
新增 16 个测试文件覆盖核心逻辑:
后端测试(6 个):
test_tts_engine_relay.py:火山引擎 payload 构建、阿里云模型规范化、错误提取test_asr_aliyun_dashscope.py:DashScope URL 构建、转写文本提取、模型规范化test_asr_stream_disconnect.py:WebSocket 断连检测test_provider_voices_tts.py:音色目录加载和模型过滤test_provider_catalog_aliyun_fields.py:阿里云 NLS 字段规范化test_provider_catalog_tts_defaults.py:TTS 提供商默认端点验证前端测试(10 个):
audio-direct.test.ts:直连请求构建和旧格式回退tts-chunker.test.ts:文本分段、CJK、特殊标记tts-stream-segmenter.test.ts:流式分段和 draintts-streaming-runner.test.ts:队列执行、错误处理browser-recognition-restart.test.ts:自动重启逻辑capture-startup.test.ts:Worklet 回退逻辑provider-fields.test.ts/provider-visibility.test.ts:字段过滤和可见性transcript-filter.test.ts/transcription-language.test.ts:转写清洗和语言规范化自测方式
pytest backend/tests/test_tts_*.py backend/tests/test_asr_*.py backend/tests/test_provider_*.py)pnpm --filter @whalewhisper/app-core test)破坏性变更
_stream_dify_tts、_stream_coze_tts等函数已删除openai-audio-speech改为browser-local-audio-speechrun_tts_engine从StreamingResponse改为Response(返回完整音频缓冲区)Checklist
.idea/目录已从提交中移除或加入.gitignore由 Claude AI 自动生成