Skip to content

Fix/tts voice flow#33

Closed
Kiritogu wants to merge 13 commits intodatawhalechina:mainfrom
Kiritogu:fix/tts-voice-flow
Closed

Fix/tts voice flow#33
Kiritogu wants to merge 13 commits intodatawhalechina:mainfrom
Kiritogu:fix/tts-voice-flow

Conversation

@Kiritogu
Copy link

@Kiritogu Kiritogu commented Mar 1, 2026

概要

将 TTS 和 ASR 模块从第三方代理服务(unspeech)迁移至直接对接火山引擎和阿里云 DashScope 官方 API,同时新增增量流式 TTS、实时 ASR WebSocket、麦克风交互等能力。

问题

  • TTS 合成依赖第三方中转服务 unspeech.hyp3r.link,增加了延迟和单点故障风险
  • ASR 仅支持批量转写,缺少实时流式语音识别
  • 前端缺少麦克风输入按钮,无法在对话界面直接语音交互
  • LLM 流式输出时 TTS 需等待完整回复,响应体验差
  • 语音提供商列表包含多个未实际接入的占位项

关联 Issue:

解决方案

后端:为火山引擎和阿里云分别实现原生 API 调用(火山引擎走 HTTPS + Base64 JSON,阿里云走 WebSocket 三阶段协议),消除对 unspeech 中转的依赖。新增阿里云 DashScope 实时 ASR(qwen3-asr-flash-realtime)WebSocket 流式识别。Provider Registry 新增本地语音目录(JSON 文件),按模型筛选兼容音色。

前端:实现增量流式 TTS(TtsStreamSegmenter 在 LLM token 到达时按句分段、立即合成),新增多个工具函数模块化拆分逻辑。对话界面和桌面端新增麦克风按钮,支持静音/取消静音状态。精简语音提供商列表,仅保留已实际接入的 4 个。

变更内容

核心变更

后端 TTS 直连路由 (backend/app/api/tts.py)

  • 新增 _forward_volcengine_tts:直接调用火山引擎 TTS API,Bearer token 鉴权,Base64 音频解码
  • 新增 _forward_alibaba_tts:通过 WebSocket 对接阿里云 CosyVoice(run-task → continue-task → finish-task 三阶段协议)
  • 移除 Dify/Coze TTS 处理函数和旧的流式代理逻辑
  • 新增结构化错误提取和火山引擎凭证错误提示装饰

后端 ASR 实时流 (backend/app/api/asr.py)

  • 新增阿里云 DashScope 批量转写(/compatible-mode/v1/chat/completions + input_audio
  • 新增 AliyunRealtimeSession 实时 WebSocket ASR(wss://dashscope.aliyuncs.com/api-ws/v1/realtime
  • 新增 WebSocket 断连检测与安全清理(_is_websocket_disconnect_message, _safe_send_ws_error
  • 模型自动规范化:实时流强制 -realtime 后缀,批量自动去除

Provider Registry (backend/app/services/providers/registry.py)

  • list_voices 从本地 JSON 目录加载音色列表(LRU 缓存)
  • 阿里云音色按 compatible_models 过滤,匹配当前选中模型
  • 新增阿里云 NLS ASR 验证和模型列表

前端增量流式 TTS (speech-output.ts + 新工具模块)

  • TtsStreamSegmenter:按标点和特殊标记分段 LLM 流式 token
  • runTtsChunkQueue:顺序合成分段,失败时回退合并剩余文本
  • tts-chunker.ts:支持 CJK 分词(Intl.Segmenter)、保留小数、省略号规范化
  • tts-direct-request.ts:构建直连 TTS 请求,405 回退旧格式

前端 ASR 增强 (transcription.ts + 新工具模块)

  • 浏览器语音识别自动重启(shouldAutoRestartBrowserRecognition
  • AudioWorklet 失败自动回退 MediaRecorder(decideCaptureFallback
  • 转写结果清洗:过滤误识别的 Windows 路径(sanitizeTranscript
  • 语言代码规范化:zhzh-CN,默认跟随 navigator.language

前端 UI

  • ChatArea.vue / DesktopChatOverlay.vue:新增麦克风按钮(含静音/取消静音视觉状态)
  • App.vue:接入 onTokenLiteral / onTokenSpecial 驱动增量 TTS
  • AudioSection.vue:语言改为下拉选择,移除冗余开关,新增测试说明
  • Tauri 窗口启用 useHttpsScheme 以支持麦克风权限

辅助变更

  • 新增语音目录数据文件:alibaba.json(CosyVoice 音色)、volcengine.json(火山引擎音色)
  • 精简 provider-fallback.ts / provider-options.ts:移除 7 个未接入的语音提供商
  • provider-visibility.ts:白名单仅保留 4 个已接入提供商
  • provider-fields.ts:有默认 baseUrl 时自动隐藏该字段
  • websockets 从 dev 依赖提升为运行时依赖
  • 配置文件 engines.yaml / providers.yaml 更新为官方 API 端点

注意事项

  • .idea/ 目录(JetBrains IDE 配置)被包含在提交中,建议加入 .gitignore
  • CLAUDE.md 作为项目指引文档一并提交

测试

新增 16 个测试文件覆盖核心逻辑:

后端测试(6 个):

  • test_tts_engine_relay.py:火山引擎 payload 构建、阿里云模型规范化、错误提取
  • test_asr_aliyun_dashscope.py:DashScope URL 构建、转写文本提取、模型规范化
  • test_asr_stream_disconnect.py:WebSocket 断连检测
  • test_provider_voices_tts.py:音色目录加载和模型过滤
  • test_provider_catalog_aliyun_fields.py:阿里云 NLS 字段规范化
  • test_provider_catalog_tts_defaults.py:TTS 提供商默认端点验证

前端测试(10 个):

  • audio-direct.test.ts:直连请求构建和旧格式回退
  • tts-chunker.test.ts:文本分段、CJK、特殊标记
  • tts-stream-segmenter.test.ts:流式分段和 drain
  • tts-streaming-runner.test.ts:队列执行、错误处理
  • browser-recognition-restart.test.ts:自动重启逻辑
  • capture-startup.test.ts:Worklet 回退逻辑
  • provider-fields.test.ts / provider-visibility.test.ts:字段过滤和可见性
  • transcript-filter.test.ts / transcription-language.test.ts:转写清洗和语言规范化

自测方式

  • 后端单测通过(pytest backend/tests/test_tts_*.py backend/tests/test_asr_*.py backend/tests/test_provider_*.py
  • 前端单测通过(pnpm --filter @whalewhisper/app-core test
  • 本地启动后端 + 前端,验证火山引擎 TTS 合成正常
  • 本地验证阿里云 CosyVoice TTS 合成正常
  • 本地验证阿里云 DashScope 实时 ASR 流式识别正常
  • 对话界面麦克风按钮功能正常(Web + Tauri)
  • LLM 流式回复时增量 TTS 播放正常

破坏性变更

  • 移除 Dify/Coze TTS 处理_stream_dify_tts_stream_coze_tts 等函数已删除
  • 移除 7 个前端语音提供商选项:OpenAI、ElevenLabs、Microsoft Speech 等不再出现在设置中
  • 默认语音提供商变更:从 openai-audio-speech 改为 browser-local-audio-speech
  • TTS API 返回类型变更run_tts_engineStreamingResponse 改为 Response(返回完整音频缓冲区)

Checklist

  • 代码符合项目规范
  • 已完成自审
  • 本地测试通过
  • 文档已更新(如需要)
  • .idea/ 目录已从提交中移除或加入 .gitignore

由 Claude AI 自动生成

@github-actions github-actions bot added area/backend Touches backend (FastAPI/Python) area/frontend Touches frontend (Vue/TS) needs-review Needs careful review (large/complex changes) size/XL PR size: >= 1000 lines changed labels Mar 1, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@qodo-code-review
Copy link

Review Summary by Qodo

Complete TTS and ASR module refactoring with incremental streaming and provider consolidation

✨ Enhancement 🐞 Bug fix 🧪 Tests

Grey Divider

Walkthroughs

Description
• **Refactored TTS and ASR modules** with comprehensive support for Volcengine and Alibaba Cloud
  providers, replacing deprecated OpenAI and ElevenLabs integrations
• **Implemented incremental TTS streaming** with intelligent text chunking, stream segmentation, and
  retry logic for real-time audio playback
• **Enhanced transcription workflow** with listening source tracking, browser recognition
  auto-restart, language normalization, and transcript sanitization
• **Improved microphone input flow** across chat, desktop overlay, and settings interfaces with
  proper cleanup and state management
• **Added direct TTS request builder** with backend relay support and legacy endpoint fallback for
  provider compatibility
• **Consolidated speech provider configuration** with visibility filtering, voice catalog loading,
  and unified backend API for voice listing
• **Fixed voice field reset** when speech model changes to prevent stale selections
• **Added comprehensive test coverage** for TTS streaming, ASR integration, provider configuration,
  and utility functions
• **Updated provider endpoints** to official Volcengine and Alibaba DashScope URLs with proper model
  ID formatting
• **Added project documentation** (CLAUDE.md) for development guidelines and architecture overview
Diagram
flowchart LR
  A["Chat/Settings UI"] -->|toggleMic| B["Transcription Store"]
  B -->|source tracking| C["Browser Recognition<br/>+ Auto-restart"]
  C -->|normalized language| D["ASR Providers<br/>Volcengine/Alibaba"]
  E["Assistant Messages"] -->|token stream| F["Speech Output Store"]
  F -->|TTS Chunking| G["TTS Stream<br/>Segmenter"]
  G -->|chunk queue| H["TTS Providers<br/>Volcengine/Alibaba"]
  H -->|retry logic| I["Audio Playback"]
  J["Provider Config"] -->|voice catalog| K["Voice Selection"]
  K -->|model change| L["Reset Voice Field"]
Loading

Grey Divider

File Changes

1. frontend/packages/stage-settings-ui/src/components/AudioSection.vue ✨ Enhancement +96/-102

Refactor audio settings UI with source tracking and language selection

• Refactored microphone testing UI to use toggleMicInput() function with source tracking instead
 of simple boolean toggle
• Added listeningSource property to track whether listening is from settings-test or chat-input
• Replaced manual language input field with SelectMenu dropdown for predefined transcription
 languages (zh-CN, en-US, ja-JP, ko-KR)
• Removed voice selection UI and voiceId computed property from TTS section
• Simplified transcription display with merged interim/last transcript view and test-only hint

frontend/packages/stage-settings-ui/src/components/AudioSection.vue


2. frontend/apps/web/src/components/widgets/ChatArea.vue ✨ Enhancement +109/-34

Add microphone input button to chat area with transcription

• Added microphone button to chat input area with transcription store integration
• Implemented toggleChatMic() function to manage transcription with autoSend: true and `source:
 "chat-input"`
• Added visual feedback for active microphone state with conditional styling
• Displays transcription errors when microphone is active in chat context

frontend/apps/web/src/components/widgets/ChatArea.vue


3. frontend/apps/desktop-tauri/renderer/src/components/DesktopChatOverlay.vue ✨ Enhancement +85/-18

Add microphone button to desktop chat overlay

• Added microphone button to desktop chat overlay with transcription store integration
• Implemented toggleChatMic() function with autoSend: true and source: "chat-input"
• Added visual indicators for active microphone state and error display
• Uses SVG slash icon to indicate disabled microphone state

frontend/apps/desktop-tauri/renderer/src/components/DesktopChatOverlay.vue


View more (60)
4. frontend/apps/desktop-tauri/renderer/src/SettingsApp.vue ✨ Enhancement +17/-0

Add microphone cleanup on settings window close

• Added stopSettingsTestMic() function to stop listening when source is "settings-test"
• Integrated cleanup on settings window close and visibility change events
• Ensures microphone is stopped when settings dialog is hidden or closed

frontend/apps/desktop-tauri/renderer/src/SettingsApp.vue


5. frontend/apps/web/src/components/settings/SettingsDialog.vue ✨ Enhancement +5/-0

Stop microphone on settings dialog close

• Added cleanup logic to stop listening when settings dialog closes if source is "settings-test"
• Integrates transcription store to manage microphone state during settings interaction

frontend/apps/web/src/components/settings/SettingsDialog.vue


6. frontend/apps/desktop-tauri/renderer/src/ChatApp.vue ✨ Enhancement +8/-0

Add automatic speech output for assistant messages

• Added speech output store integration to handle assistant message playback
• Implemented onAssistantFinal listener to automatically speak final assistant messages
• Added proper cleanup of speech output listener on unmount

frontend/apps/desktop-tauri/renderer/src/ChatApp.vue


7. frontend/apps/web/src/App.vue ✨ Enhancement +7/-1

Implement incremental TTS streaming for assistant messages

• Added onTokenLiteral listener to push assistant text chunks for incremental TTS streaming
• Modified onTokenSpecial to push special tokens to speech output store
• Changed onAssistantFinal to call endAssistantStream() instead of direct speak()
• Added proper cleanup of token literal listener on unmount

frontend/apps/web/src/App.vue


8. frontend/packages/app-settings/src/sections/ModelSection.vue 🐞 Bug fix +9/-3

Reset voice field when speech model changes

• Added logic to reset voice field when speech model is changed
• Triggers provider refresh after model change to update available voices
• Prevents stale voice selections when switching between different TTS models

frontend/packages/app-settings/src/sections/ModelSection.vue


9. frontend/packages/app-settings/src/sections/AudioSection.vue ✨ Enhancement +6/-1

Stop microphone on audio settings unmount

• Added onUnmounted hook to stop listening if source is "settings-test"
• Ensures microphone is properly cleaned up when audio settings component unmounts

frontend/packages/app-settings/src/sections/AudioSection.vue


10. frontend/packages/stage-settings-ui/src/components/ProviderPanel.vue ✨ Enhancement +7/-0

Add placeholder for empty voice options

• Added placeholder text for voice field when no compatible voices are available
• Displays localized message indicating no supported voices for current configuration

frontend/packages/stage-settings-ui/src/components/ProviderPanel.vue


11. frontend/packages/app-core/src/stores/speech-output.ts ✨ Enhancement +404/-87

Implement incremental TTS streaming with retry logic

• Removed voiceId local storage and replaced with provider config-based voice selection
• Implemented incremental TTS streaming with TtsStreamSegmenter for real-time audio playback
• Added pushAssistantLiteral(), pushAssistantSpecial(), and endAssistantStream() methods for
 streaming support
• Implemented retry logic with exponential backoff for TTS requests
• Added requestTtsDirectWithRetry() and requestRemoteTtsBlob() for improved reliability
• Refactored chunk scheduling and playback with fallback to merged remainder on chunk failure

frontend/packages/app-core/src/stores/speech-output.ts


12. frontend/packages/app-core/src/stores/transcription.ts ✨ Enhancement +210/-60

Add listening source tracking and browser recognition restart

• Added listeningSource tracking to distinguish between "settings-test" and "chat-input" sources
• Implemented StartListeningOptions with autoSend and source parameters
• Added browser recognition auto-restart logic with error tracking and restart delay
• Integrated language normalization and transcript sanitization utilities
• Enhanced VAD startup with fallback from worklet to media recorder capture mode
• Added applyTranscript() helper to centralize transcript handling with auto-send logic
• Improved error handling and cleanup for different listening sources

frontend/packages/app-core/src/stores/transcription.ts


13. frontend/packages/app-core/src/data/provider-fallback.ts ⚙️ Configuration changes +12/-232

Consolidate speech providers and update endpoints

• Removed OpenAI audio speech provider entries (openai-audio-speech, openai-compatible-audio-speech)
• Removed ElevenLabs, Microsoft Speech, Index TTS, Comet API, and Player2 speech providers
• Updated Volcengine default base URL from unspeech.hyp3r.link to openspeech.bytedance.com
• Updated Alibaba Cloud Model Studio default base URL and model ID format (removed alibaba/ prefix)
• Updated Aliyun NLS transcription provider label and description to Alibaba Cloud Model Studio

frontend/packages/app-core/src/data/provider-fallback.ts


14. frontend/packages/app-core/src/utils/tts-direct-request.ts ✨ Enhancement +262/-0

Add TTS direct request builder utility

• New utility module for building direct TTS HTTP requests to backend relay endpoints
• Implements buildDirectTtsHttpRequest() for Volcengine and Alibaba Cloud Model Studio engines
• Implements buildLegacyTtsHttpRequest() for fallback to legacy /api/tts/synthesize endpoint
• Handles URL normalization for legacy unspeech.hyp3r.link endpoints
• Supports engine-specific configuration mapping and validation

frontend/packages/app-core/src/utils/tts-direct-request.ts


15. frontend/packages/app-core/src/utils/tts-chunker.ts ✨ Enhancement +243/-0

Add intelligent TTS text chunking utility

• New utility module for intelligent TTS text chunking with punctuation-aware segmentation
• Implements chunkTtsInput() with configurable minimum/maximum word counts and boost factor
• Supports special markers (TTS_FLUSH_INSTRUCTION, TTS_SPECIAL_TOKEN) for streaming control
• Uses Intl.Segmenter API for accurate word and grapheme counting across languages
• Provides toSpeakableTtsChunks() for sanitized chunk extraction

frontend/packages/app-core/src/utils/tts-chunker.ts


16. frontend/packages/app-core/src/data/provider-options.ts ⚙️ Configuration changes +7/-104

Remove unsupported speech providers from options

• Removed OpenAI TTS provider options and voice/model definitions
• Removed ElevenLabs, Microsoft Speech, Index TTS, Comet API, and Player2 speech provider options
• Updated Volcengine and Alibaba Cloud Model Studio default base URLs
• Consolidated speech provider options to supported backends only

frontend/packages/app-core/src/data/provider-options.ts


17. frontend/packages/app-core/src/services/audio-direct.test.ts 🧪 Tests +172/-0

Add tests for TTS direct request builder

• New test file for TTS direct request building functionality
• Tests backend relay request generation for Volcengine and Alibaba engines
• Tests legacy unspeech URL normalization to official provider endpoints
• Tests legacy fallback request generation with proper field mapping

frontend/packages/app-core/src/services/audio-direct.test.ts


18. frontend/packages/app-core/src/services/audio.ts ✨ Enhancement +100/-13

Implement direct TTS request with legacy fallback

• Refactored requestTts() to use new requestTtsDirect() implementation
• Added requestTtsDirect() with backend relay request building and legacy fallback support
• Implemented resolveTtsBlob() to handle both binary and JSON audio responses
• Added decodeBase64Audio() for base64-encoded audio payload extraction
• Added error handling with status codes and detailed error messages

frontend/packages/app-core/src/services/audio.ts


19. frontend/packages/app-core/src/utils/provider-fields.test.ts 🧪 Tests +61/-0

Add tests for provider field filtering

• New test file for provider field filtering logic
• Tests hiding of baseUrl field when provider has default configuration
• Tests preservation of baseUrl field when no default is provided

frontend/packages/app-core/src/utils/provider-fields.test.ts


20. frontend/packages/app-core/src/stores/providers.ts ✨ Enhancement +63/-5

Provider visibility filtering and Alibaba NLS normalization

• Added imports for filterProviderFields and isVisibleSpeechProviderId utilities
• Introduced ALIYUN_NLS_PROVIDER_ID constant and aliyunNlsNormalizedFields configuration for
 Alibaba NLS provider
• Added provider filtering logic to hide unsupported speech providers (volcengine, aliyun-nls)
• Implemented normalizeProviderEntry function to normalize Alibaba NLS provider fields
• Added watchers to auto-refresh providers when chat/speech/transcription provider IDs change
• Refactored getProviderFields to use filterProviderFields utility

frontend/packages/app-core/src/stores/providers.ts


21. frontend/packages/app-core/src/services/providers.ts Refactoring +33/-58

Unified voice listing through backend API endpoint

• Removed direct unspeech library imports and Alibaba Cloud specific voice listing logic
• Refactored listProviderVoices to use unified API endpoint approach via proxy or direct HTTP
• Simplified voice listing to call backend /api/providers/voices endpoint with standardized
 payload
• Removed complex model candidate matching logic (moved to backend)

frontend/packages/app-core/src/services/providers.ts


22. frontend/packages/app-core/src/utils/tts-streaming-runner.ts ✨ Enhancement +72/-0

TTS chunk queue processing utility

• New utility for managing TTS chunk queue processing with error handling
• Supports optional error callbacks and stop-on-error behavior
• Distinguishes between abort errors (fatal) and other errors (recoverable)
• Returns result object with succeeded/failed counts and last error

frontend/packages/app-core/src/utils/tts-streaming-runner.ts


23. frontend/packages/app-core/src/utils/tts-streaming-runner.test.ts 🧪 Tests +79/-0

TTS chunk queue processing tests

• Comprehensive test suite for runTtsChunkQueue function
• Tests error resilience, abort handling, and stop-on-error behavior
• Validates proper error context and result tracking

frontend/packages/app-core/src/utils/tts-streaming-runner.test.ts


24. frontend/packages/app-core/src/utils/tts-chunker.test.ts 🧪 Tests +51/-0

TTS text chunking tests

• Tests for text chunking logic with punctuation handling
• Validates decimal number preservation and ellipsis normalization
• Tests special token and flush instruction handling

frontend/packages/app-core/src/utils/tts-chunker.test.ts


25. frontend/packages/app-core/src/utils/capture-startup.test.ts 🧪 Tests +63/-0

Audio capture startup fallback tests

• Tests for audio capture fallback decision logic
• Validates media recorder fallback when worklet fails
• Tests error normalization for missing fallback options

frontend/packages/app-core/src/utils/capture-startup.test.ts


26. frontend/packages/app-core/src/utils/tts-stream-segmenter.ts ✨ Enhancement +64/-0

TTS stream segmentation utility

• New class for managing TTS stream segmentation with buffering
• Supports appending literals, special markers, and flush markers
• Implements drain logic to emit complete chunks while preserving tail text

frontend/packages/app-core/src/utils/tts-stream-segmenter.ts


27. frontend/packages/app-core/src/utils/provider-visibility.test.ts 🧪 Tests +42/-0

Speech provider visibility tests

• Tests for speech provider visibility filtering
• Validates that only configured providers are visible
• Tests filtering of unsupported provider IDs

frontend/packages/app-core/src/utils/provider-visibility.test.ts


28. frontend/packages/app-core/src/utils/browser-recognition-restart.test.ts 🧪 Tests +56/-0

Browser recognition restart logic tests

• Tests for browser recognition auto-restart decision logic
• Validates conditions for restart (user active, no manual stop, no fatal errors)
• Tests microphone permission denial handling

frontend/packages/app-core/src/utils/browser-recognition-restart.test.ts


29. frontend/packages/app-core/src/utils/tts-stream-segmenter.test.ts 🧪 Tests +39/-0

TTS stream segmentation tests

• Tests for TTS stream segmentation with sentence boundaries
• Validates special marker flushing and final drain behavior

frontend/packages/app-core/src/utils/tts-stream-segmenter.test.ts


30. frontend/packages/app-core/src/utils/transcription-language.test.ts 🧪 Tests +40/-0

Transcription language normalization tests

• Tests for transcription language normalization
• Validates short language code expansion (zh→zh-CN, en→en-US)
• Tests locale token normalization and fallback behavior

frontend/packages/app-core/src/utils/transcription-language.test.ts


31. frontend/packages/app-core/src/utils/capture-startup.ts ✨ Enhancement +35/-0

Audio capture startup fallback utility

• New utility for deciding audio capture fallback strategy
• Normalizes error messages from various error types
• Returns decision object with capture mode and error details

frontend/packages/app-core/src/utils/capture-startup.ts


32. frontend/packages/app-core/src/utils/provider-fields.ts ✨ Enhancement +23/-0

Provider field filtering utility

• New utility to filter provider fields based on default base URL availability
• Hides baseUrl field when provider has default base URL configured

frontend/packages/app-core/src/utils/provider-fields.ts


33. frontend/packages/app-core/src/utils/transcript-filter.test.ts 🧪 Tests +23/-0

Transcript filter tests

• Tests for transcript sanitization logic
• Validates Windows absolute path detection and filtering
• Tests preservation of normal language transcripts

frontend/packages/app-core/src/utils/transcript-filter.test.ts


34. frontend/packages/app-core/src/stores/settings.ts ⚙️ Configuration changes +1/-1

Default speech provider configuration change

• Changed default speech provider from openai-audio-speech to browser-local-audio-speech

frontend/packages/app-core/src/stores/settings.ts


35. frontend/packages/app-core/src/utils/browser-recognition-restart.ts ✨ Enhancement +23/-0

Browser recognition auto-restart decision utility

• New utility to determine if browser recognition should auto-restart
• Checks user request status, manual stop flag, and fatal error codes
• Prevents restart on permission denial and other non-recoverable errors

frontend/packages/app-core/src/utils/browser-recognition-restart.ts


36. frontend/packages/app-core/src/utils/transcription-language.ts ✨ Enhancement +25/-0

Transcription language normalization utility

• New utility for normalizing transcription language codes
• Expands short codes (zh→zh-CN, en→en-US) and normalizes locale tokens
• Provides fallback to en-US for missing/invalid languages

frontend/packages/app-core/src/utils/transcription-language.ts


37. frontend/packages/app-core/src/utils/transcript-filter.ts ✨ Enhancement +17/-0

Transcript sanitization utility

• New utility to sanitize transcripts by filtering Windows absolute paths
• Uses regex pattern matching to detect and remove file paths

frontend/packages/app-core/src/utils/transcript-filter.ts


38. frontend/packages/app-core/src/utils/provider-visibility.ts ✨ Enhancement +14/-0

Speech provider visibility configuration

• New utility defining visible speech provider IDs
• Includes volcengine, alibaba, and browser/app local audio providers
• Provides filtering function for provider ID lists

frontend/packages/app-core/src/utils/provider-visibility.ts


39. backend/app/api/tts.py ✨ Enhancement +675/-191

TTS engine refactoring with Volcengine and Alibaba support

• Completely refactored TTS engine to support Volcengine and Alibaba direct APIs
• Removed streaming response support, now returns complete audio responses
• Added WebSocket support for Alibaba Dashscope TTS
• Implemented comprehensive error extraction and decoration for provider-specific messages
• Added helper functions for payload building, model/voice resolution, and API key handling
• Removed Dify and Coze TTS support

backend/app/api/tts.py


40. backend/app/api/asr.py ✨ Enhancement +613/-24

ASR engine enhancement with Alibaba Bailian support

• Added support for Alibaba Bailian ASR with realtime WebSocket streaming
• Implemented AliyunRealtimeSession dataclass for managing WebSocket connections
• Added comprehensive helper functions for Alibaba Dashscope API integration
• Improved WebSocket disconnect handling with better error recovery
• Added audio format resolution and base64 encoding utilities
• Supports both realtime and non-realtime Alibaba ASR models

backend/app/api/asr.py


41. backend/app/services/providers/registry.py ✨ Enhancement +225/-4

Provider registry enhancement with voice catalog support

• Added support for Alibaba NLS ASR provider validation and model listing
• Implemented local TTS voice loading from JSON files for Volcengine and Alibaba
• Added voice parsing functions for both providers with model compatibility filtering
• Implemented voice description building with language information
• Added model candidate resolution for Alibaba voice filtering

backend/app/services/providers/registry.py


42. backend/tests/test_tts_engine_relay.py 🧪 Tests +188/-0

TTS engine relay tests

• Comprehensive test suite for TTS engine payload building and configuration
• Tests for Volcengine and Alibaba payload construction
• Tests for API key resolution, model normalization, and error decoration
• Tests for WebSocket URL resolution and JSON error extraction

backend/tests/test_tts_engine_relay.py


43. backend/tests/test_asr_aliyun_dashscope.py 🧪 Tests +112/-0

Alibaba Dashscope ASR tests

• Test suite for Alibaba Dashscope ASR integration
• Tests for URL building, model resolution, and WebSocket URL construction
• Tests for event text extraction and response parsing
• Tests for realtime session credential resolution

backend/tests/test_asr_aliyun_dashscope.py


44. backend/tests/test_provider_voices_tts.py 🧪 Tests +110/-0

Provider voice listing tests

• Tests for provider voice listing functionality
• Tests local voice catalog loading for Volcengine and Alibaba
• Tests model-based voice filtering for Alibaba provider
• Tests unsupported provider handling

backend/tests/test_provider_voices_tts.py


45. backend/app/api/providers.py ✨ Enhancement +52/-23

Provider API field normalization for Alibaba NLS

• Added helper function to convert provider fields to dictionary format
• Implemented special handling for Alibaba NLS provider to expose only API key field
• Refactored field resolution to use new _resolve_provider_field_dicts function

backend/app/api/providers.py


46. backend/tests/test_provider_catalog_tts_defaults.py 🧪 Tests +43/-0

TTS provider catalog defaults tests

• Tests for TTS provider default configuration
• Validates Volcengine and Alibaba default endpoints and models
• Tests model field options for Alibaba provider

backend/tests/test_provider_catalog_tts_defaults.py


47. frontend/apps/desktop-tauri/src-tauri/tauri.conf.json ⚙️ Configuration changes +2/-0

Tauri HTTPS scheme configuration

• Added useHttpsScheme: true to main and settings window configurations

frontend/apps/desktop-tauri/src-tauri/tauri.conf.json


48. .idea/inspectionProfiles/profiles_settings.xml Miscellaneous +6/-0

IDE inspection profile configuration

• New IntelliJ IDEA inspection profile settings file

.idea/inspectionProfiles/profiles_settings.xml


49. backend/tests/test_provider_catalog_aliyun_fields.py 🧪 Tests +76/-0

Add Alibaba Bailian provider field normalization tests

• Added new test file for validating Alibaba Bailian provider field normalization
• Tests verify that OpenAI-style fields are properly normalized for Aliyun catalog
• Tests ensure fields are reduced to minimal required shape (only apiKey)
• Includes helper functions for field creation and test execution

backend/tests/test_provider_catalog_aliyun_fields.py


50. backend/tests/test_asr_stream_disconnect.py 🧪 Tests +45/-0

Add ASR stream disconnect detection tests

• Added new test file for ASR WebSocket disconnect handling
• Tests verify detection of WebSocket disconnect message frames
• Tests validate detection of disconnect-related RuntimeError exceptions
• Tests ensure unrelated RuntimeErrors are properly ignored

backend/tests/test_asr_stream_disconnect.py


51. backend/app/services/providers/voices/alibaba.json ⚙️ Configuration changes +294/-0

Add Alibaba Bailian TTS voice catalog data

• Added comprehensive voice catalog for Alibaba Bailian TTS service
• Contains 20 voice profiles with metadata (name, preview URL, model, language, bitrate)
• Includes both Chinese and bilingual voice options with various use case scenarios
• All voices use cosyvoice-v1 model with MP3 format at 22050 bitrate

backend/app/services/providers/voices/alibaba.json


52. .idea/inspectionProfiles/Project_Default.xml ⚙️ Configuration changes +144/-0

Add IDE inspection profile configuration

• Added IDE inspection profile configuration for PyCharm/IntelliJ
• Configured 129 ignored packages for PyPackageRequirementsInspection
• Enabled Eslint and Stylelint inspections with appropriate severity levels
• Suppresses false warnings for common dependencies in the project

.idea/inspectionProfiles/Project_Default.xml


53. CLAUDE.md 📝 Documentation +296/-0

Add comprehensive project documentation for Claude AI

• Added comprehensive project documentation for Claude AI code assistant
• Includes project overview, architecture, development environment setup, and code standards
• Documents project structure, configuration-driven architecture, and async-first design patterns
• Provides quick start guides, testing procedures, common commands, and contribution guidelines

CLAUDE.md


54. backend/config/providers.yaml ⚙️ Configuration changes +11/-18

Update TTS and ASR provider endpoints and configurations

• Updated Volcengine TTS provider base URL from unspeech.hyp3r.link to
 openspeech.bytedance.com/api/v1/tts
• Updated Alibaba Cloud Model Studio TTS base URL and model ID to use official DashScope endpoint
• Changed Alibaba model from alibaba/cosyvoice-v1 to cosyvoice-v1 format
• Updated Aliyun NLS transcription provider configuration with new engine ID and removed unnecessary
 fields
• Changed ElevenLabs voice field from select dropdown to text input type

backend/config/providers.yaml


55. backend/config/engines.yaml ⚙️ Configuration changes +27/-3

Update TTS endpoints and add Alibaba Bailian ASR engine

• Updated Volcengine TTS engine base URL to official openspeech.bytedance.com/api/v1/tts endpoint
• Updated Alibaba Cloud Model Studio TTS to use DashScope endpoint with corrected model ID
• Added new aliyun-nls-asr engine configuration for Alibaba Bailian ASR service
• Configured ASR engine with DashScope API, including VAD, ITN, and word customization parameters

backend/config/engines.yaml


56. backend/pyproject.toml Dependencies +1/-1

Move websockets to core dependencies

• Moved websockets>=12.0 dependency from optional dev dependencies to core dependencies
• Makes WebSocket support a required dependency for all installations

backend/pyproject.toml


57. .idea/whale-whisper.iml ⚙️ Configuration changes +8/-0

Add IDE module configuration

• Added PyCharm/IntelliJ module configuration file
• Defines Python module structure and project root settings

.idea/whale-whisper.iml


58. .idea/modules.xml ⚙️ Configuration changes +8/-0

Add IDE project modules configuration

• Added IDE project module manager configuration
• Registers the whale-whisper module in the IDE project structure

.idea/modules.xml


59. .idea/vcs.xml ⚙️ Configuration changes +7/-0

Add IDE version control configuration

• Added IDE version control system configuration
• Maps Git repositories for main project and airi submodule

.idea/vcs.xml


60. .idea/misc.xml ⚙️ Configuration changes +4/-0

Add IDE project settings

• Added IDE project settings configuration
• Specifies Python SDK name for the project

.idea/misc.xml


61. .idea/easycode.ignore ⚙️ Configuration changes +13/-0

Add EasyCode plugin ignore patterns

• Added ignore patterns for EasyCode IDE plugin
• Excludes common build artifacts, dependencies, and generated files from code generation

.idea/easycode.ignore


62. backend/app/services/providers/voices/volcengine.json Additional files +3176/-0

...

backend/app/services/providers/voices/volcengine.json


63. frontend/apps/desktop-tauri/renderer/src/App.vue Additional files +0/-7

...

frontend/apps/desktop-tauri/renderer/src/App.vue


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 1, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Dify/Coze TTS broken 🐞 Bug ✓ Correctness
Description
backend/app/api/tts.py no longer routes by engine_type (dify_tts/coze_tts). Requests for engines
still defined in backend/config/engines.yaml (dify-tts/coze-tts) will fall into the generic
OpenAI-compatible relay path, requiring model/voice and using /audio/speech, which does not match
their configured paths and request formats.
Code

backend/app/api/tts.py[R77-121]

+async def run_tts_engine(request: EngineRunRequest) -> Response:
+    engine_id = _resolve_tts_engine_id(request.engine)
+    runtime_config = _get_tts_engine_config(engine_id)
+
+    text = _extract_tts_input(request.data)
   if not text:
       raise HTTPException(status_code=400, detail="Missing text input")

-    engine_type = (config.engine_type or "openai_compat").lower()
   overrides = request.config if isinstance(request.config, dict) else {}
+    api_key = _resolve_tts_api_key(runtime_config, overrides)
+    if not api_key:
+        raise HTTPException(status_code=400, detail="Missing apiKey for TTS provider")
+
+    if engine_id in VOLCENGINE_ENGINE_IDS:
+        return await _forward_volcengine_tts(
+            runtime_config=runtime_config,
+            text=text,
+            overrides=overrides,
+            api_key=api_key,
+        )

-    if engine_type in {"dify_tts", "dify"}:
-        stream = await _stream_dify_tts(config, text, overrides)
-        return StreamingResponse(stream, media_type="audio/mpeg")
+    if engine_id in ALIBABA_ENGINE_IDS:
+        return await _forward_alibaba_tts(
+            engine_id=engine_id,
+            runtime_config=runtime_config,
+            text=text,
+            overrides=overrides,
+            api_key=api_key,
+        )

-    if engine_type in {"coze_tts", "coze"}:
-        stream = await _stream_coze_tts(config, text, overrides)
-        return StreamingResponse(stream, media_type="audio/mpeg")
+    payload = _build_unspeech_payload(
+        engine_id=engine_id,
+        runtime_config=runtime_config,
+        text=text,
+        overrides=overrides,
+    )

-    base_url_override, api_key_override = _resolve_connection_overrides(overrides)
-    payload: Dict[str, Any] = {"model": config.model, "input": text}
-    payload.update(config.default_params)
-    payload.update(sanitize_config(overrides))
+    speech_path = runtime_config.paths.get("speech") if runtime_config.paths else None
+    url = runtime_config.base_url.rstrip("/") + normalize_path(speech_path or "/audio/speech")

-    if "voice" not in payload:
-        raise HTTPException(status_code=400, detail="Missing voice for TTS")
+    headers = {
+        "Content-Type": "application/json",
+        "Authorization": f"Bearer {api_key}",
+    }
+    headers.update(runtime_config.headers)
Evidence
run_tts_engine now only special-cases Volcengine/Alibaba by engine_id and otherwise always builds an
OpenAI-compatible payload and posts to /audio/speech. Meanwhile engines.yaml still configures
dify-tts and coze-tts with engine types and speech paths that require custom forwarding
(/text-to-audio for Dify, /v1/audio/speech plus bot logic for Coze). With no dify/coze branches
remaining, these engines cannot be executed correctly.

backend/app/api/tts.py[76-125]
backend/config/engines.yaml[138-179]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`run_tts_engine` no longer supports `dify_tts` / `coze_tts` engines even though they remain configured in `backend/config/engines.yaml`. Those engines will now be treated as OpenAI-compatible relays and likely fail.
### Issue Context
The PR refactored TTS routing to special-case Volcengine/Alibaba and treat everything else as OpenAI-compatible. Existing Dify/Coze engines in config still require custom endpoints/parameters.
### Fix Focus Areas
- backend/app/api/tts.py[76-140]
- backend/config/engines.yaml[138-179]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Speech provider ID not migrated 🐞 Bug ⛯ Reliability
Description
Frontend now restricts visible speech providers and changes the default speech provider, but
existing users may still have removed/hidden speechProviderId values in localStorage. When provider
metadata is missing, resolveTtsEngineId returns an empty string, causing direct TTS requests to be
unsupported and breaking speech output after upgrade.
Code

frontend/packages/app-core/src/stores/speech-output.ts[R89-104]

+  function resolveTtsEngineId() {
+    const metadataEngineId = providerMetadata.value?.engineId;
+    if (typeof metadataEngineId === "string" && metadataEngineId.trim()) {
+      return metadataEngineId.trim();
+    }
+    if (speechProviderId.value === "volcengine-speech" || speechProviderId.value === "volcengine") {
+      return "volcengine-speech";
+    }
+    if (
+      speechProviderId.value === "alibaba-cloud-model-studio-speech" ||
+      speechProviderId.value === "alibaba-cloud-model-studio"
+    ) {
+      return "alibaba-cloud-model-studio-speech";
+    }
+    return "";
+  }
Evidence
speechProviderId is persisted via useLocalStorage and therefore keeps old values across upgrades.
The new provider visibility allowlist limits speech providers to four IDs; removed IDs may no longer
appear in catalog. speech-output’s resolveTtsEngineId returns "" for unknown IDs; tts-direct-request
rejects falsy/unknown engineIds, so remote TTS cannot run and the user may be stuck without an
automatic fallback.

frontend/packages/app-core/src/stores/settings.ts[16-22]
frontend/packages/app-core/src/utils/provider-visibility.ts[1-14]
frontend/packages/app-core/src/stores/speech-output.ts[89-104]
frontend/packages/app-core/src/utils/tts-direct-request.ts[118-145]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Upgraded users with a previously stored speech provider ID that is now hidden/removed can end up with `resolveTtsEngineId()` returning an empty string, which makes direct TTS unsupported and breaks speech output.
### Issue Context
`useLocalStorage` preserves old values across upgrades; the PR adds a strict allowlist of visible speech providers.
### Fix Focus Areas
- frontend/packages/app-core/src/stores/settings.ts[16-22]
- frontend/packages/app-core/src/utils/provider-visibility.ts[1-14]
- frontend/packages/app-core/src/stores/speech-output.ts[89-104]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Verify Volcengine auth header 🐞 Bug ⛯ Reliability
Description
Volcengine TTS forwarding sets Authorization to "Bearer;{api_key}" (semicolon) while other TTS relay
paths use "Bearer {api_key}" (space). This inconsistency is likely accidental and should be
verified, as it may cause authentication failures depending on upstream expectations.
Code

backend/app/api/tts.py[R583-587]

+    headers = {
+        "Content-Type": "application/json",
+        "Authorization": f"Bearer;{api_key}",
+    }
+    headers.update(runtime_config.headers)
Evidence
Within the same module, the generic relay path uses a conventional Bearer token header with a space
delimiter. Volcengine forwarding uses a different delimiter, suggesting a potential typo or mismatch
with the rest of the relay implementation.

backend/app/api/tts.py[114-121]
backend/app/api/tts.py[583-587]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Volcengine forwarding uses `Authorization: Bearer;{api_key}` which is inconsistent with other Bearer usage in the same file and may lead to auth failures.
### Issue Context
Other relay paths use `Bearer {api_key}`.
### Fix Focus Areas
- backend/app/api/tts.py[574-588]
- backend/app/api/tts.py[114-121]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR (XL: 8730+/992-, 66 files) makes substantial changes to both TTS and ASR modules. The review identified several issues ranging from silent failures and a potential SSRF vector to a logic bug that silently drops emoji characters from TTS output.

PR Size: XL

Issues Found

Category Critical High Medium Low
Silent Failures 1 4 1 -
Logic Bugs 1 3 2 -
Security 1 1 - -
Type Safety - - 1 1
Tests - 3 2 -
Other 1 - 2 -

Key Findings (details in inline comments)

Critical

  1. [SECURITY-SSRF] asr.py/tts.py: User-controlled base_url/api_base forwarded to server-side HTTP requests. Aliyun/Volcengine/Alibaba paths accept user-supplied URLs from request body overrides that bypass ASR_BLOCKED_CONFIG_KEYS, enabling SSRF against internal networks.
  2. [LOGIC-BUG] tts-chunker.ts: Multi-codepoint graphemes (emojis, flag characters, CJK with combining marks) are silently dropped from TTS output due to if (value.length > 1) { continue; } skipping them without appending to buffer.
  3. .idea/ committed: 8 JetBrains IDE config files added to version control. Add .idea/ to .gitignore and remove these files.

High
4. [ERROR-SILENT] asr.py:210-214: _decode_base64 catches all exceptions and returns b"" with no logging. Caller then raises misleading "Missing audio data" instead of "Invalid base64".
5. [ERROR-SILENT] registry.py:~60: _load_local_tts_voices_cached swallows JSON parse errors and returns []. Because of @lru_cache, the empty result is permanently cached for the process lifetime.
6. [SECURITY-INFO-LEAK] asr.py:174-175: str(exc) sent directly to WebSocket client, potentially leaking internal URLs, file paths, or connection strings.
7. [LOGIC-BUG] speech-output.ts: supported computed now returns true unconditionally for non-browser TTS, even when the provider is unsupported/unconfigured.
8. [LOGIC-BUG] transcription.ts: startListening now sets enabled.value = true as a side effect, silently overriding the user's persisted preference.
9. [PERFORMANCE-ISSUE] tts-stream-segmenter.ts: drain() re-runs chunkTtsInput() on the full accumulated input every call, producing O(n²) cost for long streaming responses.

Medium
10. [ERROR-SILENT] tts.py: Two except Exception: pass blocks in _extract_tts_error with zero logging.
11. [ERROR-SILENT] asr.py: _close_aliyun_realtime_session and _read_aliyun_realtime_events swallow exceptions without logging.
12. [LOGIC-BUG] speech-output.ts: endAssistantStream finally block does not clear assistantStreamChunks or assistantStreamFailedChunkIndex, leaving stale state.
13. [UTF-8 BOM] stage-settings-ui/AudioSection.vue:1: UTF-8 BOM (U+FEFF) introduced at file start — likely an editor artifact.
14. [TEST-MISSING-CRITICAL] audio-direct.test.ts: buildDirectTtsHttpRequest null-return guards (missing required fields) are entirely untested.
15. [TEST-MISSING-CRITICAL] tts-chunker.test.ts: maximumWords limit splitting and soft punctuation splitting are untested.

Review Coverage

  • Logic and correctness
  • Security (OWASP Top 10)
  • Error handling
  • Type safety
  • Documentation accuracy
  • Test coverage
  • Code clarity

Automated review by Claude AI

)
else:
response = await _forward_transcription(
engine_config, wav_bytes, overrides, filename, content_type
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ERROR-SILENT] _decode_base64 catches all exceptions and silently returns b"". The caller at line ~83 then raises a misleading "Missing audio data" error when the user actually provided data that was malformed. No logging occurs.

Suggestion:

def _decode_base64(payload: str) -> bytes:
    try:
        return base64.b64decode(payload, validate=True)
    except Exception as exc:
        logger.warning("Failed to decode base64 audio payload: %s", exc)
        return b""

Or better, raise HTTPException(400, detail="Invalid base64 audio data") to give actionable feedback.

sample_rate=sample_rate,
)
await websocket.send_json({"type": "ready"})
elif message_type == "stop":
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SECURITY-INFO-LEAK] str(exc) is sent directly to the WebSocket client. This can leak internal URLs, file paths, connection strings, or stack trace details (e.g., from httpx connection errors revealing internal hostnames).

Suggestion:

except Exception as exc:
    logger.exception("Unexpected error in ASR WebSocket stream")
    await websocket.send_json({"type": "error", "error": "Internal server error"})
    await websocket.close(code=1011)

@@ -0,0 +1,8 @@
# 默认忽略的文件
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CONFIG] 8 JetBrains IDE config files (.idea/) are being added to version control. These are developer-specific and should not be committed. The existing .gitignore already excludes .vscode/ under # IDE but .idea/ was missed.

Suggestion: Add .idea/ to .gitignore and remove these files from the PR:

echo '.idea/' >> .gitignore
git rm -r --cached .idea/

while (index < input.length) {
let value = input[index];

if (value.length > 1) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOGIC-BUG] Multi-codepoint graphemes (emojis like 👨‍👩‍👧‍👦, flag emojis 🇺🇸, CJK with combining marks) are silently dropped from TTS output. When Intl.Segmenter produces a grapheme cluster with value.length > 1, this code skips it entirely without appending to buffer, so the character vanishes from the output.

Suggestion: Append the multi-codepoint grapheme to the buffer instead of skipping it:

if (value.length > 1) {
  buffer += value;       // <-- ADD: don't drop the grapheme
  previousValue = value;
  index += 1;
  continue;
}


try:
raw = json.loads(source.read_text(encoding="utf-8"))
except Exception:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ERROR-SILENT] JSON parse errors are silently swallowed with no logging. Because this function is wrapped in @lru_cache(maxsize=8), the empty [] result is permanently cached for the process lifetime. Even if the JSON file is subsequently fixed on disk, a full server restart is needed.

Suggestion:

    try:
        raw = json.loads(source.read_text(encoding="utf-8"))
    except Exception as exc:
        logger.error("Failed to load TTS voices from %s: %s", path, exc)
        return []  # NOTE: empty result is cached by @lru_cache; consider not caching on error

Comment on lines +77 to +121
async def run_tts_engine(request: EngineRunRequest) -> Response:
engine_id = _resolve_tts_engine_id(request.engine)
runtime_config = _get_tts_engine_config(engine_id)

text = _extract_tts_input(request.data)
if not text:
raise HTTPException(status_code=400, detail="Missing text input")

engine_type = (config.engine_type or "openai_compat").lower()
overrides = request.config if isinstance(request.config, dict) else {}
api_key = _resolve_tts_api_key(runtime_config, overrides)
if not api_key:
raise HTTPException(status_code=400, detail="Missing apiKey for TTS provider")

if engine_id in VOLCENGINE_ENGINE_IDS:
return await _forward_volcengine_tts(
runtime_config=runtime_config,
text=text,
overrides=overrides,
api_key=api_key,
)

if engine_type in {"dify_tts", "dify"}:
stream = await _stream_dify_tts(config, text, overrides)
return StreamingResponse(stream, media_type="audio/mpeg")
if engine_id in ALIBABA_ENGINE_IDS:
return await _forward_alibaba_tts(
engine_id=engine_id,
runtime_config=runtime_config,
text=text,
overrides=overrides,
api_key=api_key,
)

if engine_type in {"coze_tts", "coze"}:
stream = await _stream_coze_tts(config, text, overrides)
return StreamingResponse(stream, media_type="audio/mpeg")
payload = _build_unspeech_payload(
engine_id=engine_id,
runtime_config=runtime_config,
text=text,
overrides=overrides,
)

base_url_override, api_key_override = _resolve_connection_overrides(overrides)
payload: Dict[str, Any] = {"model": config.model, "input": text}
payload.update(config.default_params)
payload.update(sanitize_config(overrides))
speech_path = runtime_config.paths.get("speech") if runtime_config.paths else None
url = runtime_config.base_url.rstrip("/") + normalize_path(speech_path or "/audio/speech")

if "voice" not in payload:
raise HTTPException(status_code=400, detail="Missing voice for TTS")
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}",
}
headers.update(runtime_config.headers)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Dify/coze tts broken 🐞 Bug ✓ Correctness

backend/app/api/tts.py no longer routes by engine_type (dify_tts/coze_tts). Requests for engines
still defined in backend/config/engines.yaml (dify-tts/coze-tts) will fall into the generic
OpenAI-compatible relay path, requiring model/voice and using /audio/speech, which does not match
their configured paths and request formats.
Agent Prompt
### Issue description
`run_tts_engine` no longer supports `dify_tts` / `coze_tts` engines even though they remain configured in `backend/config/engines.yaml`. Those engines will now be treated as OpenAI-compatible relays and likely fail.

### Issue Context
The PR refactored TTS routing to special-case Volcengine/Alibaba and treat everything else as OpenAI-compatible. Existing Dify/Coze engines in config still require custom endpoints/parameters.

### Fix Focus Areas
- backend/app/api/tts.py[76-140]
- backend/config/engines.yaml[138-179]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +89 to +104
function resolveTtsEngineId() {
const metadataEngineId = providerMetadata.value?.engineId;
if (typeof metadataEngineId === "string" && metadataEngineId.trim()) {
return metadataEngineId.trim();
}
if (speechProviderId.value === "volcengine-speech" || speechProviderId.value === "volcengine") {
return "volcengine-speech";
}
if (
speechProviderId.value === "alibaba-cloud-model-studio-speech" ||
speechProviderId.value === "alibaba-cloud-model-studio"
) {
return "alibaba-cloud-model-studio-speech";
}
return "";
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Speech provider id not migrated 🐞 Bug ⛯ Reliability

Frontend now restricts visible speech providers and changes the default speech provider, but
existing users may still have removed/hidden speechProviderId values in localStorage. When provider
metadata is missing, resolveTtsEngineId returns an empty string, causing direct TTS requests to be
unsupported and breaking speech output after upgrade.
Agent Prompt
### Issue description
Upgraded users with a previously stored speech provider ID that is now hidden/removed can end up with `resolveTtsEngineId()` returning an empty string, which makes direct TTS unsupported and breaks speech output.

### Issue Context
`useLocalStorage` preserves old values across upgrades; the PR adds a strict allowlist of visible speech providers.

### Fix Focus Areas
- frontend/packages/app-core/src/stores/settings.ts[16-22]
- frontend/packages/app-core/src/utils/provider-visibility.ts[1-14]
- frontend/packages/app-core/src/stores/speech-output.ts[89-104]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

this.emittedCount = 0;
}

drain(finalize: boolean) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PERFORMANCE-ISSUE] drain() re-runs chunkTtsInput(this.input) on the full accumulated input text every call, then skips the first this.emittedCount chunks. As the stream grows, this becomes O(n) per token arrival, making the total cost O(n²) for a long streaming response.

Suggestion: Track the character offset of the last emitted chunk and only re-chunk from that offset onward:

drain(finalize: boolean) {
  // Only chunk from the un-emitted portion of the input
  const tail = this.input.slice(this.emittedOffset);
  const newChunks = chunkTtsInput(tail);
  // ... emit new chunks, update this.emittedOffset
}

Or cache the chunks array and only re-chunk the text appended since the last drain().

}
return Boolean(audioApiBaseUrl.value);
return true;
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOGIC-BUG] supported now returns true unconditionally for non-browser TTS, regardless of whether the provider is actually configured, has an API key, or has a valid engine. This misleads the UI into showing TTS as available when it may not work (e.g., if the user selects a speech provider that isn't volcengine or alibaba).

Suggestion: Check that the engine is actually supported:

const supported = computed(() => {
  if (typeof window === "undefined") return false;
  if (useBrowserTts.value) {
    return "speechSynthesis" in window;
  }
  const engineId = resolveTtsEngineId();
  return supportsDirectTts(engineId);
});

}

if (!enabled.value) {
enabled.value = true;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOGIC-BUG] startListening now sets enabled.value = true as a side effect. Since enabled is persisted to localStorage, this silently overrides the user's preference. If the user intentionally disabled transcription and some other code path calls startListening, their preference is permanently overwritten.

Suggestion: Make the auto-enable opt-in rather than implicit:

async function startListening(options?: StartListeningOptions & { autoEnable?: boolean }) {
  if (!enabled.value) {
    if (options?.autoEnable) {
      enabled.value = true;
    } else {
      return;
    }
  }
  // ...
}

@github-actions
Copy link

github-actions bot commented Mar 1, 2026

Blocker (can’t access the PR diff)

  • gh pr view / gh pr diff fail with: error connecting to api.github.com
  • curl confirms DNS/network is unavailable here: Could not resolve host: api.github.com

Without the diff (or PR head file contents), I can’t do an evidence-based review with file+line citations or generate valid inline review comments.

What I can confirm from the local event payload

  • PR #33 “Fix/tts voice flow” (author: Kiritogu), base main @ b338cef..., head fix/tts-voice-flow @ 8612a4...
  • Size: XL (8862 additions + 992 deletions = 9854 lines changed; 66 files). size/XL is already present in the event labels.

XL split suggestions (based on PR description only)

  • Backend: TTS provider migration (Volcengine + Aliyun CosyVoice) as one PR
  • Backend: ASR additions (DashScope batch + realtime WS session) as one PR
  • Frontend: streaming TTS segmentation/queue modules as one PR
  • Frontend: mic UI + Tauri permission changes as one PR
  • Provider catalogs/config + tests/docs/IDE config as separate PR(s)

To proceed

  • Either enable GitHub network/DNS in this job, or provide the diff as input (preferred: attach a file like pr33.diff generated by gh pr diff 33 --repo datawhalechina/whale-whisper --color=never > pr33.diff).

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WhaleWhisper Elite Code Review — PR #33

Size: XL (8862+, 992−, 66 files)

Issues Found

# Severity Category File Line(s) Confidence Title
1 Critical LOGIC-BUG backend/app/api/asr.py 134–144 95 WebSocket break paths leak Aliyun realtime session
2 Medium LOGIC-BUG frontend/.../tts-chunker.ts 124–127 85 Multi-character graphemes silently dropped from TTS text

Review Coverage

  • Comment Analyzer
  • Test Analyzer
  • Silent Failure Hunter
  • Type Design Auditor
  • General Code Reviewer
  • Code Simplifier

Stats

  • Files reviewed: 66
  • Issues found (≥80 confidence): 2
  • False positives filtered: 5

Comment on lines 134 to +144
while True:
message = await websocket.receive()
try:
message = await websocket.receive()
except WebSocketDisconnect:
break
except RuntimeError as exc:
if _is_disconnect_receive_runtime_error(exc):
break
raise

if _is_websocket_disconnect_message(message):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL] [LOGIC-BUG] WebSocket break paths leak aliyun_realtime_session (Confidence: 95)

These three break statements exit the while True loop normally, which means the except WebSocketDisconnect and except Exception handlers (which contain the session cleanup logic) are never reached. If aliyun_realtime_session has been created, the Aliyun WebSocket connection and its background reader task are leaked.

The outer except blocks at the end of run_asr_engine_stream correctly call _close_aliyun_realtime_session, but break does not trigger them — it falls through to the code after the try/except, which has no cleanup.

Suggested fix — wrap the loop in a try/finally:

aliyun_realtime_session: Optional["AliyunRealtimeSession"] = None
try:
    while True:
        try:
            message = await websocket.receive()
        except WebSocketDisconnect:
            break
        except RuntimeError as exc:
            if _is_disconnect_receive_runtime_error(exc):
                break
            raise

        if _is_websocket_disconnect_message(message):
            break

        # ... rest of message handling ...

except WebSocketDisconnect:
    return
except Exception as exc:
    # ... error handling ...
finally:
    if aliyun_realtime_session is not None:
        await _close_aliyun_realtime_session(aliyun_realtime_session)

This ensures the session is always cleaned up regardless of how the loop exits.

Comment on lines +124 to +128
if (value.length > 1) {
previousValue = value;
index += 1;
continue;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] [LOGIC-BUG] Multi-character graphemes silently dropped from TTS text (Confidence: 85)

When value.length > 1 (true for emoji like 👋, composed characters like é via combining marks, and other multi-codeunit graphemes), the code sets previousValue and increments index but never appends value to buffer. These characters are silently lost from the TTS output.

This is especially impactful for CJK users and emoji-heavy conversations.

Suggested fix — add the value to the buffer before continuing:

if (value.length > 1) {
  buffer += value;        // ← preserve the grapheme
  previousValue = value;
  index += 1;
  continue;
}

Alternatively, consider whether multi-char graphemes should also increment chunkWordsCount (they likely represent a visible word/symbol).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/backend Touches backend (FastAPI/Python) area/frontend Touches frontend (Vue/TS) needs-review Needs careful review (large/complex changes) size/XL PR size: >= 1000 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants