Roadmap-Hinweis: Vage Bullets ohne Akzeptanzkriterien in Checkbox-Tasks überführen. Format:
- [ ] <Task> (Target: <Q/Jahr>).
v1.1.0 – Production-ready voice assistant system. VoiceAssistant orchestrator with Whisper-based STT, llama.cpp TTS/LLM integration, session management, phone call transcription, meeting protocol generation, real-time browser WebSocket streaming, voice biometric authentication, and telephony bridge (SIP/WebRTC) are all implemented.
- VoiceAssistant – central coordinator for all voice interaction
- STT processing via Whisper AI (speaker diarization, timestamps)
- LLM integration via EmbeddedLLM / LlamaWrapper (intent recognition, query generation, response generation)
- TTS synthesis with audio format output
- Voice command processing pipeline (audio → STT → LLM → TTS → audio)
- Session state and conversation history management
- Context-aware conversational AI
- Phone call recording and transcription
- Meeting protocol generation
- Voice-based database query interface
- Storage and retrieval of voice session data
- Key point and summary extraction
- Real-time streaming STT (word-by-word transcription as audio arrives) (Issue: #2496)
- Wake-word detection for hands-free activation (Issue: #2365)
- Voice biometric authentication (speaker verification) (Issue: #2494)
- Multi-speaker diarization improvements (Issue: #2497)
- WebSocket audio streaming endpoint for browser clients (Issue: #2350)
- Integration with telephony systems — SIP call sessions, WebRTC peer connections, IVR engine, TelephonyBridge coordinator (Issue: #2495)
- (none)
- Federated learning for on-device voice model personalisation (Target: Q3 2026)
- GPU-accelerated noise suppression and codec processing (Target: Q4 2026)
-
VoiceAssistant– central coordinator for all voice interaction - STT processing via Whisper AI (speaker diarization, timestamps)
- LLM integration via
EmbeddedLLM/LlamaWrapper(intent recognition, query generation, response generation) - TTS synthesis with audio format output
- Voice command processing pipeline (audio → STT → LLM → TTS → audio)
- Session state and conversation history management
- Context-aware conversational AI
- Phone call recording and transcription
- Meeting protocol generation
- Voice-based database query interface
- Storage and retrieval of voice session data
- Key point and summary extraction
- Real-time streaming STT (word-by-word transcription as audio arrives)
- Wake-word detection for hands-free activation
- Multi-speaker diarization improvements
- Voice command macros (user-defined shortcuts to AQL queries)
- Language detection and automatic locale switching
- Noise suppression preprocessing (RNNoise integration)
- WebSocket audio streaming endpoint for browser clients (Issue: #2350)
- Voice session playback and search in stored transcripts
- Multi-language TTS (German, French, Spanish voices)
- Emotion / sentiment detection from voice tone
- Voice biometric authentication (speaker verification)
- Real-time meeting transcription with action-item extraction (Target: Q1 2026)
- Integration with telephony systems (SIP / WebRTC) (Issue: #2495)
- Unit tests coverage > 80% (Issue: #2355) —
test_voice_assistant.cpp,test_voice_coverage.cpp,test_voice_production.cpp(496+ tests); focused targets:VoiceProductionFocusedTests,VoiceCoverageFocusedTests - Integration tests (full pipeline: audio in → transcription → AQL → audio out) (Issue: #2356) —
VoiceProductionFocusedTests - Performance benchmarks (STT latency, TTS generation speed) (Issue: #2357) —
benchmarks/bench_voice_assistant.cpp - [I] Security audit (audio data storage, transcription PII handling) (Issue: #2358)
- [I] Documentation complete (Issue: #2359)
- API stability guaranteed (Issue: #2360) — VoiceAssistant session API stable from v1.x; new v1.1.0 APIs (telephony, biometric, browser streaming) marked stable
- Standalone focused test targets registered in
tests/CMakeLists.txt:VoiceProductionFocusedTests,VoiceCoverageFocusedTests,VoiceAssistantFocusedTests(LLM-gated),VoiceBrowserStreamingFocusedTests,VoiceTelephonyFocusedTests - CI workflow registered —
.github/workflows/voice-module-ci.yml(VoiceProductionFocusedTests, VoiceCoverageFocusedTests, VoiceBrowserStreamingFocusedTests, VoiceTelephonyFocusedTests)
- Streaming STT operates in sliding-window mode (3 s window, 1 s step); true sample-by-sample streaming requires Whisper.cpp
THEMIS_ENABLE_WHISPERbuild flag. - Wake-word detection uses energy-based VAD gating and acoustic feature scoring
(density, spectral centroid, crest factor). A neural wake-word model backend
(e.g. Porcupine, openWakeWord) can be plugged in via
WakeWordDetector::scorePhrase()without API changes. - Multi-speaker diarization uses k-means++ clustering on sub-band acoustic features (RMS + ZCR). Accuracy degrades with more than 4 simultaneous speakers; a neural embedding backend (e.g., pyannote-style x-vector) can be substituted via
diarizeSegments()without API changes. - TTS voice quality depends on the llama.cpp model in use.
- Voice biometric authentication uses acoustic sub-band features (no external model required). A neural i-vector/x-vector backend can be plugged in via
VoiceBiometricAuthenticator's internalextractFeatures()without changing the public API. Liveness detection is heuristic-based (crest factor, spectral flatness, ZCR variability); a neural anti-spoofing model is recommended for production.
- VoiceAssistant session API is stable from v1.x.
- Audio format configuration (sample rate, encoding) may gain new options in v1.5.0; backward-compatible.
Stand: 2026-04-20 – Quelle: src/UNUSED_FUNCTIONS_REPORT.md
NoiseSuppressor– RNNoise-basierte Rauschunterdrückung; nur im Voice-Produktionstest geprüftAktion: ROADMAP-Ticket für Produktions-Integration ergänzen oder als CANDIDATE_FOR_REMOVAL markieren.
processRNNoiseFrames– Verarbeitet Audio-Frames durch RNNoise-ModellapplyRNNoiseSuppression– Wendet RNNoise auf gesamten Audio-Buffer anAktion: Für jedes Symbol entscheiden: (1) Verdrahten, (2) Testen oder (3) als CANDIDATE_FOR_REMOVAL einplanen.