Skip to content

Latest commit

 

History

History
112 lines (90 loc) · 7.13 KB

File metadata and controls

112 lines (90 loc) · 7.13 KB

Roadmap-Hinweis: Vage Bullets ohne Akzeptanzkriterien in Checkbox-Tasks überführen. Format: - [ ] <Task> (Target: <Q/Jahr>).

Voice Module Roadmap

Current Status

v1.1.0 – Production-ready voice assistant system. VoiceAssistant orchestrator with Whisper-based STT, llama.cpp TTS/LLM integration, session management, phone call transcription, meeting protocol generation, real-time browser WebSocket streaming, voice biometric authentication, and telephony bridge (SIP/WebRTC) are all implemented.

Completed ✅

  • VoiceAssistant – central coordinator for all voice interaction
  • STT processing via Whisper AI (speaker diarization, timestamps)
  • LLM integration via EmbeddedLLM / LlamaWrapper (intent recognition, query generation, response generation)
  • TTS synthesis with audio format output
  • Voice command processing pipeline (audio → STT → LLM → TTS → audio)
  • Session state and conversation history management
  • Context-aware conversational AI
  • Phone call recording and transcription
  • Meeting protocol generation
  • Voice-based database query interface
  • Storage and retrieval of voice session data
  • Key point and summary extraction
  • Real-time streaming STT (word-by-word transcription as audio arrives) (Issue: #2496)
  • Wake-word detection for hands-free activation (Issue: #2365)
  • Voice biometric authentication (speaker verification) (Issue: #2494)
  • Multi-speaker diarization improvements (Issue: #2497)
  • WebSocket audio streaming endpoint for browser clients (Issue: #2350)
  • Integration with telephony systems — SIP call sessions, WebRTC peer connections, IVR engine, TelephonyBridge coordinator (Issue: #2495)

In Progress 🚧

  • (none)

Planned Features 📋

Long-term (6-12 months)

  • Federated learning for on-device voice model personalisation (Target: Q3 2026)
  • GPU-accelerated noise suppression and codec processing (Target: Q4 2026)

Implementation Phases

Phase 1: Voice Pipeline & Session Management (Status: Completed ✅)

  • VoiceAssistant – central coordinator for all voice interaction
  • STT processing via Whisper AI (speaker diarization, timestamps)
  • LLM integration via EmbeddedLLM / LlamaWrapper (intent recognition, query generation, response generation)
  • TTS synthesis with audio format output
  • Voice command processing pipeline (audio → STT → LLM → TTS → audio)
  • Session state and conversation history management
  • Context-aware conversational AI
  • Phone call recording and transcription
  • Meeting protocol generation
  • Voice-based database query interface
  • Storage and retrieval of voice session data
  • Key point and summary extraction

Phase 2: Streaming STT & Wake-Word Detection (Status: Completed ✅)

  • Real-time streaming STT (word-by-word transcription as audio arrives)
  • Wake-word detection for hands-free activation
  • Multi-speaker diarization improvements

Phase 3: Voice Macros & Browser Streaming (Status: Completed ✅)

  • Voice command macros (user-defined shortcuts to AQL queries)
  • Language detection and automatic locale switching
  • Noise suppression preprocessing (RNNoise integration)
  • WebSocket audio streaming endpoint for browser clients (Issue: #2350)
  • Voice session playback and search in stored transcripts

Phase 4: Multi-Language TTS & Biometric Authentication (Status: Completed ✅)

  • Multi-language TTS (German, French, Spanish voices)
  • Emotion / sentiment detection from voice tone
  • Voice biometric authentication (speaker verification)
  • Real-time meeting transcription with action-item extraction (Target: Q1 2026)
  • Integration with telephony systems (SIP / WebRTC) (Issue: #2495)

Production Readiness Checklist

  • Unit tests coverage > 80% (Issue: #2355) — test_voice_assistant.cpp, test_voice_coverage.cpp, test_voice_production.cpp (496+ tests); focused targets: VoiceProductionFocusedTests, VoiceCoverageFocusedTests
  • Integration tests (full pipeline: audio in → transcription → AQL → audio out) (Issue: #2356) — VoiceProductionFocusedTests
  • Performance benchmarks (STT latency, TTS generation speed) (Issue: #2357) — benchmarks/bench_voice_assistant.cpp
  • [I] Security audit (audio data storage, transcription PII handling) (Issue: #2358)
  • [I] Documentation complete (Issue: #2359)
  • API stability guaranteed (Issue: #2360) — VoiceAssistant session API stable from v1.x; new v1.1.0 APIs (telephony, biometric, browser streaming) marked stable
  • Standalone focused test targets registered in tests/CMakeLists.txt: VoiceProductionFocusedTests, VoiceCoverageFocusedTests, VoiceAssistantFocusedTests (LLM-gated), VoiceBrowserStreamingFocusedTests, VoiceTelephonyFocusedTests
  • CI workflow registered — .github/workflows/voice-module-ci.yml (VoiceProductionFocusedTests, VoiceCoverageFocusedTests, VoiceBrowserStreamingFocusedTests, VoiceTelephonyFocusedTests)

Known Issues & Limitations

  • Streaming STT operates in sliding-window mode (3 s window, 1 s step); true sample-by-sample streaming requires Whisper.cpp THEMIS_ENABLE_WHISPER build flag.
  • Wake-word detection uses energy-based VAD gating and acoustic feature scoring (density, spectral centroid, crest factor). A neural wake-word model backend (e.g. Porcupine, openWakeWord) can be plugged in via WakeWordDetector::scorePhrase() without API changes.
  • Multi-speaker diarization uses k-means++ clustering on sub-band acoustic features (RMS + ZCR). Accuracy degrades with more than 4 simultaneous speakers; a neural embedding backend (e.g., pyannote-style x-vector) can be substituted via diarizeSegments() without API changes.
  • TTS voice quality depends on the llama.cpp model in use.
  • Voice biometric authentication uses acoustic sub-band features (no external model required). A neural i-vector/x-vector backend can be plugged in via VoiceBiometricAuthenticator's internal extractFeatures() without changing the public API. Liveness detection is heuristic-based (crest factor, spectral flatness, ZCR variability); a neural anti-spoofing model is recommended for production.

Breaking Changes

  • VoiceAssistant session API is stable from v1.x.
  • Audio format configuration (sample rate, encoding) may gain new options in v1.5.0; backward-compatible.

Latente Symbole (Unused-Functions-Audit)

Stand: 2026-04-20 – Quelle: src/UNUSED_FUNCTIONS_REPORT.md

🧪 NUR_TESTS (implementiert, kein Produktions-Aufrufer)

  • NoiseSuppressor – RNNoise-basierte Rauschunterdrückung; nur im Voice-Produktionstest geprüft

    Aktion: ROADMAP-Ticket für Produktions-Integration ergänzen oder als CANDIDATE_FOR_REMOVAL markieren.

🟡 UNGENUTZT (kein Test, kein externer Aufrufer)

  • processRNNoiseFrames – Verarbeitet Audio-Frames durch RNNoise-Modell
  • applyRNNoiseSuppression – Wendet RNNoise auf gesamten Audio-Buffer an

    Aktion: Für jedes Symbol entscheiden: (1) Verdrahten, (2) Testen oder (3) als CANDIDATE_FOR_REMOVAL einplanen.