Echo Discord Bot - Gemini Live Voice Integration
This document provides a comprehensive overview of the voice chat system, detailing the architecture, responsibilities, and implementation from high-level to low-level components.
- High-Level Overview
- Component Breakdown
- Data Flow
- File Responsibilities
- Current Concerns
- Resilience & Recovery
- Future Improvements
Enable real-time voice conversations between Discord users and Gemini Live API, creating a seamless voice chat experience where users can speak naturally and receive AI-generated audio responses.
Crucial for User Trust: Echo treats voice data as ephemeral.
- Stream: Audio is streamed in real-time chunks to Gemini.
- Process: It is converted to semantic vectors instantly.
- Discard: The raw audio buffer is overwritten/discarded immediately. WE DO NOT STORE AUDIO FILES. This architecture ensures we can offer safety features without becoming a surveillance tool.
┌─────────────────────────────────────────────────────────────┐
│ USER LAYER │
│ (Discord Voice Channel) │
└─────────────────────────────────────────────────────────────┘
↓ Opus audio (Ephemeral)
┌─────────────────────────────────────────────────────────────┐
│ COMMAND LAYER │
│ commands/join.js │
│ • Entry point for voice sessions │
│ • Validates permissions and user state │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ ORCHESTRATION LAYER │
│ voiceSessionManager.js │
│ • Manages session lifecycle │
│ • Coordinates all components │
│ • Handles Gemini Live WebSocket │
└─────────────────────────────────────────────────────────────┘
↙ ↓ ↘
┌──────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ INPUT │ │ OUTPUT │ │ UTILITIES │
│ UserAudio │ │ AudioStream │ │ voice/utils.js │
│ Receiver.js │ │ Manager.js │ │ │
│ │ │ │ │ • PCM resampling│
│ • Opus decode│ │ • Stream lifecycle│ │ • FFmpeg transcode│
│ • Silence │ │ • De-duplication │ │ • Format conversion│
│ • Speaker │ │ • Backpressure │ │ │
│ switching │ │ │ │ │
└──────────────┘ └──────────────────┘ └──────────────────┘
↓ ↑ ↑
└─────────────────────┴──────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ EXTERNAL LAYER │
│ • Gemini Live API (WebSocket) │
│ • Discord Voice Gateway (UDP) │
│ • @discordjs/voice library │
└─────────────────────────────────────────────────────────────┘
Level: High
Role: User-facing entry point
Responsibilities:
- Validates user is in voice channel
- Checks bot permissions (Connect, Speak)
- Prevents duplicate sessions
- Parses voice preset selection
- Initiates voice session via
startVoiceSession() - Provides user feedback
Key Functions:
execute(interaction)- Main command handler
Level: High-Medium
Role: Central coordinator for voice sessions
Responsibilities:
-
Session Lifecycle:
- Creates and destroys voice sessions
- Manages session state (active/destroyed)
- Tracks sessions per voice channel
-
Connection Management:
- Joins Discord voice channels
- Connects to Gemini Live WebSocket
- Handles connection errors and reconnection
-
Component Coordination:
- Initializes
AudioStreamManagerfor output - Initializes
UserAudioReceiverfor input - Routes audio between Discord ↔ Gemini
- Initializes
-
Business Logic:
- Hangup detection (via summary keyword)
- Idle timeout (60 seconds of silence)
- Stage channel permission handling
- Graceful goodbye (waits for audio to finish)
Key Classes:
VoiceSession- Main session orchestrator
Key Functions:
startVoiceSession()- Creates new sessionendVoiceSession()- Destroys existing sessiongetVoiceSession()- Retrieves active sessionhandleServerMessage()- Processes Gemini messagesforwardPcmChunk()- Sends user audio to GeminisignalAudioStreamEnd()- Signals end of user speech
Level: Medium
Role: Discord audio input processor
Responsibilities:
-
Audio Capture:
- Subscribes to Discord user audio streams
- Decodes Opus → PCM (48kHz mono)
- Forwards PCM to Gemini
-
Silence Detection:
- 750ms timer per utterance
- Triggers
onSilence()callback when user stops speaking - Signals Gemini to generate response
-
Speaker Management:
- Enforces single active speaker
- Detects speaker switches
- Interrupts current playback on switch
-
Error Handling:
- Graceful Opus decode errors (packet loss)
- Auto-reconnection on stream failures
- Error rate tracking (diagnostics)
Key Classes:
UserAudioReceiver
Key Functions:
start()- Begins listening to voice eventssubscribe(userId)- Subscribes to user's audiounsubscribe(userId)- Cleans up user's audiohandleChunk(userId, pcmBuffer)- Processes audio chunkhandleSpeakerStart(userId)- Manages speaker switching
Audio Flow:
Discord User → Opus Stream → Opus Decoder → PCM (48kHz mono) → Gemini
Level: Medium
Role: Discord audio output processor
Responsibilities:
-
Stream Lifecycle:
- Creates PassThrough stream per Gemini "turn"
- Writes audio chunks to stream
- Closes stream on
turnComplete
-
De-duplication:
- MD5 hashing of first 1KB
- Detects and skips duplicate chunks
- Maintains hash history (last 10)
-
Adaptive Backpressure:
- Tracks write latency
- Enables backpressure when latency > 50ms
- Disables when latency < 20ms
- Balances smoothness vs. latency
-
Format Handling:
- Delegates to
utils.jsfor conversion - Supports PCM, WAV, MP3, OGG via FFmpeg
- Delegates to
Key Classes:
AudioStreamManager
Key Functions:
createStream()- Creates new playback streamwriteChunk(buffer, mimeType)- Writes audio chunkcloseStream()- Ends current streamflush()- Emergency stop (interruption)updateLatencyMetrics()- Adaptive backpressure logic
Audio Flow:
Gemini → PCM/WAV/MP3 → Format Conversion → Discord Audio Player
Level: Low
Role: Audio processing primitives
Responsibilities:
-
PCM Processing:
- Sample rate conversion (e.g., 24kHz → 48kHz)
- Channel conversion (mono → stereo)
- Linear interpolation resampling
-
Format Transcoding:
- FFmpeg integration for WAV/MP3/OGG
- Safe argument building (command injection prevention)
- Stream-based processing
-
Security:
- MIME type sanitization
- Prevents shell injection attacks
Key Functions:
ensurePcm48kStereo()- Converts to Discord formatresampleMonoPcm()- Resamples audiodownmixToMono()- Multi-channel → monoduplicateMonoToStereo()- Mono → stereowriteChunkToPlayback()- Entry point for conversiontranscodeChunkToStream()- FFmpeg transcodingbuildFfmpegArgs()- Safe FFmpeg argssanitizeMimeType()- Security filter
1. User speaks in Discord
↓
2. Discord sends Opus packets (UDP)
↓
3. @discordjs/voice receives packets
↓
4. UserAudioReceiver subscribes to stream
↓
5. prism-media decodes Opus → PCM
↓
6. UserAudioReceiver forwards PCM to VoiceSession
↓
7. VoiceSession.forwardPcmChunk() sends to Gemini
↓
8. Gemini Live WebSocket receives audio
↓
9. After 750ms silence, UserAudioReceiver triggers onSilence()
↓
10. VoiceSession.signalAudioStreamEnd() sent to Gemini
↓
11. Gemini processes and generates response
1. Gemini sends audio chunks via WebSocket
↓
2. VoiceSession.handleServerMessage() receives
↓
3. VoiceSession.handleModelPart() extracts audio
↓
4. AudioStreamManager.writeChunk() called
↓
5. Hash-based de-duplication check
↓
6. utils.writeChunkToPlayback() converts format
↓
7. PCM written to PassThrough stream
↓
8. Discord audio player plays stream
↓
9. On turnComplete, AudioStreamManager.closeStream()
1. User says "goodbye" or similar
↓
2. Audio sent to Gemini as normal
↓
3. Gemini includes "<<<TERMINATE_SESSION>>>" in text summary
↓
4. VoiceSession.handleTextPart() detects keyword
↓
5. Sets sessionEndRequested flag
↓
6. On turnComplete, checks flag
↓
7. VoiceSession.scheduleDestroyAfterPlayback() called
↓
8. Polls audio player until idle
↓
9. VoiceSession.destroy() called
↓
10. Cleanup: streams, timers, connections
| File | Layer | Primary Role | Lines of Code |
|---|---|---|---|
commands/join.js |
Command | User entry point | ~100 |
voiceSessionManager.js |
Orchestration | Session lifecycle & coordination | ~780 |
voice/UserAudioReceiver.js |
Input | Discord audio capture | ~290 |
voice/AudioStreamManager.js |
Output | Discord audio playback | ~250 |
voice/utils.js |
Utilities | Audio format conversion | ~320 |
commands/join.js
↓ (requires)
voiceSessionManager.js
↓ (requires)
├─ voice/UserAudioReceiver.js
│ ↓ (requires)
│ └─ @discordjs/voice
│ └─ prism-media
│
├─ voice/AudioStreamManager.js
│ ↓ (requires)
│ ├─ @discordjs/voice
│ └─ voice/utils.js
│ ↓ (requires)
│ ├─ prism-media (FFmpeg)
│ └─ node:stream
│
└─ gemini-live.js (Gemini WebSocket client)
Severity: High
Impact: Random disconnections (code 1008)
The Gemini Live Preview API is unstable and can randomly close connections. This is an external issue beyond our control. The bot handles it gracefully by destroying the session, but it disrupts user experience.
Evidence:
- Frequent
code 1008errors in logs "Operation is not implemented, or supported, or enabled."
Mitigation:
- Graceful session cleanup on disconnect
- Clear user messaging about beta status
- Session state tracking prevents orphaned sessions
Severity: Medium
Impact: Occasional audio loss (1-2 seconds)
Discord packet loss causes Opus decoder corruption. The decoder cannot recover and must be restarted, resulting in brief audio gaps.
Evidence:
TypeError: The compressed data passed is corrupted- Occurs in
prism-medialibrary (external)
Mitigation:
- Auto-reconnection via
unsubscribe()→ new decoder - Error rate tracking (diagnostics)
- Conditional logging (reduces console spam)
Goal: Auto-reconnect on Gemini Live disconnections instead of destroying session
Design:
- Exponential Backoff: 1s, 2s, 4s, 8s (max 30s)
- Retry Logic: Max 3 attempts before giving up
- Context Preservation: Preserves conversation context across reconnections so the bot "remembers" what was just said
- User Feedback: Notifies users of reconnection attempts via Discord messages
Benefits:
- Better user experience during API instability
- Maintains conversation state
- Reduces need to manually
/joinagain
Risks & Mitigation:
- Risk: Session state inconsistencies
- Mitigation: Strict state tracking in
voiceSessionManager.js
Priority: Medium
Complexity: High
Integrate noise suppression library (e.g., RNNoise) to filter background noise before sending to Gemini.
Benefits:
- Cleaner audio input
- Better Gemini transcription accuracy
- More professional sound
Implementation:
- Add RNNoise wrapper in
voice/utils.js - Process PCM in
UserAudioReceiverbefore forwarding - Optional toggle via
/joinparameter
Priority: Low
Complexity: Very High
Prevent bot's own audio from being captured by users' microphones.
Challenge:
- Requires acoustic echo cancellation (AEC)
- Complex signal processing
- May not be necessary (Discord handles per-user audio)
Priority: Medium
Complexity: High
Goal: Allow multiple users to speak simultaneously without interruption
Current Limitation:
- Single active speaker enforced
- Speaker switches interrupt current playback
Design:
- Audio mixing in
UserAudioReceiver - Speaker identification via metadata
- Gemini context: "User A: ..., User B: ..."
Challenges:
- Gemini context window limits
- Audio mixing complexity
- Cross-talk handling
Benefits:
- Natural group conversations
- Better Discord voice chat experience
Priority: Low
Complexity: Low
Goal: Adapt silence timeout based on speech patterns
Current: Fixed 750ms timeout
Design:
- Track user's average pause length
- Adjust timeout: Fast speakers (500ms), slow speakers (1000ms)
- Per-user calibration
Benefits:
- More natural conversation flow
- Fewer false "end of utterance" triggers
Priority: Medium
Complexity: Medium
Goal: Track and expose audio quality metrics for debugging
Metrics:
- Packet loss rate (Opus errors / total packets)
- Average write latency
- Duplicate detection rate
- Silence detection accuracy
Implementation:
- Dashboard endpoint:
/api/voice/metrics - Periodic logging (every 5 minutes)
- Per-session statistics
Benefits:
- Identify quality issues
- Tune backpressure thresholds
- Validate improvements
Priority: High
Complexity: Medium
Goal: Improve silence detection with proper VAD algorithm
Current: Timer-based (750ms)
Design:
- Use WebRTC VAD or similar library
- Analyze audio energy levels
- More accurate "end of speech" detection
Benefits:
- Faster response times (no fixed delay)
- Better handling of natural pauses
- Reduces false triggers
Implementation:
- Add VAD library to
UserAudioReceiver - Replace timer-based logic
- Configurable sensitivity
Priority: Low
Complexity: Medium
Goal: Persist voice conversation summaries to database
Design:
- Store Gemini text summaries per turn
- Link to user ID and timestamp
- Queryable via
/history voicecommand
Benefits:
- User can review past conversations
- Analytics on usage patterns
- Debugging conversation flow
Schema:
voice_turns (
id, session_id, user_id, timestamp,
summary_text, duration_ms
)Priority: Low
Complexity: High
Goal: Allow users to wake bot with custom phrase
Current: Bot always listening when in channel
Design:
- Wake word detection library (e.g., Porcupine)
- User configures:
/voice wake-word "Hey Echo" - Bot only forwards audio after wake word
Benefits:
- Privacy (bot not always listening)
- Better channel etiquette
- Optional "push-to-talk" alternative
Challenges:
- Wake word detection accuracy
- Latency overhead
- Multiple wake words per channel
Priority: Low
Complexity: Low
Goal: Allow users to switch voice preset mid-conversation
Current: Voice selected at /join, fixed for session
Design:
- New command:
/voice preset Charon - Reconnects Gemini with new voice
- Maintains conversation context
Implementation:
- Add
changeVoicePreset()toVoiceSession - Disconnect current, reconnect with new voice
- Preserve conversation history
Priority: Low
Complexity: Medium
Goal: Allow users to record voice sessions
Design:
/voice record startcommand- Save raw PCM or transcode to MP3
- Upload to Discord or S3
/voice record stopto end
Legal Considerations:
- Require explicit user consent
- Notify all participants
- Compliance with recording laws
Implementation:
- Add recording stream in
AudioStreamManager - Tee audio to file writer
- Metadata: timestamp, participants
The voice system is a well-architected, modular design that successfully bridges Discord voice chat with Gemini Live API. Despite current limitations (API instability, Opus errors), the system is robust, maintainable, and positioned for future enhancements.
Implemented Features:
- ✅ Reconnection strategy with exponential backoff
- ✅ Context preservation across reconnections
- ✅ Cross-modal awareness (voice ↔ text)
- ✅ Hangup detection via keywords
Next Steps:
- Monitor production stability
- Gather user feedback on voice quality
- Evaluate VAD integration (high impact)
Last updated: 2026-02-03