A pure MCP (Model Context Protocol) server for Discord voice channel transcription, written in Go. Control your Discord bot entirely through Claude Desktop or other MCP clients - no Discord commands needed.
Component | Details |
---|---|
Docker Image | ~12 MB (minimal) / ~50 MB (with ffmpeg) / ~500 MB (whisper with GPU) |
Binary Size | ~15 MB |
Memory Usage | ~10-20 MB (base) / ~200-500 MB (with Whisper) |
Language | Go 1.25 |
MCP SDK | v0.2.0 (official Go SDK) |
GPU Support | CUDA, ROCm, Vulkan (auto-detected) |
- Create a Discord Bot at https://discord.com/developers/applications
- Get your Discord User ID (Enable Developer Mode in Discord settings → Right-click your username → Copy User ID)
- Invite bot to your server with the following permissions:
Permission | Why It's Needed |
---|---|
View Channels | See available voice channels |
Connect | Join voice channels |
Speak | Transmit audio in voice channels |
Use Voice Activity | Detect when users are speaking |
Minimum permission integer: 3145728
(for OAuth2 URL generator)
- Go to Discord Developer Portal
- Create a new application and bot
- Copy the bot token
- Generate an invite link:
- Go to OAuth2 → URL Generator
- Select scopes:
bot
- Select permissions:
View Channels
,Connect
,Speak
,Use Voice Activity
- Or use this template URL (replace
YOUR_CLIENT_ID
):
https://discord.com/api/oauth2/authorize?client_id=YOUR_CLIENT_ID&permissions=3145728&scope=bot
# Run the MCP server with your user ID
docker run -i --rm \
-e DISCORD_TOKEN="your-bot-token" \
-e DISCORD_USER_ID="your-discord-user-id" \
ghcr.io/fankserver/discord-voice-mcp:latest
# Basic usage
docker run -i --rm \
-e DISCORD_TOKEN="your-bot-token" \
-e DISCORD_USER_ID="your-discord-user-id" \
ghcr.io/fankserver/discord-voice-mcp:latest
Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json
on macOS):
{
"mcpServers": {
"discord-voice": {
"command": "docker",
"args": [
"run", "-i", "--rm",
"-e", "DISCORD_TOKEN=your-bot-token",
"-e", "DISCORD_USER_ID=your-discord-user-id",
"ghcr.io/fankserver/discord-voice-mcp:latest"
]
}
}
}
# Windows
GOOS=windows GOARCH=amd64 go build -o discord-voice-mcp.exe
# macOS
GOOS=darwin GOARCH=amd64 go build -o discord-voice-mcp-mac
# Linux ARM (Raspberry Pi)
GOOS=linux GOARCH=arm64 go build -o discord-voice-mcp-arm
This is a pure MCP server that connects to Discord. All control is through MCP tools - no Discord commands.
cmd/discord-voice-mcp/
└── main.go - Entry point, MCP server startup
internal/
├── mcp/
│ └── server.go - MCP tool implementations
├── bot/
│ └── bot.go - Discord voice connection handler
├── audio/
│ └── processor.go - Audio capture & processing
└── session/
└── manager.go - Transcript session management
pkg/
└── transcriber/
└── transcriber.go - Transcription provider interface
- MCP-First: All control through MCP tools, no Discord text commands
- User-Centric: Tools work with "your channel" via DISCORD_USER_ID
- Auto-Follow: Bot can automatically follow you between channels
- Stateless Commands: Each MCP tool call is independent
- Session-Based: Transcripts organized by voice sessions
- GPU Acceleration: Automatic detection of NVIDIA/AMD/Intel GPUs for 5-10x faster transcription
- Universal Image: Single Docker image works on any hardware (GPU or CPU)
- Lightweight: 12MB minimal Docker image, 50MB with ffmpeg, 500MB with full GPU support
- Fast Startup: Sub-second initialization
- Cross-Platform: Compile for Windows, macOS, Linux, ARM
- Concurrent: Go's goroutines handle multiple audio streams efficiently
- Clean Shutdown: Proper resource cleanup with context cancellation
- Structured Logging: Configurable log levels for debugging
- Go 1.25+
- FFmpeg (for audio processing with normal Docker image)
- Discord Bot Token
- (Optional) Whisper.cpp and model file for real transcription
# Get dependencies
go mod download
# Run tests
go test ./...
# Build with optimizations
go build -ldflags="-w -s" -o discord-voice-mcp
# Check binary size
ls -lh discord-voice-mcp
# -rwxr-xr-x 1 user staff 15M discord-voice-mcp
Variable | Required | Description | Example |
---|---|---|---|
DISCORD_TOKEN |
✅ | Bot token from Discord Developer Portal | MTIz... |
DISCORD_USER_ID |
✅ | Your Discord user ID for "my channel" commands | 123456789012345678 |
LOG_LEVEL |
❌ | Logging verbosity (default: info ) |
debug , info , warn , error |
TRANSCRIBER_TYPE |
❌ | Transcription provider (default: mock ) |
mock , whisper , google |
WHISPER_MODEL_PATH |
Path to Whisper model (required if using whisper ) |
/models/ggml-base.en.bin |
|
AUDIO_BUFFER_DURATION_SEC |
❌ | Buffer duration trigger (default: 2 ) |
1 , 2 , 5 |
AUDIO_SILENCE_TIMEOUT_MS |
❌ | Silence detection timeout (default: 1500 ) |
500 , 1500 , 3000 |
AUDIO_MIN_BUFFER_MS |
❌ | Minimum audio before transcription (default: 100 ) |
50 , 100 , 200 |
WHISPER_USE_GPU |
❌ | Enable GPU acceleration (default: true ) |
true , false |
CUDA_VISIBLE_DEVICES |
❌ | Select NVIDIA GPU (default: 0 ) |
0 , 1 , all |
HIP_VISIBLE_DEVICES |
❌ | Select AMD GPU (default: 0 ) |
0 , 1 |
Tool | Description | Parameters |
---|---|---|
join_my_voice_channel |
Join the voice channel where you are | None |
follow_me |
Auto-follow you between voice channels | enabled : boolean |
join_specific_channel |
Join a specific channel by ID | guildId , channelId |
leave_voice_channel |
Leave current voice channel | None |
get_bot_status |
Get bot connection status | None |
list_sessions |
List all transcription sessions | None |
get_transcript |
Get transcript for a session | sessionId |
export_session |
Export session to JSON | sessionId |
# Join your current voice channel
"Use the join_my_voice_channel tool"
# Enable auto-follow so bot follows you
"Enable follow_me to track my movements"
# Check bot status
"What's the bot status?"
# Get transcripts
"List all sessions and show me the latest transcript"
The server runs with mock transcription by default, which shows audio is being captured but doesn't transcribe actual content.
The Whisper Docker image (ghcr.io/fankserver/discord-voice-mcp:whisper
) includes built-in GPU acceleration for NVIDIA (CUDA), AMD (ROCm), and Intel/Other GPUs (Vulkan). The image automatically detects and uses available hardware acceleration, falling back to CPU if no GPU is available.
- NVIDIA GPUs: CUDA acceleration (5-10x faster)
- AMD GPUs: ROCm acceleration (5-10x faster)
- Intel/Other GPUs: Vulkan acceleration (3-5x faster)
- CPU Fallback: OpenBLAS acceleration (2-3x faster than baseline)
# For multilingual support (recommended for non-English):
wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin -O models/ggml-base.bin
# For German language specifically, use the multilingual models:
# - ggml-base.bin (142 MB) - good balance, supports 99 languages
# - ggml-small.bin (466 MB) - better accuracy for German
# - ggml-medium.bin (1.5 GB) - high accuracy
# - ggml-large-v3.bin (3.1 GB) - best accuracy
# For English-only (faster but no German support):
# - ggml-base.en.bin (142 MB) - English only
# - ggml-tiny.en.bin (39 MB) - fastest, English only
NVIDIA GPU:
docker run -i --rm --gpus all \
-e DISCORD_TOKEN="your-bot-token" \
-e DISCORD_USER_ID="your-discord-user-id" \
-e TRANSCRIBER_TYPE="whisper" \
-e WHISPER_MODEL_PATH="/models/ggml-base.bin" \
-v $(pwd)/models:/models:ro \
ghcr.io/fankserver/discord-voice-mcp:whisper
AMD GPU:
docker run -i --rm \
--device=/dev/kfd --device=/dev/dri --group-add video \
-e DISCORD_TOKEN="your-bot-token" \
-e DISCORD_USER_ID="your-discord-user-id" \
-e TRANSCRIBER_TYPE="whisper" \
-e WHISPER_MODEL_PATH="/models/ggml-base.bin" \
-v $(pwd)/models:/models:ro \
ghcr.io/fankserver/discord-voice-mcp:whisper
Intel/Other GPUs (via Vulkan):
docker run -i --rm --device=/dev/dri \
-e DISCORD_TOKEN="your-bot-token" \
-e DISCORD_USER_ID="your-discord-user-id" \
-e TRANSCRIBER_TYPE="whisper" \
-e WHISPER_MODEL_PATH="/models/ggml-base.bin" \
-v $(pwd)/models:/models:ro \
ghcr.io/fankserver/discord-voice-mcp:whisper
CPU-Only (with OpenBLAS acceleration):
docker run -i --rm \
-e DISCORD_TOKEN="your-bot-token" \
-e DISCORD_USER_ID="your-discord-user-id" \
-e TRANSCRIBER_TYPE="whisper" \
-e WHISPER_MODEL_PATH="/models/ggml-base.bin" \
-v $(pwd)/models:/models:ro \
ghcr.io/fankserver/discord-voice-mcp:whisper
The Google Speech-to-Text transcriber is a stub implementation that returns "Google transcription not implemented in PoC". Full implementation requires Google Cloud credentials integration.
The Whisper Docker image includes automatic GPU detection and acceleration:
Hardware | Real-Time Factor | 10s Audio Processing Time | Speedup |
---|---|---|---|
CPU (no acceleration) | 0.5x | ~5 seconds | Baseline |
CPU (OpenBLAS) | 0.2x | ~2 seconds | 2-3x |
Intel GPU (Vulkan) | 0.1x | ~1 second | 5x |
AMD GPU (ROCm) | 0.05x | ~0.5 seconds | 10x |
NVIDIA GPU (CUDA) | 0.05x | ~0.5 seconds | 10x |
Lower Real-Time Factor is better. 0.1x means 10x faster than real-time.
# Build universal GPU support (Vulkan - works on ALL GPUs)
docker build -f Dockerfile.whisper -t discord-voice-mcp:whisper .
# Build NVIDIA-optimized version (CUDA - maximum performance)
docker build -f Dockerfile.whisper-cuda -t discord-voice-mcp:whisper-cuda .
# Build standard version (no GPU acceleration)
docker build -f Dockerfile -t discord-voice-mcp:latest .
The most common cause of poor transcription is audio being split into chunks that are too small, causing loss of context. For example, "und meinen zwei Bären" (and my two bears) might be split into "und meinen zwei" and "Bären", causing Whisper to misinterpret "Bären" as "wären" (would be) without context.
Solution: Increase the buffer duration to capture complete sentences:
-e AUDIO_BUFFER_DURATION_SEC="5" # Default is 2, use 5-10 for better context
-e AUDIO_SILENCE_TIMEOUT_MS="2000" # Default is 1500, increase for natural pauses
If you're experiencing poor transcription accuracy with German or other non-English languages (e.g., "Bär" being transcribed as "Bild"), follow these recommendations:
-
Use a multilingual model (not the
.en
variants):# Download a multilingual model (small recommended for German) wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin -O models/ggml-small.bin
-
Explicitly set the language:
-e WHISPER_LANGUAGE="de" # For German
-
Use higher beam size for better accuracy:
-e WHISPER_BEAM_SIZE="5" # Default is 1 for speed, 5 for accuracy
-
Complete example for German transcription:
docker run -i --rm --gpus all \ -e DISCORD_TOKEN="your-bot-token" \ -e DISCORD_USER_ID="your-discord-user-id" \ -e TRANSCRIBER_TYPE="whisper" \ -e WHISPER_MODEL_PATH="/models/ggml-small.bin" \ -e WHISPER_LANGUAGE="de" \ -e WHISPER_BEAM_SIZE="5" \ -e AUDIO_BUFFER_DURATION_SEC="5" \ -e AUDIO_SILENCE_TIMEOUT_MS="2000" \ -v $(pwd)/models:/models:ro \ ghcr.io/fankserver/discord-voice-mcp:whisper-cuda
Important: The longer buffer (5 seconds) allows Whisper to maintain context across complete sentences, significantly improving accuracy for languages like German where word order and context are crucial.
Use Case | Model | Size | Languages | Accuracy |
---|---|---|---|---|
German/Multilingual | ggml-small.bin | 466 MB | 99 | Good |
German/Multilingual (Best) | ggml-medium.bin | 1.5 GB | 99 | High |
English Only | ggml-base.en.bin | 142 MB | 1 | Good |
Fast Testing | ggml-tiny.bin | 39 MB | 99 | Low |
Production German | ggml-large-v3.bin | 3.1 GB | 99 | Best |
The audio processing behavior can be customized using environment variables:
Variable | Default | Description |
---|---|---|
AUDIO_BUFFER_DURATION_SEC |
2 |
Buffer duration in seconds before triggering transcription |
AUDIO_SILENCE_TIMEOUT_MS |
1500 |
Silence duration in milliseconds that triggers transcription |
AUDIO_MIN_BUFFER_MS |
100 |
Minimum audio duration in milliseconds before transcription |
WHISPER_LANGUAGE |
auto |
Language code for Whisper transcription (e.g., "en", "de", "es", "auto") |
WHISPER_THREADS |
CPU cores | Number of threads for Whisper processing (defaults to runtime.NumCPU()) |
WHISPER_BEAM_SIZE |
1 |
Beam size for Whisper (1 = fastest, 5 = most accurate) |
Quick transcription with short pauses:
# Trigger after 1 second buffer or 500ms silence
docker run -i --rm \
-e DISCORD_TOKEN="your-bot-token" \
-e DISCORD_USER_ID="your-discord-user-id" \
-e AUDIO_BUFFER_DURATION_SEC="1" \
-e AUDIO_SILENCE_TIMEOUT_MS="500" \
-e AUDIO_MIN_BUFFER_MS="50" \
ghcr.io/fankserver/discord-voice-mcp:latest
Longer recordings with natural pauses:
# Allow 3 second pauses, 5 second buffer
docker run -i --rm \
-e DISCORD_TOKEN="your-bot-token" \
-e DISCORD_USER_ID="your-discord-user-id" \
-e AUDIO_BUFFER_DURATION_SEC="5" \
-e AUDIO_SILENCE_TIMEOUT_MS="3000" \
-e AUDIO_MIN_BUFFER_MS="200" \
ghcr.io/fankserver/discord-voice-mcp:latest
Multilingual transcription (preserve original language):
# Auto-detect and preserve original language
docker run -i --rm \
-e DISCORD_TOKEN="your-bot-token" \
-e DISCORD_USER_ID="your-discord-user-id" \
-e WHISPER_LANGUAGE="auto" \
ghcr.io/fankserver/discord-voice-mcp:latest
Force specific language (recommended for better accuracy):
# Force German transcription with optimized settings
docker run -i --rm --gpus all \
-e DISCORD_TOKEN="your-bot-token" \
-e DISCORD_USER_ID="your-discord-user-id" \
-e TRANSCRIBER_TYPE="whisper" \
-e WHISPER_MODEL_PATH="/models/ggml-small.bin" \
-e WHISPER_LANGUAGE="de" \
-e WHISPER_BEAM_SIZE="5" \
-e AUDIO_BUFFER_DURATION_SEC="5" \
-e AUDIO_SILENCE_TIMEOUT_MS="2000" \
-v $(pwd)/models:/models:ro \
ghcr.io/fankserver/discord-voice-mcp:whisper-cuda
# Other language codes: en (English), es (Spanish), fr (French), it (Italian), etc.
Optimize for faster transcription (reduce delay):
# Use more threads and smaller beam size for speed
docker run -i --rm \
-e DISCORD_TOKEN="your-bot-token" \
-e DISCORD_USER_ID="your-discord-user-id" \
-e WHISPER_THREADS="8" \
-e WHISPER_BEAM_SIZE="1" \
-e AUDIO_SILENCE_TIMEOUT_MS="1000" \
ghcr.io/fankserver/discord-voice-mcp:whisper
Optimize for accuracy (slower but better quality):
# Use default threads but larger beam size
docker run -i --rm \
-e DISCORD_TOKEN="your-bot-token" \
-e DISCORD_USER_ID="your-discord-user-id" \
-e WHISPER_THREADS="4" \
-e WHISPER_BEAM_SIZE="5" \
ghcr.io/fankserver/discord-voice-mcp:whisper
- Meeting Transcription - Record Discord voice meetings
- Study Groups - Capture study session discussions
- Gaming Sessions - Document strategy discussions
- Podcast Recording - Transcribe Discord podcasts
- Resource Efficiency - Runs on Raspberry Pi or small VPS
- Fast Deployment - 12-50MB images deploy instantly
- Cost Efficiency - Small container footprint (12-50MB images)
- Cross-Platform - Single binary for any OS
- Claude Integration - Native MCP support
- ✅ Pure MCP Control - No Discord text commands needed
- ✅ User-Centric Tools - "Join my channel" functionality
- ✅ Auto-Follow Mode - Bot follows you automatically
- ✅ GPU Acceleration - CUDA, ROCm, Vulkan support with auto-detection
- ✅ Minimal Docker Images - 12MB minimal, 50MB with ffmpeg, 500MB with GPU
- ✅ Voice Connection - Stable Discord voice handling
- ✅ Session Management - Organized transcript storage
- ✅ Audio Pipeline - Real-time PCM processing
- ✅ MCP SDK Integration - Using official Go SDK v0.2.0
- ✅ Whisper Transcription - Complete implementation with whisper.cpp + GPU acceleration
- 🚧 Google Speech Integration - Currently stub implementation
- 🚧 Real-time Updates - Live transcript streaming
- 🚧 Multi-user Support - Track multiple speakers
- Integrate whisper.cpp for offline transcription (completed)
- Add Google Cloud Speech-to-Text (stub exists)
- Implement real-time streaming transcripts
- Speaker diarization (who said what)
- Sentiment analysis
- Keyword detection and alerts
- Multi-language support
- Kubernetes deployment manifests
- Multi-guild support
- Webhook integrations
- Transcript search API
Contributions are welcome! Areas of interest:
- Transcription provider implementations (Whisper, Google Speech)
- Additional MCP tools and features
- Performance optimizations
- Documentation improvements
Please ensure all tests pass before submitting PRs:
go test ./...
MIT