LIMA includes optional native GPU-accelerated whisper servers for faster transcription during development.
Terminology: "Docker whisper" refers to Speaches (faster-whisper in a container) - this is what runs when you
make up. "Native whisper" runs directly on your machine using GPU acceleration.
| Scenario | Recommendation |
|---|---|
| First-time setup | Docker (Speaches) - just works, no extra config |
| Production/consistency | Docker - predictable cold starts, no warmup needed |
| Development iteration | Native GPU - 3-5x faster after warmup |
| macOS Apple Silicon | Native MLX - fastest option, but needs warmup handling |
| Linux with NVIDIA GPU | Native CUDA - significant speedup over Docker |
Bottom line: Start with Docker. Switch to native if you're processing many recordings during development and want faster iteration.
# Start native whisper in background (auto-detects platform)
make whisper-native
# Check if running
make whisper-native-status
# View logs
make whisper-native-logs
# Stop
make whisper-native-stopThe native server runs on port 9001 by default (configurable via NATIVE_WHISPER_PORT in .env).
| Platform | Technology | Requirements |
|---|---|---|
| macOS (Apple Silicon) | Lightning Whisper MLX | M1/M2/M3/M4 chip |
| Linux | faster-whisper CUDA | NVIDIA GPU + drivers (nvidia-smi to verify) |
| Windows | faster-whisper CUDA | NVIDIA GPU + drivers |
Linux/Windows: Verify NVIDIA drivers
Run nvidia-smi in a terminal (Linux) or PowerShell/cmd (Windows):
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:09:00.0 On | Off |
| 0% 48C P8 18W / 450W | 1775MiB / 24564MiB | 19% Default |
+-----------------------------------------+------------------------+----------------------+
If this fails, install or update your NVIDIA drivers.
Windows note: Run the native whisper server in PowerShell or cmd, not WSL. CUDA requires direct access to Windows GPU drivers. WSL2 can access GPUs but requires additional configuration that's beyond LIMA's default setup.
Benchmark: 42-minute audio file
| Platform | Speed | Notes |
|---|---|---|
| macOS M4 Pro (MLX) | 166x realtime (~15s) | 5.3x faster than Docker, slow cold start |
| Linux RTX 4090 (CUDA) | 71x realtime (~36s) | 4.3x faster than Docker |
| Windows RTX 4090 (CUDA) | 39x realtime (~66s) | 2.8x faster than Docker |
| Docker Speaches | 14-33x realtime | Consistent, no warmup needed |
- Docker Speaches: Fast cold start, consistent performance
- Native MLX (macOS): First request is slow (model loading), subsequent requests are very fast
- Native CUDA: Moderate cold start, faster than Docker after warmup
Set these in .env before starting:
| Variable | Default | Purpose |
|---|---|---|
NATIVE_WHISPER_HOST |
0.0.0.0 |
Bind address |
NATIVE_WHISPER_PORT |
9001 |
Server port |
WHISPER_MODEL |
Systran/faster-whisper-base |
Model to use |
| Model | Size | Speed | Accuracy |
|---|---|---|---|
tiny |
~75MB | Fastest | Lower |
base |
~150MB | Fast | Good |
small |
~500MB | Medium | Better |
medium |
~1.5GB | Slower | High |
large-v3 |
~3GB | Slowest | Highest |
The Voice Memo workflow can be configured to use native whisper instead of Docker Speaches. Set NATIVE_WHISPER_PORT in .env before running make seed to configure the workflow automatically.
For platform-specific installation instructions, troubleshooting, and advanced configuration, see:
services/whisper-server/README.md
- Audio Processing Guide - Chunking strategies for long recordings
- Troubleshooting - Whisper-specific issues