Real-time dictation, zero-shot voice cloning, and cinematic video dubbing — all on your desktop.
Open-source, no API keys, fully local. 646 languages.
Quickstart · Features · Why OmniVoice Studio? · TTS Engines · Contributing · Discord
Warning
OmniVoice Studio is in active beta. Things may break between releases. For the latest features and fixes, clone the repo and run from source rather than using pre-built installers. Bug reports and PRs are very welcome — open an issue or join Discord.
|
3-second clip → mirror any voice. |
Gender, age, accent, pitch, speed, |
YouTube URL or file → transcribe → |
|
|
Demucs-powered. Splits speech |
Pyannote + WhisperX. |
|
Drop 50 videos, walk away. |
Use OmniVoice from Claude, |
AudioSeal (Meta). Invisible, |
|
No keys, no cloud, no accounts. |
CUDA · MPS · ROCm · CPU. |
Subclass |
Pick your path — from zero-install to full developer setup:
Pre-built installers (~6–8 MB) are on the Releases page. Download, install, launch. The app bootstraps a Python environment and downloads model weights automatically — the splash screen shows progress.
macOS — "app is damaged and can't be opened"
macOS quarantines apps downloaded outside the App Store. After dragging to /Applications:
xattr -cr /Applications/OmniVoice\ Studio.appOpen normally after. One-time fix.
Windows — first launch takes 5–10 minutes
The app bootstraps a Python virtual environment, installs dependencies, and downloads ffmpeg on first run. The splash screen shows each step. Subsequent launches start in seconds.
Linux — AppImage needs FUSE
If FUSE isn't available, use the .deb package or extract-and-run:
chmod +x OmniVoice.Studio_*.AppImage
./OmniVoice.Studio_*.AppImage --appimage-extract-and-runLinux — White screen on Fedora 44 / Ubuntu 24.04
Some newer distros ship a WebKit/GTK version with compositing issues. Try:
WEBKIT_DISABLE_COMPOSITING_MODE=1 ./OmniVoice.Studio_*.AppImageIf that doesn't help, use the .deb package or run from source instead.
Installation fails behind a firewall / in Russia
The desktop app downloads Python from GitHub during first launch. If your network blocks GitHub:
- Install Python 3.11 manually from python.org
- Set
UV_PYTHON_PREFERENCE=systembefore launching, or run from source withbun run dev - For PyPI mirrors: set
UV_INDEX_URL=https://mirrors.aliyun.com/pypi/simple/
Pull the pre-built image from GitHub Container Registry:
docker pull ghcr.io/debpalash/omnivoice-studio:latestRun it:
# CPU mode
docker run -d --name omnivoice \
-p 127.0.0.1:3900:3900 \
-v omnivoice-data:/app/omnivoice_data \
ghcr.io/debpalash/omnivoice-studio:latest
# NVIDIA GPU mode
docker run -d --name omnivoice --gpus all \
-p 127.0.0.1:3900:3900 \
-v omnivoice-data:/app/omnivoice_data \
ghcr.io/debpalash/omnivoice-studio:latestOr use Docker Compose (recommended):
# CPU mode
docker compose -f deploy/docker-compose.yml --profile cpu up -d
# GPU mode (NVIDIA)
docker compose -f deploy/docker-compose.yml --profile gpu up -dOpen localhost:3900 once the health check passes. First run downloads ~4 GB of model weights — progress in docker compose logs -f.
NVIDIA GPU setup prerequisites
GPU mode requires the NVIDIA Container Toolkit:
# Arch / CachyOS
sudo pacman -S nvidia-container-toolkit
# Ubuntu / Debian
sudo apt-get install -y nvidia-container-toolkit
# Then configure and restart Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerVerify with docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu22.04 nvidia-smi.
Build from source instead of pulling
# CPU
docker compose -f deploy/docker-compose.yml --profile cpu up --build -d
# GPU
docker compose -f deploy/docker-compose.yml --profile gpu up --build -dNetwork access: the host-side port mapping binds to
127.0.0.1only, and the backend itself defaults toOMNIVOICE_BIND_HOST=127.0.0.1(loopback). The shippeddocker-compose.ymlsetsOMNIVOICE_BIND_HOST=0.0.0.0inside the container so the host mapping can forward traffic in — the127.0.0.1:3900:3900mapping is what enforces loopback-only on the host. To expose on your LAN, change the host port mapping to"0.0.0.0:3900:3900". Running the backend directly (not under Docker)? SetOMNIVOICE_BIND_HOST=0.0.0.0to listen on all interfaces. OmniVoice ships no authentication — put it behind a reverse proxy with auth (Caddybasic_auth, nginx + htpasswd, Tailscale, etc.).
git clone https://github.com/debpalash/OmniVoice-Studio.git && cd OmniVoice-Studio
bun install && bun run devOpen localhost:3901 and start cloning voices. Hot-reload enabled for both frontend and backend.
bun run desktop # Build the native desktop app from source| Service | URL | Stack |
|---|---|---|
| Backend | localhost:3900 |
FastAPI · 97 endpoints · WhisperX · Demucs · OmniVoice |
| Frontend | localhost:3901 |
React · Vite · Waveform timeline · Glassmorphism UI |
| API Docs | localhost:3900/docs |
Scalar — interactive API reference |
Note
First run downloads model weights (~2.4 GB). No account needed. For faster downloads, optionally set HF_TOKEN=hf_... in your environment (get a free token here).
Having issues? Join our Discord for setup help and troubleshooting.
ElevenLabs charges $5–$330/mo and processes your audio on their servers. OmniVoice Studio runs on your hardware, with no usage limits.
| ElevenLabs | OmniVoice Studio | |
|---|---|---|
| Pricing | $5–$330/mo, per-character billing | Free for personal use · Commercial license for business |
| Voice Cloning | ✅ 3s clip | ✅ 3s clip, zero-shot |
| Voice Design | ✅ Gender, age | ✅ Gender, age, accent, pitch, style, dialect |
| Languages | 32 | 646 |
| Video Dubbing | ✅ Cloud-only | ✅ Fully local |
| Data Privacy | Audio sent to cloud | Nothing leaves your machine |
| API Keys | Required | Not needed |
| GPU Support | N/A (cloud) | CUDA · Apple Silicon · ROCm · CPU |
| Desktop App | ❌ | ✅ macOS · Windows · Linux |
| Customizable | ❌ Closed | ✅ Fork it, extend it, ship it |
OmniVoice Studio gives you professional-grade AI tools without the subscription or the cloud.
| Minimum | Recommended | |
|---|---|---|
| OS | Windows 10, macOS 12+, Ubuntu 20.04+ | Any modern 64-bit OS |
| RAM | 8 GB | 16 GB+ |
| VRAM (GPU) | 4 GB (auto-offloads TTS to CPU) | 8 GB+ (NVIDIA RTX 3060+) |
| Disk | 10 GB free (models + cache) | 20 GB+ SSD |
| Python | 3.10+ (managed by uv) |
3.11–3.12 |
| GPU | Optional — CPU works | NVIDIA CUDA · Apple Silicon MPS · AMD ROCm |
Tip
On GPUs with ≤8 GB VRAM, OmniVoice automatically offloads TTS to CPU during transcription — no config needed. A dedicated GPU is not required; the entire pipeline runs on CPU (just slower).
OmniVoice ships a multi-engine TTS backend. The default engine (OmniVoice) is always available; additional engines are opt-in and auto-detected. Switch engines in Settings → TTS Engine or via the OMNIVOICE_TTS_BACKEND env var.
| Engine | Languages | Clone | Instruct | Linux | macOS ARM | Windows | License |
|---|---|---|---|---|---|---|---|
| OmniVoice (default) | 600+ | ✅ | ✅ | ✅ CUDA/CPU | ✅ MPS | ✅ CUDA/CPU | Built-in |
| CosyVoice 3 | 9 + 18 dialects | ✅ | ✅ | ✅ CUDA/CPU | ✅ MPS | ✅ CUDA/CPU | Apache-2.0 |
| MLX-Audio (Kokoro, Qwen3-TTS, CSM, Dia, …) | Multi | Varies | Varies | ❌ | ✅ Native | ❌ | Varies |
| VoxCPM2 | 30 | ✅ | ✅ | ✅ CUDA/CPU | ✅ MPS | ✅ CUDA/CPU | Apache-2.0 |
| MOSS-TTS-Nano | 20 | ✅ | ❌ | ✅ CUDA/CPU | ✅ CPU | ✅ CUDA/CPU | Apache-2.0 |
| KittenTTS | English | ❌ | ❌ | ✅ CPU | ✅ CPU | ✅ CPU | MIT |
CUDA = GPU-accelerated · MPS = Apple Silicon Metal · CPU = runs everywhere, slower for large models · KittenTTS and MOSS-TTS-Nano run realtime on CPU · MLX-Audio is Apple Silicon only.
┌─────────────────────────────────────────────────┐
│ Frontend (React) │
│ DubTab · VoicePreview · BatchQueue · Gallery │
├─────────────────────────────────────────────────┤
│ Backend (FastAPI) │
│ 97 API endpoints · SSE streaming · SQLite │
├──────────┬──────────┬──────────┬────────────────┤
│ WhisperX │ Demucs │OmniVoice │ Pyannote │
│ ASR │ Source │ TTS │ Diarization │
│ │ Sep. │ │ │
└──────────┴──────────┴──────────┴────────────────┘
CUDA / MPS / ROCm / CPU (auto-detected)
| Category | Features |
|---|---|
| Dubbing | Full pipeline (transcribe→translate→synthesize→mux), scene-aware splitting, lip-sync scoring, streaming TTS |
| Voice | Zero-shot cloning, voice design, A/B comparison, voice preview widget, gallery with favorites/tags |
| Audio | Demucs vocal isolation, per-segment gain, selective track export, stem/SRT/VTT/MP3 export |
| Multi-Lang | Multi-language batch picker, batch dubbing queue with sequential GPU execution |
| Diarization | Pyannote ML diarization, auto speaker clone extraction, per-speaker voice assignment |
| Infra | Docker deployment, CUDA/MPS/ROCm auto-detect, cuDNN 8 compat, VRAM-aware model offloading |
| AI Provenance | AudioSeal invisible watermarking (SynthID-like), video logo overlay, watermark detection API |
| UX | Undo/redo, keyboard shortcuts, drag-and-drop, session persistence, glassmorphism design system |
| Real-time Events | WebSocket event bus — instant sidebar refresh on data mutations, exponential backoff reconnect |
| State Management | Zustand store migration — uiSlice, pillSlice, dubSlice, generateSlice, prefsSlice, glossarySlice |
| Desktop | Cross-platform Tauri installers (macOS DMG, Windows MSI, Linux deb/AppImage), auto-update infrastructure |
| Windows Hardening | Cross-platform log paths, Triton workaround, HF symlink bypass, 300s health check timeout |
| Dictation | Global system-wide hotkey (⌘+⇧+Space), frameless floating widget, streaming ASR via WebSocket, auto-paste |
| Batch Pipeline | Full batch TTS: extract → transcribe → translate → generate → mix → export, with live progress tracking |
- 🎬 Lip-sync v2 — visual speech timing with wav2lip
- 📖 Audiobook Editor — chapter-aware long-form narration
- 🌐 Hosted Demo — try OmniVoice without installing anything
- 🔌 Plugin Marketplace — community-contributed TTS engines and effects
| Channel | What happens there |
|---|---|
#showcase |
Members share their dubs, clones, and voice designs |
#help |
Setup issues, GPU troubleshooting, model questions |
#feature-requests |
Vote on what gets built next |
#dev |
Architecture discussions, PR reviews, engine integrations |
#announcements |
Release notes, breaking changes, early access |
→ Join the Discord — we respond to setup questions within hours, not days.
We welcome contributions of all kinds — bug fixes, new TTS engine adapters, UI improvements, docs, and translations.
- 📖 Read the Contributing Guide for setup, code style, and PR workflow
- 🐛 Browse good first issues
- 💬 Join our Discord to discuss ideas or ask for help
Is this really as good as ElevenLabs?
For voice cloning and dubbing, yes — OmniVoice uses a state-of-the-art diffusion TTS model with 646 languages (ElevenLabs supports 32). Quality is comparable for most use cases. Where ElevenLabs wins is in their polished cloud API and pre-made voice library. OmniVoice wins on privacy, cost, language coverage, and customizability.
Does it work on Apple Silicon (M1/M2/M3/M4)?
Yes. MPS acceleration is auto-detected. MLX-optimized Whisper models are available for faster transcription on Apple hardware.
How much VRAM do I need?
4 GB minimum. With ≤8 GB, the TTS model is automatically offloaded to CPU during transcription. With 8+ GB, everything runs on GPU simultaneously. No GPU at all? CPU mode works — just slower (~3× for TTS).
Can I use this commercially?
Personal, educational, internal-team, and non-commercial use is free under FSL-1.1-ALv2. Building a competing product or service on top of OmniVoice Studio requires a commercial license — see License. Pricing tiers coming soon. Each release converts to Apache 2.0 two years after publication.
What languages are supported?
646 languages for TTS via the OmniVoice model. Transcription (WhisperX) supports 99 languages. Translation coverage depends on the target language pair.
Can I add my own TTS engine?
Yes. OmniVoice uses a built-in backend registry. To add an engine in ~50 lines, subclass
TTSBackend in backend/services/tts_backend.py and add it to the _REGISTRY dictionary at the bottom. Six engines are built in: OmniVoice, CosyVoice, MLX-Audio (14+ sub-engines), VoxCPM2, MOSS-TTS-Nano, and KittenTTS. See the TTS Engines section for details.
OmniVoice Studio is source-available under the Functional Source License (FSL-1.1-ALv2).
Free for personal, educational, research, internal team, and non-commercial use. Each release converts to Apache 2.0 automatically two years after publication.
Business / enterprise users building a competing product or service on top of OmniVoice Studio need a commercial license. Pricing tiers coming soon. For inquiries in the meantime, reach out at OmniVoice@palash.dev.
See LICENSE for the full terms.
OmniVoice Studio is built on the shoulders of exceptional open-source work:
| Project | Role |
|---|---|
| OmniVoice (k2-fsa) | Zero-shot diffusion TTS engine — the core voice synthesis model |
| WhisperX | Word-level speech recognition and alignment |
| Demucs (Meta) | Music source separation for vocal isolation |
| Pyannote | Speaker diarization — who said what |
| CTranslate2 | Optimized Transformer inference on CPU and GPU |
| AudioSeal (Meta) | Invisible neural audio watermarking for AI provenance |
| Tauri | Native desktop app framework |
If you read this far, you're our kind of person.
⭐ Star this repo so others can find it too.
💬 Join the Discord to share what you build.







