Skip to content

debpalash/OmniVoice-Studio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

581 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OmniVoice Logo

OmniVoice Studio

The open-source ElevenLabs alternative.

Real-time dictation, zero-shot voice cloning, and cinematic video dubbing — all on your desktop.
Open-source, no API keys, fully local. 646 languages.

Quickstart · Features · Why OVS · TTS Engines · ASR Engines · Donate · Contributing · Discord · 简体中文

Stars Release License Issues Discord Ko-fi PayPal


OmniVoice Studio — The open-source ElevenLabs alternative

Warning

OmniVoice Studio is in active beta. Things may break between releases. For the latest features and fixes, clone the repo and run from source rather than using pre-built installers. Bug reports and PRs are very welcome — open an issue or join Discord.


Join Discord
Get setup help · Share your dubs · Vote on the roadmap · Early access to new engines

Features

🎙️ Voice Cloning

3-second clip → mirror any voice.
646 languages, zero-shot.

🎨 Voice Design

Gender, age, accent, pitch, speed,
emotion, dialect — dial it in.

🎬 Video Dubbing

YouTube URL or file → transcribe →
translate → re-voice → MP4.

📖 Audiobook Editor

Import text, EPUB, or PDF. Auto-chapter,
loudnorm, metadata. Export .m4b.

🎭 Stories

Multi-voice editor. Assign voices
per-line, preview, export full cast.

⌨️ Dictation Widget

⌘+⇧+Space from any app.
Transcribes, auto-pastes, disappears.

🔊 Vocal Isolation

Demucs-powered. Splits speech
from music, keeps the background.

👥 Speaker Diarization

Pyannote + WhisperX.
Auto-identifies who said what.

📦 Batch Queue

Drop 50 videos, walk away.
Progress bars per job.

🤖 MCP Server

Use OmniVoice from Claude,
Cursor, or any MCP client.

🛡️ AI Watermark

AudioSeal (Meta). Invisible,
survives compression.

🔬 Diagnostics

Self-check, error journal,
scrubbed diagnostic bundle.

🔐 100% Local

No keys, no cloud, no accounts.
Your machine only.

⚡ GPU Auto-Detect

CUDA · MPS · ROCm · CPU.
≤8 GB? Auto-offloads.

🧩 Extensible

Subclass TTSbackend,
add any engine in ~50 lines.

🧭 Engine Routing

Preflight GPU check per engine.
No silent CPU fallback.

🎒 Portable Personas

Export voices as .ovsvoice
bundles — identity + watermark.

♾️ Unlimited TTS

Sentence-chunked generation.
No length cap. Streaming via WS.

🌐 Remote Backend

Point UI at a remote server.
Tailscale-friendly. Bearer auth.

🧠 Dictation + LLM

Local LLM cleanup of transcripts.
Optional echo cancellation.


Quickstart

Download macOS DMG Download Windows MSI Download Linux AppImage Download Debian .deb
macOS: first launch needs a one-time approval — right-click → Open (or System Settings → Privacy & Security → "Open Anyway" on macOS 15). No Terminal needed. Why?

Per-OS install guides — pick yours and follow it end-to-end:

Stuck? Run the built-in self-check first — Settings → About → "Run self-check" in the app, or uv run python backend/main.py --diagnose from a checkout (--deep also test-loads the active engine). Then see docs/install/troubleshooting.md for the top 10 install errors. The in-app error UI deeplinks to those entries when something breaks at runtime, and Settings → About → "Save diagnostic bundle" packages scrubbed logs + the self-check report for bug reports.

For Hugging Face token setup, see docs/setup/huggingface-token.md. For diarization-specific gating, see docs/features/diarization.md. For download speed, the ⚡ fast-download (Xet) status, and restricted-network / mirror options, see docs/downloading-models.md.

Screenshots

Voice Clone
Voice Clone
Drop a 3-second clip → mirror any voice. 646 languages, zero-shot.
Voice Design
Voice Design
Build new voices from scratch — gender, age, accent, pitch, style.
Video Dubbing
Video Dubbing
Upload or paste a YouTube URL. Transcribe, translate, re-voice, export.
Voice Gallery
Voice Gallery
Search YouTube, browse categories, download clips, build your library.
Settings — Models
Settings → Models
15 models. One-click install. Auto-detects your platform (CUDA / MPS / CPU).
Projects
Projects
Dub projects, voice profiles, generation history, exports — all searchable.
Settings — Logs
Settings → Logs
Live backend, frontend, and Tauri runtime logs. Filter, refresh, clear.

Why OVS?

ElevenLabs charges $5–$330/mo and processes your audio on their servers. OmniVoice Studio runs on your hardware, with no usage limits.

ElevenLabs OmniVoice Studio
Pricing $5–$330/mo, per-character billing Free & open-source (AGPL-3.0) · Commercial license for proprietary use
Voice Cloning ✅ 3s clip ✅ 3s clip, zero-shot
Voice Design ✅ Gender, age ✅ Gender, age, accent, pitch, style, dialect
Audiobook / Stories ✅ Full audiobook editor + multi-voice stories (EPUB/PDF import, .m4b export)
Languages 32 646
Video Dubbing ✅ Cloud-only ✅ Fully local
Data Privacy Audio sent to cloud Nothing leaves your machine
API Keys Required Not needed
GPU Support N/A (cloud) CUDA · Apple Silicon · ROCm · CPU
Desktop App ✅ macOS · Windows · Linux
TTS Engines 1 11 (OmniVoice, CosyVoice 3, GPT-SoVITS, VoxCPM2, MOSS-TTS-Nano, KittenTTS, MLX-Audio, Sherpa-ONNX, IndexTTS 2, OmniVoice GGUF, Supertonic 3)
ASR Engines 1 9 (WhisperX, Faster-Whisper, MLX Whisper, PyTorch Whisper, Parakeet, Moonshine, FunASR, isolated Faster-Whisper, sherpa-onnx live dictation)
MCP Server ✅ Use from Claude, Cursor, any MCP client
Self-check ✅ Diagnostics suite, error journal, scrubbed debug bundles
Customizable ❌ Closed ✅ Fork it, extend it, ship it

OmniVoice Studio gives you professional-grade AI tools without the subscription or the cloud.


Convinced? Come build with us.
Join Discord


System Requirements

Minimum Recommended
OS Windows 10, macOS 12+, Ubuntu 20.04+ Any modern 64-bit OS
RAM 8 GB 16 GB+
VRAM (GPU) 4 GB (auto-offloads TTS to CPU) 8 GB+ (NVIDIA RTX 3060+)
Disk 10 GB free (models + cache) 20 GB+ SSD
Python 3.10+ (managed by uv) 3.11–3.12
GPU Optional — CPU works NVIDIA CUDA · Apple Silicon MPS · AMD ROCm

Tip

On GPUs with ≤8 GB VRAM, OmniVoice automatically offloads TTS to CPU during transcription — no config needed. A dedicated GPU is not required; the entire pipeline runs on CPU (just slower).

TTS Engines

OmniVoice ships a multi-engine TTS backend. The default engine (OmniVoice) is always available; additional engines are opt-in and auto-detected. Switch engines in Settings → TTS Engine or via the OMNIVOICE_TTS_BACKEND env var.

Engine Languages Clone Instruct Linux macOS ARM Windows License
OmniVoice (default) 600+ ✅ CUDA/CPU ✅ MPS ✅ CUDA/CPU Built-in
CosyVoice 3 9 + 18 dialects ✅ CUDA/CPU ✅ MPS ✅ CUDA/CPU Apache-2.0
GPT-SoVITS 5 ✅ CUDA/CPU ✅ CUDA/CPU MIT
VoxCPM2 30 ✅ CUDA/CPU ✅ MPS ✅ CUDA/CPU Apache-2.0
MOSS-TTS-Nano 20 ✅ CUDA/CPU ✅ CPU ✅ CUDA/CPU Apache-2.0
KittenTTS English ✅ CPU ✅ CPU ✅ CPU MIT
MLX-Audio (Kokoro, Qwen3-TTS, CSM, Dia, …) Multi Varies Varies ✅ Native Varies
Sherpa-ONNX 20+ ✅ CUDA/CPU ✅ CPU ✅ CUDA/CPU Apache-2.0
IndexTTS 2 Multi ✅ CUDA ✅ CUDA Apache-2.0
OmniVoice GGUF 600+ ✅ CPU ✅ CPU ✅ CPU Built-in
Supertonic 3 31 ✅ CPU ✅ CPU ✅ CPU OpenRAIL-M
MOSS-TTS-v1.5 ⚡ (8B) 31 ✅ CUDA/CPU ✅ CPU ✅ CUDA/CPU Apache-2.0
dots.tts ⚡ (2B) 24 ✅ CUDA/CPU ✅ CPU Apache-2.0

CUDA = GPU-accelerated · MPS = Apple Silicon Metal · CPU = runs everywhere, slower for large models · KittenTTS and MOSS-TTS-Nano run realtime on CPU · MLX-Audio is Apple Silicon only · ⚡ = lazy-registered (installed on first use)

MOSS-TTS-v1.5 (8B, ~16 GB weights) and dots.tts (2B, ~9 GB weights) are heavyweight opt-in engines that run in their own isolated venv from a local clone — see MOSS-TTS-v1.5 and dots.tts. Neither claims Apple-Silicon MPS (upstream is CUDA/CPU only; on a Mac they run on CPU). dots.tts upstream is Linux/macOS only — no Windows path.

ASR Engines

OmniVoice ships a multi-engine ASR (speech-to-text) backend that powers dictation, video dubbing, and subtitle generation — all fully local. WhisperX is the cross-platform default; the rest are opt-in and auto-detected. Switch in Settings → ASR Engine or via the OMNIVOICE_ASR_BACKEND env var.

Engine OMNIVOICE_ASR_BACKEND Languages Best for
WhisperX (default) whisperx ~100 Dubbing & subtitles — word-level timing via wav2vec2 forced alignment
Faster-Whisper faster-whisper ~100 Fast transcription on Linux / macOS / Windows (CTranslate2)
Faster-Whisper (isolated) faster-whisper-isolated ~100 Same as Faster-Whisper but crash-isolated in a subprocess — an ASR crash won't take down the app
MLX Whisper mlx-whisper ~100 Native Apple Silicon speed (Apple MLX / Metal)
PyTorch Whisper pytorch-whisper ~100 CUDA / CPU fallback via 🤗 Transformers (no cuDNN 8 needed)
Parakeet TDT nemo-parakeet English + 25 EU SOTA English accuracy, auto language detection (NVIDIA NeMo, GPU only)
Moonshine moonshine English Edge / low-latency, ONNX
FunASR funasr 50+ All-in-one multilingual — built-in VAD + inline speaker diarization (SenseVoice)
sherpa-onnx (live dictation) sherpa-onnx-asr 25 EU + 90+ Live, faster-than-real-time dictation — small streaming/offline ONNX models (Parakeet TDT v3/v2, streaming Zipformer & Paraformer, Whisper Tiny), CPU, identical on macOS / Windows / Linux. Picked per-model in Settings → Voice.

Whisper-family engines cover ~100 languages; FunASR / SenseVoice adds an all-in-one multilingual path with built-in voice-activity detection and inline speaker diarization. sherpa-onnx powers the live dictation model picker — you talk and text appears as you speak. Every engine runs on-device — no API keys, no cloud.

GPU without efficient float16? On older NVIDIA GPUs (Maxwell/Pascal, GTX 16xx) or after a CTranslate2/cuDNN mismatch, the CTranslate2 ASR engines (WhisperX, Faster-Whisper) can't run float16 and OmniVoice automatically retries on int8 — no config needed. If transcription still fails, pin the compute type with the ASR_COMPUTE_TYPE env var (escape hatch): ASR_COMPUTE_TYPE=int8 (or float32 for CPU). Set it to int8 and restart the backend.


Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Frontend (React)                          │
│  DubTab · VoiceConsole · Stories · Audiobook · Gallery     │
│  Dictation · BatchQueue · Diagnostics · MCP Client          │
├─────────────────────────────────────────────────────────────┤
│                  Backend (FastAPI)                           │
│  100+ API endpoints · SSE+WSS streaming · SQLite            │
├──────────┬──────────┬──────────┬──────────┬────────────────┤
│ WhisperX │  Demucs  │OmniVoice │ Pyannote │ Engine Routing  │
│  (+7 ASR │  Source  │  (+10    │ Diariz-  │ ↳ GPU preflight │
│ engines) │  Sep.    │  TTS)    │ ation    │ ↳ No silent CPU │
└──────────┴──────────┴──────────┴──────────┴────────────────┘
         CUDA / MPS / ROCm / CPU (auto-detected + routed)

Roadmap

✅ Shipped

Category Features
Longform Audiobook editor (text/EPUB/PDF → chaptered .m4b), Stories multi-voice editor, two-pass loudnorm mastering, crash-resume for interrupted renders, pronunciation control + SSML-lite prosody
Dubbing Full pipeline (transcribe→translate→synthesize→mux), scene-aware splitting, lip-sync scoring, streaming TTS, per-speaker voice assignment, Smart Fit timing + second-pass QC, dedicated Dub home
Voice Zero-shot cloning, voice design, A/B comparison, voice preview widget, gallery with favorites/tags, portable persona bundles (.ovsvoice), voice console workspace
Audio Demucs vocal isolation, per-segment gain, selective track export, stem/SRT/VTT/MP3 export, unlimited-length TTS via sentence-chunked generation
Multi-Lang Multi-language batch picker, batch dubbing queue with sequential GPU execution
Diarization Pyannote ML diarization, auto speaker clone extraction, per-speaker voice assignment
ASR 9 engines (WhisperX, Faster-Whisper, isolated Faster-Whisper, MLX Whisper, PyTorch Whisper, Parakeet TDT, Moonshine, FunASR/SenseVoice, sherpa-onnx live dictation), crash-isolated subprocess backend
TTS 11 engines (OmniVoice, CosyVoice 3, GPT-SoVITS, VoxCPM2, MOSS-TTS-Nano, KittenTTS, MLX-Audio, Sherpa-ONNX, + lazy: IndexTTS 2, OmniVoice GGUF, Supertonic 3), engine routing with GPU preflight
Infra Docker deployment, CUDA/MPS/ROCm auto-detect, cuDNN 8 compat, VRAM-aware model offloading, engine routing (no silent CPU fallback), diagnostics suite & error journal, restricted-network mirror support
AI Provenance AudioSeal invisible watermarking (SynthID-like), video logo overlay, watermark detection API
UX Undo/redo, keyboard shortcuts, drag-and-drop, session persistence, glassmorphism design system, UI scale fix for Linux/WebKitGTK
Real-time Events WebSocket event bus — instant sidebar refresh on data mutations, exponential backoff reconnect
State Management Zustand store migration — uiSlice, pillSlice, dubSlice, generateSlice, prefsSlice, glossarySlice
Desktop Cross-platform Tauri installers (macOS DMG/Intel, Windows MSI, Linux deb/AppImage), auto-update infrastructure, single-instance enforcement, close-to-tray, macOS Gatekeeper fix
Dictation Global system-wide hotkey (⌘+⇧+Space), frameless floating widget, streaming ASR via WebSocket, auto-paste, customizable hotkey, local-LLM transcript refinement
Batch Pipeline Full batch TTS: extract → transcribe → translate → generate → mix → export, with live progress tracking
MCP Server OmniVoice as a local TTS/STT provider for Claude, Cursor, and any MCP client
Remote Backend Point the desktop UI at a remote backend URL with bearer auth (Tailscale-documented)
Reliability Stall watchdog on bootstrap splash, per-engine GPU compatibility matrix, actionable errors for non-executable engine binaries, setuptools auto-repair

🔜 Up Next

  • 🎬 Lip-sync v2 — visual speech timing with wav2lip
  • 🌐 Hosted Demo — try OmniVoice without installing anything
  • 🔌 Plugin Marketplace — community-contributed TTS engines and effects
  • 🎵 Real-time Voice Changer — live microphone transformation during calls

Sponsor / Donate

OmniVoice Studio is built by one developer using Claude Code and AI agents — and the agent bills are real. Over the last three months I've spent thousands of dollars on Claude subscriptions to keep the features shipping, the bugs fixed, and your issues answered. If OmniVoice has created value for you, helping cover those bills means I can keep developing full-time.

This month's agent bill fund

$10 / $200 raised



Ko-fi    PayPal


Every dollar goes directly to agent bills — keeping OmniVoice development continuous.

Community

Join Discord

Channel What happens there
#showcase Members share their dubs, clones, and voice designs
#help Setup issues, GPU troubleshooting, model questions
#feature-requests Vote on what gets built next
#dev Architecture discussions, PR reviews, engine integrations
#announcements Release notes, breaking changes, early access

→ Join the Discord — we respond to setup questions within hours, not days.


Contributing

We welcome contributions of all kinds — bug fixes, new TTS engine adapters, UI improvements, docs, and translations.


FAQ

Is this really as good as ElevenLabs?
For voice cloning and dubbing, yes — OmniVoice uses a state-of-the-art diffusion TTS model with 646 languages (ElevenLabs supports 32). Quality is comparable for most use cases. Where ElevenLabs wins is in their polished cloud API and pre-made voice library. OmniVoice wins on privacy, cost, language coverage, and customizability.
Does it work on Apple Silicon (M1/M2/M3/M4)?
Yes. MPS acceleration is auto-detected. MLX-optimized Whisper models are available for faster transcription on Apple hardware.
How much VRAM do I need?
4 GB minimum. With ≤8 GB, the TTS model is automatically offloaded to CPU during transcription. With 8+ GB, everything runs on GPU simultaneously. No GPU at all? CPU mode works — just slower (~3× for TTS).
Can I use this commercially?
Yes — commercial use is free. OmniVoice Studio is free and open-source under the GNU AGPL-3.0. So personal, educational, research, and commercial / business use are all free: run it, sell the audio you make with it, dub your own or a client's videos, deploy it across your team. Because AGPL is a network copyleft license, if you modify OmniVoice Studio and make that modified version available to others over a network, you must offer those users the source of your modified version under the same AGPL terms. Want to embed OmniVoice in a closed-source or proprietary product without those obligations? A commercial license is available — see License.
What languages are supported?
646 languages for TTS via the OmniVoice model. Transcription (WhisperX) supports 99 languages. Translation coverage depends on the target language pair.
Can I add my own TTS engine?
Yes. OmniVoice uses a built-in backend registry. To add an engine in ~50 lines, subclass TTSBackend in backend/services/tts_backend.py and add it to the _REGISTRY dictionary. Eleven engines are built in: OmniVoice, CosyVoice 3, GPT-SoVITS, MLX-Audio (14+ sub-engines), VoxCPM2, MOSS-TTS-Nano, KittenTTS, Sherpa-ONNX, plus lazy-registered IndexTTS 2, OmniVoice GGUF, and Supertonic 3. See the TTS Engines section for details.

License

OmniVoice Studio is free and open-source software under the GNU Affero General Public License v3.0 (AGPL-3.0).

Free for any use — including commercial and internal business use. Run it, sell the audio you produce with it, dub your own or clients' videos, roll it out across your team — all free, no license needed. As a network copyleft license, AGPL adds one obligation: if you modify OmniVoice Studio and offer that modified version to others over a network, you must make the complete corresponding source of your modified version available to them under the same AGPL-3.0 terms.

A commercial license is available for organizations that want to embed OmniVoice Studio in a closed-source or proprietary product or service without the AGPL-3.0 copyleft obligations. Pricing tiers coming soon. Inquiries: OmniVoice@palash.dev.

The bundled omnivoice/ TTS model by Han Zhu remains Apache-2.0 upstream. See LICENSE for the full, binding terms.


Acknowledgments

OmniVoice Studio is built on the shoulders of exceptional open-source work:

Project Role
OmniVoice (k2-fsa) Zero-shot diffusion TTS engine — the core voice synthesis model
WhisperX Word-level speech recognition and alignment
Demucs (Meta) Music source separation for vocal isolation
Pyannote Speaker diarization — who said what
CTranslate2 Optimized Transformer inference on CPU and GPU
AudioSeal (Meta) Invisible neural audio watermarking for AI provenance
Tauri Native desktop app framework
Supertone / Supertonic 3 ONNX TTS engine — 31 languages, CPU-efficient
Sherpa-ONNX WASM-ready universal TTS/ASR runtime
GPT-SoVITS Zero-shot TTS engine — 5 languages, RTF 0.014


If you read this far, you're our kind of person.
⭐ Star this repo so others can find it too.
💬 Join the Discord to share what you build.
❤️ Support development — fund the AI agent bills that keep OmniVoice shipping.


Star History