OmniVoice Studio

The open-source ElevenLabs alternative.

Real-time dictation, zero-shot voice cloning, and cinematic video dubbing — all on your desktop.
No accounts. No API keys. No cloud. Everything runs on your machine. Open-source, 646 languages.

Quickstart · Features · vs Others · Engines · API · Donate · Contributing · Discord · X · 简体中文

Your voice is the most personal data you have. So why rent it back from a cloud? Every mainstream voice tool ships your audio to someone else's server and bills you monthly for the privilege. OmniVoice Studio flips that: clone, design, dub, and dictate on your own hardware — 646 languages, no meter running, nothing leaving your machine.

Warning

Active beta. Things may break between releases — for the newest fixes, run from source. Bug reports and PRs are very welcome: open an issue or join Discord.

📸 See it in action

Studio _{Generate & clone in one workspace — a 3-second clip mirrors any voice, 646 languages, zero-shot.}	Voice Design _{Build new voices from scratch — gender, age, accent, pitch, emotion, dialect.}
Voice Gallery _{Browse ready-made archetype voices with language filters, or build your own — then pick any of them in Studio, Audiobook, Stories, and Dubbing.}	Video Dubbing _{A real dub, end to end: 37 segments transcribed, translated to Bengali, re-voiced, and timed — ready to export as MP4.}
Settings → Engines _{The engine compatibility matrix — 14 TTS engines with per-engine GPU preflight, no silent CPU fallback.}	Settings → Models _{One-click model store — auto-detects your platform (CUDA / MPS / CPU) and recommends the right models.}

✨ Features

Three flagships, five more headliners, and a dozen under the fold.


🎙️ Voice Cloning _{3-sec clip → any voice · 646 languages · zero-shot}	🎨 Voice Design _{Describe it — gender, age, accent, emotion}	🎬 Video Dubbing _{Transcribe → translate → re-voice → MP4}

📖
Audiobook
_{EPUB/PDF → .m4b, multi-voice cast}

🎭
Stories
_{Multi-voice script editor}

⌨️
Dictation Widget
_{⌘⇧Space in any app}

🔐
100% Local
_{No keys, no cloud, no accounts}

🤖
MCP Server
_{Use from Claude, Cursor, …}

…and 12 more — isolation, diarization, batch, watermarking, diagnostics, and friends

🔊 Vocal Isolation — Demucs-powered: splits speech from music and keeps the background bed.
👥 Speaker Diarization — Pyannote + WhisperX auto-identify who said what.
📦 Batch Queue — drop 50 videos, walk away; per-job progress bars.
🛡️ AI Watermark — AudioSeal (Meta): invisible, survives compression.
🔬 Diagnostics — self-check suite, error journal, scrubbed diagnostic bundles.
⚡ GPU Auto-Detect — CUDA · MPS · ROCm (Linux, opt-in) · CPU; ≤8 GB VRAM auto-offloads.
🧭 Engine routing — preflight GPU check per engine; no silent CPU fallback.
🧩 Extensible — subclass TTSBackend, add any engine in ~50 lines.
🎒 Portable personas — export voices as .ovsvoice bundles: identity + watermark.
♾️ Unlimited TTS — sentence-chunked generation, no length cap, streaming via WebSocket.
🌐 Remote backend — point the UI at a remote server; Tailscale-friendly, bearer auth.
🧠 Dictation + LLM — local-LLM cleanup of transcripts, optional echo cancellation.

⚡ Quickstart

_{macOS: first launch needs a one-time approval — right-click → Open (or System Settings → Privacy & Security → "Open Anyway" on macOS 15). No Terminal needed. Why? · Intel Macs: local backend unsupported (#889) — details.}

Install guide: 🍎 macOS · 🪟 Windows · 🐧 Linux · 🐳 Docker

🧰 Troubleshooting · slow generation · HF tokens · restricted networks

Something broke? Run the self-check — Settings → About → "Run self-check" (or uv run python backend/main.py --diagnose --deep) — then the top 10 install errors. "Save diagnostic bundle" packages scrubbed logs for a bug report.
Feels slow? docs/performance.md — where the time goes and how to tune it.
Want breaths, laughter, emotion? docs/expressive-speech.md — what each engine can do today.
HF tokens · diarization · download speed / mirrors: tokens · diarization · downloads.
Coming from Real-Time-Voice-Cloning? Migration guide.

⚖️ vs Others

ElevenLabs charges $5–$330/mo and processes your audio on their servers. OmniVoice Studio runs on your hardware, with no usage limits.

	ElevenLabs	OmniVoice Studio
Pricing	$5–$330/mo, per-character billing	Free & open-source (AGPL-3.0) · Commercial license for proprietary use
Voice Cloning	✅ 3s clip	✅ 3s clip, zero-shot
Voice Design	✅ Gender, age	✅ Gender, age, accent, pitch, style, dialect
Audiobook / Stories	❌	✅ Full audiobook editor + multi-voice stories (EPUB/PDF import, .m4b export)
Languages	32	646
Video Dubbing	✅ Cloud-only	✅ Fully local
Data Privacy	Audio sent to cloud	Nothing leaves your machine
API Keys	Required	Not needed
GPU Support	N/A (cloud)	CUDA · Apple Silicon · ROCm (Linux) · CPU
Desktop App	❌	✅ macOS · Windows · Linux
TTS Engines	1	14 — full matrix
ASR Engines	1	11 — full lineup
MCP Server	❌	✅ Use from Claude, Cursor, any MCP client
Self-check	❌	✅ Diagnostics suite, error journal, scrubbed debug bundles
Customizable	❌ Closed	✅ Fork it, extend it, ship it

Professional-grade voice AI, minus the subscription and the cloud.

Convinced? Come build with us.

🖥️ System Requirements

	Minimum	Recommended
OS	Windows 10, macOS 13.3+ (Apple Silicon), Ubuntu 24.04+ (glibc 2.39+)	Any modern 64-bit OS
RAM	8 GB	16 GB+
VRAM (GPU)	4 GB (auto-offloads TTS to CPU)	8 GB+ (NVIDIA RTX 3060+)
Disk	10 GB free (models + cache)	20 GB+ SSD
Python	3.10+ (managed by `uv`)	3.11–3.12
GPU	Optional — CPU works	NVIDIA CUDA · Apple Silicon MPS · AMD ROCm (Linux only)

Note

A GPU is optional — the whole pipeline runs on CPU (just slower), and on ≤8 GB VRAM, TTS auto-offloads to CPU. Caveats: AMD ROCm is Linux-only + opt-in (Linux) — Windows AMD/Ryzen AI is CPU-only (Windows); macOS Intel can't run the local backend, so point it at a remote one (#889 · macOS).

🗣️ TTS Engines

14 engines, one picker. OmniVoice (default, 600+ languages) is always available; seven more are opt-in and auto-detected (CosyVoice 3, GPT-SoVITS, VoxCPM2, MOSS-TTS-Nano, KittenTTS, MLX-Audio, Sherpa-ONNX), plus six lazy-installed heavyweights (IndexTTS 2, OmniVoice GGUF, Supertonic 3, MOSS-TTS-v1.5, dots.tts, Confucius4-TTS). Switch in Settings → TTS Engine; the choice applies everywhere synthesis happens.

📊 The full matrix — 14 engines × platform × clone/instruct × license

Engine	Languages	Clone	Instruct	Linux	macOS ARM	Windows	License
OmniVoice (default)	600+	✅	✅	✅ CUDA/CPU	✅ MPS	✅ CUDA/CPU	Built-in
CosyVoice 3	9 + 18 dialects	✅	✅	✅ CUDA/CPU	✅ MPS	✅ CUDA/CPU	Apache-2.0
GPT-SoVITS	5	✅	—	✅ CUDA/CPU	—	✅ CUDA/CPU	MIT
VoxCPM2	30	✅	✅	✅ CUDA/CPU	✅ MPS	✅ CUDA/CPU	Apache-2.0
MOSS-TTS-Nano	20	✅	—	✅ CUDA/CPU	✅ CPU	✅ CUDA/CPU	Apache-2.0
KittenTTS	English	—	—	✅ CPU	✅ CPU	✅ CPU	MIT
MLX-Audio (Kokoro, Qwen3-TTS, CSM, Dia, …)	Multi	Varies	Varies	❌	✅ Native	❌	Varies
Sherpa-ONNX	20+	—	—	✅ CUDA/CPU	✅ CPU	✅ CUDA/CPU	Apache-2.0
IndexTTS 2 ⚡	Multi	✅	—	✅ CUDA	—	✅ CUDA	Apache-2.0
OmniVoice GGUF ⚡	600+	✅	✅	✅ CPU	✅ CPU	✅ CPU	Built-in
Supertonic 3 ⚡	31	—	—	✅ CPU	✅ CPU	✅ CPU	OpenRAIL-M
MOSS-TTS-v1.5 ⚡ (8B)	31	✅	—	✅ CUDA/CPU	✅ CPU	✅ CUDA/CPU	Apache-2.0
dots.tts ⚡ (2B)	24	✅	—	✅ CUDA/CPU	✅ CPU	❌	Apache-2.0
Confucius4-TTS ⚡	14	✅	—	✅ CUDA/CPU	✅ CPU	✅ CUDA/CPU	Apache-2.0

CUDA = GPU-accelerated · MPS = Apple Silicon Metal · CPU = runs everywhere, slower for large models · KittenTTS and MOSS-TTS-Nano run realtime on CPU · MLX-Audio is Apple Silicon only · ⚡ = lazy-registered (installed on first use)

Clone matters beyond single-clip generation: Video Dubbing (and any Batch job with a pinned voice) needs reference-audio cloning to preserve speaker identity, so picking a Clone-less engine (KittenTTS, Sherpa-ONNX, Supertonic 3) as the active engine fails those jobs up front with an actionable message instead of silently falling back to OmniVoice.

MOSS-TTS-v1.5 (8B, ~16 GB), dots.tts (2B, ~9 GB), and Confucius4-TTS are heavyweight opt-ins that run in their own isolated venv from a local clone. None claims Apple-Silicon MPS (CPU on Macs); dots.tts has no Windows path; Confucius4 wants CUDA (CPU works, ~17× realtime). Details: MOSS-TTS-v1.5 · dots.tts · Confucius4-TTS.

🎧 ASR Engines

11 engines — they power dictation, video dubbing, and subtitles. WhisperX is the cross-platform default (~100 languages, word-level timing); the rest are opt-in and auto-detected. Switch in Settings → Engines. Ten run fully on-device; the eleventh (OpenAI-compatible) is an optional remote client for Qwen3-ASR or any compatible server.

📊 The full lineup — 11 engines, what each is best at, and compute-type notes

Engine	`OMNIVOICE_ASR_BACKEND`	Languages	Best for
WhisperX (default)	`whisperx`	~100	Dubbing & subtitles — word-level timing via wav2vec2 forced alignment
Faster-Whisper	`faster-whisper`	~100	Fast transcription on Linux / macOS / Windows (CTranslate2)
Faster-Whisper (isolated)	`faster-whisper-isolated`	~100	Same as Faster-Whisper but crash-isolated in a subprocess — an ASR crash won't take down the app
MLX Whisper	`mlx-whisper`	~100	Native Apple Silicon speed (Apple MLX / Metal)
PyTorch Whisper	`pytorch-whisper`	~100	CUDA / CPU fallback via 🤗 Transformers (no cuDNN 8 needed)
Parakeet TDT	`nemo-parakeet`	English + 25 EU	SOTA accuracy at ~10× realtime even on CPU, auto language detection (NVIDIA NeMo, CUDA/CPU)
Parakeet TDT v3 (MLX)	`parakeet-mlx`	25 EU	The Parakeet tier for Apple Silicon — TDT word timestamps, ~2 GB unified memory, dictation-grade speed on the GPU via MLX. Install the model from Settings → Models and dictation prefers it automatically when your system language is one of its 25 (European) languages; other languages (CJK, Arabic, …) keep the multilingual Whisper engine so dictation coverage never regresses.
Moonshine	`moonshine`	English	Edge / low-latency, ONNX
FunASR	`funasr`	50+	All-in-one multilingual — built-in VAD + inline speaker diarization (SenseVoice)
sherpa-onnx (live dictation)	`sherpa-onnx-asr`	25 EU + 90+	Live, faster-than-real-time dictation — small streaming/offline ONNX models (Parakeet TDT v3/v2, streaming Zipformer & Paraformer, Whisper Tiny), CPU, identical on macOS / Windows / Linux. Picked per-model in Settings → Voice.
OpenAI-compatible ⚠️ remote	`openai-compat-asr`	Server-dependent	A path to Qwen3-ASR today (self-hosted server, no transformers wait), any OpenAI-compatible transcription endpoint, or OpenAI's own API — no install, configure + test the connection in Settings → Engines (ASR tab). Audio leaves your machine to whatever server you point it at; see docs/engines/openai-compatible-asr.md.

Whisper-family engines cover ~100 languages; FunASR / SenseVoice adds an all-in-one multilingual path with built-in voice-activity detection and inline speaker diarization. sherpa-onnx powers the live dictation model picker — you talk and text appears as you speak. Every engine runs on-device — no API keys, no cloud.

GPU without efficient float16? On older NVIDIA GPUs (Maxwell/Pascal, GTX 16xx) or after a CTranslate2/cuDNN mismatch, the CTranslate2 ASR engines (WhisperX, Faster-Whisper) can't run float16 and OmniVoice automatically retries on int8 — no config needed. If transcription still fails, pin the compute type with the ASR_COMPUTE_TYPE env var (escape hatch): ASR_COMPUTE_TYPE=int8 (or float32 for CPU). Set it to int8 and restart the backend.

🏗️ Architecture

A Tauri v2 desktop shell (Rust) wraps a React UI and a bundled Python/FastAPI backend that runs as a local sidecar on localhost:3900. Nothing external — every layer is on your machine.

┌────────────────────────────────────────────────────────────────────┐
│  Tauri v2 shell — Rust                                             │
│  window state · global dictation hotkey · system tray ·           │
│  signed auto-updater (stable/preview) · single-instance ·         │
│  first-run bootstrap (installs uv + Python venv) · blank guard    │
├────────────────────────────────────────────────────────────────────┤
│  Frontend — React + Vite                                          │
│  Studio · Dub · Stories · Audiobook · Gallery · Dictation ·       │
│  Batch · Diagnostics · MCP client    —   Zustand store · WS bus   │
│                          ▲  IPC  /  HTTP + WS                      │
├──────────────────────────┼─────────────────────────────────────────┤
│  Backend — FastAPI sidecar @ localhost:3900                       │
│  100+ REST endpoints · SSE + WebSocket streaming ·               │
│  SQLite + Alembic (omnivoice_data/) · OpenAI-compatible API       │
├───────────┬───────────┬───────────┬───────────┬────────────────────┤
│  TTS ×14  │  ASR ×11  │  Demucs   │ Pyannote  │  AudioSeal         │
│  clone /  │  WhisperX │  vocal    │  speaker  │  watermark         │
│  design   │  +10 more │  isolation│  diariz.  │  embed / detect    │
├───────────┴───────────┴───────────┴───────────┴────────────────────┤
│  Engine routing — per-engine GPU preflight, no silent CPU fallback │
│  Hardware:  CUDA · MPS · ROCm (Linux) · CPU   (auto-detected)      │
└────────────────────────────────────────────────────────────────────┘

Shell (Rust) — native OS integration: the system-wide dictation hotkey, tray, signed auto-updater (stable + preview channels), single-instance lock, and the first-run bootstrap that installs uv and a Python 3.11 venv.
Frontend (React) — every workspace tab over a Zustand store, with a WebSocket event bus that live-refreshes the UI when backend data changes.
Backend (FastAPI) — the bundled Python sidecar: 100+ endpoints, SSE/WSS streaming, a SQLite DB migrated by Alembic, and the OpenAI-compatible API surface.
Engines — 14 TTS + 11 ASR, plus Demucs (isolation), Pyannote (diarization), and AudioSeal (watermark), all behind routing that GPU-preflights each engine and refuses to silently fall back to CPU.

🔌 OpenAI-compatible API

Drop-in replacement for OpenAI / ElevenLabs audio. One line — no key, no code changes:

- base_url="https://api.openai.com/v1"
+ base_url="http://localhost:3900/v1"

Your existing scripts, agents, and OpenAI/ElevenLabs SDK calls now run locally on whatever engine you have active. What the cloud can't do: voice takes your own cloned-voice profile IDs, and model can pin a specific engine per request.

Endpoint	What it does
`POST /v1/audio/speech`	TTS — text in; `mp3` / `opus` / `aac` / `flac` / `wav` / `pcm` out. `model`: `tts-1`/`tts-1-hd` (active engine) or a specific one (`voxcpm2`, `cosyvoice`, `kittentts`, …). `voice`: a cloned profile ID, `default`, or an OpenAI name (`alloy`, …). `speed` supported.
`POST /v1/audio/transcriptions`	STT — audio file in; `json` / `text` / `verbose_json` / `srt` / `vtt` out (`verbose_json` adds word-level timings). `whisper-1` maps to your active ASR engine.
`GET /v1/audio/voices`	OmniVoice extension — lists every voice profile and engine, so clients can discover your clones.

Speak with your own cloned voice — list the IDs, then pass one as voice:

# 1 — find a cloned voice's profile ID
curl -s http://localhost:3900/v1/audio/voices | jq '.voices[] | select(.type=="profile") | {voice_id, name}'

# 2 — synthesize with it
curl http://localhost:3900/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"tts-1","voice":"<profile-id>","input":"Made on my own hardware.","response_format":"wav"}' \
  --output speech.wav

from openai import OpenAI
client = OpenAI(base_url="http://localhost:3900/v1", api_key="none")  # any string — nothing checks it

# TTS with your cloned voice (or "alloy" / "default"; model= can pin a specific engine)
with client.audio.speech.with_streaming_response.create(
        model="tts-1", voice="<profile-id>", input="Made on my own hardware.") as r:
    r.stream_to_file("speech.wav")

# STT
print(client.audio.transcriptions.create(model="whisper-1", file=open("clip.wav", "rb")).text)

Want the whole surface (100+ endpoints)? The full REST API reference is embedded in the app — Settings → OpenAPI Reference (Scalar-powered), or the {} button in the footer.

Calling the backend from another machine (LAN, Tailscale, behind a proxy)? It's loopback-only and unauthenticated by default; to reach it remotely you set a share PIN or an API key. docs/api-auth.md covers the exact headers, query params, 401/403/429 meanings, and the OMNIVOICE_TRUSTED_NETWORKS exemption.

📓 Run on Google Colab

No local GPU? The official notebook boots the full app — web UI included — on a free Colab T4, then walks the whole feature surface (TTS, cloning, design, transcription, dubbing, audiobook, watermarking, the OpenAI-compatible API) as a guided tour with inline playback. No tunnels, no API keys.

🤝 Agent Skills

Teach your coding agent to speak and listen through your local OmniVoice — one command, works with Claude Code, Codex, Cursor, Grok, Kimi, opencode, and any skills.sh-compatible agent:

npx skills add debpalash/omnivoice-studio

Ships two skills:

omnivoice — generate speech (including your cloned voices) and transcribe audio from any agent, free and fully offline via your local install.
oss-maintainer — the maintainer methodology this project is run with, for anyone running their own OSS project with an agent.

🗺️ Roadmap

🔜 Up Next

🎬 Lip-sync v2 — visual speech timing with wav2lip
🌐 Hosted Demo — try OmniVoice without installing anything
🔌 Plugin Marketplace — community-contributed TTS engines and effects
🎵 Real-time Voice Changer — live microphone transformation during calls

✅ Everything shipped so far — the receipts, by category

Category	Features
Longform	Audiobook editor (text/EPUB/PDF → chaptered .m4b) with multi-voice cast, expressive controls, live per-chapter progress + Stop, and a one-click sample; Stories multi-voice editor, two-pass loudnorm mastering, crash-resume for interrupted renders, pronunciation control + SSML-lite prosody
Dubbing	Full pipeline (transcribe→translate→synthesize→mux), scene-aware splitting, lip-sync scoring, streaming TTS, per-speaker voice assignment, Smart Fit timing + second-pass QC, paste-in translations from any external tool, dedicated Dub home
Voice	Zero-shot cloning, voice design, A/B comparison, voice preview widget, gallery with favorites/tags (its voices selectable in every picker — Studio, Audiobook, Stories, Dubbing), portable persona bundles (`.ovsvoice`), voice console workspace
Audio	Demucs vocal isolation, per-segment gain, selective track export, stem/SRT/VTT/MP3 export, unlimited-length TTS via sentence-chunked generation
Multi-Lang	Multi-language batch picker, batch dubbing queue with sequential GPU execution
Diarization	Pyannote ML diarization, auto speaker clone extraction, per-speaker voice assignment
ASR	11 engines (WhisperX, Faster-Whisper, isolated Faster-Whisper, MLX Whisper, PyTorch Whisper, Parakeet TDT, Parakeet TDT v3 MLX, Moonshine, FunASR/SenseVoice, sherpa-onnx live dictation, OpenAI-compatible remote), crash-isolated subprocess backend
TTS	14 engines (OmniVoice, CosyVoice 3, GPT-SoVITS, VoxCPM2, MOSS-TTS-Nano, KittenTTS, MLX-Audio, Sherpa-ONNX, + lazy: IndexTTS 2, OmniVoice GGUF, Supertonic 3, MOSS-TTS-v1.5, dots.tts, Confucius4-TTS), engine routing with GPU preflight
Infra	Docker deployment, CUDA/MPS/ROCm auto-detect, cuDNN 8 compat, VRAM-aware model offloading, engine routing (no silent CPU fallback), diagnostics suite & error journal, restricted-network mirror support
AI Provenance	AudioSeal invisible watermarking (SynthID-like), video logo overlay, watermark detection API
UX	Undo/redo, keyboard shortcuts, drag-and-drop, session persistence, glassmorphism design system, UI scale fix for Linux/WebKitGTK
Real-time Events	WebSocket event bus — instant sidebar refresh on data mutations, exponential backoff reconnect
State Management	Zustand store migration — `uiSlice`, `pillSlice`, `dubSlice`, `generateSlice`, `prefsSlice`, `glossarySlice`
Desktop	Cross-platform Tauri installers (macOS DMG — Apple Silicon; Intel unsupported for the local backend, #889 — Windows MSI, Linux deb/AppImage), auto-update infrastructure, single-instance enforcement, close-to-tray, macOS Gatekeeper fix
Dictation	Global system-wide hotkey (`⌘+⇧+Space`), frameless floating widget, streaming ASR via WebSocket, auto-paste, customizable hotkey, local-LLM transcript refinement
Batch Pipeline	Full batch TTS: extract → transcribe → translate → generate → mix → export, with live progress tracking
MCP Server	OmniVoice as a local TTS/STT provider for Claude, Cursor, and any MCP client
Remote Backend	Point the desktop UI at a remote backend URL with bearer auth (Tailscale-documented)
Reliability	Stall watchdog on bootstrap splash, per-engine GPU compatibility matrix, actionable errors for non-executable engine binaries, setuptools auto-repair

💜 Sponsor / Donate

One developer, real AI-agent bills. If OmniVoice is useful to you, chipping in keeps development full-time — every dollar goes straight to the bills.

This month's agent-bill fund: $10 / $200

_{Also from the maker: Opal 💠 · memxt 🧠 — a ⭐ helps too.}

🌟 Sponsors

OmniVoice is free and AGPL-3.0 — no paid tier, no SaaS revenue. Sponsors keep development going, and in return get a logo slot here, in the app, and (for top tiers) on the project website. It's a thank-you, never a paywall. See tiers & become a sponsor →

Your logo here — become a sponsor

_{💡 GitHub also shows a Sponsor button at the top of this repo, wired to the same links via .github/FUNDING.yml.}

💬 Community

_{We respond to setup questions within hours, not days.}

What happens in there

Channel	What happens there
`#announcements`	Release news and the big moments — new versions land here first
`#releases` + `#changelog`	Every build and exactly what's inside it
`#issues`	Bug reports as forum posts — triaged straight into GitHub issues
`#ideas`	Feature requests, discussed and voted on
`#discuss-ideas`	Design talk before things get built
`#general`	Setup help, GPU troubleshooting, and showing off your dubs

🤝 Contributing

Yes please — bug fixes, new TTS engine adapters, UI improvements, docs, translations. All of it.

📖 Read the Contributing Guide for setup, code style, and PR workflow
🐛 Browse good first issues
💬 Join our Discord to discuss ideas or ask for help
𝕏 Follow @idebpalash for updates and what's being built next

❓ FAQ

Is this really as good as ElevenLabs?

Honest answer: it depends on what you're doing.

Where OmniVoice is genuinely competitive: voice cloning from a clean reference clip (state-of-the-art open diffusion TTS), language coverage (646 languages vs. their 32), and everything structural — no per-character billing, no usage caps, no audio leaving your machine, full pipeline customizability (14 TTS engines, 11 ASR engines, your choice of translation).

Where ElevenLabs still wins: out-of-the-box consistency and polish, especially for English TTS. Their one model is heavily tuned; our quality depends on which engine you pick, your hardware, and — for cloning — the reference audio (a dry, close-mic clip clones dramatically better than a noisy or echoey one).

For dubbing specifically: a dub is a chain — transcription → translation → cloning → synthesis — only as good as its weakest link on your source material. If parts come out incoherent, check the segment table's original text first: when the transcription is already wrong, switch the ASR engine or use cleaner source audio — that's usually the fix, not the voice.

Try it on your real material — it's free and takes one download. Many users replace ElevenLabs outright; some keep both. Both outcomes are fine with us.

Why doesn't a longer reference clip sound more like me?

Because OmniVoice's cloning is zero-shot: your clip is a prompt the model conditions on at generation time — it is never trained on. Feeding it 2 hours doesn't teach it your voice; past a short window the extra audio is simply not used. The dubbing pipeline's reference builder targets ~8 s and hard-caps at 15 s (backend/services/speaker_clone.py), and engines cap the prompt themselves (VoxCPM2 trims references to 30 s). This is different from ElevenLabs Professional Voice Cloning, which fine-tunes a model on hours of your audio — that's a training job, not a bigger prompt.

What actually moves clone quality is the clip, not its length. Zero-shot cloning mirrors the acoustics and delivery of the prompt, so: record 5–15 seconds (~8 s is the sweet spot) of continuous natural speech, close to the mic, in a quiet room with no reverb or music — an echoey clip clones echoey. One speaker only, and read in the tone and pace you want the output to have, because the clone copies your delivery, not just your timbre. Recording a few candidate clips and comparing results beats any amount of extra footage.

Want audiobook-grade, trained-on-your-voice fidelity? That path exists, but it's offline fine-tuning, not an in-app button: prepare a dataset of your recordings (docs/data_preparation.md) and fine-tune the bundled checkpoint via init_from_checkpoint (docs/training.md). Fair warning — it's a technical, command-line workflow that needs a capable GPU and hours of transcribed audio. In-app fine-tuning / long-reference "professional" cloning is on the roadmap as research only; no promised date.

Does it work on Apple Silicon (M1/M2/M3/M4)?

Yes. MPS acceleration is auto-detected. MLX-optimized Whisper models are available for faster transcription on Apple hardware. Intel Macs are not supported: the app UI installs, but the local Python backend cannot run because PyTorch no longer ships Intel-Mac wheels (#889) — an Intel Mac can only be used with a remote backend.

How much VRAM do I need?

4 GB minimum. With ≤8 GB, the TTS model is automatically offloaded to CPU during transcription. With 8+ GB, everything runs on GPU simultaneously. No GPU at all? CPU mode works — just slower (~3× for TTS).

Can I use this commercially?

Yes — commercial use is free under the AGPL-3.0: run it, sell the audio you make, dub client videos, deploy it across your team. One obligation: if you modify OmniVoice and offer the modified version to others over a network, you must share that modified source under the same terms. Embedding it in a closed-source product instead? A commercial license is available — see License.

What languages are supported?

646 languages for TTS via the OmniVoice model. Transcription (WhisperX) supports 99 languages. Translation coverage depends on the target language pair.

Can I add my own TTS engine?

Yes. Subclass TTSBackend in backend/services/tts_backend.py and add it to the _REGISTRY dictionary — ~50 lines. The fourteen built-in engines all work this way; see TTS Engines.

Does OmniVoice collect any data about me?

Not unless you explicitly say yes. On first run the app asks — one screen, two equal-weight buttons, no pre-ticked box — and until you answer yes, OmniVoice sends nothing: no analytics, no telemetry, no accounts, no phone-home. Skipping the question means no. Your text, audio, voices, and projects never leave your machine either way.

If you do opt in (also togglable anytime under Settings → Privacy → "Help improve OmniVoice"), what's sent is anonymous, content-free usage stats: generations (engine, language, generation time, character count, error type), plus app lifecycle — an install ping, updates (version-to-version), crashes (error class and a bucketed uptime, never logs), error types (capped, deduplicated), and a single uninstall ping if you remove it. Never your text, audio, file names, or anything identifying — enforced in code by a property allowlist (backend/core/analytics.py), not just a promise. Every build — installer, Docker, or built from source — asks the same first-run question and stays off unless you say yes (the destination is PostHog's publishable write-only client key; skipping the question means off). Your own numbers live in Settings → Usage, computed locally, sent nowhere.

How do I uninstall it / remove all its data?

OmniVoice is fully local — uninstalling is just deleting the app plus the folders it wrote (model cache, Python env, your voices/projects, config). Run scripts/uninstall.sh (macOS/Linux) or scripts\uninstall.ps1 (Windows) — it prints every folder with its size as a dry-run first, then deletes on --yes. The full per-platform path list and app-removal steps are in docs/install/uninstall.md.

📜 License

OmniVoice Studio is free and open-source software under the GNU Affero General Public License v3.0 (AGPL-3.0).

Free for any use — including commercial and internal business use. Run it, sell the audio you produce with it, dub your own or clients' videos, roll it out across your team — all free, no license needed. As a network copyleft license, AGPL adds one obligation: if you modify OmniVoice Studio and offer that modified version to others over a network, you must make the complete corresponding source of your modified version available to them under the same AGPL-3.0 terms.

A commercial license is available for organizations that want to embed OmniVoice Studio in a closed-source or proprietary product or service without the AGPL-3.0 copyleft obligations. Pricing tiers coming soon. Inquiries: OmniVoice@palash.dev.

The bundled omnivoice/ TTS model by Han Zhu remains Apache-2.0 upstream. See LICENSE for the full, binding terms, and LICENSE-NOTICE.md for the plain-language summary and scope.

🙏 Acknowledgments

OmniVoice Studio is built on the shoulders of exceptional open-source work:

Project	Role
OmniVoice (k2-fsa)	Zero-shot diffusion TTS engine — the core voice synthesis model
WhisperX	Word-level speech recognition and alignment
Demucs (Meta)	Music source separation for vocal isolation
Pyannote	Speaker diarization — who said what
CTranslate2	Optimized Transformer inference on CPU and GPU
AudioSeal (Meta)	Invisible neural audio watermarking for AI provenance
Tauri	Native desktop app framework
Supertone / Supertonic 3	ONNX TTS engine — 31 languages, CPU-efficient
Sherpa-ONNX	WASM-ready universal TTS/ASR runtime
GPT-SoVITS	Zero-shot TTS engine — 5 languages, RTF 0.014

🧰 More local open-source from the maker

Like the local-first philosophy? It runs in the family — same maker, same rule: your data stays on your machine.

Opal 💠

Play everything. The media player for the AI era.

_{Video, anime, comics, torrents, Jellyfin & Plex — one player for all of it, with local AI memory and context built in. Written in Zig, runs on macOS & Windows.}

memxt 🧠

The fastest benchmarked open-source AI memory system.

_{Local long-term memory for Claude Code and coding agents — an MCP server on SQLite + embeddings, 100% on your machine. Your agent finally remembers yesterday.}

If you read this far, you're our kind of person.
⭐ Star this repo so others can find it too.
💬 Join the Discord to share what you build.
❤️ Support development — fund the AI agent bills that keep OmniVoice shipping.

Name		Name	Last commit message	Last commit date
Latest commit History 1,133 Commits
.claude		.claude
.github		.github
backend		backend
bin		bin
deploy		deploy
docs		docs
examples		examples
frontend		frontend
notebooks		notebooks
omnivoice-gallery @ 22e8e6d		omnivoice-gallery @ 22e8e6d
omnivoice		omnivoice
scripts		scripts
skills		skills
tests		tests
.coderabbit.yaml		.coderabbit.yaml
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.gitmodules		.gitmodules
.python-version		.python-version
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
LICENSE-NOTICE.md		LICENSE-NOTICE.md
README.md		README.md
README_CN.md		README_CN.md
SPONSORS.md		SPONSORS.md
alembic.ini		alembic.ini
backend.spec		backend.spec
bun.lock		bun.lock
greptile.json		greptile.json
package.json		package.json
pyproject.toml		pyproject.toml
turbo.json		turbo.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OmniVoice Studio

The open-source ElevenLabs alternative.

📸 See it in action

✨ Features

⚡ Quickstart

⚖️ vs Others

🖥️ System Requirements

🗣️ TTS Engines

🎧 ASR Engines

🏗️ Architecture

🔌 OpenAI-compatible API

📓 Run on Google Colab

🤝 Agent Skills

🗺️ Roadmap

🔜 Up Next

💜 Sponsor / Donate

🌟 Sponsors

💬 Community

🤝 Contributing

❓ FAQ

📜 License

🙏 Acknowledgments

🧰 More local open-source from the maker

Opal 💠

memxt 🧠

About

Releases

Packages

Contributors

Languages

Folders and files

Latest commit

History

Repository files navigation

OmniVoice Studio

The open-source ElevenLabs alternative.

📸 See it in action

✨ Features

⚡ Quickstart

⚖️ vs Others

🖥️ System Requirements

🗣️ TTS Engines

🎧 ASR Engines

🏗️ Architecture

🔌 OpenAI-compatible API

📓 Run on Google Colab

🤝 Agent Skills

🗺️ Roadmap

🔜 Up Next

💜 Sponsor / Donate

🌟 Sponsors

💬 Community

🤝 Contributing

❓ FAQ

📜 License

🙏 Acknowledgments

🧰 More local open-source from the maker

Opal 💠

memxt 🧠

About

Topics

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

Packages

Contributors

Languages