Break language barriers with cinema-quality video translation — privately, on your own hardware.
Transform any video into a professional multilingual production with natural voice cloning, lip-sync, and on-screen text translation. No cloud APIs, no subscriptions, no data leaving your machine.
📝 Enjoying the project? Please star it ⭐️! It helps me gauge interest and keep working on new features.
🎥 Click the image above to watch the full demo on YouTube!
Traditional dubbing is expensive, time-consuming, and requires professional studios. AI Video Translator democratizes video localization by bringing Hollywood-grade technology to your desktop:
- 🎬 Content Creators: Expand your audience globally without hiring voice actors
- 🎓 Educators: Make training content accessible in any language
- 📰 Journalists & Documentarians: Localize footage for international audiences
- 🎮 Game Developers: Dub cutscenes and trailers cost-effectively
- 🏢 Businesses: Translate corporate videos, presentations, and webinars
- 🔒 Privacy-Focused Users: Keep sensitive content 100% local
Upload a video, select your target language, and let the AI handle everything:
📹 Input Video (English) → 🤖 AI Pipeline → 📹 Output Video (French, with cloned voice & synced lips)
The full pipeline includes:
- Vocal Separation — Isolates speech from music/sound effects
- Transcription — Converts speech to text with word-level precision
- Translation — Translates text using local LLMs or Google Translate
- Voice Cloning — Regenerates speech in the target language with the original speaker's voice
- Lip-Sync — Adjusts mouth movements to match the new audio
- Visual Text Translation — Detects and replaces on-screen text (subtitles, signs, etc.)
- Audio Enhancement — Cleans and restores generated speech for broadcast quality
| Feature | Technology | Description |
|---|---|---|
| Vocal Separation | HDemucs (Meta) | Cleanly separates speech from background music/sfx with GPU chunking for long videos |
| Transcription | Faster-Whisper (Large v3 Turbo) | 30-50% faster with Silero VAD preprocessing and word-level confidence filtering |
| Speaker Diarization | NeMo MSDD / SpeechBrain | Identifies individual speakers for multi-voice dubbing |
| EQ Spectral Matching | Custom | Applies original voice tonal characteristics to TTS output |
| Voice Enhancement | VoiceFixer | Restores degraded speech and removes noise (optional) |
| Model | Type | Best For |
|---|---|---|
| Google Translate | Online | Fast, reliable everyday translation |
| Tencent HY-MT1.5 | Local (1.8B) | Better context preservation |
| Llama 3.1 8B Instruct | Local | Nuanced, human-like translations |
| ALMA-R 7B | Local | State-of-the-art translation quality |
All local models support context-aware mode using full-transcript context for superior coherence.
| Model | Type | Highlights |
|---|---|---|
| Edge-TTS | Online | Natural Microsoft voices, zero GPU needed |
| Piper TTS | Local | Robust offline neural TTS (auto-downloaded) |
| XTTS-v2 | Local | High-fidelity voice cloning with emotion control (Happy, Sad, Angry) |
| F5-TTS | Local | Ultra-fast zero-shot voice cloning with Sway Sampling |
| VibeVoice | Local | Microsoft's frontier long-form multi-speaker TTS (1.5B/7B) |
| Feature | Technology | Description |
|---|---|---|
| Lip-Sync (Fast) | Wav2Lip-GAN | Smooth, blended lip synchronization |
| Lip-Sync (HD) | Wav2Lip + GFPGAN | Face restoration eliminates blurriness |
| Lip-Sync (Cinema) | LivePortrait | State-of-the-art cinematic lip sync with natural facial animation |
| Visual Text Translation | PaddleOCR / EasyOCR | Detects and replaces on-screen text with OpenCV inpainting |
Video Translator supports a wide range of languages for both source and target translation.
| Language | Code |
|---|---|
| Auto Detect | auto |
| English | en |
| Spanish | es |
| French | fr |
| German | de |
| Italian | it |
| Portuguese | pt |
| Polish | pl |
| Turkish | tr |
| Russian | ru |
| Dutch | nl |
| Czech | cs |
| Arabic | ar |
| Chinese (Simplified) | zh |
| Japanese | ja |
| Korean | ko |
| Hindi | hi |
- 🖥️ Friendly Gradio UI — Easy drag-and-drop interface
- 🎛️ Fine-Grained Control — Beam size, VAD settings, voice selection, and more
- 👤 LivePortrait Lip-Sync — State-of-the-art lip synchronizer with TensorRT acceleration support
- 🖼️ Visual Text Translation — Detects, translates, and seamlessly replaces text in video frames (cached for speed)
- 📝 Auto-Generated Subtitles — Exports
.srtfiles alongside translated videos - 🔄 Smart Segment Merging — Combines choppy phrases into natural sentences
- ⏳ Real-time Progress & ETA — Track detailed progress with estimated time remaining
- 🧹 VoiceFixer Enhancement — Restores and cleans up generated audio for studio quality
- ⚡ GPU Optimized — One-model-at-a-time policy for maximum VRAM efficiency
- 🛡️ Global CPU Fallback — Automatically switches to CPU if GPU fails
"I have 50 English tutorials and want to reach Spanish speakers."
Upload each video, select English → Spanish, and export professional dubs with your cloned voice. No re-recording needed!
"Our compliance training is in English but we have offices in 12 countries."
Batch-translate training videos while maintaining the presenter's voice for authenticity. Export with or without subtitles.
"I want my indie film to premiere at international festivals."
Use LivePortrait (HD) lip-sync for cinema-quality dubbing that doesn't look like a bad overdub.
"We need our 30-second ad in French, German, and Japanese by tomorrow."
Process multiple language versions simultaneously with local LLM translation for brand-appropriate messaging.
"Our video contains confidential product demos."
Everything runs locally — no data leaves your machine. Perfect for legal teams, medical content, or proprietary information.
| Requirement | Details |
|---|---|
| Python | 3.10+ (3.10 recommended) |
| PyTorch | 2.5.1+ with CUDA 12.4+ |
| GPU | NVIDIA GPU recommended (RTX 30/40/50 series supported) |
| VRAM | 8GB minimum, 12GB+ recommended for HD lip-sync |
| FFmpeg | Must be in system PATH |
| Rubberband | Recommended for high-quality audio time-stretching |
📥 FFmpeg Installation
Windows (Option 1):
winget install ffmpeg
# Restart terminal after installationWindows (Option 2 - Manual):
- Download from ffmpeg.org/download (Windows builds → gyan.dev)
- Extract to
C:\ffmpeg - Add
C:\ffmpeg\binto your system PATH - Restart terminal and verify:
ffmpeg -version
Linux:
sudo apt install ffmpegmacOS:
brew install ffmpeg📥 Rubberband Installation
Download from Rubberband Releases. Extract and add to PATH, or place rubberband-program.exe in the project folder.
# Clone the repository
git clone https://github.com/overcrash66/video-translator.git
cd video-translator
# Create virtual environment (Python 3.10 recommended)
py -3.10 -m venv venv
.\venv\Scripts\activate # Windows
# source venv/bin/activate # Linux/macOS
# Install dependencies
pip install -r requirements.txtFor consistency with the Docker deployment, Linux and macOS users can use the Docker requirements file which contains tested, stable dependency versions:
# Create virtual environment
python3.10 -m venv venv
source venv/bin/activate
# Install PyTorch with CUDA (Linux) or Metal (macOS)
# Linux with CUDA:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# macOS (Apple Silicon):
pip install torch torchvision torchaudio
# Install project dependencies
pip install -r deploy/docker/requirements.docker.txtNote
macOS Users:
paddlepaddle-gpuis Linux-only. Installpaddlepaddleinstead:pip install paddlepaddle- Replace
onnxruntime-gpuwithonnxruntime:pip install onnxruntime - Some GPU-accelerated features may have reduced performance on Apple Silicon
| Feature | Requirement |
|---|---|
| NeMo Diarization | nemo_toolkit[asr] |
| Wav2Lip | Model file at models/wav2lip/wav2lip_gan.pth |
| F5-TTS | f5-tts package (GPU recommended) |
| Enhanced Lip-Sync | gfpgan and basicsr (included in requirements) |
| LivePortrait | ~2GB VRAM, auto-downloads to models/live_portrait |
| Llama 3.1 / NeMo | HuggingFace token (HF_TOKEN env variable) |
You can run the application in a Docker container with NVIDIA GPU support.
Prerequisites:
- NVIDIA Driver (compatible with CUDA 12.x)
- Docker Desktop (Windows) or Docker Engine (Linux)
- NVIDIA Container Toolkit (for GPU access inside Docker)
-
Build the image (run from the project root):
docker build -f deploy/docker/Dockerfile -t video-translator . -
Run the container:
# For PowerShell: docker run --gpus all -p 7860:7860 -v ${PWD}/output:/app/output --name video-translator video-translator # For Command Prompt (CMD): docker run --gpus all -p 7860:7860 -v %cd%/output:/app/output --name video-translator video-translator
Note: If you encounter a "Ports are not available" error (common on Windows with Hyper-V), try mapping to a different port like 7950:
docker run --gpus all -p 7950:7860 -v ${PWD}/output:/app/output --name video-translator video-translator -
Access the App: Open your browser to
http://localhost:7860(orhttp://localhost:7950if you used the alternate port).
"Ports are not available" / "Access is denied" On Windows, Hyper-V or WinNAT often reserves large ranges of ports (including 7860).
- Solution: Use a different host port (like 7950 or 8080) as shown in the note above.
- Check reserved ranges: Run
netsh interface ipv4 show excludedportrange protocol=tcpto see which ports are blocked.
# Activate environment
.\venv\Scripts\activate
# Launch the application
python app.pyOpen your browser to http://127.0.0.1:7860
- Upload Video — Drag & drop MP4, MKV, or MOV files
- Select Languages — Source (or Auto-detect) → Target
- Choose Models:
- Translation: Google (fast) / Llama 3.1 (quality) / ALMA-R (best)
- TTS: Edge (online) / F5-TTS (fast cloning) / XTTS (emotion control)
- Enable Features (optional):
- ✅ Speaker Diarization — Multi-speaker videos
- ✅ Lip-Sync — Select quality level (Fast/HD/Cinema)
- ✅ Visual Text Translation — Replace on-screen text
- ✅ Audio Enhancement — VoiceFixer post-processing
- Click "Process Video" and monitor progress
video-translator/
├── temp/ # Intermediate files (auto-cleaned)
├── output/ # Final translated videos
├── models/ # Downloaded model weights
└── .env # Environment variables (HF_TOKEN, etc.)
HF_TOKEN=your_huggingface_token # Required for Llama 3.1, NeMoflowchart TD
Video[Input Video] --> Extract[Extract Audio via FFmpeg]
Extract --> Separator{"Audio Separator<br/>(HDemucs)"}
Separator -->|Vocals| Vocals[Vocal Track]
Separator -->|Accompaniment| Background[Background Track]
Vocals --> VAD{"VAD Preprocessing<br/>(Silero VAD)"}
VAD --> Transcribe{"Transcribe<br/>(Faster-Whisper Turbo)"}
Transcribe --> Segments[Text Segments]
Segments --> Merge{Smart Segment Merging}
Vocals -.-> Diarize{"Diarize<br/>(NeMo / SpeechBrain)"}
Diarize -.-> SpeakerProfiling[Speaker Profiling]
Merge --> Translate{"Translate<br/>(Llama 3.1 / ALMA / HY-MT)"}
Translate --> SRT[Export .SRT Subtitles]
Translate --> TTS{"Neural TTS<br/>(F5-TTS / XTTS / Edge)"}
SpeakerProfiling -.-> TTS
TTS --> TTSAudio[Generated Speech Clips]
TTSAudio --> EQ{EQ Spectral Matching}
EQ --> Sync{"Synchronize<br/>(PyRubberband)"}
Sync --> MergedSpeech[Merged Speech Track]
MergedSpeech -.-> VoiceFixer{"Voice Enhancement<br/>(VoiceFixer)"}
VoiceFixer -.-> Mix
MergedSpeech --> Mix{Mix Audio}
Background --> Mix
Mix --> FinalAudio[Final Audio Track]
Video --> VisualTrans{"Visual Translation<br/>(PaddleOCR / EasyOCR)"}
VisualTrans --> LipSync{"Lip-Sync<br/>(LivePortrait / Wav2Lip / GFPGAN)"}
MergedSpeech -.-> LipSync
LipSync --> Mux{"Merge with Video<br/>(FFmpeg)"}
FinalAudio --> Mux
Mux --> Output[Translated Output Video]
Contributions are welcome! Areas where help is appreciated:
- Additional language support and voice models
- Performance optimizations
- Bug fixes and stability improvements
- Documentation and tutorials
This project is for educational and personal use. Please respect the licenses of underlying models and technologies.
🌟 Star this repo if you find it useful! 🌟
Made with ❤️ for content creators worldwide

