An advanced AI-powered tool that automatically translates and dubs YouTube videos into different languages while dynamically adjusting video speed. This project combines state-of-the-art speech recognition, translation, and voice cloning technologies to create natural-sounding dubbed videos.
- Automatic Video Processing: Downloads YouTube videos using yt-dlp and extracts audio automatically
- Speech Recognition: Uses Whisper AI for accurate speech-to-text transcription
- Voice Separation: Splits original audio into vocal and instrumental tracks using Spleeter
- Neural Translation: Supports high-quality translation through DeepL API
- Voice Cloning: Uses XTTS v2 for natural-sounding voice synthesis that matches the original speaker
- Intelligent Video Speed Adjustment: Automatically adjusts video speed per speech segment to maintain lip-sync
- Background Music Preservation: Maintains original background music and sound effects
- Multi-language Support: Can translate and dub into multiple target languages
- Python 3.8+
- CUDA-capable GPU (recommended for faster processing)
- FFmpeg installed and added to system PATH
- Clone the repository:
git clone https://github.com/frrobledo/AutoDub.git
cd AutoDub
- Install required packages:
pip install -r requirements.txt
- Install additional dependencies:
apt-get install ffmpeg # for debian based systems
For other OS, refer to the ffmpeg installation guide
- Set up API keys:
- Create a DeepL API account and add your API key to the configuration
├── tools/
│ ├── audio_synthesis.py # Voice cloning and audio processing
│ ├── transcriber.py # Speech recognition and translation
│ ├── video_editing.py # Video speed adjustment and editing
│ ├── video_downloader.py # YouTube video downloading
│ ├── audio_splitter_ffmpeg.py # Audio separation
│ └── logger.py # Logging utilities
├── main.py # Main execution script
└── README.md
- Run the main script:
python main.py
-
Enter the YouTube URL when prompted.
-
The script will automatically:
- Download the video
- Extract and transcribe the audio
- Separate speech from background audio
- Translate the speech
- Clone the voice in the target language
- Adjust video speed for lip-sync
- Combine everything into the final video
-
Find the output video in the
final_output
directory.
-
Video Processing:
- Downloads YouTube video using yt-dlp
- Extracts audio track
- Separates vocals from background using Spleeter
-
Speech Processing:
- Transcribes speech using Whisper AI
- Detects spoken language automatically
- Translates text using DeepL API
-
Voice Synthesis:
- Clones original voice using XTTS v2
- Generates speech in target language
- Matches timing of original speech segments
-
Video Adjustment:
- Analyzes duration of original vs. translated speech
- Adjusts video speed per segment for lip-sync
- Preserves original background audio
- Combines all elements into final video
The project creates several directories for processing:
downloads/
: Downloaded YouTube videosoriginal_audios/
: Extracted audio filesoutput_audio/
: Processed audio segmentsfinal_output/
: Final dubbed videoslogs/
: Processing logs
- Video quality depends on source YouTube video
- For some languages, audio generation can produce artifacts and very slow/fast segments
- Processing time varies based on video length and hardware
- Some languages may have better results than others
Contributions are welcome! Please feel free to submit pull requests or create issues for bugs and feature requests.
- Whisper AI for speech recognition
- XTTS v2 for voice cloning
- Spleeter for audio separation
- DeepL for neural translation
- yt-dlp for video downloading
For questions or support, please create an issue in the GitHub repository.