- About project
- 1.1. Project structure
- 1.2. Main features
- 1.3. Output examples
- 1.4. To do
- Fine tuning XTTS-v2
- 2.1. Dataset
- 2.2. Fine-tuned model
- Docs
- 3.1. Installation
- 3.2. Video dubbing module usage
- 3.3. Streamlit demo app
This project is about dubbing english videos into russian language.
The main purpose is to build a solution, which can be used locally without using any API services. Therefore, to achieve this goal, I have developed video dubbing pipelines, which consist of pretrained models for each stage of dubbing:
The project consists of two parts:
- Video-dubbing
python modulewith pipeline implementation and wrappers over models. - Simple streamlit
demo appfor user-friendly video-dubbing interface.
Note
The developed module allows you to flexibly customize the pipeline for yourself using the config system, as well as the ability to easily add wrappers over new models.
root/
├── app/ # streamlit demo app
└── notebooks/
└── scripts/ # scripts for datasets processing
└── video_dubbing/
└── core/ # basic classes
└── pipelines/ # implementation of all stages of dubbing
└── utils/ # extra useful thing, configs for example
└── video_dubber.py # API for video dubbing
- The resulting dubbing is synchronized with the original video.
- Easy pipeline configuration via configuration files.
- Several options for voice synthesis: voice cloning (XTTS-v2); using a specific voice that differs from the original speaker (SileroTTS, XTTS-v2).
- For faster inference on the CPU, you can use SileroTTS. Video-dubbing module also supports GPU usage.
- It is easy to use your own checkpoints for any models in module. All you need is to provide a paths to your model's weights in config file (look at default configs in
video_dubbing/configs/
| Original (english) | Voice Cloning (XTTS-v2) | SileroTTS (xenia) |
|---|---|---|
input-1.mp4 |
output-1-1.mp4 |
output-1-2.mp4 |
input-2.mp4 |
output-2-1.mp4 |
output-2-2.mp4 |
- Logging ✍️
- Reading and processing configs 📚
- Batch processing ⬆️
- Diarization for multi-speaker dubbing
- OpenAI API for high quality translation
- OpenAI API for processing transcription before translation (numbers -> words)
Default XTTS-v2 has a few problems with russian language:
- wrong accents
- inexpressive intonation
- sometimes there are artifacts at the end of synthesized audios. In order to have a better TTS performance on russian language it is recommended to do fine-tuning.
You can find xtts-v2 fine-tuning example in my notebook: notebooks/fine-tune-xtts.ipynb
For fine-tuning I used the RUSLAN dataset:
- original: https://ruslan-corpus.github.io/
- kaggle: https://www.kaggle.com/datasets/freezerainml/ruslan
It was preprocessed by script: /scripts/process_ruslan_dataset.py.
You can download my preprocessed version from kaggle: https://www.kaggle.com/datasets/maksimgoncharovskiy/ruslan-preprocessed
Note
You can use any other dataset, but it must be in the LJSpeech format.
You can download and use my checkpoint from kaggle: https://www.kaggle.com/models/maksimgoncharovskiy/xtts-v2_ruslan_134566
After you download a checkpoint you can use it in video-dubbing pipeline. All you need is to provide a path to checkpoint in config file.
- Clone repository
git clone https://github.com/Maksim-Goncharovskiy/video-dubbing.git- Make virtual enviroment for Python 3.10.18. You can use Miniconda:
conda create -n project_env_name python=3.10- Go to repository dir and install requirements:
pip install -r requirements.txt- Install video dubbing module
pip install -e .from video_dubbing import VideoDubber
from video_dubbing.utils import cpu_config, gpu_config
dubber = VideoDubber(config=cpu_config) # or gpu_config for voice cloning with XTTS
input_video_path = "input.mp4"
output_video_path = "output.mp4"
dubber(input_video_path, output_video_path)Let's imagine that you want to use large-v3 whisper version and moreover you have fine-tuned version of xtts-v2. You can easily build your own video_dubbing pipeline by creating custom config file. Config file should has the same format as default config files in video_dubbing/configs:
# This is our custom config file
pipeline:
- stage: vad
model: SileroVADProcessor
params:
model_path: "./models/SileroVAD/snakers4-silero-vad" # provide path if you've already downloaded model weights
threshold: 0.5
min_silence_duration_ms: 1000
min_speech_duration_ms: 1000
- stage: asr
model: FasterWhisperProcessor
params:
model_size_or_path: "large-v3" # use new model size. Model weights will be downloaded.
device: cuda
compute_type: float32
- stage: mt
model: HelsinkiEnRuProcessor
params:
model_path: ""
device: cpu
- stage: tts
model: XTTSProcessor
params:
model_path: "./models/XTTS/XTTS_ft_ruslan/" # provide path to your fine-tuned model
device: cuda
speaker: "./ruslan-wavs/reference.wav" # if you provide path to reference audio, the dubbing will be done by the voice of the selected speaker
temp-dir: "./video-dubbing-temp-dir"Important
If speaker arg in config is not provided while using XTTS model, dubbing will be done with voice cloning.
Then use load_config function:
from video_dubbing import VideoDubber
from video_dubbing.utils import load_config
config_path = "custom_config.yaml"
custom_config = load_config(config_path)
dubber = VideoDubber(config=cpu_config) # or gpu_config for voice cloning with XTTS
input_video_path = "input.mp4"
output_video_path = "output.mp4"
dubber(input_video_path, output_video_path)Moreover there is an opportunity to construct your config directly in python:
from video_dubbing import VideoDubber
from video_dubbing.utils import ProcessorConfig
config = {
"pipeline": [
ProcessorConfig(
stage="vad",
model=SileroVADProcessor,
params={
"model_path": "./Models/SileroVAD/snakers4-silero-vad",
"threshold": 0.5,
"min_silence_duration_ms": 1000,
"min_speech_duration_ms": 1000
}
),
ProcessorConfig(
stage="asr",
model=FasterWhisperProcessor,
params={
"model_size_or_path": "./Models/FasterWhisper/tiny-en",
"device": "cpu",
"compute_type": "int8"
}
),
ProcessorConfig(
stage="mt",
model=HelsinkiEnRuProcessor,
params={
"model_path": "./Models/OpusEnRu",
"device": 'cpu'
}
),
ProcessorConfig(
stage="tts",
model=SileroTTSProcessor,
params={
"model_path": "./Models/SileroModels",
"device": 'cpu',
"speaker": "xenia"
}
)
],
"temp-dir": "./video-dubbing-temp-dir"}
dubber = VideoDubber(config=config)If you are interested in adding your own models in pipeline or you want to find out how video_dubbing module works, please, read the tutoral. (not done yet)
Important
For now demo app uses a default CPU config for video dubbing, which means using a SileroTTS model.
- Build docker image
docker build -t demo-app .- Run docker container
docker run -p 8000:8501 --name demo demo-app