Video dubbing

About project

This project is about dubbing english videos into russian language.

The main purpose is to build a solution, which can be used locally without using any API services. Therefore, to achieve this goal, I have developed video dubbing pipelines, which consist of pretrained models for each stage of dubbing:

The project consists of two parts:

Video-dubbing python module with pipeline implementation and wrappers over models.
Simple streamlit demo app for user-friendly video-dubbing interface.

Note

The developed module allows you to flexibly customize the pipeline for yourself using the config system, as well as the ability to easily add wrappers over new models.

Project structure

root/
├── app/  # streamlit demo app      
└── notebooks/       
└── scripts/  # scripts for datasets processing
└── video_dubbing/
	└── core/  # basic classes
	└── pipelines/  # implementation of all stages of dubbing
	└── utils/  # extra useful thing, configs for example
	└── video_dubber.py  # API for video dubbing

Main Features

The resulting dubbing is synchronized with the original video.
Easy pipeline configuration via configuration files.
Several options for voice synthesis: voice cloning (XTTS-v2); using a specific voice that differs from the original speaker (SileroTTS, XTTS-v2).
For faster inference on the CPU, you can use SileroTTS. Video-dubbing module also supports GPU usage.
It is easy to use your own checkpoints for any models in module. All you need is to provide a paths to your model's weights in config file (look at default configs in video_dubbing/configs/

Output examples

Original (english)	Voice Cloning (XTTS-v2)	SileroTTS (xenia)
input-1.mp4	output-1-1.mp4	output-1-2.mp4
input-2.mp4	output-2-1.mp4	output-2-2.mp4

To do

Logging ✍️
Reading and processing configs 📚
Batch processing ⬆️
Diarization for multi-speaker dubbing
OpenAI API for high quality translation
OpenAI API for processing transcription before translation (numbers -> words)

Fine-tuning XTTS-v2

Default XTTS-v2 has a few problems with russian language:

wrong accents
inexpressive intonation
sometimes there are artifacts at the end of synthesized audios. In order to have a better TTS performance on russian language it is recommended to do fine-tuning.

You can find xtts-v2 fine-tuning example in my notebook: notebooks/fine-tune-xtts.ipynb

Dataset

For fine-tuning I used the RUSLAN dataset:

original: https://ruslan-corpus.github.io/
kaggle: https://www.kaggle.com/datasets/freezerainml/ruslan

It was preprocessed by script: /scripts/process_ruslan_dataset.py.

You can download my preprocessed version from kaggle: https://www.kaggle.com/datasets/maksimgoncharovskiy/ruslan-preprocessed

Note

You can use any other dataset, but it must be in the LJSpeech format.

Fine-tuned model

You can download and use my checkpoint from kaggle: https://www.kaggle.com/models/maksimgoncharovskiy/xtts-v2_ruslan_134566

After you download a checkpoint you can use it in video-dubbing pipeline. All you need is to provide a path to checkpoint in config file.

Docs

I. Installation

Clone repository

git clone https://github.com/Maksim-Goncharovskiy/video-dubbing.git

Make virtual enviroment for Python 3.10.18. You can use Miniconda:

conda create -n project_env_name python=3.10

Go to repository dir and install requirements:

pip install -r requirements.txt

Install video dubbing module

pip install -e .

II. Video dubbing module usage

Quick start

from video_dubbing import VideoDubber
from video_dubbing.utils import cpu_config, gpu_config

dubber = VideoDubber(config=cpu_config) # or gpu_config for voice cloning with XTTS

input_video_path = "input.mp4"
output_video_path = "output.mp4"

dubber(input_video_path, output_video_path)

Using your own config

Let's imagine that you want to use large-v3 whisper version and moreover you have fine-tuned version of xtts-v2. You can easily build your own video_dubbing pipeline by creating custom config file. Config file should has the same format as default config files in video_dubbing/configs:

# This is our custom config file
pipeline:
  - stage: vad
    model: SileroVADProcessor
    params:
      model_path: "./models/SileroVAD/snakers4-silero-vad" # provide path if you've already downloaded model weights
      threshold: 0.5
      min_silence_duration_ms: 1000
      min_speech_duration_ms: 1000

  - stage: asr
    model: FasterWhisperProcessor
    params:
      model_size_or_path: "large-v3" # use new model size. Model weights will be downloaded.
      device: cuda
      compute_type: float32

  - stage: mt
    model: HelsinkiEnRuProcessor
    params:
      model_path: ""
      device: cpu
      
  - stage: tts
    model: XTTSProcessor
    params:
      model_path: "./models/XTTS/XTTS_ft_ruslan/" # provide path to your fine-tuned model
      device: cuda
      speaker: "./ruslan-wavs/reference.wav" # if you provide path to reference audio, the dubbing will be done by the voice of the selected speaker


temp-dir: "./video-dubbing-temp-dir"

Important

If speaker arg in config is not provided while using XTTS model, dubbing will be done with voice cloning.

Then use load_config function:

from video_dubbing import VideoDubber
from video_dubbing.utils import load_config

config_path = "custom_config.yaml"
custom_config = load_config(config_path)

dubber = VideoDubber(config=cpu_config) # or gpu_config for voice cloning with XTTS

input_video_path = "input.mp4"
output_video_path = "output.mp4"

dubber(input_video_path, output_video_path)

Moreover there is an opportunity to construct your config directly in python:

from video_dubbing import VideoDubber
from video_dubbing.utils import ProcessorConfig

config = {
    "pipeline": [
        ProcessorConfig(
            stage="vad",
            model=SileroVADProcessor,
            params={
                "model_path": "./Models/SileroVAD/snakers4-silero-vad", 
                "threshold": 0.5,  
                "min_silence_duration_ms": 1000, 
                "min_speech_duration_ms": 1000
            }
        ),
        ProcessorConfig(
            stage="asr",
            model=FasterWhisperProcessor,
            params={
                "model_size_or_path": "./Models/FasterWhisper/tiny-en", 
                "device": "cpu", 
                "compute_type": "int8"
            }
        ), 
        ProcessorConfig(
            stage="mt",
            model=HelsinkiEnRuProcessor,
            params={
                "model_path": "./Models/OpusEnRu", 
                "device": 'cpu'
            }
        ),  
        ProcessorConfig(
            stage="tts",
            model=SileroTTSProcessor,
            params={
                "model_path": "./Models/SileroModels", 
                "device": 'cpu', 
                "speaker": "xenia"
            }
        )
    ],
    "temp-dir": "./video-dubbing-temp-dir"}

dubber = VideoDubber(config=config)

If you are interested in adding your own models in pipeline or you want to find out how video_dubbing module works, please, read the tutoral. (not done yet)

III. Demo streamlit app

Important

For now demo app uses a default CPU config for video dubbing, which means using a SileroTTS model.

Run demo app with Docker

Build docker image

docker build -t demo-app .

Run docker container

docker run -p 8000:8501 --name demo demo-app

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
app		app
notebooks		notebooks
scripts		scripts
video_dubbing		video_dubbing
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video dubbing

Table of contents

About project

Project structure

Main Features

Output examples

To do

Fine-tuning XTTS-v2

Dataset

Fine-tuned model

Docs

I. Installation

II. Video dubbing module usage

Quick start

Using your own config

III. Demo streamlit app

Run demo app with Docker

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Video dubbing

Table of contents

Quick start

Using your own config

Run demo app with Docker

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!