The total pipeline needs two machines, one for client and one for server.
For 2D avatar, the RAG will run on the server and other parts
will run on the client.
For 3D avatar, the RAG and lipsync
(SAiD) will run on the server
and the other parts will run on the client.
Both Windows and Ubuntu machines should be equipped with Intel Arc A770 GPU 16GB x1.
The guide assumes the use of Visual Studio Code and Conda-forge. However, you can use any other source code editor you prefer.
Download the Visual Studio Code installer and install the tool. Then, set up Git in VS Code.
Download the Conda-forge installer and install the tool. You need to add the tool to the system's PATH if you want to use it in the VS Code terminal.
After the configuration, you can proceed with the setup, using VS Code terminal.
Organize the project code accordingly and make sure the downloaded third-party models are within the file structure presented below.
-
Follow the MuseTalk documentation and place all the models under the
resource/musetalk_modelsdirectory.resource/musetalk_models/ ├── musetalk │ ├── musetalk.json │ └── pytorch_model.bin ├── dwpose │ └── dw-ll_ucoco_384.pth ├── face-parse-bisent │ ├── 79999_iter.pth │ └── resnet18-5c106cde.pth ├── sd-vae-ft-mse │ ├── config.json │ └── diffusion_pytorch_model.bin └── whisper └── tiny.pt -
Convert the MuseTalk model to Openvino IR (Intermediate Representation):
python -m da.util.musetalk_torch2ov
After successful conversion, the Openvino IR will be saved to the following directory:
resource/musetalk_models/ └── musetalk ├── unet-vae-b4.bin └── unet-vae-b4.xml
-
Use Git to download the FunASR
speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorchmodel, used as an automatic speech recognition (ASR) model in the pipeline.cd resource/funasr_models git clone https://www.modelscope.cn/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git cd ../..
-
Convert the FunASR model to ONNX and apply quantization, using the command below:
python -m da.util.funasr_torch2onnx
After a successful conversion, the onnx quantized model will be saved to the following directory:
resource/funasr_models/ └── speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch └── model.onnx
Run the commands below to create the Python environment and install required dependencies.
conda create --name da python=3.10 ffmpeg webrtcvad
conda activate da
pip install -r requirements.txt
mim install -r requirements-mim.txtTo install Docker, refer to the official website.
- Organize the project code accordingly:
-
In Ubuntu, the required code should be in the
said_dockerfolder. -
SAiD models are used for 3D lip sync.
SAiD.pthshould be placed under thesaid_docker/said_modelsdirectory:said_docker └── said_models └── SAiD.pth -
Downloaded third-party models should also be saved within a similar file structure.
- Follow "SAiD on A770" to build and setup a server.
To set up a Retrieval-Augmented Generation (RAG) pipeline, refer to the guide.