VoxCPM WebUI - Advanced Text-to-Speech with Voice Cloning

没写过Gradio项目，代码略丑陋，都是AI写的，大家先凑合用。 Never written a Gradio project before. The code is a bit ugly since it was all written by AI. Please bear with it for now.

English 🌍

Overview

VoxCPM WebUI is a user-friendly web interface powered by Gradio for the VoxCPM Text-to-Speech model from OpenBMB/ModelBest. This project provides easy access to VoxCPM's powerful speech generation capabilities with advanced voice cloning features and intuitive controls.

🎨 System Preview

🎯 Key Features

Added main features for voice cloning and optimized UX details.

Advanced Audio Tools - Built-in audio/video extraction tool for easy prompt preparation
Preset Voice Management - Save and manage multiple voice presets for quick synthesis

📋 Requirements

Python 3.10+
CUDA 11.8+ (for GPU acceleration) or CPU mode
FFmpeg (for audio/video processing)
8GB+ VRAM recommended (for GPU)

🚀 Quick Start

0. Clone the Repository

git clone https://github.com/rainow/voxcpm-webui.git
cd voxcpm-webui

1. Create Virtual Environment

# Using Conda
conda create -n voxcpm python=3.10
conda activate voxcpm

# Or using venv
python3.10 -m venv voxcpm_env
source voxcpm_env/bin/activate  # On Windows: voxcpm_env\Scripts\activate

2. Install VoxCPM

See https://github.com/OpenBMB/VoxCPM Install voxcpm in the virtual environment created in step 1.

3. Install Dependencies

# Basic installation
pip install -r requirements.txt

# For GPU support (CUDA 11.8)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# For GPU support (CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

4. Install FFmpeg

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

# Windows (using chocolatey)
choco install ffmpeg

# Or download from: https://ffmpeg.org/download.html

5. Download Models (Optional)

Models will be automatically downloaded on first run. To download in advance:

python3 << 'EOF'
from huggingface_hub import snapshot_download
from modelscope import snapshot_download as ms_snapshot_download

# VoxCPM Model
snapshot_download("openbmb/VoxCPM-0.5B", local_dir="./models/openbmb__VoxCPM-0.5B")

# ZipEnhancer (for audio enhancement)
ms_snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')

# SenseVoice (for speech recognition)
ms_snapshot_download('iic/SenseVoiceSmall')
EOF

6. Run the Application

python app2.py

Make sure you have activated the virtual environment before running the script.

The web interface will be available at: http://localhost:7860

For network access (LAN):

# Modify app2.py run_demo call or use environment variable
python app2.py  # Already configured for network access

📖 Usage Guide

Basic Text-to-Speech

Enter Target Text - Input the text you want synthesized
Adjust Parameters:
- CFG Value (1.0-3.0): Controls adherence to prompt style
- Inference Timesteps (4-30): Quality vs. speed tradeoff
- Text Normalization: Enable for general text processing
Click "Generate Speech" - Wait for synthesis to complete

Voice Cloning

Provide Prompt Speech - Upload or record a reference audio clip (5-40 seconds)
Add Prompt Text - Enter the corresponding transcript (auto-recognition available)
Optional Enhancements:
- Enable "Prompt Speech Enhancement" to denoise reference audio
Generate - Synthesize new text with the cloned voice

Audio/Video Extraction Tool

Upload File - Support for audio/video formats
Preview & Set Time - Mark segment start and end times (5-40 seconds)
Extract - Generate audio segment
Use as Prompt - Automatically set as prompt speech with text recognition

Voice Presets

Manage Presets - Access "Voice Configuration Guide"
Save Configurations - Store current settings as reusable presets
Load Presets - Quickly switch between saved voice configurations
Edit Config - Modify voice_presets.json for advanced setup

⚙️ Configuration

Edit voice_presets.json to customize voice presets:

{
  "default_preset": "example",
  "default_text": "Default synthesis text",
  "max_chars_per_segment": 200,
  "segment_pause_duration": 0.5,
  "presets": {
    "custom_voice": {
      "name": "Custom Voice",
      "description": "A custom voice configuration",
      "prompt_speech": "./presets/custom_voice.wav",
      "prompt_text": "Reference text for this voice"
    }
  }
}

🔧 Advanced Options

Text Normalization: Disable for phoneme input or special symbols
Prompt Speech Enhancement: Denoise reference audio for cleaner results
CFG Value: Lower for creativity, higher for prompt adherence
Long Text Support: Automatically splits text into segments for coherent synthesis

📁 Project Structure

VoxCPM/
├── app2.py                          # Main application entry, called app2 because there is already an app.py in VoxCPM project
├── ui_assets.py                     # UI resources
├── voxcpm_demo.py                   # Business logic
├── voice_presets.json               # Voice presets config
├── voice_presets.json.example       # Config template
├── requirements.txt                 # Dependencies
├── assets/                          # Logo and icons
├── static/                          # Frontend JavaScript
├── models/openbmb__VoxCPM-0.5B/     # Model weights (Optional, will download on first run)
├── presets/                         # Saved voice presets
├── examples/                        # Example audio
└── [working directories]/           # denoised_audio, extracted_audio, etc. In folder ./ by default

🐛 Troubleshooting

Issue: FFmpeg not found

Solution: Install FFmpeg and ensure it's in system PATH

Issue: CUDA out of memory

Solution: Reduce inference timesteps or use CPU mode (slower)

Issue: Audio extraction fails

Solution: Verify FFmpeg is installed and file format is supported

🙏 Acknowledgments

VoxCPM model: OpenBMB

简体中文 🇨🇳

项目简介

VoxCPM WebUI 是 VoxCPM 文本转语音模型（由 OpenBMB/ModelBest 开发）的用户友好的网络界面。本项目基于 Gradio 提供了简洁易用的交互界面，让用户轻松访问 VoxCPM 强大的语音生成功能，支持高级音色克隆和多种自定义选项。

🎨 系统预览

🎯 主要特性

主要增加了音色克隆相关的工具，优化了一些使用体验方面的细节。

高级音频工具 - 内置音频/视频截取工具，便于准备参考音频
音色预设管理 - 保存和管理多个音色预设，快速进行语音合成

📋 系统需求

Python 3.10+
CUDA 11.8+（用于 GPU 加速）或 CPU 模式
FFmpeg（用于音频/视频处理）
建议 8GB+ VRAM（GPU 模式）

🚀 快速开始

0. 克隆仓库

git clone https://github.com/rainow/voxcpm-webui.git
cd voxcpm-webui

1. 创建虚拟环境

# 使用 Conda
conda create -n voxcpm python=3.10
conda activate voxcpm

# 或使用 venv
python3.10 -m venv voxcpm_env
source voxcpm_env/bin/activate  # Windows: voxcpm_env\Scripts\activate

2. 安装VoxCPM

详见VoxCPM项目： https://github.com/OpenBMB/VoxCPM 注意，要安装在上一步创建的虚拟环境中。

3. 安装依赖

# 基础安装
pip install -r requirements.txt

# GPU 支持 (CUDA 11.8)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# GPU 支持 (CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

4. 安装 FFmpeg

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

# Windows (使用 chocolatey)
choco install ffmpeg

# 或从以下地址下载: https://ffmpeg.org/download.html

5. 下载模型（可选）

模型将在首次运行时自动下载。若要提前下载：

python3 << 'EOF'
from huggingface_hub import snapshot_download
from modelscope import snapshot_download as ms_snapshot_download

# VoxCPM 模型
snapshot_download("openbmb/VoxCPM-0.5B", local_dir="./models/openbmb__VoxCPM-0.5B")

# ZipEnhancer（用于音频增强）
ms_snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')

# SenseVoice（用于语音识别）
ms_snapshot_download('iic/SenseVoiceSmall')
EOF

6. 运行应用

python app2.py

运行前确保你已经激活了虚拟环境（见第二步：conda activate voxcpm 或者 source voxcpm_env/bin/activate）

网络界面将在以下地址可用：http://localhost:7860

用于网络访问（局域网）：

# 修改 app2.py 运行配置或使用环境变量
python app2.py  # 已配置为支持网络访问

📖 使用指南

基础文本转语音

输入目标文本 - 输入要合成的文本
调整参数：
- CFG Value (1.0-3.0): 控制对参考音频风格的遵循程度
- Inference Timesteps (4-30): 质量与速度的权衡
- 文本正则化: 启用以处理常规文本
点击"生成语音" - 等待合成完成

声音克隆

提供参考语音 - 上传或录制参考音频片段（5-40 秒）
输入参考文本 - 输入对应的文字转录（支持自动识别）
可选增强：
- 启用"参考语音增强"以降低参考音频噪声
生成 - 使用克隆的声音合成新文本

音频/视频截取工具

上传文件 - 支持多种音频/视频格式
预览和设置时间 - 标记片段的开始和结束时间（5-40 秒）
截取 - 生成音频片段
用作参考 - 自动设置为参考语音并进行文本识别

音色预设

管理预设 - 访问"音色配置指南"
保存配置 - 将当前设置存储为可重用的预设
加载预设 - 快速切换已保存的音色配置
编辑配置 - 修改 voice_presets.json 进行高级设置

⚙️ 配置

编辑 voice_presets.json 自定义音色预设：

{
  "default_preset": "example",
  "default_text": "默认合成文本",
  "max_chars_per_segment": 200,
  "segment_pause_duration": 0.5,
  "presets": {
    "custom_voice": {
      "name": "自定义音色",
      "description": "一个自定义的音色配置",
      "prompt_speech": "./presets/custom_voice.wav",
      "prompt_text": "这个音色的参考文本"
    }
  }
}

🔧 高级选项

文本正则化: 禁用以支持音素输入或特殊符号
参考语音增强: 降噪参考音频以获得更清晰的结果
CFG Value: 较低值用于创意合成，较高值用于遵循参考
长文本支持: 自动将文本分段以实现连贯的合成

📁 项目结构

VoxCPM/
├── app2.py                          # 主应用程序入口，不叫app.py是因为VoxCPM里面有一个app.py，这样你可以把这个项目的代码直接复制到VoxCPM的目录里面去也不会冲突
├── app_utils.py                     # 工具函数
├── ui_assets.py                     # UI 资源
├── voxcpm_demo.py                   # 业务逻辑
├── voice_presets.json               # 音色预设配置
├── voice_presets.json.example       # 配置模板
├── requirements.txt                 # 依赖列表
├── assets/                          # Logo 和图标
├── static/                          # 前端 JavaScript
├── models/openbmb__VoxCPM-0.5B/     # 模型权重(可选，会在首次运行时下载)
├── presets/                         # 保存的音色预设
├── examples/                        # 示例音频
└── [工作目录]/                      # denoised_audio、extracted_audio 等，默认在./路径下

🐛 故障排除

问题: 未找到 FFmpeg

解决: 安装 FFmpeg 并确保其在系统 PATH 中

问题: CUDA 内存不足

解决: 减少推理时间步或使用 CPU 模式（较慢）

问题: 音频提取失败

解决: 验证 FFmpeg 已安装且文件格式受支持

🙏 致谢

VoxCPM 模型: OpenBMB

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
examples		examples
presets		presets
readme_files		readme_files
static		static
.gitignore		.gitignore
README.md		README.md
app2.py		app2.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ui_assets.py		ui_assets.py
voice_presets.json		voice_presets.json
voice_presets.json.example		voice_presets.json.example
voxcpm_demo.py		voxcpm_demo.py

Folders and files

Latest commit

History

Repository files navigation

VoxCPM WebUI - Advanced Text-to-Speech with Voice Cloning

English 🌍

Overview

🎨 System Preview

🎯 Key Features

📋 Requirements

🚀 Quick Start

0. Clone the Repository

1. Create Virtual Environment

2. Install VoxCPM

3. Install Dependencies

4. Install FFmpeg

5. Download Models (Optional)

6. Run the Application

📖 Usage Guide

Basic Text-to-Speech

Voice Cloning

Audio/Video Extraction Tool

Voice Presets

⚙️ Configuration

🔧 Advanced Options

📁 Project Structure

🐛 Troubleshooting

🙏 Acknowledgments

简体中文 🇨🇳

项目简介

🎨 系统预览

🎯 主要特性

📋 系统需求

🚀 快速开始

0. 克隆仓库

1. 创建虚拟环境

2. 安装VoxCPM

3. 安装依赖

4. 安装 FFmpeg

5. 下载模型（可选）

6. 运行应用

📖 使用指南

基础文本转语音

声音克隆

音频/视频截取工具

音色预设

⚙️ 配置

🔧 高级选项

📁 项目结构

🐛 故障排除

🙏 致谢

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages