没写过Gradio项目,代码略丑陋,都是AI写的,大家先凑合用。 Never written a Gradio project before. The code is a bit ugly since it was all written by AI. Please bear with it for now.
VoxCPM WebUI is a user-friendly web interface powered by Gradio for the VoxCPM Text-to-Speech model from OpenBMB/ModelBest. This project provides easy access to VoxCPM's powerful speech generation capabilities with advanced voice cloning features and intuitive controls.
Added main features for voice cloning and optimized UX details.
- Advanced Audio Tools - Built-in audio/video extraction tool for easy prompt preparation
- Preset Voice Management - Save and manage multiple voice presets for quick synthesis
- Python 3.10+
- CUDA 11.8+ (for GPU acceleration) or CPU mode
- FFmpeg (for audio/video processing)
- 8GB+ VRAM recommended (for GPU)
git clone https://github.com/rainow/voxcpm-webui.git
cd voxcpm-webui# Using Conda
conda create -n voxcpm python=3.10
conda activate voxcpm
# Or using venv
python3.10 -m venv voxcpm_env
source voxcpm_env/bin/activate # On Windows: voxcpm_env\Scripts\activateSee https://github.com/OpenBMB/VoxCPM Install voxcpm in the virtual environment created in step 1.
# Basic installation
pip install -r requirements.txt
# For GPU support (CUDA 11.8)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# For GPU support (CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get install ffmpeg
# Windows (using chocolatey)
choco install ffmpeg
# Or download from: https://ffmpeg.org/download.htmlModels will be automatically downloaded on first run. To download in advance:
python3 << 'EOF'
from huggingface_hub import snapshot_download
from modelscope import snapshot_download as ms_snapshot_download
# VoxCPM Model
snapshot_download("openbmb/VoxCPM-0.5B", local_dir="./models/openbmb__VoxCPM-0.5B")
# ZipEnhancer (for audio enhancement)
ms_snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')
# SenseVoice (for speech recognition)
ms_snapshot_download('iic/SenseVoiceSmall')
EOFpython app2.pyMake sure you have activated the virtual environment before running the script.
The web interface will be available at: http://localhost:7860
For network access (LAN):
# Modify app2.py run_demo call or use environment variable
python app2.py # Already configured for network access- Enter Target Text - Input the text you want synthesized
- Adjust Parameters:
- CFG Value (1.0-3.0): Controls adherence to prompt style
- Inference Timesteps (4-30): Quality vs. speed tradeoff
- Text Normalization: Enable for general text processing
- Click "Generate Speech" - Wait for synthesis to complete
- Provide Prompt Speech - Upload or record a reference audio clip (5-40 seconds)
- Add Prompt Text - Enter the corresponding transcript (auto-recognition available)
- Optional Enhancements:
- Enable "Prompt Speech Enhancement" to denoise reference audio
- Generate - Synthesize new text with the cloned voice
- Upload File - Support for audio/video formats
- Preview & Set Time - Mark segment start and end times (5-40 seconds)
- Extract - Generate audio segment
- Use as Prompt - Automatically set as prompt speech with text recognition
- Manage Presets - Access "Voice Configuration Guide"
- Save Configurations - Store current settings as reusable presets
- Load Presets - Quickly switch between saved voice configurations
- Edit Config - Modify
voice_presets.jsonfor advanced setup
Edit voice_presets.json to customize voice presets:
{
"default_preset": "example",
"default_text": "Default synthesis text",
"max_chars_per_segment": 200,
"segment_pause_duration": 0.5,
"presets": {
"custom_voice": {
"name": "Custom Voice",
"description": "A custom voice configuration",
"prompt_speech": "./presets/custom_voice.wav",
"prompt_text": "Reference text for this voice"
}
}
}- Text Normalization: Disable for phoneme input or special symbols
- Prompt Speech Enhancement: Denoise reference audio for cleaner results
- CFG Value: Lower for creativity, higher for prompt adherence
- Long Text Support: Automatically splits text into segments for coherent synthesis
VoxCPM/
├── app2.py # Main application entry, called app2 because there is already an app.py in VoxCPM project
├── ui_assets.py # UI resources
├── voxcpm_demo.py # Business logic
├── voice_presets.json # Voice presets config
├── voice_presets.json.example # Config template
├── requirements.txt # Dependencies
├── assets/ # Logo and icons
├── static/ # Frontend JavaScript
├── models/openbmb__VoxCPM-0.5B/ # Model weights (Optional, will download on first run)
├── presets/ # Saved voice presets
├── examples/ # Example audio
└── [working directories]/ # denoised_audio, extracted_audio, etc. In folder ./ by default
Issue: FFmpeg not found
- Solution: Install FFmpeg and ensure it's in system PATH
Issue: CUDA out of memory
- Solution: Reduce inference timesteps or use CPU mode (slower)
Issue: Audio extraction fails
- Solution: Verify FFmpeg is installed and file format is supported
- VoxCPM model: OpenBMB
VoxCPM WebUI 是 VoxCPM 文本转语音模型(由 OpenBMB/ModelBest 开发)的用户友好的网络界面。本项目基于 Gradio 提供了简洁易用的交互界面,让用户轻松访问 VoxCPM 强大的语音生成功能,支持高级音色克隆和多种自定义选项。
主要增加了音色克隆相关的工具,优化了一些使用体验方面的细节。
- 高级音频工具 - 内置音频/视频截取工具,便于准备参考音频
- 音色预设管理 - 保存和管理多个音色预设,快速进行语音合成
- Python 3.10+
- CUDA 11.8+(用于 GPU 加速)或 CPU 模式
- FFmpeg(用于音频/视频处理)
- 建议 8GB+ VRAM(GPU 模式)
git clone https://github.com/rainow/voxcpm-webui.git
cd voxcpm-webui# 使用 Conda
conda create -n voxcpm python=3.10
conda activate voxcpm
# 或使用 venv
python3.10 -m venv voxcpm_env
source voxcpm_env/bin/activate # Windows: voxcpm_env\Scripts\activate详见VoxCPM项目: https://github.com/OpenBMB/VoxCPM 注意,要安装在上一步创建的虚拟环境中。
# 基础安装
pip install -r requirements.txt
# GPU 支持 (CUDA 11.8)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# GPU 支持 (CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get install ffmpeg
# Windows (使用 chocolatey)
choco install ffmpeg
# 或从以下地址下载: https://ffmpeg.org/download.html模型将在首次运行时自动下载。若要提前下载:
python3 << 'EOF'
from huggingface_hub import snapshot_download
from modelscope import snapshot_download as ms_snapshot_download
# VoxCPM 模型
snapshot_download("openbmb/VoxCPM-0.5B", local_dir="./models/openbmb__VoxCPM-0.5B")
# ZipEnhancer(用于音频增强)
ms_snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')
# SenseVoice(用于语音识别)
ms_snapshot_download('iic/SenseVoiceSmall')
EOFpython app2.py运行前确保你已经激活了虚拟环境(见第二步:conda activate voxcpm 或者 source voxcpm_env/bin/activate)
网络界面将在以下地址可用:http://localhost:7860
用于网络访问(局域网):
# 修改 app2.py 运行配置或使用环境变量
python app2.py # 已配置为支持网络访问- 输入目标文本 - 输入要合成的文本
- 调整参数:
- CFG Value (1.0-3.0): 控制对参考音频风格的遵循程度
- Inference Timesteps (4-30): 质量与速度的权衡
- 文本正则化: 启用以处理常规文本
- 点击"生成语音" - 等待合成完成
- 提供参考语音 - 上传或录制参考音频片段(5-40 秒)
- 输入参考文本 - 输入对应的文字转录(支持自动识别)
- 可选增强:
- 启用"参考语音增强"以降低参考音频噪声
- 生成 - 使用克隆的声音合成新文本
- 上传文件 - 支持多种音频/视频格式
- 预览和设置时间 - 标记片段的开始和结束时间(5-40 秒)
- 截取 - 生成音频片段
- 用作参考 - 自动设置为参考语音并进行文本识别
- 管理预设 - 访问"音色配置指南"
- 保存配置 - 将当前设置存储为可重用的预设
- 加载预设 - 快速切换已保存的音色配置
- 编辑配置 - 修改
voice_presets.json进行高级设置
编辑 voice_presets.json 自定义音色预设:
{
"default_preset": "example",
"default_text": "默认合成文本",
"max_chars_per_segment": 200,
"segment_pause_duration": 0.5,
"presets": {
"custom_voice": {
"name": "自定义音色",
"description": "一个自定义的音色配置",
"prompt_speech": "./presets/custom_voice.wav",
"prompt_text": "这个音色的参考文本"
}
}
}- 文本正则化: 禁用以支持音素输入或特殊符号
- 参考语音增强: 降噪参考音频以获得更清晰的结果
- CFG Value: 较低值用于创意合成,较高值用于遵循参考
- 长文本支持: 自动将文本分段以实现连贯的合成
VoxCPM/
├── app2.py # 主应用程序入口,不叫app.py是因为VoxCPM里面有一个app.py,这样你可以把这个项目的代码直接复制到VoxCPM的目录里面去也不会冲突
├── app_utils.py # 工具函数
├── ui_assets.py # UI 资源
├── voxcpm_demo.py # 业务逻辑
├── voice_presets.json # 音色预设配置
├── voice_presets.json.example # 配置模板
├── requirements.txt # 依赖列表
├── assets/ # Logo 和图标
├── static/ # 前端 JavaScript
├── models/openbmb__VoxCPM-0.5B/ # 模型权重(可选,会在首次运行时下载)
├── presets/ # 保存的音色预设
├── examples/ # 示例音频
└── [工作目录]/ # denoised_audio、extracted_audio 等,默认在./路径下
问题: 未找到 FFmpeg
- 解决: 安装 FFmpeg 并确保其在系统 PATH 中
问题: CUDA 内存不足
- 解决: 减少推理时间步或使用 CPU 模式(较慢)
问题: 音频提取失败
- 解决: 验证 FFmpeg 已安装且文件格式受支持
- VoxCPM 模型: OpenBMB

