Skip to content

rainow/voxcpm-webui

Repository files navigation

VoxCPM WebUI - Advanced Text-to-Speech with Voice Cloning

简体中文 | English

没写过Gradio项目,代码略丑陋,都是AI写的,大家先凑合用。 Never written a Gradio project before. The code is a bit ugly since it was all written by AI. Please bear with it for now.


English 🌍

Overview

VoxCPM WebUI is a user-friendly web interface powered by Gradio for the VoxCPM Text-to-Speech model from OpenBMB/ModelBest. This project provides easy access to VoxCPM's powerful speech generation capabilities with advanced voice cloning features and intuitive controls.

🎨 System Preview

System Preview 1 System Preview 2

🎯 Key Features

Added main features for voice cloning and optimized UX details.

  • Advanced Audio Tools - Built-in audio/video extraction tool for easy prompt preparation
  • Preset Voice Management - Save and manage multiple voice presets for quick synthesis

📋 Requirements

  • Python 3.10+
  • CUDA 11.8+ (for GPU acceleration) or CPU mode
  • FFmpeg (for audio/video processing)
  • 8GB+ VRAM recommended (for GPU)

🚀 Quick Start

0. Clone the Repository

git clone https://github.com/rainow/voxcpm-webui.git
cd voxcpm-webui

1. Create Virtual Environment

# Using Conda
conda create -n voxcpm python=3.10
conda activate voxcpm

# Or using venv
python3.10 -m venv voxcpm_env
source voxcpm_env/bin/activate  # On Windows: voxcpm_env\Scripts\activate

2. Install VoxCPM

See https://github.com/OpenBMB/VoxCPM Install voxcpm in the virtual environment created in step 1.

3. Install Dependencies

# Basic installation
pip install -r requirements.txt

# For GPU support (CUDA 11.8)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# For GPU support (CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

4. Install FFmpeg

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

# Windows (using chocolatey)
choco install ffmpeg

# Or download from: https://ffmpeg.org/download.html

5. Download Models (Optional)

Models will be automatically downloaded on first run. To download in advance:

python3 << 'EOF'
from huggingface_hub import snapshot_download
from modelscope import snapshot_download as ms_snapshot_download

# VoxCPM Model
snapshot_download("openbmb/VoxCPM-0.5B", local_dir="./models/openbmb__VoxCPM-0.5B")

# ZipEnhancer (for audio enhancement)
ms_snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')

# SenseVoice (for speech recognition)
ms_snapshot_download('iic/SenseVoiceSmall')
EOF

6. Run the Application

python app2.py

Make sure you have activated the virtual environment before running the script.

The web interface will be available at: http://localhost:7860

For network access (LAN):

# Modify app2.py run_demo call or use environment variable
python app2.py  # Already configured for network access

📖 Usage Guide

Basic Text-to-Speech

  1. Enter Target Text - Input the text you want synthesized
  2. Adjust Parameters:
    • CFG Value (1.0-3.0): Controls adherence to prompt style
    • Inference Timesteps (4-30): Quality vs. speed tradeoff
    • Text Normalization: Enable for general text processing
  3. Click "Generate Speech" - Wait for synthesis to complete

Voice Cloning

  1. Provide Prompt Speech - Upload or record a reference audio clip (5-40 seconds)
  2. Add Prompt Text - Enter the corresponding transcript (auto-recognition available)
  3. Optional Enhancements:
    • Enable "Prompt Speech Enhancement" to denoise reference audio
  4. Generate - Synthesize new text with the cloned voice

Audio/Video Extraction Tool

  1. Upload File - Support for audio/video formats
  2. Preview & Set Time - Mark segment start and end times (5-40 seconds)
  3. Extract - Generate audio segment
  4. Use as Prompt - Automatically set as prompt speech with text recognition

Voice Presets

  1. Manage Presets - Access "Voice Configuration Guide"
  2. Save Configurations - Store current settings as reusable presets
  3. Load Presets - Quickly switch between saved voice configurations
  4. Edit Config - Modify voice_presets.json for advanced setup

⚙️ Configuration

Edit voice_presets.json to customize voice presets:

{
  "default_preset": "example",
  "default_text": "Default synthesis text",
  "max_chars_per_segment": 200,
  "segment_pause_duration": 0.5,
  "presets": {
    "custom_voice": {
      "name": "Custom Voice",
      "description": "A custom voice configuration",
      "prompt_speech": "./presets/custom_voice.wav",
      "prompt_text": "Reference text for this voice"
    }
  }
}

🔧 Advanced Options

  • Text Normalization: Disable for phoneme input or special symbols
  • Prompt Speech Enhancement: Denoise reference audio for cleaner results
  • CFG Value: Lower for creativity, higher for prompt adherence
  • Long Text Support: Automatically splits text into segments for coherent synthesis

📁 Project Structure

VoxCPM/
├── app2.py                          # Main application entry, called app2 because there is already an app.py in VoxCPM project
├── ui_assets.py                     # UI resources
├── voxcpm_demo.py                   # Business logic
├── voice_presets.json               # Voice presets config
├── voice_presets.json.example       # Config template
├── requirements.txt                 # Dependencies
├── assets/                          # Logo and icons
├── static/                          # Frontend JavaScript
├── models/openbmb__VoxCPM-0.5B/     # Model weights (Optional, will download on first run)
├── presets/                         # Saved voice presets
├── examples/                        # Example audio
└── [working directories]/           # denoised_audio, extracted_audio, etc. In folder ./ by default

🐛 Troubleshooting

Issue: FFmpeg not found

  • Solution: Install FFmpeg and ensure it's in system PATH

Issue: CUDA out of memory

  • Solution: Reduce inference timesteps or use CPU mode (slower)

Issue: Audio extraction fails

  • Solution: Verify FFmpeg is installed and file format is supported

🙏 Acknowledgments


简体中文 🇨🇳

项目简介

VoxCPM WebUI 是 VoxCPM 文本转语音模型(由 OpenBMB/ModelBest 开发)的用户友好的网络界面。本项目基于 Gradio 提供了简洁易用的交互界面,让用户轻松访问 VoxCPM 强大的语音生成功能,支持高级音色克隆和多种自定义选项。

🎨 系统预览

系统预览 1 系统预览 2

🎯 主要特性

主要增加了音色克隆相关的工具,优化了一些使用体验方面的细节。

  • 高级音频工具 - 内置音频/视频截取工具,便于准备参考音频
  • 音色预设管理 - 保存和管理多个音色预设,快速进行语音合成

📋 系统需求

  • Python 3.10+
  • CUDA 11.8+(用于 GPU 加速)或 CPU 模式
  • FFmpeg(用于音频/视频处理)
  • 建议 8GB+ VRAM(GPU 模式)

🚀 快速开始

0. 克隆仓库

git clone https://github.com/rainow/voxcpm-webui.git
cd voxcpm-webui

1. 创建虚拟环境

# 使用 Conda
conda create -n voxcpm python=3.10
conda activate voxcpm

# 或使用 venv
python3.10 -m venv voxcpm_env
source voxcpm_env/bin/activate  # Windows: voxcpm_env\Scripts\activate

2. 安装VoxCPM

详见VoxCPM项目: https://github.com/OpenBMB/VoxCPM 注意,要安装在上一步创建的虚拟环境中。

3. 安装依赖

# 基础安装
pip install -r requirements.txt

# GPU 支持 (CUDA 11.8)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# GPU 支持 (CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

4. 安装 FFmpeg

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

# Windows (使用 chocolatey)
choco install ffmpeg

# 或从以下地址下载: https://ffmpeg.org/download.html

5. 下载模型(可选)

模型将在首次运行时自动下载。若要提前下载:

python3 << 'EOF'
from huggingface_hub import snapshot_download
from modelscope import snapshot_download as ms_snapshot_download

# VoxCPM 模型
snapshot_download("openbmb/VoxCPM-0.5B", local_dir="./models/openbmb__VoxCPM-0.5B")

# ZipEnhancer(用于音频增强)
ms_snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')

# SenseVoice(用于语音识别)
ms_snapshot_download('iic/SenseVoiceSmall')
EOF

6. 运行应用

python app2.py

运行前确保你已经激活了虚拟环境(见第二步:conda activate voxcpm 或者 source voxcpm_env/bin/activate)

网络界面将在以下地址可用:http://localhost:7860

用于网络访问(局域网):

# 修改 app2.py 运行配置或使用环境变量
python app2.py  # 已配置为支持网络访问

📖 使用指南

基础文本转语音

  1. 输入目标文本 - 输入要合成的文本
  2. 调整参数
    • CFG Value (1.0-3.0): 控制对参考音频风格的遵循程度
    • Inference Timesteps (4-30): 质量与速度的权衡
    • 文本正则化: 启用以处理常规文本
  3. 点击"生成语音" - 等待合成完成

声音克隆

  1. 提供参考语音 - 上传或录制参考音频片段(5-40 秒)
  2. 输入参考文本 - 输入对应的文字转录(支持自动识别)
  3. 可选增强
    • 启用"参考语音增强"以降低参考音频噪声
  4. 生成 - 使用克隆的声音合成新文本

音频/视频截取工具

  1. 上传文件 - 支持多种音频/视频格式
  2. 预览和设置时间 - 标记片段的开始和结束时间(5-40 秒)
  3. 截取 - 生成音频片段
  4. 用作参考 - 自动设置为参考语音并进行文本识别

音色预设

  1. 管理预设 - 访问"音色配置指南"
  2. 保存配置 - 将当前设置存储为可重用的预设
  3. 加载预设 - 快速切换已保存的音色配置
  4. 编辑配置 - 修改 voice_presets.json 进行高级设置

⚙️ 配置

编辑 voice_presets.json 自定义音色预设:

{
  "default_preset": "example",
  "default_text": "默认合成文本",
  "max_chars_per_segment": 200,
  "segment_pause_duration": 0.5,
  "presets": {
    "custom_voice": {
      "name": "自定义音色",
      "description": "一个自定义的音色配置",
      "prompt_speech": "./presets/custom_voice.wav",
      "prompt_text": "这个音色的参考文本"
    }
  }
}

🔧 高级选项

  • 文本正则化: 禁用以支持音素输入或特殊符号
  • 参考语音增强: 降噪参考音频以获得更清晰的结果
  • CFG Value: 较低值用于创意合成,较高值用于遵循参考
  • 长文本支持: 自动将文本分段以实现连贯的合成

📁 项目结构

VoxCPM/
├── app2.py                          # 主应用程序入口,不叫app.py是因为VoxCPM里面有一个app.py,这样你可以把这个项目的代码直接复制到VoxCPM的目录里面去也不会冲突
├── app_utils.py                     # 工具函数
├── ui_assets.py                     # UI 资源
├── voxcpm_demo.py                   # 业务逻辑
├── voice_presets.json               # 音色预设配置
├── voice_presets.json.example       # 配置模板
├── requirements.txt                 # 依赖列表
├── assets/                          # Logo 和图标
├── static/                          # 前端 JavaScript
├── models/openbmb__VoxCPM-0.5B/     # 模型权重(可选,会在首次运行时下载)
├── presets/                         # 保存的音色预设
├── examples/                        # 示例音频
└── [工作目录]/                      # denoised_audio、extracted_audio 等,默认在./路径下

🐛 故障排除

问题: 未找到 FFmpeg

  • 解决: 安装 FFmpeg 并确保其在系统 PATH 中

问题: CUDA 内存不足

  • 解决: 减少推理时间步或使用 CPU 模式(较慢)

问题: 音频提取失败

  • 解决: 验证 FFmpeg 已安装且文件格式受支持

🙏 致谢

About

A new webui for VoxCPM. VoxCPM的web界面,提供了音色克隆所需要的音频工具,优化了用户体验。

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors