Skip to content

HakusaiTH/Qwen3_TTS_Voice_Clone_Full

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

🎙️ AI Voice Clone with Qwen3-TTS

Clone any voice, design new voices, or use preset speakers — powered by Qwen3-TTS.


✨ Features

Mode Description
🔁 Voice Cloning Clone any voice from a 3–30 second reference audio
🎨 Voice Design Create a new voice from a natural language description
🗣️ Custom Voice Use 9 built-in high-quality preset voices
🌏 Multilingual Supports 10 languages with cross-lingual cloning

🚀 Quick Start

Option 1 — Google Colab (Recommended)

  1. Open Qwen3_TTS_Voice_Clone_Full.ipynb in Google Colab
  2. Go to Runtime → Change runtime type → T4 GPU
  3. Run cells from top to bottom

Option 2 — Local Installation

pip install qwen-tts soundfile librosa
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    'Qwen/Qwen3-TTS-12Hz-1.7B-Base',
    device_map='cuda:0',
    dtype=torch.bfloat16,
)

wavs, sr = model.generate_voice_clone(
    text="Hello! This is a cloned voice.",
    ref_audio='path/to/reference.wav',
    ref_text='Transcript of the reference audio.',
)

sf.write('output.wav', wavs[0], sr)

📋 Requirements

  • Python 3.9+
  • CUDA GPU (8 GB+ VRAM for 1.7B, 4 GB+ for 0.6B)
  • PyTorch 2.0+

🧠 Model Sizes

Model VRAM Quality Speed
1.7B (recommended) ~8 GB ⭐⭐⭐⭐⭐ Slower
0.6B (lightweight) ~4 GB ⭐⭐⭐⭐ Faster

🔁 Voice Cloning

Clone a voice by providing a reference audio file and its transcript.

wavs, sr = model.generate_voice_clone(
    text="Text you want to synthesize.",
    ref_audio='reference.wav',   # Audio to clone (3–30 seconds)
    ref_text='Exact transcript of reference.wav',  # Must match the audio
)

Important: ref_text must be the exact transcript of what is spoken in ref_audio. Mismatched text will reduce quality significantly.

Reference Audio Tips

  • ✅ 3–10 seconds — ideal
  • ✅ 10–30 seconds — works well
  • ⚠️ 30+ seconds — may cause out-of-memory errors
  • 🎯 Use clean audio with minimal background noise

🎨 Voice Design

Create a new voice from a text description.

from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    'Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign',
    device_map='cuda:0',
    dtype=torch.bfloat16,
)

wavs, sr = model.generate_voice_design(
    text="Welcome to the show!",
    instruct="A warm and energetic female voice with a clear and engaging tone.",
)

🗣️ Custom Voice (Preset Speakers)

Use one of 9 built-in voices with optional style control.

from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice',
    device_map='cuda:0',
    dtype=torch.bfloat16,
)

wavs, sr = model.generate_custom_voice(
    text="Hello! Nice to meet you.",
    language='English',
    speaker='Vivian',
    instruct='Speak warmly and clearly.',  # Optional
)

Available Speakers

Vivian · Ryan · Ava · Liam · Emma · Noah · Sophia · Oliver · Isabella

Supported Languages

Chinese · English · Japanese · Korean · German · French · Russian · Portuguese · Spanish · Italian


🌏 Cross-lingual Cloning

You can clone a voice in one language and generate speech in another.

# Reference audio in English → output in Japanese
wavs, sr = model.generate_voice_clone(
    text="こんにちは!音声クローニングのデモです。",
    ref_audio='english_voice.wav',
    ref_text='Transcript of the English reference audio.',
)

💾 Reuse a Cloned Voice Across Projects

Qwen3-TTS uses zero-shot cloning — there is no separate voice model file to save. To reuse a cloned voice, keep the reference audio file and its transcript, then pass them into every generation call.

class VoiceCloner:
    def __init__(self, model_size='1.7B'):
        self.model = Qwen3TTSModel.from_pretrained(
            f'Qwen/Qwen3-TTS-12Hz-{model_size}-Base',
            device_map='cuda:0',
            dtype=torch.bfloat16,
        )
        self.ref_audio = None
        self.ref_text  = None

    def set_voice(self, ref_audio_path, ref_text):
        self.ref_audio = ref_audio_path
        self.ref_text  = ref_text

    def speak(self, text, output_path='output.wav'):
        wavs, sr = self.model.generate_voice_clone(
            text=text,
            ref_audio=self.ref_audio,
            ref_text=self.ref_text,
        )
        sf.write(output_path, wavs[0], sr)
        return output_path

Key point: The only files you need to keep are reference.wav and the ref_text string. These two items define the voice identity.


🛠️ Troubleshooting

Problem Solution
CUDA out of memory Switch to 0.6B model or reduce reference audio length
Poor cloning quality Use cleaner audio; ensure ref_text exactly matches ref_audio
Slow generation Install flash-attn: pip install flash-attn --no-build-isolation
Model download fails Check internet connection — models are 1–3.5 GB
No GPU in Colab Go to Runtime → Change runtime type → T4 GPU

📁 Project Structure

.
├── Qwen3_TTS_Voice_Clone_Full.ipynb  # Full Google Colab notebook
├── voice_cloner.py                   # Reusable VoiceCloner class
├── api.py                            # FastAPI server (optional)
├── README.md
└── samples/
    └── reference.wav                 # Your reference audio files

📚 References


📄 License

This project uses Qwen3-TTS which is released under the Apache 2.0 License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors