Clone any voice, design new voices, or use preset speakers — powered by Qwen3-TTS.
| Mode | Description |
|---|---|
| 🔁 Voice Cloning | Clone any voice from a 3–30 second reference audio |
| 🎨 Voice Design | Create a new voice from a natural language description |
| 🗣️ Custom Voice | Use 9 built-in high-quality preset voices |
| 🌏 Multilingual | Supports 10 languages with cross-lingual cloning |
- Open
Qwen3_TTS_Voice_Clone_Full.ipynbin Google Colab - Go to Runtime → Change runtime type → T4 GPU
- Run cells from top to bottom
pip install qwen-tts soundfile librosaimport torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
'Qwen/Qwen3-TTS-12Hz-1.7B-Base',
device_map='cuda:0',
dtype=torch.bfloat16,
)
wavs, sr = model.generate_voice_clone(
text="Hello! This is a cloned voice.",
ref_audio='path/to/reference.wav',
ref_text='Transcript of the reference audio.',
)
sf.write('output.wav', wavs[0], sr)- Python 3.9+
- CUDA GPU (8 GB+ VRAM for 1.7B, 4 GB+ for 0.6B)
- PyTorch 2.0+
| Model | VRAM | Quality | Speed |
|---|---|---|---|
1.7B (recommended) |
~8 GB | ⭐⭐⭐⭐⭐ | Slower |
0.6B (lightweight) |
~4 GB | ⭐⭐⭐⭐ | Faster |
Clone a voice by providing a reference audio file and its transcript.
wavs, sr = model.generate_voice_clone(
text="Text you want to synthesize.",
ref_audio='reference.wav', # Audio to clone (3–30 seconds)
ref_text='Exact transcript of reference.wav', # Must match the audio
)Important:
ref_textmust be the exact transcript of what is spoken inref_audio. Mismatched text will reduce quality significantly.
- ✅ 3–10 seconds — ideal
- ✅ 10–30 seconds — works well
⚠️ 30+ seconds — may cause out-of-memory errors- 🎯 Use clean audio with minimal background noise
Create a new voice from a text description.
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
'Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign',
device_map='cuda:0',
dtype=torch.bfloat16,
)
wavs, sr = model.generate_voice_design(
text="Welcome to the show!",
instruct="A warm and energetic female voice with a clear and engaging tone.",
)Use one of 9 built-in voices with optional style control.
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice',
device_map='cuda:0',
dtype=torch.bfloat16,
)
wavs, sr = model.generate_custom_voice(
text="Hello! Nice to meet you.",
language='English',
speaker='Vivian',
instruct='Speak warmly and clearly.', # Optional
)Vivian · Ryan · Ava · Liam · Emma · Noah · Sophia · Oliver · Isabella
Chinese · English · Japanese · Korean · German · French · Russian · Portuguese · Spanish · Italian
You can clone a voice in one language and generate speech in another.
# Reference audio in English → output in Japanese
wavs, sr = model.generate_voice_clone(
text="こんにちは!音声クローニングのデモです。",
ref_audio='english_voice.wav',
ref_text='Transcript of the English reference audio.',
)Qwen3-TTS uses zero-shot cloning — there is no separate voice model file to save. To reuse a cloned voice, keep the reference audio file and its transcript, then pass them into every generation call.
class VoiceCloner:
def __init__(self, model_size='1.7B'):
self.model = Qwen3TTSModel.from_pretrained(
f'Qwen/Qwen3-TTS-12Hz-{model_size}-Base',
device_map='cuda:0',
dtype=torch.bfloat16,
)
self.ref_audio = None
self.ref_text = None
def set_voice(self, ref_audio_path, ref_text):
self.ref_audio = ref_audio_path
self.ref_text = ref_text
def speak(self, text, output_path='output.wav'):
wavs, sr = self.model.generate_voice_clone(
text=text,
ref_audio=self.ref_audio,
ref_text=self.ref_text,
)
sf.write(output_path, wavs[0], sr)
return output_pathKey point: The only files you need to keep are
reference.wavand theref_textstring. These two items define the voice identity.
| Problem | Solution |
|---|---|
| CUDA out of memory | Switch to 0.6B model or reduce reference audio length |
| Poor cloning quality | Use cleaner audio; ensure ref_text exactly matches ref_audio |
| Slow generation | Install flash-attn: pip install flash-attn --no-build-isolation |
| Model download fails | Check internet connection — models are 1–3.5 GB |
| No GPU in Colab | Go to Runtime → Change runtime type → T4 GPU |
.
├── Qwen3_TTS_Voice_Clone_Full.ipynb # Full Google Colab notebook
├── voice_cloner.py # Reusable VoiceCloner class
├── api.py # FastAPI server (optional)
├── README.md
└── samples/
└── reference.wav # Your reference audio files
This project uses Qwen3-TTS which is released under the Apache 2.0 License.