Free voice cloning for creators using Qwen3-TTS on Google Colab.
Clone your voice from as little as 3–20 seconds of audio for consistent narration and voiceovers.
Complete guide to build your own notebook. Apache 2.0 licensed.
Qwen3-TTS is a multilingual text-to-speech system with high-quality zero-shot voice cloning capabilities.
It uses a discrete speech-token language-model architecture combined with a flow-matching decoder to generate natural speech that preserves speaker identity, cadence, and tone from a short reference clip.
Unlike many creator-facing TTS systems, Qwen3-TTS is fully open-source (Apache 2.0), produces unwatermarked audio, and does not require external APIs or paid inference services.
This repository focuses primarily on voice cloning workflows for creators. Qwen3-TTS also supports preset voices and text-designed speakers which we've included for additional testing.
Note: Qwen3-TTS is a recent release, so this project is perhaps a bit more experimental than our prior work and the foundation we built with AI-Voice-Clone-with-Coqui-XTTS-v2 (though early test results have been promising), and serves as an extension of our continued exploration with Colab-based voice cloning notebooks.
Qwen3-TTS uses a discrete speech-token language-model architecture and supports a dual-track hybrid streaming mode for low-latency synthesis. The process differs depending on whether you're using a preset voice or cloning:
Preset voices use the CustomVoice model. You provide text and a speaker name, and optionally a style instruction (e.g. "speak with calm authority"). The model generates speech matching that speaker's characteristics and your style direction.
Voice cloning uses the Base model. You provide a reference audio clip and its transcription. The model extracts acoustic features — pitch, tone, cadence, speaking style — and encodes them into a speaker embedding. That embedding is then used to synthesize new text in your voice. The transcription (ref_text) gives the model a phonetic alignment target, which improves accuracy over audio alone.
- Audio Analysis — The model analyzes your reference audio to extract acoustic features such as pitch, tone, cadence, and speaking style.
- Speaker Encoding — These features are encoded into a speaker embedding that represents your vocal identity.
- Text-to-Speech Generation — Given new text, the model synthesizes speech that matches the learned speaker characteristics.
- Waveform Synthesis — The generated speech tokens are decoded into a high-quality waveform suitable for narration and post-production.
Providing the transcription of your reference clip (ref_text) improves phonetic alignment and cloning accuracy, especially with very short samples.
VoiceDesign (1.7B only) generates a brand new speaker from a text description. No reference audio needed — just describe the voice characteristics you want.
- Model: Qwen3-TTS (0.6B and 1.7B variants)
- Framework: PyTorch
- Inference: Runs on Google Colab’s free T4 GPU (16 GB VRAM)
- Sample Rate: 24 kHz output
- Languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Qwen3-TTS benefits from GPU-accelerated inference, which supports practical voice cloning workflows and faster iteration.
Google Colab provides free access to GPU-accelerated computing well-suited for running Qwen3-TTS models without local setup. A single T4 GPU is sufficient to run all Qwen3-TTS models used in this guide without any local hardware, paid cloud services, or complex setup. No Python installation needed locally. Everything runs in the notebook.
All inference runs inside your Colab session — no audio is uploaded to third-party APIs.
- Consistent narration across multiple recording sessions
- Editing specific audio segments without re-recording the whole piece
- Voiceovers when recording conditions aren't ideal
- Generating placeholder audio for video editing workflows
- Exploring preset voices and style control for creative projects
All weights are on HuggingFace. The qwen-tts package downloads them automatically on first load.
| Model | Size | Purpose |
|---|---|---|
| Qwen3-TTS-12Hz-0.6B-CustomVoice | 2.52 GB | 9 preset voices with style control |
| Qwen3-TTS-12Hz-0.6B-Base | 2.52 GB | Voice cloning from your audio |
| Qwen3-TTS-12Hz-1.7B-CustomVoice | 4.54 GB | Preset voices, higher quality |
| Qwen3-TTS-12Hz-1.7B-Base | 4.54 GB | Voice cloning, higher quality |
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | 4.54 GB | Create voices from text descriptions |
All fit in a Colab T4's 16 GB VRAM. Start with 0.6B, step up to 1.7B for better quality.
- Google account (for Google Colab and Google Drive)
- For voice cloning: 3–20 seconds of clean reference audio in WAV
- Best results: clear speech, minimal background noise
- Mix of scripted and natural speaking recommended
- Google Colab with T4 GPU runtime (available with free plan but subject to usage limits)
- No Python installation needed locally (runs in Colab)
- .wav or .mp3 sample audio file uploaded to your Google Drive
- 3-20 seconds in length
- 16-bit or 24-bit, 44.1kHz or 48kHz sample rate recommended
macOS (built-in tool):
# afconvert comes pre-installed on macOS
afconvert -f WAVE -d LEI16 input.m4a output.wavMac/Linux/Windows (use ffmpeg):
Install ffmpeg first:
# macOS with MacPorts
sudo port install ffmpeg
# Ubuntu/Debian
sudo apt-get install ffmpeg
# Windows (with Chocolatey)
choco install ffmpegConvert audio:
ffmpeg -i input.m4a -ar 24000 output.wavSupported input formats: .m4a, .mp3, .mp4, .mov, and most audio/video formats.
This repository was created as a companion to the YouTube video covering:
- Qwen3-TTS setup with Google Colab
-
Enable GPU: Runtime → Change runtime type → GPU (T4)
-
Preset/CustomVoice
- Run Cells 1–3 in order (environment setup)
- Run Cells 4–8 for preset voices
-
Voice Cloning
- Run Cells 1–3 (environment setup)
- Skip Cells 4–8
- Run Cells 9–13 for voice cloning
- Upload a short reference audio clip (3–20 seconds) for voice cloning
- Enter the transcription of your clip and the text you want generated
- Generate and download your cloned voice!
-
VoiceDesign
- Run Cells 1–3 (environment setup)
- Skip Cells 4–13
- Run Cells 14–16 for voice design
Important: Only one model (CustomVoice, Base, or VoiceDesign) should be loaded per session. Restart the Colab runtime when switching model families to free GPU memory.
In Google Colab:
- Runtime → Change runtime type
- Set Hardware accelerator to GPU (T4)
Then verify CUDA is available:
import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")These are required for audio loading, conversion, and saving:
!apt-get update -qq
!apt-get install -y -qq ffmpeg sox libsndfile1Install the official Python package. Model weights are downloaded automatically on first use.
!pip install -U qwen-ttsStart with the 0.6B CustomVoice model for faster iteration. You can switch to 1.7B later for higher quality.
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice",
device_map="cuda:0",
dtype=torch.bfloat16,
)
print("Model loaded")
print("Supported speakers:", model.get_supported_speakers())
print("Supported languages:", model.get_supported_languages())Note: You'll see two warnings after this cell runs —
flash-attn not installedandHF_TOKENnot set. Both are expected on Colab's free T4 GPU and can be ignored. FlashAttention 2 is recommended by upstream but requires Ampere-class hardware (A100 or newer) not available on Colab’s T4 GPUs. Qwen3-TTS falls back to standard PyTorch attention and runs correctly.
Edit the text and speaker as desired.
from IPython.display import Audio, display
text = "This is a test of Qwen3-TTS running on Google Colab."
speaker = model.get_supported_speakers()[0]
wavs, sr = model.generate_custom_voice(
text=text,
language="English",
speaker=speaker,
)
output_path = "qwen3_tts_test.wav"
sf.write(output_path, wavs[0], sr)
display(Audio(output_path))
print(f"Saved: {output_path}")from google.colab import files
files.download("qwen3_tts_test.wav")Generate long scripts in segments for better control and stability.
segments = [
"This is the first segment of narration.",
"This is the second segment, generated separately.",
"This approach works well for long-form content."
]
for i, segment in enumerate(segments):
wavs, sr = model.generate_custom_voice(
text=segment,
language="English",
speaker=speaker,
)
path = f"segment_{i:02d}.wav"
sf.write(path, wavs[0], sr)
print(f"Saved: {path}")(Repeat for any batch segments you want to download.)
from google.colab import files
for i in range(3):
files.download(f"segment_{i:02d}.wav")Once the pipeline works, switch to the 1.7B models by changing the model name:
"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"Generation will be slower but more natural and closer to real speech.
Upload your reference audio to Drive first, then mount it here.
from google.colab import drive
drive.mount('/content/drive')Swaps out CustomVoice for Base. If you already ran Cells 4–8, restart the runtime to free VRAM, then re-run Cells 1–3 before continuing.
Choose one of the following Base models:
Fastest to load and useful for pipeline testing and short experiments. Cloning quality is functional but may lack speaker nuance compared to larger models (though results may improve significantly with clean, higher-quality reference samples and slightly longer durations).
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-0.6B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
)
print("Base model (0.6B) loaded — ready for voice cloning")Higher-fidelity voice cloning with noticeably improved tone, cadence, and speaker identity. Larger download (≈4.5 GB), but strongly recommended for best overall results.
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
)
print("Base model (1.7B) loaded — ready for voice cloning")Edit the three variables: your audio path, its transcription, and the text you want generated.
from IPython.display import Audio, display
# --- Edit these ---
ref_audio_path = "/content/drive/MyDrive/your_reference.wav"
ref_text = "This is exactly what was said in the reference clip."
target_text = "This is the new text you want generated in your voice."
# -----------------
wavs, sr = model.generate_voice_clone(
text=target_text,
language="English",
ref_audio=ref_audio_path,
ref_text=ref_text,
)
output_path = "cloned_voice.wav"
sf.write(output_path, wavs[0], sr)
display(Audio(output_path))
print(f"Saved: {output_path}")For multiple segments, cache the speaker embedding first. This avoids re-extracting from your reference audio on every iteration — significant speedup for longer scripts.
# Cache the speaker embedding once
prompt = model.create_voice_clone_prompt(
ref_audio=ref_audio_path,
ref_text=ref_text,
)
segments = [
"This is the first segment of your narration.",
"This is the second segment, same voice, no re-upload.",
"Batch cloning keeps things consistent across the whole script."
]
for i, segment in enumerate(segments):
wavs, sr = model.generate_voice_clone(
text=segment,
language="English",
voice_clone_prompt=prompt,
)
path = f"clone_segment_{i:02d}.wav"
sf.write(path, wavs[0], sr)
print(f"Saved: {path}")from google.colab import files
files.download("cloned_voice.wav")(Repeat for any batch segments.)
If you can't transcribe the reference clip, this mode still works but quality may drop:
wavs, sr = model.generate_voice_clone(
text=target_text,
language="English",
ref_audio=ref_audio_path,
x_vector_only_mode=True,
)VoiceDesign generates a new synthetic speaker from a text description (no reference audio needed).
Requires the 1.7B VoiceDesign model.
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
device_map="cuda:0",
dtype=torch.bfloat16,
)
print("VoiceDesign model loaded — ready to generate new speakers")from IPython.display import Audio, display
# --- Edit these ---
voice_description = (
"Warm and conversational adult voice, calm pacing, lightly textured tone, "
"clear articulation, friendly but not overly energetic."
)
target_text = "This is a VoiceDesign test. The voice you hear was generated from a text description."
instruct = "Please speak naturally and clearly."
# -----------------
wavs, sr = model.generate_voice_design(
text=target_text,
language="English",
voice_description=voice_description,
instruct=instruct,
)
output_path = "voicedesign.wav"
sf.write(output_path, wavs[0], sr)
display(Audio(output_path))
print(f"Saved: {output_path}")from google.colab import files
files.download("voicedesign.wav")- GPU recommended: Best results are achieved on Colab’s T4-class GPUs.
- Preset voices: Preset speakers are strongest in their native accents/languages—try a few for best English results. 
- Long narration: Generate long scripts in segments (e.g., 30–90 seconds) for smoother iteration.
Note: While our local development platform has been upgraded to Apple Silicon, (MPS) is experimental; this guide focuses on Colab for consistency.
Recommended for best results while recording audio samples:
- USB audio interface (we used an Arturia MiniFuse 2)
- Condenser or shotgun microphone (we used an Audio-Technica AT875R)
- Quiet recording environment
Acceptable minimum:
- Smartphone (eg. iPhone 8+) in a quiet room
- USB microphone with cardioid pattern
- Desktop/Laptop built-in mic in very quiet environment (quality will be lower)
Background noise:
- More important than mic quality. Record in a quiet space.
This workflow uses PyTorch for model loading and inference. When running on Google Colab, PyTorch automatically detects and uses the available GPU runtime.
No manual CUDA configuration or version pinning is required for the steps described in this guide. Model weights and dependencies are resolved automatically by the qwen-tts package and the Colab environment.
If a GPU runtime is unavailable, inference may fall back to CPU, but GPU-backed runtimes are recommended for practical voice cloning workflows.
This repository's code and documentation: Apache 2.0
This project builds upon:
- Qwen3-TTS - Open-source models and framework
- Google Colab - Free GPU infrastructure
- PyTorch - Deep learning framework
We're grateful to the open-source community for making voice cloning accessible to all creators.
Colab Free Plan Limitations:
- In the free version of Colab notebooks can run for at most 12 hours, depending on availability and usage patterns.
- Colab Pro and Pay As You Go offer increased compute availability based on your compute unit balance.
- If unavailable, wait 12+ hours or consider Colab Pro ($9.99/month) for increased access.
Questions? Check the video tutorial or open an issue!
