High-quality text-to-speech for Manim animations using Qwen3-TTS
A manim-voiceover plugin that integrates Alibaba's state-of-the-art Qwen3-TTS models, bringing natural-sounding voiceovers to your mathematical animations.
| Feature | Description |
|---|---|
| Voice Cloning | Clone any voice from a 3+ second audio sample |
| Voice Design | Create custom voices from natural language descriptions |
| Preset Voices | 9 premium built-in voices with emotion/style control |
| Multi-language | Support for 10 languages including English, Chinese, Japanese, Korean |
| Caching | Automatic audio caching for fast re-renders |
| Multiple Characters | Easy voice switching for dialogue scenes |
If you already have manim and manim-voiceover installed:
pip install manim-voiceover-qwen3-ttsSet up a complete environment from scratch using UV:
# Install UV if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install system dependencies for manim (Ubuntu/Debian)
sudo apt-get install -y libcairo2-dev libpango1.0-dev ffmpeg libsndfile1
# Create project directory
mkdir my-manim-project && cd my-manim-project
# Initialize UV project
uv init
# Add numba first (ensures correct dependency resolution)
uv add numba
# Add manim packages
uv add manim manim-voiceover manim-voiceover-qwen3-tts
# Copy the Quick Start example from the README (Option 1: Preset Voices section below)
# and save it as scene.py, then run:
uv run manim -pql scene.py QuickStartgit clone https://github.com/DurhamSmith/manim-voiceover-qwen3-tts.git
cd manim-voiceover-qwen3-tts
pip install -e .For faster inference (requires compatible GPU):
pip install flash-attn --no-build-isolation- Python 3.10+
- CUDA-capable GPU (recommended, ~4GB VRAM for 1.7B models)
- manim >= 0.18.0
- manim-voiceover >= 0.3.0
Manim requires some system libraries. On Ubuntu/Debian:
sudo apt-get install -y libcairo2-dev libpango1.0-dev ffmpeg libsndfile1On macOS:
brew install cairo pango ffmpeg libsndfileUse Qwen3's built-in premium voices - no setup required:
from manim import *
from manim_voiceover import VoiceoverScene
from manim_voiceover_qwen3_tts import Qwen3PresetVoiceService
class QuickStart(VoiceoverScene):
def construct(self):
self.set_speech_service(
Qwen3PresetVoiceService(
speaker="Ryan",
language="English",
use_flash_attention=False, # Set True if flash-attn installed
)
)
circle = Circle(color=BLUE)
with self.voiceover(text="Let's draw a circle!") as tracker:
self.play(Create(circle), run_time=tracker.duration)Available Preset Speakers:
| Language | Speakers |
|---|---|
| English | Ryan, Aiden |
| Chinese | Vivian, Serena, Uncle_Fu, Dylan, Eric |
| Japanese | Ono_Anna |
| Korean | Sohee |
Create any voice by describing it in natural language:
from manim import *
from manim_voiceover import VoiceoverScene
from manim_voiceover_qwen3_tts import Qwen3VoiceDesignService
class VoiceDesignDemo(VoiceoverScene):
def construct(self):
self.set_speech_service(
Qwen3VoiceDesignService(
voice_description="A warm, friendly female voice with a slight "
"British accent, speaking clearly and professionally.",
language="English",
use_flash_attention=False, # Set True if flash-attn installed
)
)
title = Text("Welcome!")
with self.voiceover(text="Welcome to our tutorial!") as tracker:
self.play(Write(title), run_time=tracker.duration)Clone any voice from a short audio sample (3+ seconds):
from manim import *
from manim_voiceover import VoiceoverScene
from manim_voiceover_qwen3_tts import Qwen3VoiceCloningService, VoiceProfile
# Define a voice profile
narrator = VoiceProfile(
name="narrator",
ref_audio="voices/narrator_sample.wav", # Your audio file
ref_text="This is a sample of the narrator speaking clearly.", # Transcript
language="English",
)
class VoiceCloneDemo(VoiceoverScene):
def construct(self):
self.set_speech_service(
Qwen3VoiceCloningService(
voices=[narrator],
default_voice="narrator",
use_flash_attention=False, # Set True if flash-attn installed
)
)
with self.voiceover(text="Hello! My voice was cloned from a short sample.") as tracker:
self.wait(tracker.duration)Perfect for educational videos with multiple speakers:
from manim import *
from manim_voiceover import VoiceoverScene
from manim_voiceover_qwen3_tts import Qwen3VoiceCloningService, VoiceProfile
# Define character voices
alice = VoiceProfile(
name="alice",
ref_audio="voices/alice.wav",
ref_text="Hi, I'm Alice and I love explaining math concepts!",
)
bob = VoiceProfile(
name="bob",
ref_audio="voices/bob.wav",
ref_text="Hey there, I'm Bob. Let me ask you a question.",
)
class DialogueScene(VoiceoverScene):
def construct(self):
self.set_speech_service(Qwen3VoiceCloningService(
voices=[alice, bob],
use_flash_attention=False, # Set True if flash-attn installed
))
# Visual setup
alice_label = Text("Alice", color=BLUE).to_edge(LEFT)
bob_label = Text("Bob", color=RED).to_edge(RIGHT)
self.add(alice_label, bob_label)
# Dialogue
with self.voiceover(text="Hi Bob! Want to learn about vectors?", voice="alice"):
self.play(Indicate(alice_label))
with self.voiceover(text="Sure Alice! That sounds interesting.", voice="bob"):
self.play(Indicate(bob_label))
with self.voiceover(text="Great! A vector has both magnitude and direction.", voice="alice"):
arrow = Arrow(LEFT, RIGHT, color=YELLOW)
self.play(Create(arrow))Use Qwen3's premium preset voices with optional emotion/style control.
Qwen3PresetVoiceService(
speaker="Ryan", # Preset speaker name
language="English", # Language for synthesis
instruct="Speak with enthusiasm", # Optional: style instruction
model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", # Model ID
device="cuda:0", # Device (cuda:0, cpu)
dtype="bfloat16", # Weight dtype
use_flash_attention=True, # Use FlashAttention 2
output_format="mp3", # Output format (mp3/wav)
)Create custom voices from natural language descriptions.
Qwen3VoiceDesignService(
voice_description="Description of desired voice characteristics",
language="English",
model="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
device="cuda:0",
dtype="bfloat16",
use_flash_attention=True,
output_format="mp3",
)Clone voices from reference audio samples.
Qwen3VoiceCloningService(
voices=[voice_profile1, voice_profile2], # List of VoiceProfile objects
default_voice="narrator", # Default voice name
model="Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device="cuda:0",
dtype="bfloat16",
use_flash_attention=True,
output_format="mp3",
)Define a voice for cloning.
VoiceProfile(
name="character_name", # Unique identifier for this voice
ref_audio="path/to/audio.wav", # Reference audio file (3+ seconds)
ref_text="Transcript of audio", # Exact transcript of the reference audio
language="Auto", # Language ("Auto" for auto-detection)
)Override any setting for individual voiceover calls:
# Override speaker
with self.voiceover(text="Hello!", speaker="Aiden") as tracker:
...
# Override voice (for cloning service)
with self.voiceover(text="Hello!", voice="bob") as tracker:
...
# Override style instruction
with self.voiceover(text="Wow!", instruct="Speak with excitement") as tracker:
...
# Override language
with self.voiceover(text="Bonjour!", language="French") as tracker:
...| Model | Parameters | Use Case | VRAM |
|---|---|---|---|
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice |
1.7B | Preset voices | ~4GB |
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign |
1.7B | Voice design | ~4GB |
Qwen/Qwen3-TTS-12Hz-1.7B-Base |
1.7B | Voice cloning | ~4GB |
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice |
0.6B | Lightweight preset | ~2GB |
Qwen/Qwen3-TTS-12Hz-0.6B-Base |
0.6B | Lightweight cloning | ~2GB |
All services support 10 languages:
- English
- Chinese (Mandarin)
- Japanese
- Korean
- German
- French
- Russian
- Portuguese
- Spanish
- Italian
Significantly faster inference on compatible GPUs:
pip install flash-attn --no-build-isolationFor faster generation with acceptable quality:
Qwen3PresetVoiceService(
model="Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice",
...
)manim-voiceover automatically caches generated audio. Re-renders with unchanged text are instant.
For voice cloning, the service automatically caches voice prompts. The first generation with a new voice takes longer, but subsequent uses are fast.
Option 1: Use a smaller model:
Qwen3PresetVoiceService(
model="Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice",
)Option 2: Run on CPU (slower):
Qwen3PresetVoiceService(
device="cpu",
dtype="float32",
use_flash_attention=False,
)Disable it explicitly:
Qwen3PresetVoiceService(
use_flash_attention=False,
)- Ensure reference audio is at least 3 seconds long for voice cloning
- Use high-quality reference audio (clear speech, minimal background noise)
- Verify the transcript exactly matches the reference audio
Models are downloaded from HuggingFace on first use. Ensure you have:
- Stable internet connection
- Sufficient disk space (~7GB for 1.7B models)
You may see deprecation warnings when running. These come from upstream dependencies, not this package:
UserWarning: pkg_resources is deprecated as an API...
FutureWarning: librosa.core.audio.__audioread_load Deprecated...
UserWarning: PySoundFile failed. Trying audioread instead.
| Warning | Source | Status |
|---|---|---|
pkg_resources deprecated |
manim-voiceover | Upstream issue - awaiting fix |
librosa.__audioread_load deprecated |
qwen-tts → librosa | Upstream issue - awaiting fix |
PySoundFile failed |
qwen-tts → librosa | Install libsndfile (see below) |
To reduce warnings:
-
Install system audio library:
# Ubuntu/Debian sudo apt-get install libsndfile1 # macOS brew install libsndfile
-
Suppress warnings in your script (optional):
import warnings warnings.filterwarnings("ignore", category=DeprecationWarning) warnings.filterwarnings("ignore", category=FutureWarning)
These warnings don't affect functionality - your videos will render correctly.
- Duration: 3-10 seconds is ideal
- Quality: Clear audio without background noise
- Content: Natural speech, not whispered or shouted
- Format: WAV or MP3 supported
The transcript must exactly match what's said in the reference audio. This helps the model understand the voice characteristics.
For projects with multiple characters, organize your voices:
project/
├── voices/
│ ├── narrator/
│ │ ├── sample.wav
│ │ └── metadata.json
│ ├── teacher/
│ │ ├── sample.wav
│ │ └── metadata.json
│ └── student/
│ ├── sample.wav
│ └── metadata.json
├── scenes/
│ └── my_scene.py
See the examples/ directory for complete working examples:
| Example | Service | Description |
|---|---|---|
preset_voices.py |
Qwen3PresetVoiceService |
Preset speakers + languages — switches between built-in voices (Ryan, Vivian, Ono_Anna, Sohee) across English, Chinese, Japanese, Korean |
emotion_showcase.py |
Qwen3PresetVoiceService |
One voice, many emotions — same speaker (Ryan), varying instruct per line (happy, sad, angry, excited, calm, etc.) |
voice_design.py |
Qwen3VoiceDesignService |
Many designed voices — same content delivered by 4 different voices created from text descriptions |
storytelling_scene.py |
Qwen3VoiceDesignService |
Multi-character story — narrator, hero, mentor, villain each with unique designed voices |
voice_cloning.py |
Qwen3VoiceCloningService |
Clone from audio — clone voices from reference .wav files, switch between multiple cloned voices |
Note:
voice_cloning.pyincludes a sample narrator voice (voices/narrator.wav). To add your own voices, create additionalVoiceProfileentries with:
ref_audio: path to your .wav file (3+ seconds of clear speech)ref_text: exact transcript of what's spoken in the audio
# Preset voices (built-in speakers, multiple languages)
manim -pql examples/preset_voices.py PresetVoicesDemo
# Emotion control (same voice, different emotions via instruct)
manim -pql examples/emotion_showcase.py EmotionShowcase
# Voice design (create voices from descriptions)
manim -pql examples/voice_design.py VoiceDesignDemo
# Storytelling (multi-character with designed voices)
manim -pql examples/storytelling_scene.py StorytellingScene
# Voice cloning (requires your own .wav files)
manim -pql examples/voice_cloning.py VoiceCloningDemoContributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Qwen3-TTS - The underlying TTS model by Alibaba
- manim-voiceover - The voiceover framework this plugin extends
- Manim Community - The amazing animation library
If you use this project in your research or videos, please consider citing:
@software{manim_voiceover_qwen3_tts,
title = {manim-voiceover-qwen3-tts: Qwen3-TTS Integration for Manim},
url = {https://github.com/DurhamSmith/manim-voiceover-qwen3-tts},
year = {2026}
}