manim-voiceover-qwen3-tts

High-quality text-to-speech for Manim animations using Qwen3-TTS

A manim-voiceover plugin that integrates Alibaba's state-of-the-art Qwen3-TTS models, bringing natural-sounding voiceovers to your mathematical animations.

Features

Feature	Description
Voice Cloning	Clone any voice from a 3+ second audio sample
Voice Design	Create custom voices from natural language descriptions
Preset Voices	9 premium built-in voices with emotion/style control
Multi-language	Support for 10 languages including English, Chinese, Japanese, Korean
Caching	Automatic audio caching for fast re-renders
Multiple Characters	Easy voice switching for dialogue scenes

Installation

Option 1: Add to Existing Manim Project

If you already have manim and manim-voiceover installed:

pip install manim-voiceover-qwen3-tts

Option 2: Fresh Install with UV (Linux)

Set up a complete environment from scratch using UV:

# Install UV if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install system dependencies for manim (Ubuntu/Debian)
sudo apt-get install -y libcairo2-dev libpango1.0-dev ffmpeg libsndfile1

# Create project directory
mkdir my-manim-project && cd my-manim-project

# Initialize UV project
uv init

# Add numba first (ensures correct dependency resolution)
uv add numba

# Add manim packages
uv add manim manim-voiceover manim-voiceover-qwen3-tts

# Copy the Quick Start example from the README (Option 1: Preset Voices section below)
# and save it as scene.py, then run:
uv run manim -pql scene.py QuickStart

Option 3: From Source

git clone https://github.com/DurhamSmith/manim-voiceover-qwen3-tts.git
cd manim-voiceover-qwen3-tts
pip install -e .

Optional: FlashAttention 2

For faster inference (requires compatible GPU):

pip install flash-attn --no-build-isolation

Requirements

Python 3.10+
CUDA-capable GPU (recommended, ~4GB VRAM for 1.7B models)
manim >= 0.18.0
manim-voiceover >= 0.3.0

System Dependencies

Manim requires some system libraries. On Ubuntu/Debian:

sudo apt-get install -y libcairo2-dev libpango1.0-dev ffmpeg libsndfile1

On macOS:

brew install cairo pango ffmpeg libsndfile

Quick Start

Option 1: Preset Voices (Easiest)

Use Qwen3's built-in premium voices - no setup required:

from manim import *
from manim_voiceover import VoiceoverScene
from manim_voiceover_qwen3_tts import Qwen3PresetVoiceService

class QuickStart(VoiceoverScene):
    def construct(self):
        self.set_speech_service(
            Qwen3PresetVoiceService(
                speaker="Ryan",
                language="English",
                use_flash_attention=False,  # Set True if flash-attn installed
            )
        )

        circle = Circle(color=BLUE)
        with self.voiceover(text="Let's draw a circle!") as tracker:
            self.play(Create(circle), run_time=tracker.duration)

Available Preset Speakers:

Language	Speakers
English	`Ryan`, `Aiden`
Chinese	`Vivian`, `Serena`, `Uncle_Fu`, `Dylan`, `Eric`
Japanese	`Ono_Anna`
Korean	`Sohee`

Option 2: Voice Design

Create any voice by describing it in natural language:

from manim import *
from manim_voiceover import VoiceoverScene
from manim_voiceover_qwen3_tts import Qwen3VoiceDesignService

class VoiceDesignDemo(VoiceoverScene):
    def construct(self):
        self.set_speech_service(
            Qwen3VoiceDesignService(
                voice_description="A warm, friendly female voice with a slight "
                                  "British accent, speaking clearly and professionally.",
                language="English",
                use_flash_attention=False,  # Set True if flash-attn installed
            )
        )

        title = Text("Welcome!")
        with self.voiceover(text="Welcome to our tutorial!") as tracker:
            self.play(Write(title), run_time=tracker.duration)

Option 3: Voice Cloning

Clone any voice from a short audio sample (3+ seconds):

from manim import *
from manim_voiceover import VoiceoverScene
from manim_voiceover_qwen3_tts import Qwen3VoiceCloningService, VoiceProfile

# Define a voice profile
narrator = VoiceProfile(
    name="narrator",
    ref_audio="voices/narrator_sample.wav",  # Your audio file
    ref_text="This is a sample of the narrator speaking clearly.",  # Transcript
    language="English",
)

class VoiceCloneDemo(VoiceoverScene):
    def construct(self):
        self.set_speech_service(
            Qwen3VoiceCloningService(
                voices=[narrator],
                default_voice="narrator",
                use_flash_attention=False,  # Set True if flash-attn installed
            )
        )

        with self.voiceover(text="Hello! My voice was cloned from a short sample.") as tracker:
            self.wait(tracker.duration)

Multi-Character Dialogue

Perfect for educational videos with multiple speakers:

from manim import *
from manim_voiceover import VoiceoverScene
from manim_voiceover_qwen3_tts import Qwen3VoiceCloningService, VoiceProfile

# Define character voices
alice = VoiceProfile(
    name="alice",
    ref_audio="voices/alice.wav",
    ref_text="Hi, I'm Alice and I love explaining math concepts!",
)

bob = VoiceProfile(
    name="bob",
    ref_audio="voices/bob.wav",
    ref_text="Hey there, I'm Bob. Let me ask you a question.",
)

class DialogueScene(VoiceoverScene):
    def construct(self):
        self.set_speech_service(Qwen3VoiceCloningService(
            voices=[alice, bob],
            use_flash_attention=False,  # Set True if flash-attn installed
        ))

        # Visual setup
        alice_label = Text("Alice", color=BLUE).to_edge(LEFT)
        bob_label = Text("Bob", color=RED).to_edge(RIGHT)
        self.add(alice_label, bob_label)

        # Dialogue
        with self.voiceover(text="Hi Bob! Want to learn about vectors?", voice="alice"):
            self.play(Indicate(alice_label))

        with self.voiceover(text="Sure Alice! That sounds interesting.", voice="bob"):
            self.play(Indicate(bob_label))

        with self.voiceover(text="Great! A vector has both magnitude and direction.", voice="alice"):
            arrow = Arrow(LEFT, RIGHT, color=YELLOW)
            self.play(Create(arrow))

API Reference

Services

`Qwen3PresetVoiceService`

Use Qwen3's premium preset voices with optional emotion/style control.

Qwen3PresetVoiceService(
    speaker="Ryan",                                      # Preset speaker name
    language="English",                                  # Language for synthesis
    instruct="Speak with enthusiasm",                    # Optional: style instruction
    model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",       # Model ID
    device="cuda:0",                                     # Device (cuda:0, cpu)
    dtype="bfloat16",                                    # Weight dtype
    use_flash_attention=True,                            # Use FlashAttention 2
    output_format="mp3",                                 # Output format (mp3/wav)
)

`Qwen3VoiceDesignService`

Create custom voices from natural language descriptions.

Qwen3VoiceDesignService(
    voice_description="Description of desired voice characteristics",
    language="English",
    model="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device="cuda:0",
    dtype="bfloat16",
    use_flash_attention=True,
    output_format="mp3",
)

`Qwen3VoiceCloningService`

Clone voices from reference audio samples.

Qwen3VoiceCloningService(
    voices=[voice_profile1, voice_profile2],            # List of VoiceProfile objects
    default_voice="narrator",                            # Default voice name
    model="Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device="cuda:0",
    dtype="bfloat16",
    use_flash_attention=True,
    output_format="mp3",
)

Classes

`VoiceProfile`

Define a voice for cloning.

VoiceProfile(
    name="character_name",           # Unique identifier for this voice
    ref_audio="path/to/audio.wav",   # Reference audio file (3+ seconds)
    ref_text="Transcript of audio",  # Exact transcript of the reference audio
    language="Auto",                 # Language ("Auto" for auto-detection)
)

Per-Voiceover Overrides

Override any setting for individual voiceover calls:

# Override speaker
with self.voiceover(text="Hello!", speaker="Aiden") as tracker:
    ...

# Override voice (for cloning service)
with self.voiceover(text="Hello!", voice="bob") as tracker:
    ...

# Override style instruction
with self.voiceover(text="Wow!", instruct="Speak with excitement") as tracker:
    ...

# Override language
with self.voiceover(text="Bonjour!", language="French") as tracker:
    ...

Available Models

Model	Parameters	Use Case	VRAM
`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`	1.7B	Preset voices	~4GB
`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`	1.7B	Voice design	~4GB
`Qwen/Qwen3-TTS-12Hz-1.7B-Base`	1.7B	Voice cloning	~4GB
`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`	0.6B	Lightweight preset	~2GB
`Qwen/Qwen3-TTS-12Hz-0.6B-Base`	0.6B	Lightweight cloning	~2GB

Supported Languages

All services support 10 languages:

English
Chinese (Mandarin)
Japanese
Korean
German
French
Russian
Portuguese
Spanish
Italian

Performance Tips

1. Enable FlashAttention 2

Significantly faster inference on compatible GPUs:

pip install flash-attn --no-build-isolation

2. Use Smaller Models

For faster generation with acceptable quality:

Qwen3PresetVoiceService(
    model="Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice",
    ...
)

3. Leverage Caching

manim-voiceover automatically caches generated audio. Re-renders with unchanged text are instant.

4. Voice Prompt Caching

For voice cloning, the service automatically caches voice prompts. The first generation with a new voice takes longer, but subsequent uses are fast.

Troubleshooting

CUDA Out of Memory

Option 1: Use a smaller model:

Qwen3PresetVoiceService(
    model="Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice",
)

Option 2: Run on CPU (slower):

Qwen3PresetVoiceService(
    device="cpu",
    dtype="float32",
    use_flash_attention=False,
)

FlashAttention Not Available

Disable it explicitly:

Qwen3PresetVoiceService(
    use_flash_attention=False,
)

Audio Quality Issues

Ensure reference audio is at least 3 seconds long for voice cloning
Use high-quality reference audio (clear speech, minimal background noise)
Verify the transcript exactly matches the reference audio

Model Download Issues

Models are downloaded from HuggingFace on first use. Ensure you have:

Stable internet connection
Sufficient disk space (~7GB for 1.7B models)

Known Warnings

You may see deprecation warnings when running. These come from upstream dependencies, not this package:

UserWarning: pkg_resources is deprecated as an API...
FutureWarning: librosa.core.audio.__audioread_load Deprecated...
UserWarning: PySoundFile failed. Trying audioread instead.

Warning	Source	Status
`pkg_resources` deprecated	manim-voiceover	Upstream issue - awaiting fix
`librosa.__audioread_load` deprecated	qwen-tts → librosa	Upstream issue - awaiting fix
`PySoundFile failed`	qwen-tts → librosa	Install `libsndfile` (see below)

To reduce warnings:

Install system audio library:

# Ubuntu/Debian
sudo apt-get install libsndfile1

# macOS
brew install libsndfile

Suppress warnings in your script (optional):

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

These warnings don't affect functionality - your videos will render correctly.

Voice Cloning Best Practices

Reference Audio Guidelines

Duration: 3-10 seconds is ideal
Quality: Clear audio without background noise
Content: Natural speech, not whispered or shouted
Format: WAV or MP3 supported

Transcript Accuracy

The transcript must exactly match what's said in the reference audio. This helps the model understand the voice characteristics.

Organizing Voice Profiles

For projects with multiple characters, organize your voices:

project/
├── voices/
│   ├── narrator/
│   │   ├── sample.wav
│   │   └── metadata.json
│   ├── teacher/
│   │   ├── sample.wav
│   │   └── metadata.json
│   └── student/
│       ├── sample.wav
│       └── metadata.json
├── scenes/
│   └── my_scene.py

Examples

See the examples/ directory for complete working examples:

Example	Service	Description
`preset_voices.py`	`Qwen3PresetVoiceService`	Preset speakers + languages — switches between built-in voices (Ryan, Vivian, Ono_Anna, Sohee) across English, Chinese, Japanese, Korean
`emotion_showcase.py`	`Qwen3PresetVoiceService`	One voice, many emotions — same speaker (Ryan), varying `instruct` per line (happy, sad, angry, excited, calm, etc.)
`voice_design.py`	`Qwen3VoiceDesignService`	Many designed voices — same content delivered by 4 different voices created from text descriptions
`storytelling_scene.py`	`Qwen3VoiceDesignService`	Multi-character story — narrator, hero, mentor, villain each with unique designed voices
`voice_cloning.py`	`Qwen3VoiceCloningService`	Clone from audio — clone voices from reference .wav files, switch between multiple cloned voices

Note: voice_cloning.py includes a sample narrator voice (voices/narrator.wav). To add your own voices, create additional VoiceProfile entries with:

ref_audio: path to your .wav file (3+ seconds of clear speech)

ref_text: exact transcript of what's spoken in the audio

Running Examples

# Preset voices (built-in speakers, multiple languages)
manim -pql examples/preset_voices.py PresetVoicesDemo

# Emotion control (same voice, different emotions via instruct)
manim -pql examples/emotion_showcase.py EmotionShowcase

# Voice design (create voices from descriptions)
manim -pql examples/voice_design.py VoiceDesignDemo

# Storytelling (multi-character with designed voices)
manim -pql examples/storytelling_scene.py StorytellingScene

# Voice cloning (requires your own .wav files)
manim -pql examples/voice_cloning.py VoiceCloningDemo

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Qwen3-TTS - The underlying TTS model by Alibaba
manim-voiceover - The voiceover framework this plugin extends
Manim Community - The amazing animation library

Citation

If you use this project in your research or videos, please consider citing:

@software{manim_voiceover_qwen3_tts,
  title = {manim-voiceover-qwen3-tts: Qwen3-TTS Integration for Manim},
  url = {https://github.com/DurhamSmith/manim-voiceover-qwen3-tts},
  year = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
manim_voiceover_qwen3_tts		manim_voiceover_qwen3_tts
voices		voices
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

manim-voiceover-qwen3-tts

Features

Installation

Option 1: Add to Existing Manim Project

Option 2: Fresh Install with UV (Linux)

Option 3: From Source

Optional: FlashAttention 2

Requirements

System Dependencies

Quick Start

Option 1: Preset Voices (Easiest)

Option 2: Voice Design

Option 3: Voice Cloning

Multi-Character Dialogue

API Reference

Services

Qwen3PresetVoiceService

Qwen3VoiceDesignService

Qwen3VoiceCloningService

Classes

VoiceProfile

Per-Voiceover Overrides

Available Models

Supported Languages

Performance Tips

1. Enable FlashAttention 2

2. Use Smaller Models

3. Leverage Caching

4. Voice Prompt Caching

Troubleshooting

CUDA Out of Memory

FlashAttention Not Available

Audio Quality Issues

Model Download Issues

Known Warnings

Voice Cloning Best Practices

Reference Audio Guidelines

Transcript Accuracy

Organizing Voice Profiles

Examples

Running Examples

Contributing

License

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`Qwen3PresetVoiceService`

`Qwen3VoiceDesignService`

`Qwen3VoiceCloningService`

`VoiceProfile`

Packages