English | δΈζη
ComfyUI custom nodes for speech synthesis, voice cloning, and voice design, based on the open-source Qwen3-TTS project by the Alibaba Qwen team.
- 2026-02-04: Feature Update: Added Global Pause Control (
QwenTTSConfigNode) andextra_model_paths.yamlsupport (update.md) - 2026-01-29: Feature Update: Support for loading custom fine-tuned models & speakers (update.md)
- Note: Fine-tuning is currently experimental; zero-shot cloning is recommended for best results.
- 2026-01-27: UI Optimization: Sleek LoadSpeaker UI; fixed PyTorch 2.6+ compatibility (update.md)
- 2026-01-26: Functional Update: New voice persistence system (SaveVoice / LoadSpeaker) (update.md)
- 2026-01-24: Added attention mechanism selection & model memory management features (update.md)
- 2026-01-24: Added generation parameters (top_p, top_k, temperature, repetition_penalty) to all TTS nodes (update.md)
- 2026-01-23: Dependency compatibility & Mac (MPS) support, New nodes: VoiceClonePromptNode, DialogueInferenceNode (update.md)
- Qwen3-TTS Multi-Role Multi-Round Dialogue Generation Workflow:
- Qwen3-TTS 3-in-1 (Clone, Design, Custom) Workflow:
- π΅ Speech Synthesis: High-quality text-to-speech conversion.
- π Voice Cloning: Zero-shot voice cloning from short reference audio.
- π¨ Voice Design: Create custom voice characteristics based on natural language descriptions.
- π Efficient Inference: Supports both 12Hz and 25Hz speech tokenizer architectures.
- π― Multilingual: Native support for 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian).
- β‘ Integrated Loading: No separate loader nodes required; model loading is managed on-demand with global caching.
- β±οΈ Ultra-Low Latency: Supports high-fidelity speech reconstruction with low-latency streaming.
- π§ Attention Mechanism Selection: Choose from multiple attention implementations (sage_attn, flash_attn, sdpa, eager) with auto-detection and graceful fallback.
- πΎ Memory Management: Optional model unloading after generation to free GPU memory for users with limited VRAM.
Generate unique voices based on text descriptions.
- Inputs:
text: Target text to synthesize.instruct: Description of the voice (e.g., "A gentle female voice with a high pitch").model_choice: Currently locked to 1.7B for VoiceDesign features.attention: Attention mechanism (auto, sage_attn, flash_attn, sdpa, eager).unload_model_after_generate: Unload model from memory after generation to free GPU memory.
- Capabilities: Best for creating "imaginary" voices or specific character archetypes.
Clone a voice from a reference audio clip.
- Inputs:
ref_audio: A short (5-15s) audio clip to clone.ref_text: Text spoken in theref_audio(helps improve quality).target_text: The new text you want the cloned voice to say.model_choice: Choose between 0.6B (fast) or 1.7B (high quality).attention: Attention mechanism (auto, sage_attn, flash_attn, sdpa, eager).unload_model_after_generate: Unload model from memory after generation to free GPU memory.
Standard TTS using preset speakers.
- Inputs:
text: Target text.speaker: Selection from preset voices (Aiden, Eric, Serena, etc.).instruct: Optional style instructions.attention: Attention mechanism (auto, sage_attn, flash_attn, sdpa, eager).unload_model_after_generate: Unload model from memory after generation to free GPU memory.
Collect and manage multiple voice prompts for dialogue generation.
- Inputs:
- Up to 8 roles, each with:
role_name_N: Name of the role (e.g., "Alice", "Bob", "Narrator")prompt_N: Voice clone prompt fromVoiceClonePromptNode
- Up to 8 roles, each with:
- Capabilities: Create named voice registry for use in
DialogueInferenceNode. Supports up to 8 different voices per bank.
Extract and reuse voice features from reference audio.
- Inputs:
ref_audio: A short (5-15s) audio clip to extract features from.ref_text: Text spoken in theref_audio(highly recommended for better quality).model_choice: Choose between 0.6B (fast) or 1.7B (high quality).attention: Attention mechanism (auto, sage_attn, flash_attn, sdpa, eager).unload_model_after_generate: Unload model from memory after generation to free GPU memory.
- Capabilities: Extract a "prompt item" once and use it multiple times across different
VoiceCloneNodeinstances for faster and more consistent generation.
Synthesize complex dialogues with multiple speakers.
- Inputs:
script: Dialogue script in format "RoleName: Text".role_bank: Role bank fromRoleBankNodecontaining voice prompts.model_choice: Choose between 0.6B (fast) or 1.7B (high quality).attention: Attention mechanism (auto, sage_attn, flash_attn, sdpa, eager).unload_model_after_generate: Unload model from memory after generation to free GPU memory.pause_seconds: Silence duration between sentences.merge_outputs: Merge all dialogue segments into a single long audio.batch_size: Number of lines to process in parallel (larger = faster but more VRAM).
- Capabilities: Handles multi-role speech synthesis in a single node, ideal for audiobook narration or roleplay scenarios.
Load saved voice features and metadata with zero configuration.
- Capabilities: Enables a "Select & Play" experience by auto-loading pre-computed features and metadata.
Persist extracted voice features and metadata to disk for future use.
- Capabilities: Build a permanent voice library for reuse via
LoadSpeakerNode.
Define global pause durations for punctuation to control speech rhythm.
- Inputs:
pause_linebreak: Silence after linebreaks.period_pause: Silence after periods (.).comma_pause: Silence after commas (,).question_pause: Silence after question marks (?).hyphen_pause: Silence after hyphens (-).
- Usage: Connect output to the
configinput of other TTS nodes.
All nodes support multiple attention implementations with automatic detection and graceful fallback:
| Mechanism | Description | Speed | Installation |
|---|---|---|---|
| sage_attn | SAGE attention implementation | β‘β‘β‘ Fastest | pip install sage_attn |
| flash_attn | Flash Attention 2 | β‘β‘ Fast | pip install flash_attn |
| sdpa | Scaled Dot Product Attention (PyTorch built-in) | β‘ Medium | Built-in (no installation) |
| eager | Standard attention (fallback) | π’ Slowest | Built-in (no installation) |
| auto | Automatically selects best available option | Varies | N/A |
When attention: "auto" is selected, the system checks in this order:
- sage_attn β If installed, use SAGE attention (fastest)
- flash_attn β If installed, use Flash Attention 2
- sdpa β Always available (PyTorch built-in)
- eager β Always available (fallback, slowest)
The selected mechanism is logged to the console for transparency.
If you select an attention mechanism that's not available:
- Falls back to
sdpa(if available) - Falls back to
eager(as last resort) - Logs the fallback decision with a warning message
- Models are cached with attention-specific keys
- Changing attention mechanism automatically clears cache and reloads model
- Same model with different attention mechanisms coexists in cache
The unload_model_after_generate toggle is available on all nodes:
- Enabled: Clears model cache, GPU memory, and runs garbage collection after generation
- Disabled: Model remains in cache for faster subsequent generations (default)
When to use:
- β Enable if you have limited VRAM (< 8GB)
- β Enable if you need to run multiple different models sequentially
- β Enable if you're done with generation and want to free memory
- β Disable if you're generating multiple clips with the same model (faster)
Console Output:
ποΈ [Qwen3-TTS] Unloading 1 cached model(s)...
β
[Qwen3-TTS] Model cache and GPU memory cleared
Ensure you have the required dependencies:
pip install torch torchaudio transformers librosa accelerateComfyUI-Qwen-TTS automatically searches for models in the following priority:
ComfyUI/
βββ models/
β βββ qwen-tts/
β βββ Qwen/Qwen3-TTS-12Hz-1.7B-Base/
β βββ Qwen/Qwen3-TTS-12Hz-0.6B-Base/
β βββ Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign/
β βββ Qwen/Qwen3-TTS-Tokenizer-12Hz/
β βββ voices/ (Saved presets .wav/.qvp)
Note: You can also use extra_model_paths.yaml to define a custom model path:
qwen-tts: D:\MyModels\Qwen- Cloning: Use clean, noise-free reference audio (5-15 seconds).
- Reference Text: Providing text spoken in reference audio significantly improves quality.
- Language: Select the correct language for best pronunciation and prosody.
- VRAM: Use
bf16precision to save significant memory with minimal quality loss. - Attention: Use
attention: "auto"for automatic selection of fastest available mechanism. - Model Unloading: Enable
unload_model_after_generateif you have limited VRAM (< 8GB) or need to run multiple different models. - Local Models: Pre-download weights to
models/qwen-tts/to prioritize local loading and avoid HuggingFace timeouts.
- Best Performance: Install
sage_attnorflash_attnfor 2-3x speedup over sdpa. - Compatibility: Use
sdpa(default) for maximum compatibility - no installation required. - Low VRAM: Use
eagerwith smaller models (0.6B) if other mechanisms cause OOM errors.
- Batch Size: Increase
batch_sizefor faster generation (more VRAM usage). - Pauses: Adjust
pause_secondsto control timing between dialogue segments. - Merge: Enable
merge_outputsfor continuous dialogue; disable for separate clips.
- Qwen3-TTS: Official open-source repository by Alibaba Qwen team.
- This project is licensed under the Apache License 2.0.
- Model weights are subject to the Qwen3-TTS License Agreement.
