TTS Model Architecture (Qwen3-TTS)

Reference for Swift MLX port. Based on Qwen3-TTS-12Hz-0.6B. Speech tokenizer decoder based on Mimi (Kyutai).

For the separate 48 kHz MiniCPM-4-based multilingual stack, see VoxCPM2.

Overview

Qwen3-TTS has four components: a Talker (main LM), Code Predictor (residual codebooks), Speech Tokenizer (neural audio codec), and Speaker Encoder (voice cloning). The Swift port implements the Talker, Code Predictor, Speech Tokenizer Decoder, and Speaker Encoder. The Speech Tokenizer Encoder (for ICL voice cloning) is not yet ported.

Text input
    |
    v
+--------------------+
|  Text Embedding    |   Qwen2 BPE -> Embedding(151936, 2048) -> MLP projection -> 1024
+--------+-----------+
         |
         v
+--------------------+
|  Talker            |   28-layer Qwen3 transformer (MRoPE, GQA, SwiGLU)
|  1024 hidden dim   |   Generates first codebook autoregressively
+--------+-----------+
         |  hidden states + first codebook tokens
         v
+--------------------+
|  Code Predictor    |   5-layer transformer (standard RoPE)
|  1024 hidden dim   |   Predicts remaining 15 codebooks sequentially
+--------+-----------+
         |  16 codebook indices per frame
         v
+--------------------+
|  Speech Tokenizer  |   Mimi-based neural codec
|  Decoder only      |   RVQ decode -> Transformer -> Upsampling convs -> waveform
+--------+-----------+
         |
         v
    Audio (24kHz)

Component A: Talker

The primary autoregressive transformer. Generates the first codebook of speech tokens from text.

Parameter	0.6B	1.7B
Hidden size	1024	2048
Layers	28	28
Attention heads (Q)	16	16
KV heads (GQA)	8	8
Head dimension	128	128
Intermediate size	3072	6144
Codec vocab size	3072	3072
Text vocab size	151936	151936
RoPE type	MRoPE (3D sections [24,20,20], interleaved)	MRoPE
RoPE base	1,000,000	1,000,000
Quantization	4-bit or 8-bit	4-bit or 8-bit
Q/K normalization	RMSNorm per head	RMSNorm per head

MRoPE (Multimodal RoPE): Unlike ASR's standard 1D RoPE, the Talker uses 3D position encoding with sections [24, 20, 20] across the 64 rotation dimensions (head_dim/2 = 64). Positions are interleaved as [T, H, W, T, H, W, ...] across the dimension.

Two embedding tables:

text_embedding: 151936 tokens, dim 2048 (projected to 1024 via MLP)
codec_embedding: per-codebook, dim 1024

Special codec tokens (0.6B):

Token	ID
`codec_pad`	2148
`codec_bos`	2149
`codec_eos`	2150
Language: English	2050
Language: German	2052
Language: Spanish	2054
Language: Chinese	2055
Language: Japanese	2058
Language: French	2061
Language: Korean	2064
Language: Russian	2069
Language: Italian	2070

Transformer block (identical to ASR except RoPE):

x -> RMSNorm -> Attention(Q/K projections, Q/K RMSNorm, MRoPE, GQA via SDPA) -> + residual
  -> RMSNorm -> SwiGLU MLP(gate_proj, up_proj, down_proj)                     -> + residual

Component B: Code Predictor

A smaller transformer that predicts the remaining 15 codebooks given the first.

Parameter	Value
Hidden size	1024
Layers	5
Attention heads (Q)	16
KV heads (GQA)	8
Head dimension	128
Intermediate size	3072
Vocab size	2048 (per codebook)
RoPE type	Standard 1D
LM heads	15 (one per remaining codebook)

Sequential prediction: For each time step, predicts codebook 2 from codebook 1's hidden state, then codebook 3, and so on through codebook 16.

Component C: Speech Tokenizer (Mimi Codec)

A neural audio codec that converts between waveforms and discrete multi-codebook tokens at 12.5 Hz.

Encoder (Not Yet Implemented)

The encoder converts audio to codebook tokens and is used for voice cloning input. It is not yet ported to Swift.

Audio (24kHz) -> SeanetEncoder (Conv1d downsampling, residual blocks)
             -> ProjectedTransformer (8 layers, causal, RoPE)
             -> ConvDownsample1d (to 12.5 Hz)
             -> SplitResidualVectorQuantizer
                  - 1 semantic quantizer (4096 codebook)
                  - 15 acoustic quantizers (2048 codebook each)
             -> 16 codebook indices per frame

Parameter	Value
Input sample rate	24,000 Hz
Frame rate	12.5 Hz
Downsample rate	1920x (24000 / 12.5)
Encoder hidden size	512
Transformer layers	8
Transformer heads	8
Num quantizers	16 (1 semantic + 15 acoustic)
Codebook size	2048 (acoustic), 4096 (semantic)
Codebook dim	256

Decoder (tokens -> audio, used for synthesis output)

16 codebook indices -> SplitRVQ.decode() (sum embeddings)       -> [T, 512]
                    -> Pre-conv (CausalConv1d k=3)              -> [T, 1024]
                    -> Pre-transformer (8 layers, causal RoPE,
                       1024->512 bottleneck, SwiGLU+LayerScale) -> [T, 1024]
                    -> Pre-upsample (TransposedConv1d 2x + ConvNeXt) x2 -> [4T, 1024]
                    -> Input conv                               -> [4T, 1536]
                    -> SEANet decoder (8x,5x,4x,3x = 480x):
                       SnakeBeta + TransposedConv1d + 3x dilated residual units
                    -> SnakeBeta + CausalConv1d(7,1) + clip     -> [T*1920, 1]
                    -> Audio waveform (24kHz)

Parameter	Value
Output sample rate	24,000 Hz
Decoder dim	1536
Latent dim	1024
Transformer layers	8
Transformer heads	16
Upsample rates	[2, 2] pre-upsample (4x) then [8, 5, 4, 3] SEANet decoder (480x) = 1920x

SnakeBeta activation: x + (1/b) * sin^2(a * x) — learnable periodic activation used in the decoder upsampling blocks for high-quality audio reconstruction.

Model Variants

Qwen3-TTS ships in two variants with identical architecture (Talker + Code Predictor + Speech Tokenizer). The difference is fine-tuning and how speaker identity is provided. Both variants come in 0.6B and 1.7B sizes. The 0.6B ships 4-bit and 8-bit; the 1.7B ships 8-bit and bf16 — its 4-bit was dropped because it degraded badly (silent or garbled output on some inputs).

Variant	Size	Quantization	HuggingFace ID	Speaker Selection
Base	0.6B	4-bit	`aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-4bit`	None (single default voice)
Base	0.6B	8-bit	`aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-8bit`	None (single default voice)
Base	1.7B	8-bit	`aufklarer/Qwen3-TTS-12Hz-1.7B-Base-MLX-8bit`	None (single default voice)
Base	1.7B	bf16	`aufklarer/Qwen3-TTS-12Hz-1.7B-Base-MLX-bf16`	None (single default voice)
CustomVoice	0.6B	4-bit	`aufklarer/Qwen3-TTS-12Hz-0.6B-CustomVoice-MLX-4bit`	9 preset voices + instruction control

CustomVoice Speakers

The CustomVoice model includes 9 preset voices. Each speaker is selected by prepending a speaker token ID to the codec prefix.

Speaker	Language	Codec Token ID	Description
Vivian	Chinese (Mandarin)	3065	Bright young female
Serena	Chinese (Mandarin)	3066	Warm, gentle young female
Uncle_Fu	Chinese (Mandarin)	3010	Seasoned male, mellow timbre
Dylan	Chinese (Beijing dialect)	2878	Youthful Beijing male
Eric	Chinese (Sichuan dialect)	2875	Lively Chengdu male
Ryan	English	3061	Dynamic male with rhythm
Aiden	English	2861	Sunny American male
Ono_Anna	Japanese	2873	Playful female
Sohee	Korean	2864	Warm female

Dialect handling: Dylan and Eric have dialect overrides in spk_is_dialect — when selected, the language ID is replaced with the corresponding dialect token instead of standard Chinese.

Supported Languages

Both variants support these language IDs in the codec prefix:

Language	Codec Token ID
English	2050
German	2052
Spanish	2054
Chinese	2055
Japanese	2058
French	2061
Korean	2064
Russian	2069
Italian	2070

Note: The CustomVoice preset speakers are each trained on specific languages (see table above). Using a speaker with a mismatched language (e.g., Ryan with Chinese text) will produce output but quality may degrade.

Instruction Control (CustomVoice only)

CustomVoice accepts a natural language instruct string to control tone, emotion, and prosody. The instruction is prepended to the text input in ChatML format and interpreted by the Talker transformer.

Token format:

<|im_start|>user\n{instruct}<|im_end|>\n

This is prepended before the role/text embeddings in the prefill sequence:

[instruct_embeddings | role_embed | text_embeddings | trailing_tokens]

Default instruct: When no instruct is provided and the CustomVoice model is loaded, "Speak naturally." is applied automatically. This prevents the model from producing rambling or unfocused output, especially for short texts.

Example instructions:

Instruction	Effect
`"Speak naturally."`	Default — neutral, clear speech
`"Speak in a cheerful, upbeat tone"`	Happy, energetic delivery
`"Read this slowly and solemnly"`	Slow pacing, serious tone
`"Whisper this softly"`	Quiet, breathy voice
`"Speak with excitement and energy"`	Fast, enthusiastic delivery
`"Read this as a news anchor"`	Professional, measured cadence
`"Speak gently, as if to a child"`	Soft, warm, simple phrasing

Notes:

The Base model does not support instruct — the parameter is ignored
Instruct works with all synthesis modes: basic, streaming, and batch
Different instructions produce measurably different audio (verified in tests)
Keep instructions concise (1-2 sentences) for best results

Codec Prefix Construction

Base model (6 tokens):

[think, think_bos, language_id, think_eos, pad, bos]

CustomVoice with speaker (7 tokens):

[think, think_bos, language_id, think_eos, pad, bos, speaker_token]

Component D: Speaker Encoder

ECAPA-TDNN network extracting speaker embeddings from reference audio for x-vector voice cloning. The CustomVoice model does not use the speaker encoder — voices are selected by token ID.

Parameter	Value
Input	128-bin mel spectrogram (24kHz, n_fft=1024, hop=256)
Output	1024-dim speaker embedding
Architecture	TDNN(128→512) → 3x SE-Res2Net(512) → MFA(1536) → ASP → FC(3072→1024)
Channels	[512, 512, 512, 512, 1536]
Weights	76 tensors in `speaker_encoder.*` from model.safetensors
File	`Sources/Qwen3TTS/SpeakerEncoder.swift`

The speaker embedding is injected between the think tokens and pad/bos in the codec prefix:

Without cloning: [think, think_bos, lang_id, think_eos, pad, bos]
With cloning:    [think, think_bos, lang_id, think_eos, SPEAKER_EMBED, pad, bos]

Generation Config

Parameter	Talker	Code Predictor
Sampling	Yes	Yes
Temperature	0.9	0.9
Top-k	50	50
Top-p	1.0	1.0
Repetition penalty	1.05	-
Max tokens	8192	-

Weight Files

File	Unquantized	4-bit Quantized	Purpose
`model.safetensors`	1.83 GB	977 MB	Talker + Code Predictor + Speaker Encoder + text embeddings
`speech_tokenizer/model.safetensors`	682 MB	651 MB (float32)	Audio codec (encoder + decoder + RVQ)
`config.json`	4.5 kB	4.5 kB	Main model config
`speech_tokenizer/config.json`	2.3 kB	2.3 kB	Codec config
`vocab.json` + `merges.txt`	4.5 MB	4.5 MB	BPE tokenizer

Total: ~2.5 GB unquantized, ~1.6 GB 4-bit quantized (speech tokenizer stays float32).

Pre-converted MLX weights: aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-4bit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TTS Model Architecture (Qwen3-TTS)

Overview

Component A: Talker

Component B: Code Predictor

Component C: Speech Tokenizer (Mimi Codec)

Encoder (Not Yet Implemented)

Decoder (tokens -> audio, used for synthesis output)

Model Variants

CustomVoice Speakers

Supported Languages

Instruction Control (CustomVoice only)

Codec Prefix Construction

Component D: Speaker Encoder

Generation Config

Weight Files

Uh oh!

FilesExpand file tree

tts-model.md

Latest commit

History

tts-model.md

File metadata and controls

TTS Model Architecture (Qwen3-TTS)

Overview

Component A: Talker

Component B: Code Predictor

Component C: Speech Tokenizer (Mimi Codec)

Encoder (Not Yet Implemented)

Decoder (tokens -> audio, used for synthesis output)

Model Variants

CustomVoice Speakers

Supported Languages

Instruction Control (CustomVoice only)

Codec Prefix Construction

Component D: Speaker Encoder

Generation Config

Weight Files