Reference for Swift MLX port. Based on Qwen3-TTS-12Hz-0.6B. Speech tokenizer decoder based on Mimi (Kyutai).
For the separate 48 kHz MiniCPM-4-based multilingual stack, see VoxCPM2.
Qwen3-TTS has four components: a Talker (main LM), Code Predictor (residual codebooks), Speech Tokenizer (neural audio codec), and Speaker Encoder (voice cloning). The Swift port implements the Talker, Code Predictor, Speech Tokenizer Decoder, and Speaker Encoder. The Speech Tokenizer Encoder (for ICL voice cloning) is not yet ported.
Text input
|
v
+--------------------+
| Text Embedding | Qwen2 BPE -> Embedding(151936, 2048) -> MLP projection -> 1024
+--------+-----------+
|
v
+--------------------+
| Talker | 28-layer Qwen3 transformer (MRoPE, GQA, SwiGLU)
| 1024 hidden dim | Generates first codebook autoregressively
+--------+-----------+
| hidden states + first codebook tokens
v
+--------------------+
| Code Predictor | 5-layer transformer (standard RoPE)
| 1024 hidden dim | Predicts remaining 15 codebooks sequentially
+--------+-----------+
| 16 codebook indices per frame
v
+--------------------+
| Speech Tokenizer | Mimi-based neural codec
| Decoder only | RVQ decode -> Transformer -> Upsampling convs -> waveform
+--------+-----------+
|
v
Audio (24kHz)
The primary autoregressive transformer. Generates the first codebook of speech tokens from text.
| Parameter | 0.6B | 1.7B |
|---|---|---|
| Hidden size | 1024 | 2048 |
| Layers | 28 | 28 |
| Attention heads (Q) | 16 | 16 |
| KV heads (GQA) | 8 | 8 |
| Head dimension | 128 | 128 |
| Intermediate size | 3072 | 6144 |
| Codec vocab size | 3072 | 3072 |
| Text vocab size | 151936 | 151936 |
| RoPE type | MRoPE (3D sections [24,20,20], interleaved) | MRoPE |
| RoPE base | 1,000,000 | 1,000,000 |
| Quantization | 4-bit or 8-bit | 4-bit or 8-bit |
| Q/K normalization | RMSNorm per head | RMSNorm per head |
MRoPE (Multimodal RoPE): Unlike ASR's standard 1D RoPE, the Talker uses 3D position encoding with sections [24, 20, 20] across the 64 rotation dimensions (head_dim/2 = 64). Positions are interleaved as [T, H, W, T, H, W, ...] across the dimension.
Two embedding tables:
text_embedding: 151936 tokens, dim 2048 (projected to 1024 via MLP)codec_embedding: per-codebook, dim 1024
Special codec tokens (0.6B):
| Token | ID |
|---|---|
codec_pad |
2148 |
codec_bos |
2149 |
codec_eos |
2150 |
| Language: English | 2050 |
| Language: German | 2052 |
| Language: Spanish | 2054 |
| Language: Chinese | 2055 |
| Language: Japanese | 2058 |
| Language: French | 2061 |
| Language: Korean | 2064 |
| Language: Russian | 2069 |
| Language: Italian | 2070 |
Transformer block (identical to ASR except RoPE):
x -> RMSNorm -> Attention(Q/K projections, Q/K RMSNorm, MRoPE, GQA via SDPA) -> + residual
-> RMSNorm -> SwiGLU MLP(gate_proj, up_proj, down_proj) -> + residual
A smaller transformer that predicts the remaining 15 codebooks given the first.
| Parameter | Value |
|---|---|
| Hidden size | 1024 |
| Layers | 5 |
| Attention heads (Q) | 16 |
| KV heads (GQA) | 8 |
| Head dimension | 128 |
| Intermediate size | 3072 |
| Vocab size | 2048 (per codebook) |
| RoPE type | Standard 1D |
| LM heads | 15 (one per remaining codebook) |
Sequential prediction: For each time step, predicts codebook 2 from codebook 1's hidden state, then codebook 3, and so on through codebook 16.
A neural audio codec that converts between waveforms and discrete multi-codebook tokens at 12.5 Hz.
The encoder converts audio to codebook tokens and is used for voice cloning input. It is not yet ported to Swift.
Audio (24kHz) -> SeanetEncoder (Conv1d downsampling, residual blocks)
-> ProjectedTransformer (8 layers, causal, RoPE)
-> ConvDownsample1d (to 12.5 Hz)
-> SplitResidualVectorQuantizer
- 1 semantic quantizer (4096 codebook)
- 15 acoustic quantizers (2048 codebook each)
-> 16 codebook indices per frame
| Parameter | Value |
|---|---|
| Input sample rate | 24,000 Hz |
| Frame rate | 12.5 Hz |
| Downsample rate | 1920x (24000 / 12.5) |
| Encoder hidden size | 512 |
| Transformer layers | 8 |
| Transformer heads | 8 |
| Num quantizers | 16 (1 semantic + 15 acoustic) |
| Codebook size | 2048 (acoustic), 4096 (semantic) |
| Codebook dim | 256 |
16 codebook indices -> SplitRVQ.decode() (sum embeddings) -> [T, 512]
-> Pre-conv (CausalConv1d k=3) -> [T, 1024]
-> Pre-transformer (8 layers, causal RoPE,
1024->512 bottleneck, SwiGLU+LayerScale) -> [T, 1024]
-> Pre-upsample (TransposedConv1d 2x + ConvNeXt) x2 -> [4T, 1024]
-> Input conv -> [4T, 1536]
-> SEANet decoder (8x,5x,4x,3x = 480x):
SnakeBeta + TransposedConv1d + 3x dilated residual units
-> SnakeBeta + CausalConv1d(7,1) + clip -> [T*1920, 1]
-> Audio waveform (24kHz)
| Parameter | Value |
|---|---|
| Output sample rate | 24,000 Hz |
| Decoder dim | 1536 |
| Latent dim | 1024 |
| Transformer layers | 8 |
| Transformer heads | 16 |
| Upsample rates | [2, 2] pre-upsample (4x) then [8, 5, 4, 3] SEANet decoder (480x) = 1920x |
SnakeBeta activation: x + (1/b) * sin^2(a * x) — learnable periodic activation used in the decoder upsampling blocks for high-quality audio reconstruction.
Qwen3-TTS ships in two variants with identical architecture (Talker + Code Predictor + Speech Tokenizer). The difference is fine-tuning and how speaker identity is provided. Both variants come in 0.6B and 1.7B sizes. The 0.6B ships 4-bit and 8-bit; the 1.7B ships 8-bit and bf16 — its 4-bit was dropped because it degraded badly (silent or garbled output on some inputs).
| Variant | Size | Quantization | HuggingFace ID | Speaker Selection |
|---|---|---|---|---|
| Base | 0.6B | 4-bit | aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-4bit |
None (single default voice) |
| Base | 0.6B | 8-bit | aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-8bit |
None (single default voice) |
| Base | 1.7B | 8-bit | aufklarer/Qwen3-TTS-12Hz-1.7B-Base-MLX-8bit |
None (single default voice) |
| Base | 1.7B | bf16 | aufklarer/Qwen3-TTS-12Hz-1.7B-Base-MLX-bf16 |
None (single default voice) |
| CustomVoice | 0.6B | 4-bit | aufklarer/Qwen3-TTS-12Hz-0.6B-CustomVoice-MLX-4bit |
9 preset voices + instruction control |
The CustomVoice model includes 9 preset voices. Each speaker is selected by prepending a speaker token ID to the codec prefix.
| Speaker | Language | Codec Token ID | Description |
|---|---|---|---|
| Vivian | Chinese (Mandarin) | 3065 | Bright young female |
| Serena | Chinese (Mandarin) | 3066 | Warm, gentle young female |
| Uncle_Fu | Chinese (Mandarin) | 3010 | Seasoned male, mellow timbre |
| Dylan | Chinese (Beijing dialect) | 2878 | Youthful Beijing male |
| Eric | Chinese (Sichuan dialect) | 2875 | Lively Chengdu male |
| Ryan | English | 3061 | Dynamic male with rhythm |
| Aiden | English | 2861 | Sunny American male |
| Ono_Anna | Japanese | 2873 | Playful female |
| Sohee | Korean | 2864 | Warm female |
Dialect handling: Dylan and Eric have dialect overrides in spk_is_dialect — when selected, the language ID is replaced with the corresponding dialect token instead of standard Chinese.
Both variants support these language IDs in the codec prefix:
| Language | Codec Token ID |
|---|---|
| English | 2050 |
| German | 2052 |
| Spanish | 2054 |
| Chinese | 2055 |
| Japanese | 2058 |
| French | 2061 |
| Korean | 2064 |
| Russian | 2069 |
| Italian | 2070 |
Note: The CustomVoice preset speakers are each trained on specific languages (see table above). Using a speaker with a mismatched language (e.g., Ryan with Chinese text) will produce output but quality may degrade.
CustomVoice accepts a natural language instruct string to control tone, emotion, and prosody. The instruction is prepended to the text input in ChatML format and interpreted by the Talker transformer.
Token format:
<|im_start|>user\n{instruct}<|im_end|>\n
This is prepended before the role/text embeddings in the prefill sequence:
[instruct_embeddings | role_embed | text_embeddings | trailing_tokens]
Default instruct: When no instruct is provided and the CustomVoice model is loaded, "Speak naturally." is applied automatically. This prevents the model from producing rambling or unfocused output, especially for short texts.
Example instructions:
| Instruction | Effect |
|---|---|
"Speak naturally." |
Default — neutral, clear speech |
"Speak in a cheerful, upbeat tone" |
Happy, energetic delivery |
"Read this slowly and solemnly" |
Slow pacing, serious tone |
"Whisper this softly" |
Quiet, breathy voice |
"Speak with excitement and energy" |
Fast, enthusiastic delivery |
"Read this as a news anchor" |
Professional, measured cadence |
"Speak gently, as if to a child" |
Soft, warm, simple phrasing |
Notes:
- The Base model does not support instruct — the parameter is ignored
- Instruct works with all synthesis modes: basic, streaming, and batch
- Different instructions produce measurably different audio (verified in tests)
- Keep instructions concise (1-2 sentences) for best results
Base model (6 tokens):
[think, think_bos, language_id, think_eos, pad, bos]
CustomVoice with speaker (7 tokens):
[think, think_bos, language_id, think_eos, pad, bos, speaker_token]
ECAPA-TDNN network extracting speaker embeddings from reference audio for x-vector voice cloning. The CustomVoice model does not use the speaker encoder — voices are selected by token ID.
| Parameter | Value |
|---|---|
| Input | 128-bin mel spectrogram (24kHz, n_fft=1024, hop=256) |
| Output | 1024-dim speaker embedding |
| Architecture | TDNN(128→512) → 3x SE-Res2Net(512) → MFA(1536) → ASP → FC(3072→1024) |
| Channels | [512, 512, 512, 512, 1536] |
| Weights | 76 tensors in speaker_encoder.* from model.safetensors |
| File | Sources/Qwen3TTS/SpeakerEncoder.swift |
The speaker embedding is injected between the think tokens and pad/bos in the codec prefix:
Without cloning: [think, think_bos, lang_id, think_eos, pad, bos]
With cloning: [think, think_bos, lang_id, think_eos, SPEAKER_EMBED, pad, bos]
| Parameter | Talker | Code Predictor |
|---|---|---|
| Sampling | Yes | Yes |
| Temperature | 0.9 | 0.9 |
| Top-k | 50 | 50 |
| Top-p | 1.0 | 1.0 |
| Repetition penalty | 1.05 | - |
| Max tokens | 8192 | - |
| File | Unquantized | 4-bit Quantized | Purpose |
|---|---|---|---|
model.safetensors |
1.83 GB | 977 MB | Talker + Code Predictor + Speaker Encoder + text embeddings |
speech_tokenizer/model.safetensors |
682 MB | 651 MB (float32) | Audio codec (encoder + decoder + RVQ) |
config.json |
4.5 kB | 4.5 kB | Main model config |
speech_tokenizer/config.json |
2.3 kB | 2.3 kB | Codec config |
vocab.json + merges.txt |
4.5 MB | 4.5 MB | BPE tokenizer |
Total: ~2.5 GB unquantized, ~1.6 GB 4-bit quantized (speech tokenizer stays float32).
Pre-converted MLX weights: aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-4bit