Reference for Swift MLX port. Based on Qwen3-TTS-12Hz-0.6B. Speech tokenizer decoder based on Mimi (Kyutai).
For the tokenizer-free 48 kHz multilingual backend, see VoxCPM2.
Text -> [Prepare] -> [Talker] -> [Code Predictor] -> [Codec Decode] -> Audio (24kHz)
<1% ~55% ~40% ~5%
Input text + language tag
|
v
Chat template formatting:
<|im_start|>assistant\n{text}<|im_end|>\n<|im_start|>assistant\n
|
v
Qwen2 BPE tokenizer -> token IDs
|
v
text_embedding(token_ids) [151936, 2048]
|
v
Text projection MLP: Linear(2048, 2048) -> SiLU -> Linear(2048, 1024)
|
v
Text hidden states [B, T_text, 1024]
When an instruct string is provided (or the default "Speak naturally." is applied), it is tokenized and prepended to the prefill embeddings:
Instruct text (e.g. "Speak in a cheerful, upbeat tone")
|
v
ChatML format:
<|im_start|>user\n{instruct}<|im_end|>\n
token IDs: [151644, 872, 198, ...instruct tokens..., 151645, 198]
|
v
text_embedding -> text_projection
|
v
Instruct hidden states [B, T_instruct, 1024]
The instruct embeddings are concatenated at the start of the prefill sequence:
Without instruct: [role_embed | text_embeddings | trailing_tokens]
With instruct: [instruct_embeddings | role_embed | text_embeddings | trailing_tokens]
This positions the style instruction early in the context so the Talker's attention can condition all subsequent generation on it.
Before autoregressive generation, a 6-token codec prefix is constructed:
[think, think_bos, language_id, think_eos, pad, bos]
codec_think: 2151,codec_think_bos: 2153,codec_think_eos: 2154codec_pad: 2148,codec_bos: 2149language_id: e.g. 2050 (English), 2052 (German), 2054 (Spanish), 2055 (Chinese), 2058 (Japanese), 2061 (French), 2064 (Korean), 2069 (Russian), 2070 (Italian)
CustomVoice model: When a speaker is selected, a 7th token (the speaker token ID) is appended to the codec prefix. See
tts-model.md→ Model Variants for the full speaker list and language mappings.
Combined embeddings = [codec_prefix | text_embeddings | trailing_tokens]
|
v
28-layer Qwen3 transformer (autoregressive, MRoPE, GQA, KV cache)
|
v
codec_head (Linear -> 3072 vocab) -> logits
|
v
Sampling: temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05
NOTE: Higher temperature (0.8–0.9) is critical for non-English quality.
Low temperature (≤0.3) produces degenerate/looping output.
|
v
First codebook token sequence (until codec_eos = 2150)
MRoPE position tracking: Three separate position counters (temporal, height, width) are maintained and incremented according to the token type (text vs codec).
For each generated first-codebook token:
Hidden state from Talker
|
v
For codebook_group in 2..16:
5-layer transformer (standard RoPE)
|
v
lm_head[group] -> sample token for this codebook
Feed predicted embedding back for next codebook group
|
v
Result: 16 codebook tokens per time step
16 x T codebook indices
|
v
SplitRVQ.decode():
semantic_quantizer.decode(codebook_1) -> [T, 512]
acoustic_quantizer.decode(codebooks_2-16) -> [T, 512]
sum embeddings -> [T, 512]
|
v
Pre-conv (CausalConv1d, k=3) -> [T, 1024]
|
v
Pre-transformer (8 layers, causal, RoPE, SwiGLU+LayerScale):
input_proj: 1024 -> 512 bottleneck
8x DecoderTransformerLayer (hidden=512)
output_proj: 512 -> 1024 -> [T, 1024]
|
v
Pre-upsample (TransposedConv1d 2x + ConvNeXt) x2 = 4x -> [4T, 1024]
|
v
Input conv -> [4T, 1536]
|
v
SEANet decoder blocks (SnakeBeta + TransposedConv1d + residual units):
[4T, 1536] -> 8x -> 5x -> 4x -> 3x = 480x
Total: 4 * 480 = 1920x upsample (12.5 Hz -> 24000 Hz)
|
v
Audio waveform [1, T*1920, 1] at 24kHz
| Qwen3-TTS (release) | Apple TTS | |
|---|---|---|
| RTF (long text) | ~0.7 | ~0.02 |
| Latency (6s audio) | 3.9s | 0.17s |
| Speech quality | Natural, expressive | Robotic, monotone |
| Voice cloning | Yes (x-vector) | No |
| Languages | EN/ZH/DE/JA/ES/FR/KO/RU/IT | 60+ |
| On-device | Yes (MLX) | Yes (AVFoundation) |
| Model size | ~1.7 GB | Built-in |
- Chunked codec decoding — Codec frames processed in overlapping chunks (
chunkSize=25, leftContext=10), reducing O(T²) attention to O(chunk²) - Batch embedding lookups — All 15 codebook group embeddings summed in one call per step
- Bulk float extraction — Waveform extracted via single
.asArray(Float.self)call - Causal mask in decoder transformer — Additive causal mask for pre-transformer attention (required for chunked decoding correctness)
- Lazy code predictor chain — 15 sequential codebook predictions use lazy MLXArray (Gumbel-max trick) with a single
eval()at the end, reducing 15 GPU sync barriers to 1 per timestep - Compiled talker + code predictor —
compile(shapeless: true)for the 28-layer talker (growing KV cache);compile(shapeless: false)for the 5-layer code predictor (14 fixed cache sizes)
The Talker, Code Predictor, and Mimi decoder are all fully causal, enabling chunk-by-chunk audio emission during generation.
[Prefill] → emit first chunk → [Generate + emit subsequent chunks] → final chunk
~25ms ~120ms 53ms/step
synthesizeStream() returns AsyncThrowingStream<AudioChunk> (from AudioCommon):
- First chunk — emitted after
firstChunkFramestokens (default 3, or 1 for low-latency) - Subsequent chunks — emitted every
chunkFramestokens (default 25 = 2s audio) - Codec decode — each chunk runs the Mimi decoder with left-context overlap for quality
When firstChunkFrames < 4 (the codec's minimum input size due to ConvNeXt kernel=7 after 2x pre-upsample), the decoder input is zero-padded on the left. The decoder is fully causal, so zero frames produce silence. Output is trimmed from the right to keep exactly realChunkFrames × 1920 samples.
| Config | First Chunk | Latency |
|---|---|---|
| Default (3 frames) | 240ms audio | ~225ms |
| Low-latency (1 frame) | 80ms audio | ~120ms |
Two modes are possible. The x-vector mode is implemented; ICL mode is planned.
Extracts a speaker embedding from reference audio and injects it into the codec prefix:
Reference audio (24kHz)
|
+---> 128-bin Mel Spectrogram (n_fft=1024, hop=256, fmin=0, fmax=12000)
|
+---> Speaker Encoder (ECAPA-TDNN) -> 1024-dim x-vector
(injected between think tokens and pad/bos in codec prefix)
Codec prefix with speaker embedding:
[think, think_bos, lang_id, think_eos, SPEAKER_EMBED, pad, bos]
^^^^^^^^^^^^^^
raw 1024-dim vector
.build/release/speech speak "Hello world" --voice-sample reference.wav --output cloned.wavIn-Context Learning mode for higher quality voice cloning. Encodes reference audio into codec tokens via the Mimi speech tokenizer encoder and prepends them with the reference transcript.
Reference Audio (24kHz)
+---> Mimi Encoder (SEANet downsample → transformer → RVQ encode) → [1, 16, T_ref]
+---> Speaker Encoder (ECAPA-TDNN) → 1024-dim x-vector
Prefill: [role] [codec_prefix + speaker] [ref_text + target_text + codec_pad] [codec_bos + ref_codecs]
+---> Talker (autoregressive) → new codec tokens
+---> Code Predictor (15 remaining codebooks)
+---> Mimi Decoder → waveform
let (model, encoder) = try await Qwen3TTSModel.fromPretrainedWithEncoder()
let audio = model.synthesizeWithVoiceCloneICL(
text: "Target text",
referenceAudio: refSamples,
referenceSampleRate: 24000,
referenceText: "Exact transcript of reference.",
language: "english",
codecEncoder: encoder
)ICL fixes EOS failure on short texts and non-English languages (e.g. German) that occur with x-vector-only mode.
Both synthesizeWithVoiceClone (x-vector) and synthesizeWithVoiceCloneICL (ICL) cache their per-reference preprocessing across calls on the same model instance:
- x-vector mode caches the ECAPA-TDNN speaker embedding
[1, 1024]. - ICL mode additionally caches the Mimi codec encoder output
[1, 16, T_ref].
The cache is content-addressed — keyed by a hash of the raw sample buffer and its sample rate. Repeated synthesis against the same reference waveform skips the mel + speaker encoder pass (and the codec encoder pass for ICL). Capacity is a bounded LRU (default 4 entries) to keep memory predictable when cycling through multiple reference voices.
let tts = try await Qwen3TTSModel.fromPretrained()
// First call: runs ECAPA-TDNN, caches the embedding
_ = tts.synthesizeWithVoiceClone(text: "Hello", referenceAudio: ref, ...)
// Subsequent calls with the same reference: cache hit (logs "Speaker embedding: cache hit")
_ = tts.synthesizeWithVoiceClone(text: "How are you?", referenceAudio: ref, ...)
// Explicit eviction (rarely needed — LRU handles capacity)
tts.clearReferenceAudioCache()The CoreML backend uses 6 ANE-optimized models for on-device TTS inference:
Text -> [TextProjector] -> embeddings
+ CodeEmbedder(codec_tokens) ─> [CodeDecoder] -> CB0 logits
|
v
[MultiCodeDecoder] -> CB1-15
|
v
[SpeechDecoder] -> 24kHz audio
| Model | Description | I/O | Quantization |
|---|---|---|---|
| TextProjector | text token → embedding | int32 → [1,1024,1,1] |
W16A16 |
| CodeEmbedder | codec token → embedding | int32 → [1,1024,1,1] |
W16A16 |
| MultiCodeEmbedder | CB1-15 token → embedding (linearized) | int32 → [1,1024,1,1] |
W16A16 |
| CodeDecoder | 28-layer transformer, scatter-write KV | NCHW, KV [1,28672,1,256] |
W8A16 |
| MultiCodeDecoder | 5-layer CP transformer, 15 lm_heads | NCHW, KV [1,5120,1,16] |
W8A16 |
| SpeechDecoder | Batch vocoder (T=125) | [1,16,125] → audio |
W8A16 |
- NCHW layout: All embeddings and transformer I/O use
[batch, channels, 1, seq]format for ANE - Scatter-write KV cache: Fixed-size pre-allocated cache with one-hot position mask
- Non-streaming prefill: All text tokens processed in prefill, decode only feeds codec+pad
- Speaker embedding: Required 1024-dim x-vector for voice identity (critical for FP16 quality)
- Actual model layers: Converted via
torch.jit.traceon the real model, not reimplementation
python scripts/convert_qwen3_tts_coreml.py \
--model-id Qwen/Qwen3-TTS-12Hz-0.6B-Base \
--output-dir models/Qwen3-TTS-CoreML \
--quantize-w8let model = try await Qwen3TTSCoreMLModel.fromPretrained()
let audio = try model.synthesize(text: "Hello world", language: "english")speech speak "Hello world" --engine coreml --output hello.wav