Skip to content

Latest commit

 

History

History
337 lines (263 loc) · 12 KB

File metadata and controls

337 lines (263 loc) · 12 KB

TTS Inference Pipeline (Qwen3-TTS)

Reference for Swift MLX port. Based on Qwen3-TTS-12Hz-0.6B. Speech tokenizer decoder based on Mimi (Kyutai).

For the tokenizer-free 48 kHz multilingual backend, see VoxCPM2.

Overview

Text -> [Prepare] -> [Talker] -> [Code Predictor] -> [Codec Decode] -> Audio (24kHz)
         <1%          ~55%         ~40%                ~5%

Stage 1: Text Preparation

Input text + language tag
    |
    v
Chat template formatting:
    <|im_start|>assistant\n{text}<|im_end|>\n<|im_start|>assistant\n
    |
    v
Qwen2 BPE tokenizer -> token IDs
    |
    v
text_embedding(token_ids)  [151936, 2048]
    |
    v
Text projection MLP:  Linear(2048, 2048) -> SiLU -> Linear(2048, 1024)
    |
    v
Text hidden states [B, T_text, 1024]

Stage 1b: Instruct Preparation (CustomVoice only)

When an instruct string is provided (or the default "Speak naturally." is applied), it is tokenized and prepended to the prefill embeddings:

Instruct text (e.g. "Speak in a cheerful, upbeat tone")
    |
    v
ChatML format:
    <|im_start|>user\n{instruct}<|im_end|>\n
    token IDs: [151644, 872, 198, ...instruct tokens..., 151645, 198]
    |
    v
text_embedding -> text_projection
    |
    v
Instruct hidden states [B, T_instruct, 1024]

The instruct embeddings are concatenated at the start of the prefill sequence:

Without instruct: [role_embed | text_embeddings | trailing_tokens]
With instruct:    [instruct_embeddings | role_embed | text_embeddings | trailing_tokens]

This positions the style instruction early in the context so the Talker's attention can condition all subsequent generation on it.

Stage 2: Codec Prefix

Before autoregressive generation, a 6-token codec prefix is constructed:

[think, think_bos, language_id, think_eos, pad, bos]
  • codec_think: 2151, codec_think_bos: 2153, codec_think_eos: 2154
  • codec_pad: 2148, codec_bos: 2149
  • language_id: e.g. 2050 (English), 2052 (German), 2054 (Spanish), 2055 (Chinese), 2058 (Japanese), 2061 (French), 2064 (Korean), 2069 (Russian), 2070 (Italian)

CustomVoice model: When a speaker is selected, a 7th token (the speaker token ID) is appended to the codec prefix. See tts-model.md → Model Variants for the full speaker list and language mappings.

Stage 3: Talker Generation (First Codebook)

Combined embeddings = [codec_prefix | text_embeddings | trailing_tokens]
    |
    v
28-layer Qwen3 transformer (autoregressive, MRoPE, GQA, KV cache)
    |
    v
codec_head (Linear -> 3072 vocab) -> logits
    |
    v
Sampling: temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05
    NOTE: Higher temperature (0.8–0.9) is critical for non-English quality.
          Low temperature (≤0.3) produces degenerate/looping output.
    |
    v
First codebook token sequence (until codec_eos = 2150)

MRoPE position tracking: Three separate position counters (temporal, height, width) are maintained and incremented according to the token type (text vs codec).

Stage 4: Code Predictor (Remaining 15 Codebooks)

For each generated first-codebook token:

Hidden state from Talker
    |
    v
For codebook_group in 2..16:
    5-layer transformer (standard RoPE)
        |
        v
    lm_head[group] -> sample token for this codebook
    Feed predicted embedding back for next codebook group
    |
    v
Result: 16 codebook tokens per time step

Stage 5: Codec Decode (Speech Tokenizer)

16 x T codebook indices
    |
    v
SplitRVQ.decode():
    semantic_quantizer.decode(codebook_1)     -> [T, 512]
    acoustic_quantizer.decode(codebooks_2-16) -> [T, 512]
    sum embeddings                            -> [T, 512]
    |
    v
Pre-conv (CausalConv1d, k=3) -> [T, 1024]
    |
    v
Pre-transformer (8 layers, causal, RoPE, SwiGLU+LayerScale):
    input_proj: 1024 -> 512 bottleneck
    8x DecoderTransformerLayer (hidden=512)
    output_proj: 512 -> 1024                  -> [T, 1024]
    |
    v
Pre-upsample (TransposedConv1d 2x + ConvNeXt) x2 = 4x -> [4T, 1024]
    |
    v
Input conv -> [4T, 1536]
    |
    v
SEANet decoder blocks (SnakeBeta + TransposedConv1d + residual units):
    [4T, 1536] -> 8x -> 5x -> 4x -> 3x = 480x
    Total: 4 * 480 = 1920x upsample (12.5 Hz -> 24000 Hz)
    |
    v
Audio waveform [1, T*1920, 1] at 24kHz

vs Apple AVSpeechSynthesizer (M2 Max, 64 GB)

Qwen3-TTS (release) Apple TTS
RTF (long text) ~0.7 ~0.02
Latency (6s audio) 3.9s 0.17s
Speech quality Natural, expressive Robotic, monotone
Voice cloning Yes (x-vector) No
Languages EN/ZH/DE/JA/ES/FR/KO/RU/IT 60+
On-device Yes (MLX) Yes (AVFoundation)
Model size ~1.7 GB Built-in

Implementation Notes

  • Chunked codec decoding — Codec frames processed in overlapping chunks (chunkSize=25, leftContext=10), reducing O(T²) attention to O(chunk²)
  • Batch embedding lookups — All 15 codebook group embeddings summed in one call per step
  • Bulk float extraction — Waveform extracted via single .asArray(Float.self) call
  • Causal mask in decoder transformer — Additive causal mask for pre-transformer attention (required for chunked decoding correctness)
  • Lazy code predictor chain — 15 sequential codebook predictions use lazy MLXArray (Gumbel-max trick) with a single eval() at the end, reducing 15 GPU sync barriers to 1 per timestep
  • Compiled talker + code predictorcompile(shapeless: true) for the 28-layer talker (growing KV cache); compile(shapeless: false) for the 5-layer code predictor (14 fixed cache sizes)

Streaming Synthesis

The Talker, Code Predictor, and Mimi decoder are all fully causal, enabling chunk-by-chunk audio emission during generation.

[Prefill] → emit first chunk → [Generate + emit subsequent chunks] → final chunk
    ~25ms        ~120ms                53ms/step

synthesizeStream() returns AsyncThrowingStream<AudioChunk> (from AudioCommon):

  1. First chunk — emitted after firstChunkFrames tokens (default 3, or 1 for low-latency)
  2. Subsequent chunks — emitted every chunkFrames tokens (default 25 = 2s audio)
  3. Codec decode — each chunk runs the Mimi decoder with left-context overlap for quality

Zero-Pad Decode

When firstChunkFrames < 4 (the codec's minimum input size due to ConvNeXt kernel=7 after 2x pre-upsample), the decoder input is zero-padded on the left. The decoder is fully causal, so zero frames produce silence. Output is trimmed from the right to keep exactly realChunkFrames × 1920 samples.

First-Packet Latency (M2 Max, release)

Config First Chunk Latency
Default (3 frames) 240ms audio ~225ms
Low-latency (1 frame) 80ms audio ~120ms

Voice Cloning

Two modes are possible. The x-vector mode is implemented; ICL mode is planned.

X-Vector Mode (Implemented)

Extracts a speaker embedding from reference audio and injects it into the codec prefix:

Reference audio (24kHz)
    |
    +---> 128-bin Mel Spectrogram (n_fft=1024, hop=256, fmin=0, fmax=12000)
    |
    +---> Speaker Encoder (ECAPA-TDNN) -> 1024-dim x-vector
          (injected between think tokens and pad/bos in codec prefix)

Codec prefix with speaker embedding:

[think, think_bos, lang_id, think_eos, SPEAKER_EMBED, pad, bos]
                                        ^^^^^^^^^^^^^^
                                        raw 1024-dim vector
.build/release/speech speak "Hello world" --voice-sample reference.wav --output cloned.wav

ICL Mode

In-Context Learning mode for higher quality voice cloning. Encodes reference audio into codec tokens via the Mimi speech tokenizer encoder and prepends them with the reference transcript.

Reference Audio (24kHz)
    +---> Mimi Encoder (SEANet downsample → transformer → RVQ encode) → [1, 16, T_ref]
    +---> Speaker Encoder (ECAPA-TDNN) → 1024-dim x-vector

Prefill: [role] [codec_prefix + speaker] [ref_text + target_text + codec_pad] [codec_bos + ref_codecs]
    +---> Talker (autoregressive) → new codec tokens
    +---> Code Predictor (15 remaining codebooks)
    +---> Mimi Decoder → waveform
let (model, encoder) = try await Qwen3TTSModel.fromPretrainedWithEncoder()
let audio = model.synthesizeWithVoiceCloneICL(
    text: "Target text",
    referenceAudio: refSamples,
    referenceSampleRate: 24000,
    referenceText: "Exact transcript of reference.",
    language: "english",
    codecEncoder: encoder
)

ICL fixes EOS failure on short texts and non-English languages (e.g. German) that occur with x-vector-only mode.

Reference Audio Caching

Both synthesizeWithVoiceClone (x-vector) and synthesizeWithVoiceCloneICL (ICL) cache their per-reference preprocessing across calls on the same model instance:

  • x-vector mode caches the ECAPA-TDNN speaker embedding [1, 1024].
  • ICL mode additionally caches the Mimi codec encoder output [1, 16, T_ref].

The cache is content-addressed — keyed by a hash of the raw sample buffer and its sample rate. Repeated synthesis against the same reference waveform skips the mel + speaker encoder pass (and the codec encoder pass for ICL). Capacity is a bounded LRU (default 4 entries) to keep memory predictable when cycling through multiple reference voices.

let tts = try await Qwen3TTSModel.fromPretrained()

// First call: runs ECAPA-TDNN, caches the embedding
_ = tts.synthesizeWithVoiceClone(text: "Hello", referenceAudio: ref, ...)

// Subsequent calls with the same reference: cache hit (logs "Speaker embedding: cache hit")
_ = tts.synthesizeWithVoiceClone(text: "How are you?", referenceAudio: ref, ...)

// Explicit eviction (rarely needed — LRU handles capacity)
tts.clearReferenceAudioCache()

CoreML Backend (Neural Engine)

The CoreML backend uses 6 ANE-optimized models for on-device TTS inference:

Architecture

Text -> [TextProjector] -> embeddings
                            + CodeEmbedder(codec_tokens) ─> [CodeDecoder] -> CB0 logits
                                                                |
                                                                v
                                                        [MultiCodeDecoder] -> CB1-15
                                                                |
                                                                v
                                                        [SpeechDecoder] -> 24kHz audio

6 CoreML Models

Model Description I/O Quantization
TextProjector text token → embedding int32 → [1,1024,1,1] W16A16
CodeEmbedder codec token → embedding int32 → [1,1024,1,1] W16A16
MultiCodeEmbedder CB1-15 token → embedding (linearized) int32 → [1,1024,1,1] W16A16
CodeDecoder 28-layer transformer, scatter-write KV NCHW, KV [1,28672,1,256] W8A16
MultiCodeDecoder 5-layer CP transformer, 15 lm_heads NCHW, KV [1,5120,1,16] W8A16
SpeechDecoder Batch vocoder (T=125) [1,16,125] → audio W8A16

Key Design Patterns

  • NCHW layout: All embeddings and transformer I/O use [batch, channels, 1, seq] format for ANE
  • Scatter-write KV cache: Fixed-size pre-allocated cache with one-hot position mask
  • Non-streaming prefill: All text tokens processed in prefill, decode only feeds codec+pad
  • Speaker embedding: Required 1024-dim x-vector for voice identity (critical for FP16 quality)
  • Actual model layers: Converted via torch.jit.trace on the real model, not reimplementation

Conversion

python scripts/convert_qwen3_tts_coreml.py \
    --model-id Qwen/Qwen3-TTS-12Hz-0.6B-Base \
    --output-dir models/Qwen3-TTS-CoreML \
    --quantize-w8

Usage (Swift)

let model = try await Qwen3TTSCoreMLModel.fromPretrained()
let audio = try model.synthesize(text: "Hello world", language: "english")

CLI

speech speak "Hello world" --engine coreml --output hello.wav