Skip to content

Commit 18542bf

Browse files
committed
docs(moss-tts): document MOSS-TTS-Local variant in cookbook
Add the MOSS-TTS-Local-Transformer-v1.5 variant to the Prerequisites and Server Configuration sections of the MOSS-TTS cookbook: download command, the trust_remote_code codec note (MOSS-Audio-Tokenizer-v2), the two-GPU default launch (AR engine on cuda:0, codec on cuda:1), and the single-GPU colocated config option.
1 parent 8e272bc commit 18542bf

1 file changed

Lines changed: 39 additions & 4 deletions

File tree

docs/cookbook/moss_tts.md

Lines changed: 39 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,21 +9,40 @@ duration control, and the vocoder reconstructs 24 kHz speech. In SGLang-Omni it
99
`preprocessing → tts_engine → vocoder` pipeline and is served through the OpenAI-compatible
1010
`/v1/audio/speech` endpoint.
1111

12+
SGLang-Omni serves two MOSS-TTS checkpoints:
13+
14+
- **`MOSS-TTS-v1.5`** (default) — the delay-pattern model described above; a single-GPU
15+
`preprocessing → tts_engine → vocoder` pipeline.
16+
- **`MOSS-TTS-Local-Transformer-v1.5`** — the "Local" variant, which pairs the Qwen3 backbone
17+
with a local frame transformer and the `MOSS-Audio-Tokenizer-v2` codec. Its pipeline defaults
18+
to **two GPUs** (the ~1B-param codec runs on a second device so it does not starve the AR
19+
engine). It exposes the same `/v1/audio/speech` request shape and generation parameters as the
20+
delay model, so the synthesis examples below apply to both.
21+
1222
## Prerequisites
1323

1424
Install `sglang-omni` by following [Installation](../get_started/installation.md), then
15-
download the model (public, no token required):
25+
download the checkpoint for the variant you want (both are public, no token required):
1626

1727
```bash
28+
# Delay model (single GPU)
1829
hf download OpenMOSS-Team/MOSS-TTS-v1.5
30+
31+
# Local variant (defaults to two GPUs)
32+
hf download OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5
1933
```
2034

21-
The processor ships with the checkpoint, so no extra TTS package is needed. Decoding base64
22-
(data-URI) reference audio additionally requires `soundfile` (`uv pip install soundfile`).
35+
The processor ships with each checkpoint, so no extra TTS package is needed. The Local variant
36+
loads its codec (`OpenMOSS-Team/MOSS-Audio-Tokenizer-v2`) through `trust_remote_code`, which is
37+
fetched automatically on first launch. Decoding base64 (data-URI) reference audio additionally
38+
requires `soundfile` (`uv pip install soundfile`).
2339

2440
## Server Configuration
2541

26-
The pipeline is `preprocessing → tts_engine → vocoder`.
42+
Both variants serve the same `preprocessing → tts_engine → vocoder` pipeline; pick the config
43+
that matches the checkpoint.
44+
45+
### MOSS-TTS-v1.5 (delay model, single GPU)
2746

2847
```bash
2948
sgl-omni serve \
@@ -32,6 +51,22 @@ sgl-omni serve \
3251
--port 8000
3352
```
3453

54+
### MOSS-TTS-Local-Transformer-v1.5 (two GPUs)
55+
56+
The Local config places the AR engine on `cuda:0` and the codec on `cuda:1`, so launch it on a
57+
host with at least two visible GPUs:
58+
59+
```bash
60+
sgl-omni serve \
61+
--model-path OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 \
62+
--config examples/configs/moss_tts_local.yaml \
63+
--port 8000
64+
```
65+
66+
To run the Local variant on a single GPU, colocate the codec with the AR engine by setting
67+
`config_cls: MossTTSLocalColocatedPipelineConfig` in a copy of `moss_tts_local.yaml` (this packs
68+
the codec and AR engine onto `cuda:0`, at the cost of throughput under concurrency).
69+
3570
## Synthesizing Speech
3671

3772
### Basic Speech

0 commit comments

Comments
 (0)