@@ -9,21 +9,40 @@ duration control, and the vocoder reconstructs 24 kHz speech. In SGLang-Omni it
99` preprocessing → tts_engine → vocoder ` pipeline and is served through the OpenAI-compatible
1010` /v1/audio/speech ` endpoint.
1111
12+ SGLang-Omni serves two MOSS-TTS checkpoints:
13+
14+ - ** ` MOSS-TTS-v1.5 ` ** (default) — the delay-pattern model described above; a single-GPU
15+ ` preprocessing → tts_engine → vocoder ` pipeline.
16+ - ** ` MOSS-TTS-Local-Transformer-v1.5 ` ** — the "Local" variant, which pairs the Qwen3 backbone
17+ with a local frame transformer and the ` MOSS-Audio-Tokenizer-v2 ` codec. Its pipeline defaults
18+ to ** two GPUs** (the ~ 1B-param codec runs on a second device so it does not starve the AR
19+ engine). It exposes the same ` /v1/audio/speech ` request shape and generation parameters as the
20+ delay model, so the synthesis examples below apply to both.
21+
1222## Prerequisites
1323
1424Install ` sglang-omni ` by following [ Installation] ( ../get_started/installation.md ) , then
15- download the model ( public, no token required):
25+ download the checkpoint for the variant you want (both are public, no token required):
1626
1727``` bash
28+ # Delay model (single GPU)
1829hf download OpenMOSS-Team/MOSS-TTS-v1.5
30+
31+ # Local variant (defaults to two GPUs)
32+ hf download OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5
1933```
2034
21- The processor ships with the checkpoint, so no extra TTS package is needed. Decoding base64
22- (data-URI) reference audio additionally requires ` soundfile ` (` uv pip install soundfile ` ).
35+ The processor ships with each checkpoint, so no extra TTS package is needed. The Local variant
36+ loads its codec (` OpenMOSS-Team/MOSS-Audio-Tokenizer-v2 ` ) through ` trust_remote_code ` , which is
37+ fetched automatically on first launch. Decoding base64 (data-URI) reference audio additionally
38+ requires ` soundfile ` (` uv pip install soundfile ` ).
2339
2440## Server Configuration
2541
26- The pipeline is ` preprocessing → tts_engine → vocoder ` .
42+ Both variants serve the same ` preprocessing → tts_engine → vocoder ` pipeline; pick the config
43+ that matches the checkpoint.
44+
45+ ### MOSS-TTS-v1.5 (delay model, single GPU)
2746
2847``` bash
2948sgl-omni serve \
@@ -32,6 +51,22 @@ sgl-omni serve \
3251 --port 8000
3352```
3453
54+ ### MOSS-TTS-Local-Transformer-v1.5 (two GPUs)
55+
56+ The Local config places the AR engine on ` cuda:0 ` and the codec on ` cuda:1 ` , so launch it on a
57+ host with at least two visible GPUs:
58+
59+ ``` bash
60+ sgl-omni serve \
61+ --model-path OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 \
62+ --config examples/configs/moss_tts_local.yaml \
63+ --port 8000
64+ ```
65+
66+ To run the Local variant on a single GPU, colocate the codec with the AR engine by setting
67+ ` config_cls: MossTTSLocalColocatedPipelineConfig ` in a copy of ` moss_tts_local.yaml ` (this packs
68+ the codec and AR engine onto ` cuda:0 ` , at the cost of throughput under concurrency).
69+
3570## Synthesizing Speech
3671
3772### Basic Speech
0 commit comments