Skip to content

docs(moss-tts): document MOSS-TTS-Local variant in cookbook#782

Open
xinlij wants to merge 1 commit into
mainfrom
docs/moss-tts-local-cookbook
Open

docs(moss-tts): document MOSS-TTS-Local variant in cookbook#782
xinlij wants to merge 1 commit into
mainfrom
docs/moss-tts-local-cookbook

Conversation

@xinlij

@xinlij xinlij commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds the MOSS-TTS-Local-Transformer-v1.5 variant to the MOSS-TTS cookbook (docs/cookbook/moss_tts.md), which previously only documented the delay model (MOSS-TTS-v1.5).

Prerequisites

  • Distinguishes the two checkpoints (single-GPU delay model vs. two-GPU Local variant).
  • Adds the hf download OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 command.
  • Notes the codec (MOSS-Audio-Tokenizer-v2, 48 kHz output) is loaded via the checkpoint's remote code, so trust_remote_code must be enabled.

Server Configuration

  • Splits into two subsections, one per checkpoint.
  • Documents the Local launch with examples/configs/moss_tts_local.yaml and its two-GPU default (AR engine on cuda:0, codec on cuda:1).
  • Notes the single-GPU option via config_cls: MossTTSLocalColocatedPipelineConfig.

The request shape and generation parameters are identical to the delay model, so the existing synthesis examples apply to both variants. Note the output sample rate differs (Local 48 kHz vs. delay 24 kHz).

Verified against the codebase

  • Architecture (36-layer Qwen3 backbone + 1-layer frame-local transformer): sglang_omni/models/moss_tts_local/sglang_model.py.
  • Two-GPU default (codec_device="cuda:1") vs. colocated (codec_device="cuda:0"): sglang_omni/models/moss_tts_local/config.py.
  • config_cls: MossTTSLocalColocatedPipelineConfig resolves via the Variants lookup in sglang_omni/models/registry.py (get_config_cls_by_name).
  • 48 kHz output: sglang_omni/models/moss_tts_local/payload_types.py (sample_rate = 48000).

Add the MOSS-TTS-Local-Transformer-v1.5 variant to the Prerequisites and
Server Configuration sections of the MOSS-TTS cookbook: download command,
the remote-code codec note (MOSS-Audio-Tokenizer-v2, 48 kHz output), the
two-GPU default launch (AR engine on cuda:0, codec on cuda:1), and the
single-GPU colocated config option.
@xinlij xinlij force-pushed the docs/moss-tts-local-cookbook branch from 18542bf to 93fe10c Compare June 14, 2026 05:41
@xinlij xinlij removed the request for review from yuan-luo June 14, 2026 05:50
Comment thread docs/cookbook/moss_tts.md
@@ -1,2 +1,2 @@
# MOSS-TTS

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability and discoverability, we might consider splitting this section into two dedicated subsections from the beginning: one for the Delay Pattern model and one for the Local Transformer model. Each subsection could describe the corresponding architecture, serving pipeline, hardware assumptions, and usage examples. This would make the tradeoff between the two token modeling patterns more explicit and avoid mixing model-specific details in a single flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants