Strategy for "1-Hour Ultra-long Speech" & Best Practices for Max Training Sequence Length

Hello MOSS-TTS team,

I am currently fully fine-tuning MOSS-TTS-8B on a 5,000-hour Arabic dataset using. My goal is to create a foundational, highly fluent Arabic TTS model with robust zero-shot cloning.

**My Journey & Problem:**
Initially, I trained on 2 to 30-second clips, but I encountered the classic issue where the model would clip early and stop generating before finishing longer text prompts. 
To fix this, I stitched my dataset into longer paragraphs (randomized from 2 seconds up to 200 seconds). I am training with `batch_size=1`, `gradient_accumulation_steps=128`, and truncating `max_seq_len` at 10,000 tokens to avoid OOM on the H200.

This partially solved the early clipping (still clip at ~50s)!

**My Questions regarding your "1 Hour" generation claim:**
In the README, it states the model supports "continuous long-form speech generation for up to one hour."

1. **Training Sequence Length:** Did you actually train on ultra-long sequences to achieve this (and if so, how did you bypass the immense VRAM requirements)? Or was the model trained on shorter chunks (e.g., 10s - 30s) and relies on Qwen3's RoPE/Context extrapolation?
2. **Inference Strategy for Long Speech:** Is the 1-hour generation achieved by passing the entire massive text prompt at once, or do you recommend an inference-time chunking strategy (e.g., generating paragraph-by-paragraph and using the trailing audio as the reference for the next chunk)?
3. **Silent Rubbish Loops:** Have you encountered the model hallucinating endless silence/noise on very long generations? Do you recommend strict repetition penalties, or forcing EOS via specific text prompt structures?

Thank you for open-sourcing this. Any insights into your **training length distribution** vs. inference strategy would be massively helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strategy for "1-Hour Ultra-long Speech" & Best Practices for Max Training Sequence Length #75

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Strategy for "1-Hour Ultra-long Speech" & Best Practices for Max Training Sequence Length #75

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions