-
Notifications
You must be signed in to change notification settings - Fork 83
Description
Hello MOSS-TTS team,
I am currently fully fine-tuning MOSS-TTS-8B on a 5,000-hour Arabic dataset using. My goal is to create a foundational, highly fluent Arabic TTS model with robust zero-shot cloning.
My Journey & Problem:
Initially, I trained on 2 to 30-second clips, but I encountered the classic issue where the model would clip early and stop generating before finishing longer text prompts.
To fix this, I stitched my dataset into longer paragraphs (randomized from 2 seconds up to 200 seconds). I am training with batch_size=1, gradient_accumulation_steps=128, and truncating max_seq_len at 10,000 tokens to avoid OOM on the H200.
This partially solved the early clipping (still clip at ~50s)!
My Questions regarding your "1 Hour" generation claim:
In the README, it states the model supports "continuous long-form speech generation for up to one hour."
- Training Sequence Length: Did you actually train on ultra-long sequences to achieve this (and if so, how did you bypass the immense VRAM requirements)? Or was the model trained on shorter chunks (e.g., 10s - 30s) and relies on Qwen3's RoPE/Context extrapolation?
- Inference Strategy for Long Speech: Is the 1-hour generation achieved by passing the entire massive text prompt at once, or do you recommend an inference-time chunking strategy (e.g., generating paragraph-by-paragraph and using the trailing audio as the reference for the next chunk)?
- Silent Rubbish Loops: Have you encountered the model hallucinating endless silence/noise on very long generations? Do you recommend strict repetition penalties, or forcing EOS via specific text prompt structures?
Thank you for open-sourcing this. Any insights into your training length distribution vs. inference strategy would be massively helpful!