-
Notifications
You must be signed in to change notification settings - Fork 347
Description
📝 Background
This bounty is for bringing up the CosyVoice model using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).
CosyVoice is a multi-lingual large voice generation model from Alibaba's FunAudioLLM that provides full-stack TTS capabilities. Key features include:
Multi-lingual support: Chinese, English, Japanese, Cantonese, and Korean
Multiple inference modes: SFT (supervised fine-tuning), zero-shot TTS, cross-lingual, instruct-based, and voice conversion
Scalable architecture: 300M parameter model with streaming capability
Supervised semantic tokens: Novel approach for controllable speech synthesis
LLM-based generation: Leverages large language model capabilities for speech
High quality: Competitive performance on Seed-TTS Eval and ESD benchmarks
Full deployment stack: Includes TensorRT-LLM acceleration and production-ready runtime
Apache 2.0 license: Fully open source and commercially usable
The model achieves strong performance with WER 2.28 and speaker similarity 65.49 on Seed-TTS Eval benchmark.
The goal is to enable this model to run on TT hardware for high-throughput, low-latency multilingual speech synthesis across diverse applications including virtual assistants, audiobooks, content creation, and voice cloning.
🎯 What Success Looks Like
A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.
Stage 1 — Bring-Up
- Implement CosyVoice-300M using TTNN APIs (Python)
- Implements the full generation pipeline:
- LLM backbone for semantic token generation
- Flow-based decoder for acoustic modeling
- Vocoder for waveform generation
- Model runs on either N150 or N300 Tenstorrent hardware with no errors
- Supports multiple generation modes:
- SFT mode: Generate speech with predefined speakers
- Zero-shot mode: Generate speech with reference audio (voice cloning)
- Cross-lingual mode: Generate speech in different language from reference
- Instruct mode: Generate expressive speech with instructions
- Produces valid audio output on sample texts (5 languages: Chinese, English, Japanese, Cantonese, Korean)
- Output is verifiable (audio quality assessment, compare with PyTorch reference)
- Achieves baseline throughput target:
- At least 30 tokens/second for semantic token generation
- Real-time factor (RTF) < 0.5 for typical sentences
- Accuracy evaluation: Token-level accuracy > 95% against PyTorch reference
- Audio quality: WER < 3.0, speaker similarity > 60 on test set
- Clear instructions for setup and running the model
Stage 2 — Basic Optimizations
- Use optimal sharded/interleaved memory configs for LLM layers
- Implement efficient sharding strategy for:
- Token embeddings (text + semantic tokens)
- Transformer layers in LLM backbone
- Multi-head attention mechanisms
- Flow-based decoder layers
- Fuse simple ops where possible (e.g., layer normalization, attention patterns, activation functions)
- Store intermediate activations in L1 where beneficial
- Use recommended TTNN/tt-metal LLM flows
- Leverage TT library of fused ops for attention and MLP blocks
- Optimize flow-based decoder (normalizing flows)
- Efficient KV-cache management for autoregressive generation
- Optimize vocoder integration
Stage 3 — Deeper Optimization
- Maximize core counts used per inference
- Implement deeper TT-specific optimizations:
- Efficient KV-cache management for long sequences
- Flash Attention or equivalent for attention layers
- Minimize token generation latency
- Batch processing for multiple utterances
- Efficient sampling strategies (temperature, top-p, top-k)
- Pipeline semantic generation with acoustic modeling
- Optimize flow-based decoder computation
- Minimize memory and TM (tensor manipulation) overheads
- Explore speculative decoding or other acceleration techniques
- Document any advanced tuning, known limitations, or trade-offs
- Target stretched goals:
- 60+ tokens/second generation speed
- RTF < 0.2 for real-time applications
- Support for streaming inference
- Efficient multi-lingual switching (5 languages)
🧭 Guidance & Starting Points
- Use the TTNN model bring-up tech report as your primary reference
- Reference LLM implementations in tt-metal for LLM-based model patterns
- Use the official CosyVoice repository for model architecture details
- Refer to the CosyVoice paper (arXiv:2407.05407) for technical details
- Refer to the CosyVoice 2 paper (arXiv:2412.10117) for streaming capabilities
- Refer to TT Fused ops PR #29236 for optimization opportunities
- The model architecture consists of:
- LLM backbone: Transformer for semantic token prediction
- Flow-based decoder: Normalizing flows for acoustic modeling
- Vocoder: HiFi-GAN or similar for waveform generation
- Semantic tokens: Supervised token representations
- Key challenges:
- Flow-based decoder optimization (normalizing flows)
- Multi-lingual token vocabulary
- Autoregressive generation latency
- Vocoder integration
- Multiple inference modes (SFT, zero-shot, cross-lingual, instruct)
- Ask for help or file issues if ops are missing in TTNN
🔎 Possible Approaches
- Start from the official CosyVoice repository and port components sequentially:
- LLM backbone for semantic tokens
- Flow-based decoder
- Vocoder integration
- Multi-mode inference logic (SFT, zero-shot, etc.)
- End-to-end pipeline integration
- Validate each submodule's output against PyTorch reference before integration
- For the LLM backbone:
- Start with standard transformer layers
- Optimize attention mechanisms
- Implement efficient KV-cache
- Handle multi-lingual token vocabulary
- For the flow-based decoder:
- Understand normalizing flow operations
- Optimize iterative refinement process
- Consider approximations for faster inference
- Experiment with different sharding strategies
- Use TTNN profiling tools to identify bottlenecks
- Test diverse use cases:
- Plain TTS in 5 languages
- Zero-shot voice cloning
- Cross-lingual generation
- Instruct-based expressive speech
- Open a draft PR early to get feedback on your approach
📊 Result Submission Guidelines
Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after all 3 stages are completed.
Deliverables:
- Functional model implementation
- Validation logs (output correctness)
- Performance report + header for final review
Links:
📚 Resources
Model Resources
- CosyVoice Official Repository
- CosyVoice Paper (arXiv:2407.05407)
- CosyVoice 2 Paper (arXiv:2412.10117)
- CosyVoice 3 Paper (arXiv:2505.17589)
- CosyVoice Website
- HuggingFace Model Hub
Evaluation Resources
TT-Metal Resources
Metadata
Metadata
Assignees
Labels
Type
Projects
Status