[Bounty $1500] CosyVoice bring up using TTNN APIs

### 📝 Background

This bounty is for bringing up the CosyVoice model using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).
CosyVoice is a multi-lingual large voice generation model from Alibaba's FunAudioLLM that provides full-stack TTS capabilities. Key features include:
Multi-lingual support: Chinese, English, Japanese, Cantonese, and Korean
Multiple inference modes: SFT (supervised fine-tuning), zero-shot TTS, cross-lingual, instruct-based, and voice conversion
Scalable architecture: 300M parameter model with streaming capability
Supervised semantic tokens: Novel approach for controllable speech synthesis
LLM-based generation: Leverages large language model capabilities for speech
High quality: Competitive performance on Seed-TTS Eval and ESD benchmarks
Full deployment stack: Includes TensorRT-LLM acceleration and production-ready runtime
Apache 2.0 license: Fully open source and commercially usable
The model achieves strong performance with WER 2.28 and speaker similarity 65.49 on Seed-TTS Eval benchmark.
The goal is to enable this model to run on TT hardware for high-throughput, low-latency multilingual speech synthesis across diverse applications including virtual assistants, audiobooks, content creation, and voice cloning.

### 🎯 What Success Looks Like

A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.


### Stage 1 — Bring-Up

* Implement CosyVoice-300M using TTNN APIs (Python)
* Implements the full generation pipeline:
  * LLM backbone for semantic token generation
  * Flow-based decoder for acoustic modeling
  * Vocoder for waveform generation
* Model runs on either N150 or N300 Tenstorrent hardware with no errors
* Supports multiple generation modes:
  * **SFT mode**: Generate speech with predefined speakers
  * **Zero-shot mode**: Generate speech with reference audio (voice cloning)
  * **Cross-lingual mode**: Generate speech in different language from reference
  * **Instruct mode**: Generate expressive speech with instructions
* Produces valid audio output on sample texts (5 languages: Chinese, English, Japanese, Cantonese, Korean)
* Output is verifiable (audio quality assessment, compare with PyTorch reference)
* Achieves baseline throughput target:
  * At least 30 tokens/second for semantic token generation
  * Real-time factor (RTF) < 0.5 for typical sentences
* Accuracy evaluation: Token-level accuracy > 95% against PyTorch reference
* Audio quality: WER < 3.0, speaker similarity > 60 on test set
* Clear instructions for setup and running the model


### Stage 2 — Basic Optimizations

* Use optimal sharded/interleaved memory configs for LLM layers
* Implement efficient sharding strategy for:
  * Token embeddings (text + semantic tokens)
  * Transformer layers in LLM backbone
  * Multi-head attention mechanisms
  * Flow-based decoder layers
* Fuse simple ops where possible (e.g., layer normalization, attention patterns, activation functions)
* Store intermediate activations in L1 where beneficial
* Use recommended TTNN/tt-metal LLM flows
* Leverage TT library of fused ops for attention and MLP blocks
* Optimize flow-based decoder (normalizing flows)
* Efficient KV-cache management for autoregressive generation
* Optimize vocoder integration

### Stage 3 — Deeper Optimization

* Maximize core counts used per inference
* Implement deeper TT-specific optimizations:
  * Efficient KV-cache management for long sequences
  * Flash Attention or equivalent for attention layers
  * Minimize token generation latency
  * Batch processing for multiple utterances
  * Efficient sampling strategies (temperature, top-p, top-k)
  * Pipeline semantic generation with acoustic modeling
  * Optimize flow-based decoder computation
* Minimize memory and TM (tensor manipulation) overheads
* Explore speculative decoding or other acceleration techniques
* Document any advanced tuning, known limitations, or trade-offs
* Target stretched goals:
  * 60+ tokens/second generation speed
  * RTF < 0.2 for real-time applications
  * Support for streaming inference
  * Efficient multi-lingual switching (5 languages)

### 🧭 Guidance & Starting Points

* Use the [TTNN model bring-up tech report](https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/ttnn/TTNN-model-bringup.md) as your primary reference
* Reference [LLM implementations in tt-metal](https://github.com/tenstorrent/tt-metal/tree/main/models/demos/wormhole) for LLM-based model patterns
* Use the [official CosyVoice repository](https://github.com/FunAudioLLM/CosyVoice) for model architecture details
* Refer to the [CosyVoice paper (arXiv:2407.05407)](https://arxiv.org/abs/2407.05407) for technical details
* Refer to the [CosyVoice 2 paper (arXiv:2412.10117)](https://arxiv.org/abs/2412.10117) for streaming capabilities
* Refer to [TT Fused ops PR #29236](https://github.com/tenstorrent/tt-metal/pull/29236) for optimization opportunities
* The model architecture consists of:
  * **LLM backbone**: Transformer for semantic token prediction
  * **Flow-based decoder**: Normalizing flows for acoustic modeling
  * **Vocoder**: HiFi-GAN or similar for waveform generation
  * **Semantic tokens**: Supervised token representations
* Key challenges:
  * Flow-based decoder optimization (normalizing flows)
  * Multi-lingual token vocabulary
  * Autoregressive generation latency
  * Vocoder integration
  * Multiple inference modes (SFT, zero-shot, cross-lingual, instruct)
* Ask for help or file issues if ops are missing in TTNN

### 🔎 Possible Approaches

* Start from the [official CosyVoice repository](https://github.com/FunAudioLLM/CosyVoice) and port components sequentially:
  1. LLM backbone for semantic tokens
  2. Flow-based decoder
  3. Vocoder integration
  4. Multi-mode inference logic (SFT, zero-shot, etc.)
  5. End-to-end pipeline integration
* Validate each submodule's output against PyTorch reference before integration
* For the LLM backbone:
  * Start with standard transformer layers
  * Optimize attention mechanisms
  * Implement efficient KV-cache
  * Handle multi-lingual token vocabulary
* For the flow-based decoder:
  * Understand normalizing flow operations
  * Optimize iterative refinement process
  * Consider approximations for faster inference
* Experiment with different sharding strategies
* Use TTNN profiling tools to identify bottlenecks
* Test diverse use cases:
  * Plain TTS in 5 languages
  * Zero-shot voice cloning
  * Cross-lingual generation
  * Instruct-based expressive speech
* Open a draft PR early to get feedback on your approach


## 📊 Result Submission Guidelines

Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after **all 3 stages** are completed.

**Deliverables:**
- Functional model implementation
- Validation logs (output correctness)
- Performance report + header for final review

**Links:**
- [Performance Sheet](https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/ttnn/TTNN-model-bringup.md#41-performance-sheet)
- [Perf Header Docs](https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/profiling_ttnn_operations.html#perf-report-headers)


### 📚 Resources

### Model Resources

* [CosyVoice Official Repository](https://github.com/FunAudioLLM/CosyVoice)
* [CosyVoice Paper (arXiv:2407.05407)](https://arxiv.org/abs/2407.05407)
* [CosyVoice 2 Paper (arXiv:2412.10117)](https://arxiv.org/abs/2412.10117)
* [CosyVoice 3 Paper (arXiv:2505.17589)](https://arxiv.org/abs/2505.17589)
* [CosyVoice Website](https://funaudiollm.github.io/cosyvoice2)
* [HuggingFace Model Hub](https://huggingface.co/FunAudioLLM)

### Evaluation Resources

* [Seed-TTS Eval Benchmark](https://github.com/ByteDance/seed-tts-eval)
* [Emotional Speech Dataset (ESD)](https://github.com/HLTSingapore/Emotional-Speech-Data)

### TT-Metal Resources

* [TTNN Model Bring-up Tech Report](https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/ttnn/TTNN-model-bringup.md)
* [LLM Implementations in tt-metal](https://github.com/tenstorrent/tt-metal/tree/main/models/demos/wormhole)
* [TT Fused Ops PR #29236](https://github.com/tenstorrent/tt-metal/pull/29236)
* [Performance Report Header](https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/profiling_ttnn_operations.html#perf-report-headers)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bounty $1500] CosyVoice bring up using TTNN APIs #32178

📝 Background

🎯 What Success Looks Like

Stage 1 — Bring-Up

Stage 2 — Basic Optimizations

Stage 3 — Deeper Optimization

🧭 Guidance & Starting Points

🔎 Possible Approaches

📊 Result Submission Guidelines

📚 Resources

Model Resources

Evaluation Resources

TT-Metal Resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bounty $1500] CosyVoice bring up using TTNN APIs #32178

Description

📝 Background

🎯 What Success Looks Like

Stage 1 — Bring-Up

Stage 2 — Basic Optimizations

Stage 3 — Deeper Optimization

🧭 Guidance & Starting Points

🔎 Possible Approaches

📊 Result Submission Guidelines

📚 Resources

Model Resources

Evaluation Resources

TT-Metal Resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions