Skip to content

[Bounty $1500] CosyVoice bring up using TTNN APIs #32178

@tvardhineniTT

Description

@tvardhineniTT

📝 Background

This bounty is for bringing up the CosyVoice model using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).
CosyVoice is a multi-lingual large voice generation model from Alibaba's FunAudioLLM that provides full-stack TTS capabilities. Key features include:
Multi-lingual support: Chinese, English, Japanese, Cantonese, and Korean
Multiple inference modes: SFT (supervised fine-tuning), zero-shot TTS, cross-lingual, instruct-based, and voice conversion
Scalable architecture: 300M parameter model with streaming capability
Supervised semantic tokens: Novel approach for controllable speech synthesis
LLM-based generation: Leverages large language model capabilities for speech
High quality: Competitive performance on Seed-TTS Eval and ESD benchmarks
Full deployment stack: Includes TensorRT-LLM acceleration and production-ready runtime
Apache 2.0 license: Fully open source and commercially usable
The model achieves strong performance with WER 2.28 and speaker similarity 65.49 on Seed-TTS Eval benchmark.
The goal is to enable this model to run on TT hardware for high-throughput, low-latency multilingual speech synthesis across diverse applications including virtual assistants, audiobooks, content creation, and voice cloning.

🎯 What Success Looks Like

A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.

Stage 1 — Bring-Up

  • Implement CosyVoice-300M using TTNN APIs (Python)
  • Implements the full generation pipeline:
    • LLM backbone for semantic token generation
    • Flow-based decoder for acoustic modeling
    • Vocoder for waveform generation
  • Model runs on either N150 or N300 Tenstorrent hardware with no errors
  • Supports multiple generation modes:
    • SFT mode: Generate speech with predefined speakers
    • Zero-shot mode: Generate speech with reference audio (voice cloning)
    • Cross-lingual mode: Generate speech in different language from reference
    • Instruct mode: Generate expressive speech with instructions
  • Produces valid audio output on sample texts (5 languages: Chinese, English, Japanese, Cantonese, Korean)
  • Output is verifiable (audio quality assessment, compare with PyTorch reference)
  • Achieves baseline throughput target:
    • At least 30 tokens/second for semantic token generation
    • Real-time factor (RTF) < 0.5 for typical sentences
  • Accuracy evaluation: Token-level accuracy > 95% against PyTorch reference
  • Audio quality: WER < 3.0, speaker similarity > 60 on test set
  • Clear instructions for setup and running the model

Stage 2 — Basic Optimizations

  • Use optimal sharded/interleaved memory configs for LLM layers
  • Implement efficient sharding strategy for:
    • Token embeddings (text + semantic tokens)
    • Transformer layers in LLM backbone
    • Multi-head attention mechanisms
    • Flow-based decoder layers
  • Fuse simple ops where possible (e.g., layer normalization, attention patterns, activation functions)
  • Store intermediate activations in L1 where beneficial
  • Use recommended TTNN/tt-metal LLM flows
  • Leverage TT library of fused ops for attention and MLP blocks
  • Optimize flow-based decoder (normalizing flows)
  • Efficient KV-cache management for autoregressive generation
  • Optimize vocoder integration

Stage 3 — Deeper Optimization

  • Maximize core counts used per inference
  • Implement deeper TT-specific optimizations:
    • Efficient KV-cache management for long sequences
    • Flash Attention or equivalent for attention layers
    • Minimize token generation latency
    • Batch processing for multiple utterances
    • Efficient sampling strategies (temperature, top-p, top-k)
    • Pipeline semantic generation with acoustic modeling
    • Optimize flow-based decoder computation
  • Minimize memory and TM (tensor manipulation) overheads
  • Explore speculative decoding or other acceleration techniques
  • Document any advanced tuning, known limitations, or trade-offs
  • Target stretched goals:
    • 60+ tokens/second generation speed
    • RTF < 0.2 for real-time applications
    • Support for streaming inference
    • Efficient multi-lingual switching (5 languages)

🧭 Guidance & Starting Points

  • Use the TTNN model bring-up tech report as your primary reference
  • Reference LLM implementations in tt-metal for LLM-based model patterns
  • Use the official CosyVoice repository for model architecture details
  • Refer to the CosyVoice paper (arXiv:2407.05407) for technical details
  • Refer to the CosyVoice 2 paper (arXiv:2412.10117) for streaming capabilities
  • Refer to TT Fused ops PR #29236 for optimization opportunities
  • The model architecture consists of:
    • LLM backbone: Transformer for semantic token prediction
    • Flow-based decoder: Normalizing flows for acoustic modeling
    • Vocoder: HiFi-GAN or similar for waveform generation
    • Semantic tokens: Supervised token representations
  • Key challenges:
    • Flow-based decoder optimization (normalizing flows)
    • Multi-lingual token vocabulary
    • Autoregressive generation latency
    • Vocoder integration
    • Multiple inference modes (SFT, zero-shot, cross-lingual, instruct)
  • Ask for help or file issues if ops are missing in TTNN

🔎 Possible Approaches

  • Start from the official CosyVoice repository and port components sequentially:
    1. LLM backbone for semantic tokens
    2. Flow-based decoder
    3. Vocoder integration
    4. Multi-mode inference logic (SFT, zero-shot, etc.)
    5. End-to-end pipeline integration
  • Validate each submodule's output against PyTorch reference before integration
  • For the LLM backbone:
    • Start with standard transformer layers
    • Optimize attention mechanisms
    • Implement efficient KV-cache
    • Handle multi-lingual token vocabulary
  • For the flow-based decoder:
    • Understand normalizing flow operations
    • Optimize iterative refinement process
    • Consider approximations for faster inference
  • Experiment with different sharding strategies
  • Use TTNN profiling tools to identify bottlenecks
  • Test diverse use cases:
    • Plain TTS in 5 languages
    • Zero-shot voice cloning
    • Cross-lingual generation
    • Instruct-based expressive speech
  • Open a draft PR early to get feedback on your approach

📊 Result Submission Guidelines

Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after all 3 stages are completed.

Deliverables:

  • Functional model implementation
  • Validation logs (output correctness)
  • Performance report + header for final review

Links:

📚 Resources

Model Resources

Evaluation Resources

TT-Metal Resources

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Worked On 🔨

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions