Skip to content

[Bounty $1500] OpenVoice V2 bring up using TTNN APIs #32182

@tvardhineniTT

Description

@tvardhineniTT

📝 Background

This bounty is for bringing up the OpenVoice V2 model using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).

OpenVoice V2 is an instant voice cloning model from MyShell.ai that provides accurate tone color cloning and flexible voice style control. Released in April 2024, key features include:

  • Accurate tone color cloning: Can accurately clone reference voice and generate speech in multiple languages and accents
  • Flexible voice style control: Granular control over emotion, accent, rhythm, pauses, and intonation
  • Zero-shot cross-lingual voice cloning: Clone voice from any language and speak in any other language
  • Native multi-lingual support: English, Spanish, French, Chinese, Japanese, and Korean
  • Better audio quality: V2 uses improved training strategy for superior audio quality
  • Instant cloning: Fast voice cloning from short reference audio
  • MIT License: Free for commercial use
  • MeloTTS integration: Leverages MeloTTS for high-quality synthesis

OpenVoice V2 enables voice cloning where the reference speech can be in any language, and the generated speech can be in any of the supported languages, with full control over voice style parameters.

The goal is to enable this model to run on TT hardware for high-throughput, low-latency voice cloning across diverse applications.

🎯 What Success Looks Like

A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.

Stage 1 — Bring-Up

  • Implement OpenVoice V2 using TTNN APIs (Python)
  • Implements the full generation pipeline:
    • Tone color converter (voice cloning module)
    • Base TTS model (MeloTTS integration)
    • Style control module
  • Model runs on either N150 or N300 Tenstorrent hardware with no errors
  • Supports multiple generation modes:
    • Tone color cloning: Clone voice from reference audio
    • Cross-lingual cloning: Clone voice and speak in different language
    • Style control: Control emotion, accent, rhythm, pauses, intonation
  • Produces valid audio output with cloned voices (6 languages: English, Spanish, French, Chinese, Japanese, Korean)
  • Output is verifiable (audio quality assessment, compare with PyTorch reference)
  • Achieves baseline throughput target:
    • At least 25 tokens/second for speech generation
    • Real-time factor (RTF) < 0.6 for typical sentences
    • Cloning latency < 2 seconds from reference audio
  • Accuracy evaluation:
    • Speaker similarity > 70% against reference
    • Intelligibility WER < 3.0
    • Token-level accuracy > 95% against PyTorch reference
  • Audio quality: Natural prosody and accurate tone color cloning
  • Clear instructions for setup and running the model

Stage 2 — Basic Optimizations

  • Use optimal sharded/interleaved memory configs for model layers
  • Implement efficient sharding strategy for:
    • Tone color converter layers
    • MeloTTS base model components
    • Style embedding layers
    • Attention mechanisms
  • Fuse simple ops where possible (e.g., layer normalization, activation functions)
  • Store intermediate activations in L1 where beneficial
  • Use recommended TTNN/tt-metal flows for audio models
  • Leverage TT library of fused ops for attention and MLP blocks
  • Optimize tone color extraction from reference audio
  • Efficient style parameter conditioning
  • Optimize MeloTTS integration

Stage 3 — Deeper Optimization

  • Maximize core counts used per inference
  • Implement deeper TT-specific optimizations:
    • Minimize voice cloning latency (reference audio processing)
    • Efficient multi-style conditioning
    • Batch processing for multiple voice clones
    • Pipeline tone color extraction with synthesis
    • Optimize cross-lingual token mapping
    • Efficient accent and emotion control
  • Minimize memory and TM (tensor manipulation) overheads
  • Explore caching strategies for frequently used voices
  • Document any advanced tuning, known limitations, or trade-offs
  • Target stretched goals:
    • 50+ tokens/second generation speed
    • RTF < 0.3 for real-time applications
    • Cloning latency < 1 second
    • Support for 10+ concurrent voice clones

🧭 Guidance & Starting Points

  • Use the TTNN model bring-up tech report as your primary reference
  • Reference audio model implementations in tt-metal for audio model patterns
  • Use the official OpenVoice repository for model architecture details
  • Refer to the OpenVoice V2 on HuggingFace for checkpoints and demos
  • Reference MeloTTS for base TTS model details
  • Refer to TT Fused ops PR #29236 for optimization opportunities
  • The model architecture consists of:
    • Tone color converter: Extracts and transfers voice characteristics
    • Base TTS model: MeloTTS for multi-lingual speech synthesis
    • Style encoder: Encodes prosody, emotion, and accent information
    • Reference encoder: Processes reference audio for voice cloning
  • Key challenges:
    • Tone color extraction and conversion
    • Cross-lingual voice transfer
    • Style parameter conditioning
    • MeloTTS integration
    • Real-time voice cloning latency
    • Multi-lingual token handling (6 languages)
  • Ask for help or file issues if ops are missing in TTNN

🔎 Possible Approaches

  • Start from the official OpenVoice repository and port components sequentially
  • Validate each submodule's output against PyTorch reference before integration
  • For tone color conversion:
    • Optimize reference audio encoding
    • Efficient tone color feature extraction
    • Fast tone color transfer to target speech
  • For MeloTTS integration:
    • Initial: Run MeloTTS on CPU
    • Advanced: Port MeloTTS components to TTNN if time permits
    • Optimize multi-lingual token processing
  • For style control:
    • Efficient style embedding lookup/generation
    • Optimize style parameter conditioning
    • Handle emotion, accent, rhythm parameters
  • Experiment with different sharding strategies
  • Use TTNN profiling tools to identify bottlenecks in:
    • Reference audio processing
    • Tone color conversion
    • MeloTTS synthesis
    • Style conditioning overhead
  • Test diverse use cases:
    • Same-language voice cloning (6 languages)
    • Cross-lingual cloning (clone Chinese voice, speak English, etc.)
    • Style variations (different emotions, accents)
    • Short vs. long reference audio
  • Open a draft PR early to get feedback on your approach

📊 Result Submission Guidelines

Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after all 3 stages are completed.

Deliverables:

  • Functional model implementation
  • Validation logs (output correctness)
  • Performance report + header for final review

Links:

📚 Resources

Model Resources

TT-Metal Resources

Metadata

Metadata

Assignees

Type

No type

Projects

Status

PR Submitted 🕒

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions