Skip to content

[Bounty $1500] Higgs Audio v2 bring up using TTNN APIs #32068

@tvardhineniTT

Description

@tvardhineniTT

📝 Background

This bounty is for bringing up the Higgs Audio v2 model using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).
Higgs Audio v2 is a cutting-edge text-audio foundation model from Boson AI that redefines expressiveness in audio generation. Released in 2025, it offers state-of-the-art capabilities:
Exceptional expressiveness: Industry-leading performance on emotion and prosody benchmarks
Multi-speaker dialog: Natural multi-speaker conversations with distinct voices
Voice cloning: Zero-shot voice cloning from reference audio
Diverse audio generation: Speech, sound effects, music, and environmental sounds
Massive training scale: Trained on 10 million hours of audio data (AudioVerse dataset)
Custom tokenizer: Unified semantic and acoustic tokenization
DualFFN architecture: Enhanced LLM for acoustic modeling with minimal overhead
Apache 2.0 license: Fully open source and commercially usable
The model achieves 75.71% win-rate on EmergentTTS-Eval emotions and best-in-class multi-speaker generation quality.
The goal is to enable this model to run on TT hardware for high-throughput, low-latency audio generation across diverse use cases including virtual assistants, audiobooks, content creation, and interactive media.

🎯 What Success Looks Like

A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.

Stage 1 — Bring-Up

  • Implement Higgs Audio v2 using TTNN APIs (Python)
  • Implements the full generation pipeline:
    LLM backbone with DualFFN architecture
    Audio tokenizer (semantic + acoustic features)
    Audio decoder (token-to-waveform conversion)
  • Model runs on either N150 or N300 tenstorrent hardware with no errors
  • Supports multiple generation modes:
    Text-to-speech: Generate speech from text only
    Voice cloning: Generate speech with reference audio
    Multi-speaker dialog: Generate conversations with distinct speakers
  • Produces valid audio output on sample texts (English and multilingual)
  • Output is verifiable (audio quality assessment, compare with PyTorch reference)
  • Achieves baseline throughput target:
    At least 30 tokens/second for autoregressive generation
    Real-time factor (RTF) < 0.5 for typical sentences
  • Accuracy evaluation: Token-level accuracy > 95% against PyTorch reference
  • Audio quality: Passes intelligibility and expressiveness tests
  • Clear instructions for setup and running the model

Stage 2 — Basic Optimizations

  • Use optimal sharded/interleaved memory configs for LLM layers
  • Implement efficient sharding strategy for:
    • Token embeddings (text + audio tokens)
    • DualFFN transformer layers
  • Multi-head attention mechanisms
  • Audio tokenizer encoder/decoder
  • Fuse simple ops where possible (e.g., layer normalization, attention patterns, activation functions)
  • Store intermediate activations in L1 where beneficial
  • Use recommended TTNN/tt-metal LLM flows
  • Leverage TT library of fused ops for attention and MLP blocks
  • Optimize DualFFN architecture (dual feed-forward networks)
  • Efficient KV-cache management for autoregressive generation
  • Optimize audio tokenizer integration

Stage 3 — Deeper Optimization

  • Maximize core counts used per inference
  • Implement deeper TT-specific optimizations:
    - Efficient KV-cache management for long sequences
    - Optimized DualFFN computation (parallel FFN paths)
    - Flash Attention or equivalent for attention layers
  • Minimize token generation latency
  • Batch processing for multiple utterances/speakers
  • Efficient sampling strategies (temperature, top-p, top-k)
  • Pipeline audio encoding/decoding with LLM generation
  • Minimize memory and TM (tensor manipulation) overheads
  • Explore speculative decoding or other acceleration techniques
  • Document any advanced tuning, known limitations, or trade-offs
  • Target stretched goals:
    - 60+ tokens/second generation speed
    - RTF < 0.2 for real-time applications
  • Support for longer contexts (multi-turn dialog)
  • Efficient multi-speaker handling

🧭 Guidance & Starting Points

  • Use the TTNN model bring-up tech report as your primary reference
  • Reference LLM implementations in tt-metal for LLM-based model patterns
  • Use the official Higgs Audio v2 repository for model architecture details
  • Refer to the Higgs Audio v2 blog for technical details
  • Refer to tokenizer blog for audio tokenizer details
  • Refer to DualFFN architecture blog for architecture innovations
  • Refer to TT Fused ops PR #29236 for optimization opportunities
  • The model architecture consists of:
    • Audio tokenizer: Unified semantic and acoustic feature extraction
    • LLM backbone: Transformer with DualFFN architecture
    • DualFFN layers: Dual feed-forward networks for text and audio tokens
    • Audio decoder: Token-to-waveform conversion (24 kHz output)
  • Key challenges:
    • DualFFN architecture optimization (dual FFN paths)
    • Large vocabulary (text + audio tokens)
    • Autoregressive generation latency
    • Audio tokenizer integration
    • Multi-speaker context management
  • Ask for help or file issues if ops are missing in TTNN

🔎 Possible Approaches

  • Start from the official Higgs Audio v2 repository and port components sequentially:
  • End-to-end pipeline integration
  • Validate each submodule's output against PyTorch reference before integration
  • For example the LLM backbone:
  • Start with standard transformer layers
  • Add DualFFN architecture (key innovation)
  • Optimize attention mechanisms
  • Implement efficient KV-cache
  • Experiment with different sharding strategies
  • Use TTNN profiling tools to identify bottlenecks
  • Test diverse use cases
  • Open a draft PR early to get feedback on your approach

📊 Result Submission Guidelines

Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after all 3 stages are completed.

Deliverables:

  • Functional model implementation
  • Validation logs (output correctness)
  • Performance report + header for final review

Links:

📚 Resources

Model Resources

Evaluation Resources

TT-Metal Resources

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Worked On 🔨

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions